From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=DCMs=3Z=cebitec.uni-bielefeld.de=ro@sourceware.org>
Received: from smtp.CeBiTec.Uni-Bielefeld.DE (smtp.CeBiTec.Uni-Bielefeld.DE [129.70.160.84])
	by sourceware.org (Postfix) with ESMTPS id 7146D385B18F
	for <gdb-patches@sourceware.org>; Fri, 25 Nov 2022 09:58:14 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7146D385B18F
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=CeBiTec.Uni-Bielefeld.DE
Authentication-Results: sourceware.org; spf=none smtp.mailfrom=cebitec.uni-bielefeld.de
Received: from localhost (localhost [127.0.0.1])
	by smtp.CeBiTec.Uni-Bielefeld.DE (Postfix) with ESMTP id 48F2DAD007;
	Fri, 25 Nov 2022 10:58:13 +0100 (CET)
X-Virus-Scanned: amavisd-new at CeBiTec.Uni-Bielefeld.DE
Received: from smtp.CeBiTec.Uni-Bielefeld.DE ([127.0.0.1])
	by localhost (smtp.cebitec.uni-bielefeld.de [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id Ftv7zy_XVdxS; Fri, 25 Nov 2022 10:58:12 +0100 (CET)
Received: from manam.CeBiTec.Uni-Bielefeld.DE (p50854a7a.dip0.t-ipconnect.de [80.133.74.122])
	(Authenticated sender: ro)
	by smtp.CeBiTec.Uni-Bielefeld.DE (Postfix) with ESMTPSA id 7A1AFAC9FC;
	Fri, 25 Nov 2022 10:58:12 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
	d=CeBiTec.Uni-Bielefeld.DE; s=20200306; t=1669370292;
	bh=TRL32scEyG4gnOqrjmZlylTQPGQ09WKWXGCidWKxol0=;
	h=From:To:Cc:Subject:References:Date:In-Reply-To:From;
	b=fo/g1YQDXfk8xHrDgUYfQxQQCETHDdaM39FZTpnQgEH4Gb4FwLo2A0SXW3E57KCEb
	 kXlT8XL7D5R2qxpeJacQkeWNct8d0XPFVFtvZwWmLPvuXKMoavD8UcssUlQ8KH5zjm
	 A7fnC34rR4IzNxwo04Ai/G6ae0UdcLBL2nz0MvuwUkiNJNZQuWd4WYOiNHKV/mZFKY
	 YgBJ9F9emV4lrVcyw8alTciNf7k6a0OdQzTNtEHYQRD8iad6Jw2SD7o2JAMa1/jjc8
	 /DYkR89b5VOYPwGlgnrcXlUGVt7fRnnMpcqTFbTT9ooRJ1HabMhU4i5CMG728XOUEr
	 lDh+z/+IOccCg==
From: Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE>
To: Andrew Burgess <aburgess@redhat.com>
Cc: Andrew Burgess <andrew.burgess@embecosm.com>,  gdb-patches@sourceware.org
Subject: Re: [PATCH] Fix expected received signal message in testsuite
References: <yddlfv3eyq1.fsf@CeBiTec.Uni-Bielefeld.DE>
	<20190913221823.GV6076@embecosm.com>
	<yddmt8jqrdl.fsf@CeBiTec.Uni-Bielefeld.DE> <87sfi82vg4.fsf@redhat.com>
Date: Fri, 25 Nov 2022 10:58:11 +0100
In-Reply-To: <87sfi82vg4.fsf@redhat.com> (Andrew Burgess's message of "Thu, 24
	Nov 2022 16:01:15 +0000")
Message-ID: <yddfse7nyoc.fsf@CeBiTec.Uni-Bielefeld.DE>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1.90 (usg-unix-v)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-3788.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,SPF_HELO_NONE,SPF_NONE,TXREP,WEIRD_PORT autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gdb-patches.sourceware.org>

Hi Andrew,

> I tried to take a look at this a little.  The only Solaris machines I
> have access to run on Sparc, not x86-64, but hopefully should still have
> much the same behaviour.

they do indeed.  I agree it's a shame that no Solaris/x86 system (or
up-to-date Solaris 11.4/SPARC system) is part of the GCC compile farm,
but my attempt to provide this failed primarily for
administrative/university policy reasons.

> I did manage to (eventually) build GDB on one of these machines, but,
> I'm not sure if I built it wrong, or if the Sparc/Solaris support is
> just bad, but GDB was crashing all over the place with assertion
> failures.

It depends on your definition of `alll over the place', but yeah,
Solaris test results are pretty bad:

* sparcv9:

                === gdb Summary ===

# of unexpected core files      1
# of expected passes            68020
# of unexpected failures        3535
# of unexpected successes       13
# of expected failures          61
# of unknown successes          2
# of known failures             88
# of untested testcases         189
# of unresolved testcases       52
# of unsupported tests          244
# of paths in test names        12
# of duplicate test names       7

* amd64:

                === gdb Summary ===

# of unexpected core files      1
# of expected passes            72847
# of unexpected failures        2664
# of unexpected successes       13
# of expected failures          60
# of unknown successes          1
# of known failures             91
# of untested testcases         190
# of unresolved testcases       73
# of unsupported tests          254
# of paths in test names        13
# of duplicate test names       7

I had tried to attack the most glaring onces several times in the past,
but in the end it came to nothing: I really fear the complexity of the
GDB code base is beyond me for anything but fixing build failures and
minor feature patches ;-(

> Still, with some persistence I could see the behaviour you observe.
>
> Now, I've not done any Solaris work in >10years, so I don't claim to be
> any kind of expert, but I wonder if the fix you're proposing here isn't
> simply hiding a GDB bug.
>
> I wrote a simple test program that starts 3 worker threads and then
> blocks.  Here's the 'info threads' output for GNU/Linux:
>
>   (gdb) info threads 
>     Id   Target Id                                   Frame 
>   * 1    Thread 0x7ffff7da3740 (LWP 2243115) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>     2    Thread 0x7ffff7da2700 (LWP 2243118) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>     3    Thread 0x7ffff75a1700 (LWP 2243119) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>     4    Thread 0x7ffff6da0700 (LWP 2243120) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>
> What you'd expect.  Now here's the same on Solaris:
>
>   (gdb) info threads
>     Id   Target Id         Frame 
>   * 1    LWP    1          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     2    LWP    4          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     3    LWP    3          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     4    LWP    2          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     5    Thread 1 (LWP 1)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     6    Thread 2 (LWP 2)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     7    Thread 3 (LWP 3)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     8    Thread 4 (LWP 4)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>
> This is inline with what you describe, but, I think we can all agree,
> this seems a little odd; are there really 8 thread like things running
> as part of this process?  The output of `ps -aL` would suggest not:
>
>   $ ps -aL
>      PID   LWP TTY        LTIME CMD
>     3855     1 pts/6       0:00 thr.x
>     3855     2 pts/6       0:00 thr.x
>     3855     3 pts/6       0:00 thr.x
>     3855     4 pts/6       0:00 thr.x
>     4132     1 pts/8       0:00 ps
>
> And also, when I run the same test application using the dbx debugger, I
> see this:
>
>   (dbx) threads
>   *>    t@1  a  l@1   ?()   signal SIGINT in  ___nanosleep() 
>         t@2  a  l@2   thread_worker()   running          in  ___nanosleep() 
>         t@3  a  l@3   thread_worker()   running          in  ___nanosleep() 
>         t@4  a  l@4   thread_worker()   running          in  ___nanosleep() 
>
> So here, the process is represented as just 4 thread like things.
>
> So, why does GDB think there are 8, while every tools that ships with
> Solaris seems to think there are 4?  My guess, is that is has something
> to do with the thread lookup code in sol-thread.c, and/or the operation
> of libthread-db.

It certainly has, however the output is not really wrong, just
unnecessarily complex: As the head comment in sol-thread.c alludes to,
before Solaris 9 thread support in Solaris was implemented as a 2-level
model scheduling M user-level threads to N kernel threads/LWPs.  The old
complexity is still visible in the info threads output above, which
lists both user-level threads and LWPs.  However, with Solaris 9, this
all became obsolete with a 1:1 mapping from user-level threads to LWPs.
In fact, that new thread model was already visible in Solaris 8's
alternative thread library in /usr/lib/lwp, which gcc forcibly used
since it was the only one providing TLS support.

I tried at least once to rip out support for the two-level model from
sol-thread.c, but failed.  While the output still isn't wrong, it's
plain confusing and especially the GDB testsuite isn't able to deal with
it, so it only causes problems.

linux-thread-db.c at some point implemented a similar two-level model,
but that has been ripped out since.

> So, what I run your original selfkill test application, and use GDB to
> break on GDB's add_thread_with_info function (the thing that is
> responsible for printing the "New Thread ..." message), here's what I
> see:
>
>   (gdb) bt
>   #0  add_thread_with_info (targ=targ@entry=0x940678 <the_procfs_target>, ptid=..., priv=priv@entry=0x0) at ../../src/gdb/thread.c:290
>   #1  0x0053b61c in add_thread (targ=0x940678 <the_procfs_target>, ptid=...) at ../../src/gdb/thread.c:305
>   #2  0x004ab5f4 in sol_thread_target::wait (this=<optimized out>, ptid=..., ourstatus=0xffbff620, options=...) at ../../src/gdb/sol-thread.c:459
>   #3  0x0053019c in target_wait (ptid=..., status=status@entry=0xffbff620, options=...) at ../../src/gdb/target.c:2598
>   #4  0x00395478 in do_target_wait_1 (inf=inf@entry=0x969288, ptid=..., status=status@entry=0xffbff620, options=<error reading variable: Cannot access memory at address 0x0>) at ../../src/gdb/infrun.c:3763
>   #5  0x003a7e8c in <lambda(inferior*)>::operator() (inf=0x969288, __closure=<synthetic pointer>) at ../../src/gdb/infrun.c:3822
>   #6  do_target_wait (options=..., ecs=0xffbff600) at ../../src/gdb/infrun.c:3841
>   #7  fetch_inferior_event () at ../../src/gdb/infrun.c:4201
>   #8  0x001b0bd8 in check_async_event_handlers () at ../../src/gdb/async-event.c:337
>   #9  0x006c4e3c in gdb_do_one_event (mstimeout=mstimeout@entry=-1) at ../../src/gdbsupport/event-loop.cc:221
>   #10 0x003d7ea0 in start_event_loop () at ../../src/gdb/main.c:411
>   #11 captured_command_loop () at ../../src/gdb/main.c:471
>   #12 0x003d9fa8 in captured_main (data=0xffbff84c) at ../../src/gdb/main.c:1330
>   #13 gdb_main (args=args@entry=0xffbff84c) at ../../src/gdb/main.c:1345
>   #14 0x006f7c5c in main (argc=4, argv=0xffbff8bc) at ../../src/gdb/gdb.c:32
>   (gdb) frame 2
>   #2  0x004ab5f4 in sol_thread_target::wait (this=<optimized out>, ptid=..., ourstatus=0xffbff620, options=...) at ../../src/gdb/sol-thread.c:459
>   459                   add_thread (proc_target, rtnval);
>   (gdb) p rtnval
>   $1 = {m_pid = 7218, m_lwp = 0, m_tid = 1}
>   (gdb) p current_inferior_.m_obj->thread_list.m_front.ptid 
>   $2 = {m_pid = 7218, m_lwp = 1, m_tid = 0}
>   (gdb) 
>
> What this is telling us, is that, when GDB stopped after the ::wait
> call, the ptid_t it got back was '{m_pid = 7218, m_lwp = 0, m_tid = 1}',
> however, the original thread that GDB found when starting the
> application was '{m_pid = 7218, m_lwp = 1, m_tid = 0}'.
>
> This difference is what causes GDB to add the new thread.
>
> My guess is that this m_lwp/m_tid difference is a bug somewhere in the
> stack, and that really, we should be seeing the same ptid_t here.  If we
> did, then GDB would not add the new thread, and the test messages would
> not change.
>
> What are your thoughts on this analysis?

I do agree that this discrepancy is causing issues (certainly not only
here), but I've looked from a different angle:

the "<entity> received signal" message is from
infrun.c:print_signal_received_reason where the "Thread <id>" variant is
emitted if thread.c:show_thread_that_caused_stop returned true, which
happens for hightest_thread_num > 1.  It could be (I do have long-winded
notes from long ago that I can only half make sense of now) that the
fact that both the user-level thread and the lwp are counted here cause
this.

It seems that highest_thead_num is incremented once from

Thread 2 hit Hardware watchpoint 3: 'thread.c'::highest_thread_num

Old value = 0
New value = 1
0x00000001009c83ec in thread_info::thread_info (this=0x100f8c100, 
    inf_=0x100f06670, ptid_=...)
    at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:343
343       this->global_num = ++highest_thread_num;
(top-gdb) where
#0  0x00000001009c83ec in thread_info::thread_info (this=0x100f8c100, 
    inf_=0x100f06670, ptid_=...)
    at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:343
#1  0x00000001009c8560 in new_thread (inf=0x100f06670, ptid=...)
    at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:248
#2  0x00000001009cbd7c in add_thread_silent (ptid=...)
    at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:310
#3  0x0000000100831604 in procfs_target::create_inferior (
    this=0x100d138c0 <the_procfs_target>, 
    exec_file=0x100f13300 "/vol/obj/gnu/gdb/gdb/11.5-sparcv9-local/ua", 
    allargs="", env=0x100f06f80, from_tty=<optimized out>)
    at /vol/src/gnu/gdb/hg/master/local/gdb/procfs.c:3093

and a second time from

Thread 2 hit Hardware watchpoint 3: 'thread.c'::highest_thread_num

Old value = 1
New value = 2
0x00000001009c83ec in thread_info::thread_info (this=0x101176ab0, 
    inf_=0x100f06670, ptid_=...)
    at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:343
343       this->global_num = ++highest_thread_num;
(top-gdb) where
#0  0x00000001009c83ec in thread_info::thread_info (this=0x101176ab0, 
    inf_=0x100f06670, ptid_=...)
    at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:343
#1  0x00000001009c8560 in new_thread (inf=0x100f06670, ptid=...)
    at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:248
#2  0x00000001009cbd7c in add_thread_silent (ptid=...)
    at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:310
#3  0x00000001009cbee4 in add_thread_with_info (ptid=..., priv=0x0)
    at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:319
#4  0x00000001009cbfe0 in add_thread (ptid=...)
    at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:333
#5  0x0000000100908b54 in sol_thread_target::wait (
    this=0x100d1ba08 <sol_thread_ops>, ptid=..., ourstatus=<optimized out>, 
    options=<optimized out>)
    at /vol/src/gnu/gdb/hg/master/local/gdb/sol-thread.c:466
#6  0x000000010099f9e4 in target_wait (ptid=..., status=0xffffffff7fffe460, 
    options=<optimized out>)
    at /vol/src/gnu/gdb/hg/master/local/gdb/target.h:1298
#7  0x0000000100729228 in do_target_wait (ptid=..., status=0xffffffff7fffe460, 
    options=<optimized out>)
    at /vol/src/gnu/gdb/hg/master/local/gdb/infrun.c:3442
#8  0x0000000100737e28 in fetch_inferior_event (client_data=0x0)
    at /vol/src/gnu/gdb/hg/master/local/gdb/infrun.c:3725
#9  0x0000000100715210 in inferior_event_handler (event_type=<optimized out>, client_data=0x0) at /vol/src/gnu/gdb/hg/master/local/gdb/inf-loop.c:43
#10 0x00000001007273c4 in infrun_async_inferior_event_handler (data=0x0) at /vol/src/gnu/gdb/hg/master/local/gdb/infrun.c:8898
#11 0x0000000100691828 in check_async_event_handlers () at /vol/src/gnu/gdb/hg/master/local/gdb/event-loop.c:1062
#12 gdb_do_one_event () at /vol/src/gnu/gdb/hg/master/local/gdb/event-loop.c:325
#13 0x00000001006922ec in start_event_loop () at /vol/src/gnu/gdb/hg/master/local/gdb/event-loop.c:370
#14 0x0000000100786090 in captured_command_loop () at /vol/src/gnu/gdb/hg/master/local/gdb/main.c:331
#15 0x00000001007879fc in captured_main (data=<optimized out>) at /vol/src/gnu/gdb/hg/master/local/gdb/main.c:1171
#16 gdb_main (args=<optimized out>) at /vol/src/gnu/gdb/hg/master/local/gdb/main.c:1186
#17 0x0000000100470728 in main (argc=<optimized out>, argv=0xffffffff7fffeba8) at /vol/src/gnu/gdb/hg/master/local/gdb/gdb.c:32

As you see, those are old stacktraces and no longer match current code.
However, one call ultimately comes from procfs.c (i.e. for an lwp), the
second from sol-thread.c (for a user-level thread).

I strongly suspect this needs to go away, but don't have the slightest
idea how.  Certainly, the whole thread/lwp distinction is moot since
Solaris 9: tid and lwpid are identical by now and there's a 1:1 mapping
between the two.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University