From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.CeBiTec.Uni-Bielefeld.DE (smtp.CeBiTec.Uni-Bielefeld.DE [129.70.160.84]) by sourceware.org (Postfix) with ESMTPS id 7146D385B18F for ; Fri, 25 Nov 2022 09:58:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7146D385B18F Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=CeBiTec.Uni-Bielefeld.DE Authentication-Results: sourceware.org; spf=none smtp.mailfrom=cebitec.uni-bielefeld.de Received: from localhost (localhost [127.0.0.1]) by smtp.CeBiTec.Uni-Bielefeld.DE (Postfix) with ESMTP id 48F2DAD007; Fri, 25 Nov 2022 10:58:13 +0100 (CET) X-Virus-Scanned: amavisd-new at CeBiTec.Uni-Bielefeld.DE Received: from smtp.CeBiTec.Uni-Bielefeld.DE ([127.0.0.1]) by localhost (smtp.cebitec.uni-bielefeld.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ftv7zy_XVdxS; Fri, 25 Nov 2022 10:58:12 +0100 (CET) Received: from manam.CeBiTec.Uni-Bielefeld.DE (p50854a7a.dip0.t-ipconnect.de [80.133.74.122]) (Authenticated sender: ro) by smtp.CeBiTec.Uni-Bielefeld.DE (Postfix) with ESMTPSA id 7A1AFAC9FC; Fri, 25 Nov 2022 10:58:12 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=CeBiTec.Uni-Bielefeld.DE; s=20200306; t=1669370292; bh=TRL32scEyG4gnOqrjmZlylTQPGQ09WKWXGCidWKxol0=; h=From:To:Cc:Subject:References:Date:In-Reply-To:From; b=fo/g1YQDXfk8xHrDgUYfQxQQCETHDdaM39FZTpnQgEH4Gb4FwLo2A0SXW3E57KCEb kXlT8XL7D5R2qxpeJacQkeWNct8d0XPFVFtvZwWmLPvuXKMoavD8UcssUlQ8KH5zjm A7fnC34rR4IzNxwo04Ai/G6ae0UdcLBL2nz0MvuwUkiNJNZQuWd4WYOiNHKV/mZFKY YgBJ9F9emV4lrVcyw8alTciNf7k6a0OdQzTNtEHYQRD8iad6Jw2SD7o2JAMa1/jjc8 /DYkR89b5VOYPwGlgnrcXlUGVt7fRnnMpcqTFbTT9ooRJ1HabMhU4i5CMG728XOUEr lDh+z/+IOccCg== From: Rainer Orth To: Andrew Burgess Cc: Andrew Burgess , gdb-patches@sourceware.org Subject: Re: [PATCH] Fix expected received signal message in testsuite References: <20190913221823.GV6076@embecosm.com> <87sfi82vg4.fsf@redhat.com> Date: Fri, 25 Nov 2022 10:58:11 +0100 In-Reply-To: <87sfi82vg4.fsf@redhat.com> (Andrew Burgess's message of "Thu, 24 Nov 2022 16:01:15 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1.90 (usg-unix-v) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-3788.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,SPF_HELO_NONE,SPF_NONE,TXREP,WEIRD_PORT autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Andrew, > I tried to take a look at this a little. The only Solaris machines I > have access to run on Sparc, not x86-64, but hopefully should still have > much the same behaviour. they do indeed. I agree it's a shame that no Solaris/x86 system (or up-to-date Solaris 11.4/SPARC system) is part of the GCC compile farm, but my attempt to provide this failed primarily for administrative/university policy reasons. > I did manage to (eventually) build GDB on one of these machines, but, > I'm not sure if I built it wrong, or if the Sparc/Solaris support is > just bad, but GDB was crashing all over the place with assertion > failures. It depends on your definition of `alll over the place', but yeah, Solaris test results are pretty bad: * sparcv9: === gdb Summary === # of unexpected core files 1 # of expected passes 68020 # of unexpected failures 3535 # of unexpected successes 13 # of expected failures 61 # of unknown successes 2 # of known failures 88 # of untested testcases 189 # of unresolved testcases 52 # of unsupported tests 244 # of paths in test names 12 # of duplicate test names 7 * amd64: === gdb Summary === # of unexpected core files 1 # of expected passes 72847 # of unexpected failures 2664 # of unexpected successes 13 # of expected failures 60 # of unknown successes 1 # of known failures 91 # of untested testcases 190 # of unresolved testcases 73 # of unsupported tests 254 # of paths in test names 13 # of duplicate test names 7 I had tried to attack the most glaring onces several times in the past, but in the end it came to nothing: I really fear the complexity of the GDB code base is beyond me for anything but fixing build failures and minor feature patches ;-( > Still, with some persistence I could see the behaviour you observe. > > Now, I've not done any Solaris work in >10years, so I don't claim to be > any kind of expert, but I wonder if the fix you're proposing here isn't > simply hiding a GDB bug. > > I wrote a simple test program that starts 3 worker threads and then > blocks. Here's the 'info threads' output for GNU/Linux: > > (gdb) info threads > Id Target Id Frame > * 1 Thread 0x7ffff7da3740 (LWP 2243115) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6 > 2 Thread 0x7ffff7da2700 (LWP 2243118) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6 > 3 Thread 0x7ffff75a1700 (LWP 2243119) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6 > 4 Thread 0x7ffff6da0700 (LWP 2243120) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6 > > What you'd expect. Now here's the same on Solaris: > > (gdb) info threads > Id Target Id Frame > * 1 LWP 1 0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1 > 2 LWP 4 0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1 > 3 LWP 3 0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1 > 4 LWP 2 0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1 > 5 Thread 1 (LWP 1) 0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1 > 6 Thread 2 (LWP 2) 0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1 > 7 Thread 3 (LWP 3) 0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1 > 8 Thread 4 (LWP 4) 0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1 > > This is inline with what you describe, but, I think we can all agree, > this seems a little odd; are there really 8 thread like things running > as part of this process? The output of `ps -aL` would suggest not: > > $ ps -aL > PID LWP TTY LTIME CMD > 3855 1 pts/6 0:00 thr.x > 3855 2 pts/6 0:00 thr.x > 3855 3 pts/6 0:00 thr.x > 3855 4 pts/6 0:00 thr.x > 4132 1 pts/8 0:00 ps > > And also, when I run the same test application using the dbx debugger, I > see this: > > (dbx) threads > *> t@1 a l@1 ?() signal SIGINT in ___nanosleep() > t@2 a l@2 thread_worker() running in ___nanosleep() > t@3 a l@3 thread_worker() running in ___nanosleep() > t@4 a l@4 thread_worker() running in ___nanosleep() > > So here, the process is represented as just 4 thread like things. > > So, why does GDB think there are 8, while every tools that ships with > Solaris seems to think there are 4? My guess, is that is has something > to do with the thread lookup code in sol-thread.c, and/or the operation > of libthread-db. It certainly has, however the output is not really wrong, just unnecessarily complex: As the head comment in sol-thread.c alludes to, before Solaris 9 thread support in Solaris was implemented as a 2-level model scheduling M user-level threads to N kernel threads/LWPs. The old complexity is still visible in the info threads output above, which lists both user-level threads and LWPs. However, with Solaris 9, this all became obsolete with a 1:1 mapping from user-level threads to LWPs. In fact, that new thread model was already visible in Solaris 8's alternative thread library in /usr/lib/lwp, which gcc forcibly used since it was the only one providing TLS support. I tried at least once to rip out support for the two-level model from sol-thread.c, but failed. While the output still isn't wrong, it's plain confusing and especially the GDB testsuite isn't able to deal with it, so it only causes problems. linux-thread-db.c at some point implemented a similar two-level model, but that has been ripped out since. > So, what I run your original selfkill test application, and use GDB to > break on GDB's add_thread_with_info function (the thing that is > responsible for printing the "New Thread ..." message), here's what I > see: > > (gdb) bt > #0 add_thread_with_info (targ=targ@entry=0x940678 , ptid=..., priv=priv@entry=0x0) at ../../src/gdb/thread.c:290 > #1 0x0053b61c in add_thread (targ=0x940678 , ptid=...) at ../../src/gdb/thread.c:305 > #2 0x004ab5f4 in sol_thread_target::wait (this=, ptid=..., ourstatus=0xffbff620, options=...) at ../../src/gdb/sol-thread.c:459 > #3 0x0053019c in target_wait (ptid=..., status=status@entry=0xffbff620, options=...) at ../../src/gdb/target.c:2598 > #4 0x00395478 in do_target_wait_1 (inf=inf@entry=0x969288, ptid=..., status=status@entry=0xffbff620, options=) at ../../src/gdb/infrun.c:3763 > #5 0x003a7e8c in ::operator() (inf=0x969288, __closure=) at ../../src/gdb/infrun.c:3822 > #6 do_target_wait (options=..., ecs=0xffbff600) at ../../src/gdb/infrun.c:3841 > #7 fetch_inferior_event () at ../../src/gdb/infrun.c:4201 > #8 0x001b0bd8 in check_async_event_handlers () at ../../src/gdb/async-event.c:337 > #9 0x006c4e3c in gdb_do_one_event (mstimeout=mstimeout@entry=-1) at ../../src/gdbsupport/event-loop.cc:221 > #10 0x003d7ea0 in start_event_loop () at ../../src/gdb/main.c:411 > #11 captured_command_loop () at ../../src/gdb/main.c:471 > #12 0x003d9fa8 in captured_main (data=0xffbff84c) at ../../src/gdb/main.c:1330 > #13 gdb_main (args=args@entry=0xffbff84c) at ../../src/gdb/main.c:1345 > #14 0x006f7c5c in main (argc=4, argv=0xffbff8bc) at ../../src/gdb/gdb.c:32 > (gdb) frame 2 > #2 0x004ab5f4 in sol_thread_target::wait (this=, ptid=..., ourstatus=0xffbff620, options=...) at ../../src/gdb/sol-thread.c:459 > 459 add_thread (proc_target, rtnval); > (gdb) p rtnval > $1 = {m_pid = 7218, m_lwp = 0, m_tid = 1} > (gdb) p current_inferior_.m_obj->thread_list.m_front.ptid > $2 = {m_pid = 7218, m_lwp = 1, m_tid = 0} > (gdb) > > What this is telling us, is that, when GDB stopped after the ::wait > call, the ptid_t it got back was '{m_pid = 7218, m_lwp = 0, m_tid = 1}', > however, the original thread that GDB found when starting the > application was '{m_pid = 7218, m_lwp = 1, m_tid = 0}'. > > This difference is what causes GDB to add the new thread. > > My guess is that this m_lwp/m_tid difference is a bug somewhere in the > stack, and that really, we should be seeing the same ptid_t here. If we > did, then GDB would not add the new thread, and the test messages would > not change. > > What are your thoughts on this analysis? I do agree that this discrepancy is causing issues (certainly not only here), but I've looked from a different angle: the " received signal" message is from infrun.c:print_signal_received_reason where the "Thread " variant is emitted if thread.c:show_thread_that_caused_stop returned true, which happens for hightest_thread_num > 1. It could be (I do have long-winded notes from long ago that I can only half make sense of now) that the fact that both the user-level thread and the lwp are counted here cause this. It seems that highest_thead_num is incremented once from Thread 2 hit Hardware watchpoint 3: 'thread.c'::highest_thread_num Old value = 0 New value = 1 0x00000001009c83ec in thread_info::thread_info (this=0x100f8c100, inf_=0x100f06670, ptid_=...) at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:343 343 this->global_num = ++highest_thread_num; (top-gdb) where #0 0x00000001009c83ec in thread_info::thread_info (this=0x100f8c100, inf_=0x100f06670, ptid_=...) at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:343 #1 0x00000001009c8560 in new_thread (inf=0x100f06670, ptid=...) at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:248 #2 0x00000001009cbd7c in add_thread_silent (ptid=...) at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:310 #3 0x0000000100831604 in procfs_target::create_inferior ( this=0x100d138c0 , exec_file=0x100f13300 "/vol/obj/gnu/gdb/gdb/11.5-sparcv9-local/ua", allargs="", env=0x100f06f80, from_tty=) at /vol/src/gnu/gdb/hg/master/local/gdb/procfs.c:3093 and a second time from Thread 2 hit Hardware watchpoint 3: 'thread.c'::highest_thread_num Old value = 1 New value = 2 0x00000001009c83ec in thread_info::thread_info (this=0x101176ab0, inf_=0x100f06670, ptid_=...) at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:343 343 this->global_num = ++highest_thread_num; (top-gdb) where #0 0x00000001009c83ec in thread_info::thread_info (this=0x101176ab0, inf_=0x100f06670, ptid_=...) at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:343 #1 0x00000001009c8560 in new_thread (inf=0x100f06670, ptid=...) at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:248 #2 0x00000001009cbd7c in add_thread_silent (ptid=...) at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:310 #3 0x00000001009cbee4 in add_thread_with_info (ptid=..., priv=0x0) at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:319 #4 0x00000001009cbfe0 in add_thread (ptid=...) at /vol/src/gnu/gdb/hg/master/local/gdb/thread.c:333 #5 0x0000000100908b54 in sol_thread_target::wait ( this=0x100d1ba08 , ptid=..., ourstatus=, options=) at /vol/src/gnu/gdb/hg/master/local/gdb/sol-thread.c:466 #6 0x000000010099f9e4 in target_wait (ptid=..., status=0xffffffff7fffe460, options=) at /vol/src/gnu/gdb/hg/master/local/gdb/target.h:1298 #7 0x0000000100729228 in do_target_wait (ptid=..., status=0xffffffff7fffe460, options=) at /vol/src/gnu/gdb/hg/master/local/gdb/infrun.c:3442 #8 0x0000000100737e28 in fetch_inferior_event (client_data=0x0) at /vol/src/gnu/gdb/hg/master/local/gdb/infrun.c:3725 #9 0x0000000100715210 in inferior_event_handler (event_type=, client_data=0x0) at /vol/src/gnu/gdb/hg/master/local/gdb/inf-loop.c:43 #10 0x00000001007273c4 in infrun_async_inferior_event_handler (data=0x0) at /vol/src/gnu/gdb/hg/master/local/gdb/infrun.c:8898 #11 0x0000000100691828 in check_async_event_handlers () at /vol/src/gnu/gdb/hg/master/local/gdb/event-loop.c:1062 #12 gdb_do_one_event () at /vol/src/gnu/gdb/hg/master/local/gdb/event-loop.c:325 #13 0x00000001006922ec in start_event_loop () at /vol/src/gnu/gdb/hg/master/local/gdb/event-loop.c:370 #14 0x0000000100786090 in captured_command_loop () at /vol/src/gnu/gdb/hg/master/local/gdb/main.c:331 #15 0x00000001007879fc in captured_main (data=) at /vol/src/gnu/gdb/hg/master/local/gdb/main.c:1171 #16 gdb_main (args=) at /vol/src/gnu/gdb/hg/master/local/gdb/main.c:1186 #17 0x0000000100470728 in main (argc=, argv=0xffffffff7fffeba8) at /vol/src/gnu/gdb/hg/master/local/gdb/gdb.c:32 As you see, those are old stacktraces and no longer match current code. However, one call ultimately comes from procfs.c (i.e. for an lwp), the second from sol-thread.c (for a user-level thread). I strongly suspect this needs to go away, but don't have the slightest idea how. Certainly, the whole thread/lwp distinction is moot since Solaris 9: tid and lwpid are identical by now and there's a 1:1 mapping between the two. Rainer -- ----------------------------------------------------------------------------- Rainer Orth, Center for Biotechnology, Bielefeld University