public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed
From: Andrew Burgess <aburgess@redhat.com>
To: Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE>,
	Andrew Burgess <andrew.burgess@embecosm.com>
Cc: gdb-patches@sourceware.org
Subject: Re: [PATCH] Fix expected received signal message in testsuite
Date: Thu, 24 Nov 2022 17:06:30 +0000	[thread overview]
Message-ID: <87o7sw2sfd.fsf@redhat.com> (raw)
In-Reply-To: <87sfi82vg4.fsf@redhat.com>

Andrew Burgess <aburgess@redhat.com> writes:

> Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE> writes:
>
>> Hi Andrew,
>>
>>> * Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE> [2019-09-05 14:04:06 +0200]:
>>>
>>>> Quite a number of tests FAIL on Solaris due to a mismatch between
>>>> expected and received messages: the testsuite expects something like
>>>> 
>>>> 	Program received signal SIGABRT, Aborted.
>>>> 
>>>> while on Solaris it gets
>>>> 
>>>> 	Thread 2 received signal SIGABRT, Aborted.
>>>> 
>>>> For a simple testcase, info threads shows
>>>> 
>>>> (gdb) info threads 
>>>>   Id   Target Id         Frame 
>>>>   1    LWP    1          main () at /vol/src/gnu/gdb/doc/bugs/ua.c:5
>>>> * 2    Thread 1 (LWP 1)  main () at /vol/src/gnu/gdb/doc/bugs/ua.c:5
>>>> 
>>>> I suspect this is due to support for the old pre-Solaris 9 MxN thread
>>>> model where user level threads were mapped to a different set of lwps.
>>>> 
>>>> For the moment, I'm dealing with this by allowing both forms of the
>>>> message in the testsuite.  The patch is almost completely mechanical,
>>>> with the exception of gdb.base/sigbpt.exp where the introduction of a
>>>> new group in the RE required adjustments in the $expect_out indices.
>>>
>>> I'm a little nervous about just allowing either "Thread" or "Program"
>>> for all tests for all targets.  Maybe others will disagree and think
>>> I'm worrying about nothing, but I wonder if we could be more
>>> conservative by adding a support function into lib/gdb.exp that takes
>>> the name of a signal and returns the string we expect from GDB, which
>>> we can then change based on Solaris/non-Solaris.
>>>
>>> I haven't looked through the patch in enough detail to know if there's
>>> any reason why this wouldn't work, so please push back if you think
>>> the idea is unworkable.
>>
>> sorry for letting the ball drop on this one.  Only recently did I
>> stumble across it again when looking into a related issue and now I
>> finally understand why Solaris is different here.
>>
>> [Thread starting at https://sourceware.org/ml/gdb-patches/2019-09/msg00050.html]
>>
>> * Consider the following testcase:
>>
>> $ cat selfkill.c 
>> #include <sys/types.h>
>> #include <signal.h>
>> #include <unistd.h>
>> #include <pthread.h>
>>
>> void *
>> selfkill (void *arg)
>> {
>>   kill (getpid (), SIGINT);
>>   return NULL;
>> }
>>
>> int
>> main (void)
>> {
>> #ifdef _REENTRANT
>>   pthread_t tid;
>>   pthread_create (&tid, NULL, selfkill, NULL);
>>   pthread_join (tid, NULL);
>> #else
>>   selfkill (NULL);
>> #endif
>>   return 0;
>> }
>>
>> * Now compile on Solaris 9, both without and with -pthread:
>>
>> $ gcc -o selfkill selfkill.c
>> $ gcc -pthread -o selfkill-mt selfkill.c
>>
>> * Run the identical binaries and versions of gdb (7.11 here) on both
>>   Solaris 9 and Solaris 10:
>>
>> $ gdb -q --batch -ex run selfkill{,-mt}
>>
>> ** Solaris 9, selfkill:
>>
>> Program received signal SIGINT, Interrupt.
>> 0xb5d54186 in _libc_kill () from /usr/lib/libc.so.1
>>
>> ** Solaris 9, selfkill-mt:
>>
>> [Thread debugging using libthread_db enabled]
>> [New Thread 1 (LWP 1)]
>> [New LWP    2        ]
>> [New Thread 2 (LWP 2)]
>>
>> Thread 2 received signal SIGINT, Interrupt.
>> [Switching to Thread 1 (LWP 1)]
>> 0xb5c9fad5 in _lwp_wait () from /usr/lib/libc.so.1
>>
>> ** Solaris 10, selfkill:
>>
>> [Thread debugging using libthread_db enabled]
>> [New Thread 1 (LWP 1)]
>>
>> Thread 2 received signal SIGINT, Interrupt.
>> [Switching to Thread 1 (LWP 1)]
>> 0xfef0c165 in kill () from /lib/libc.so.1
>>
>> ** Solaris 10, selfkill-mt:
>>
>> [Thread debugging using libthread_db enabled]
>> [New Thread 1 (LWP 1)]
>> [New LWP    2        ]
>> [New Thread 2 (LWP 2)]
>>
>> Thread 2 received signal SIGINT, Interrupt.
>> [Switching to Thread 1 (LWP 1)]
>> 0xfeedca05 in __lwp_wait () from /lib/libc.so.1
>>
>> ** Trying the same on Linux/x86_64, one sees the same behaviour as on
>>    Solaris 9: non-threaded and threaded programs behave differently.
>>
>> * As you can see, on Solaris 10 even the not explicitly threaded version
>>   of the test is shown as threaded, explaining the difference in the
>>   "... received signal" messages.
>>
>>   This is a consequence of the Thread Model Unification Project in
>>   Solaris 10, which removed the difference between non-threaded and
>>   threaded processes.  This has nothing to do with the removal of the
>>   pre-Solaris 9 MxN multilevel thread model as I'd originally
>>   suspected.
>
> I tried to take a look at this a little.  The only Solaris machines I
> have access to run on Sparc, not x86-64, but hopefully should still have
> much the same behaviour.
>
> I did manage to (eventually) build GDB on one of these machines, but,
> I'm not sure if I built it wrong, or if the Sparc/Solaris support is
> just bad, but GDB was crashing all over the place with assertion
> failures.
>
> Still, with some persistence I could see the behaviour you observe.
>
> Now, I've not done any Solaris work in >10years, so I don't claim to be
> any kind of expert, but I wonder if the fix you're proposing here isn't
> simply hiding a GDB bug.
>
> I wrote a simple test program that starts 3 worker threads and then
> blocks.  Here's the 'info threads' output for GNU/Linux:
>
>   (gdb) info threads 
>     Id   Target Id                                   Frame 
>   * 1    Thread 0x7ffff7da3740 (LWP 2243115) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>     2    Thread 0x7ffff7da2700 (LWP 2243118) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>     3    Thread 0x7ffff75a1700 (LWP 2243119) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>     4    Thread 0x7ffff6da0700 (LWP 2243120) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>
> What you'd expect.  Now here's the same on Solaris:
>
>   (gdb) info threads
>     Id   Target Id         Frame 
>   * 1    LWP    1          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     2    LWP    4          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     3    LWP    3          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     4    LWP    2          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     5    Thread 1 (LWP 1)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     6    Thread 2 (LWP 2)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     7    Thread 3 (LWP 3)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     8    Thread 4 (LWP 4)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>
> This is inline with what you describe, but, I think we can all agree,
> this seems a little odd; are there really 8 thread like things running
> as part of this process?  The output of `ps -aL` would suggest not:
>
>   $ ps -aL
>      PID   LWP TTY        LTIME CMD
>     3855     1 pts/6       0:00 thr.x
>     3855     2 pts/6       0:00 thr.x
>     3855     3 pts/6       0:00 thr.x
>     3855     4 pts/6       0:00 thr.x
>     4132     1 pts/8       0:00 ps
>
> And also, when I run the same test application using the dbx debugger, I
> see this:
>
>   (dbx) threads
>   *>    t@1  a  l@1   ?()   signal SIGINT in  ___nanosleep() 
>         t@2  a  l@2   thread_worker()   running          in  ___nanosleep() 
>         t@3  a  l@3   thread_worker()   running          in  ___nanosleep() 
>         t@4  a  l@4   thread_worker()   running          in  ___nanosleep() 
>
> So here, the process is represented as just 4 thread like things.
>
> So, why does GDB think there are 8, while every tools that ships with
> Solaris seems to think there are 4?  My guess, is that is has something
> to do with the thread lookup code in sol-thread.c, and/or the operation
> of libthread-db.
>
> So, what I run your original selfkill test application, and use GDB to
> break on GDB's add_thread_with_info function (the thing that is
> responsible for printing the "New Thread ..." message), here's what I
> see:
>
>   (gdb) bt
>   #0  add_thread_with_info (targ=targ@entry=0x940678 <the_procfs_target>, ptid=..., priv=priv@entry=0x0) at ../../src/gdb/thread.c:290
>   #1  0x0053b61c in add_thread (targ=0x940678 <the_procfs_target>, ptid=...) at ../../src/gdb/thread.c:305
>   #2  0x004ab5f4 in sol_thread_target::wait (this=<optimized out>, ptid=..., ourstatus=0xffbff620, options=...) at ../../src/gdb/sol-thread.c:459
>   #3  0x0053019c in target_wait (ptid=..., status=status@entry=0xffbff620, options=...) at ../../src/gdb/target.c:2598
>   #4  0x00395478 in do_target_wait_1 (inf=inf@entry=0x969288, ptid=..., status=status@entry=0xffbff620, options=<error reading variable: Cannot access memory at address 0x0>) at ../../src/gdb/infrun.c:3763
>   #5  0x003a7e8c in <lambda(inferior*)>::operator() (inf=0x969288, __closure=<synthetic pointer>) at ../../src/gdb/infrun.c:3822
>   #6  do_target_wait (options=..., ecs=0xffbff600) at ../../src/gdb/infrun.c:3841
>   #7  fetch_inferior_event () at ../../src/gdb/infrun.c:4201
>   #8  0x001b0bd8 in check_async_event_handlers () at ../../src/gdb/async-event.c:337
>   #9  0x006c4e3c in gdb_do_one_event (mstimeout=mstimeout@entry=-1) at ../../src/gdbsupport/event-loop.cc:221
>   #10 0x003d7ea0 in start_event_loop () at ../../src/gdb/main.c:411
>   #11 captured_command_loop () at ../../src/gdb/main.c:471
>   #12 0x003d9fa8 in captured_main (data=0xffbff84c) at ../../src/gdb/main.c:1330
>   #13 gdb_main (args=args@entry=0xffbff84c) at ../../src/gdb/main.c:1345
>   #14 0x006f7c5c in main (argc=4, argv=0xffbff8bc) at ../../src/gdb/gdb.c:32
>   (gdb) frame 2
>   #2  0x004ab5f4 in sol_thread_target::wait (this=<optimized out>, ptid=..., ourstatus=0xffbff620, options=...) at ../../src/gdb/sol-thread.c:459
>   459                   add_thread (proc_target, rtnval);
>   (gdb) p rtnval
>   $1 = {m_pid = 7218, m_lwp = 0, m_tid = 1}
>   (gdb) p current_inferior_.m_obj->thread_list.m_front.ptid 
>   $2 = {m_pid = 7218, m_lwp = 1, m_tid = 0}
>   (gdb) 
>
> What this is telling us, is that, when GDB stopped after the ::wait
> call, the ptid_t it got back was '{m_pid = 7218, m_lwp = 0, m_tid = 1}',
> however, the original thread that GDB found when starting the
> application was '{m_pid = 7218, m_lwp = 1, m_tid = 0}'.
>
> This difference is what causes GDB to add the new thread.
>
> My guess is that this m_lwp/m_tid difference is a bug somewhere in the
> stack, and that really, we should be seeing the same ptid_t here.  If we
> did, then GDB would not add the new thread, and the test messages would
> not change.

So, to clarify this a little, the discrepancy seems to arise from
lwp_to_thread, this is where we query libthread-db.

Before this point, in sol_thread_target::wait, we call:

  ptid_t rtnval = beneath ()->wait (ptid, ourstatus, options);

this returns us the (maybe?) expected ptid_t {m_pid = 7218, m_lwp = 1,
m_tid = 0}, then when we call lwp_to_thread, we get back the alternative
ptid_t where the tid field is set, but the lwp field is not.

I don't know if this indicates a bug in libthread-db, or a bug in GDB.

Thanks,
Andrew


  reply	other threads:[~2022-11-24 17:06 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-05 12:04 Rainer Orth
2019-09-13 12:47 ` Rainer Orth
2019-09-13 22:18 ` Andrew Burgess
2022-11-22  9:18   ` Rainer Orth
2022-11-24 16:01     ` Andrew Burgess
2022-11-24 17:06       ` Andrew Burgess [this message]
2022-11-25 10:04         ` Rainer Orth
2022-11-25  9:58       ` Rainer Orth

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o7sw2sfd.fsf@redhat.com \
    --to=aburgess@redhat.com \
    --cc=andrew.burgess@embecosm.com \
    --cc=gdb-patches@sourceware.org \
    --cc=ro@CeBiTec.Uni-Bielefeld.DE \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).