public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed
From: Andrew Burgess <aburgess@redhat.com>
To: Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE>,
	Andrew Burgess <andrew.burgess@embecosm.com>
Cc: gdb-patches@sourceware.org
Subject: Re: [PATCH] Fix expected received signal message in testsuite
Date: Thu, 24 Nov 2022 16:01:15 +0000	[thread overview]
Message-ID: <87sfi82vg4.fsf@redhat.com> (raw)
In-Reply-To: <yddmt8jqrdl.fsf@CeBiTec.Uni-Bielefeld.DE>

Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE> writes:

> Hi Andrew,
>
>> * Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE> [2019-09-05 14:04:06 +0200]:
>>
>>> Quite a number of tests FAIL on Solaris due to a mismatch between
>>> expected and received messages: the testsuite expects something like
>>> 
>>> 	Program received signal SIGABRT, Aborted.
>>> 
>>> while on Solaris it gets
>>> 
>>> 	Thread 2 received signal SIGABRT, Aborted.
>>> 
>>> For a simple testcase, info threads shows
>>> 
>>> (gdb) info threads 
>>>   Id   Target Id         Frame 
>>>   1    LWP    1          main () at /vol/src/gnu/gdb/doc/bugs/ua.c:5
>>> * 2    Thread 1 (LWP 1)  main () at /vol/src/gnu/gdb/doc/bugs/ua.c:5
>>> 
>>> I suspect this is due to support for the old pre-Solaris 9 MxN thread
>>> model where user level threads were mapped to a different set of lwps.
>>> 
>>> For the moment, I'm dealing with this by allowing both forms of the
>>> message in the testsuite.  The patch is almost completely mechanical,
>>> with the exception of gdb.base/sigbpt.exp where the introduction of a
>>> new group in the RE required adjustments in the $expect_out indices.
>>
>> I'm a little nervous about just allowing either "Thread" or "Program"
>> for all tests for all targets.  Maybe others will disagree and think
>> I'm worrying about nothing, but I wonder if we could be more
>> conservative by adding a support function into lib/gdb.exp that takes
>> the name of a signal and returns the string we expect from GDB, which
>> we can then change based on Solaris/non-Solaris.
>>
>> I haven't looked through the patch in enough detail to know if there's
>> any reason why this wouldn't work, so please push back if you think
>> the idea is unworkable.
>
> sorry for letting the ball drop on this one.  Only recently did I
> stumble across it again when looking into a related issue and now I
> finally understand why Solaris is different here.
>
> [Thread starting at https://sourceware.org/ml/gdb-patches/2019-09/msg00050.html]
>
> * Consider the following testcase:
>
> $ cat selfkill.c 
> #include <sys/types.h>
> #include <signal.h>
> #include <unistd.h>
> #include <pthread.h>
>
> void *
> selfkill (void *arg)
> {
>   kill (getpid (), SIGINT);
>   return NULL;
> }
>
> int
> main (void)
> {
> #ifdef _REENTRANT
>   pthread_t tid;
>   pthread_create (&tid, NULL, selfkill, NULL);
>   pthread_join (tid, NULL);
> #else
>   selfkill (NULL);
> #endif
>   return 0;
> }
>
> * Now compile on Solaris 9, both without and with -pthread:
>
> $ gcc -o selfkill selfkill.c
> $ gcc -pthread -o selfkill-mt selfkill.c
>
> * Run the identical binaries and versions of gdb (7.11 here) on both
>   Solaris 9 and Solaris 10:
>
> $ gdb -q --batch -ex run selfkill{,-mt}
>
> ** Solaris 9, selfkill:
>
> Program received signal SIGINT, Interrupt.
> 0xb5d54186 in _libc_kill () from /usr/lib/libc.so.1
>
> ** Solaris 9, selfkill-mt:
>
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP    2        ]
> [New Thread 2 (LWP 2)]
>
> Thread 2 received signal SIGINT, Interrupt.
> [Switching to Thread 1 (LWP 1)]
> 0xb5c9fad5 in _lwp_wait () from /usr/lib/libc.so.1
>
> ** Solaris 10, selfkill:
>
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
>
> Thread 2 received signal SIGINT, Interrupt.
> [Switching to Thread 1 (LWP 1)]
> 0xfef0c165 in kill () from /lib/libc.so.1
>
> ** Solaris 10, selfkill-mt:
>
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP    2        ]
> [New Thread 2 (LWP 2)]
>
> Thread 2 received signal SIGINT, Interrupt.
> [Switching to Thread 1 (LWP 1)]
> 0xfeedca05 in __lwp_wait () from /lib/libc.so.1
>
> ** Trying the same on Linux/x86_64, one sees the same behaviour as on
>    Solaris 9: non-threaded and threaded programs behave differently.
>
> * As you can see, on Solaris 10 even the not explicitly threaded version
>   of the test is shown as threaded, explaining the difference in the
>   "... received signal" messages.
>
>   This is a consequence of the Thread Model Unification Project in
>   Solaris 10, which removed the difference between non-threaded and
>   threaded processes.  This has nothing to do with the removal of the
>   pre-Solaris 9 MxN multilevel thread model as I'd originally
>   suspected.

I tried to take a look at this a little.  The only Solaris machines I
have access to run on Sparc, not x86-64, but hopefully should still have
much the same behaviour.

I did manage to (eventually) build GDB on one of these machines, but,
I'm not sure if I built it wrong, or if the Sparc/Solaris support is
just bad, but GDB was crashing all over the place with assertion
failures.

Still, with some persistence I could see the behaviour you observe.

Now, I've not done any Solaris work in >10years, so I don't claim to be
any kind of expert, but I wonder if the fix you're proposing here isn't
simply hiding a GDB bug.

I wrote a simple test program that starts 3 worker threads and then
blocks.  Here's the 'info threads' output for GNU/Linux:

  (gdb) info threads 
    Id   Target Id                                   Frame 
  * 1    Thread 0x7ffff7da3740 (LWP 2243115) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
    2    Thread 0x7ffff7da2700 (LWP 2243118) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
    3    Thread 0x7ffff75a1700 (LWP 2243119) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
    4    Thread 0x7ffff6da0700 (LWP 2243120) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6

What you'd expect.  Now here's the same on Solaris:

  (gdb) info threads
    Id   Target Id         Frame 
  * 1    LWP    1          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
    2    LWP    4          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
    3    LWP    3          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
    4    LWP    2          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
    5    Thread 1 (LWP 1)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
    6    Thread 2 (LWP 2)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
    7    Thread 3 (LWP 3)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
    8    Thread 4 (LWP 4)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1

This is inline with what you describe, but, I think we can all agree,
this seems a little odd; are there really 8 thread like things running
as part of this process?  The output of `ps -aL` would suggest not:

  $ ps -aL
     PID   LWP TTY        LTIME CMD
    3855     1 pts/6       0:00 thr.x
    3855     2 pts/6       0:00 thr.x
    3855     3 pts/6       0:00 thr.x
    3855     4 pts/6       0:00 thr.x
    4132     1 pts/8       0:00 ps

And also, when I run the same test application using the dbx debugger, I
see this:

  (dbx) threads
  *>    t@1  a  l@1   ?()   signal SIGINT in  ___nanosleep() 
        t@2  a  l@2   thread_worker()   running          in  ___nanosleep() 
        t@3  a  l@3   thread_worker()   running          in  ___nanosleep() 
        t@4  a  l@4   thread_worker()   running          in  ___nanosleep() 

So here, the process is represented as just 4 thread like things.

So, why does GDB think there are 8, while every tools that ships with
Solaris seems to think there are 4?  My guess, is that is has something
to do with the thread lookup code in sol-thread.c, and/or the operation
of libthread-db.

So, what I run your original selfkill test application, and use GDB to
break on GDB's add_thread_with_info function (the thing that is
responsible for printing the "New Thread ..." message), here's what I
see:

  (gdb) bt
  #0  add_thread_with_info (targ=targ@entry=0x940678 <the_procfs_target>, ptid=..., priv=priv@entry=0x0) at ../../src/gdb/thread.c:290
  #1  0x0053b61c in add_thread (targ=0x940678 <the_procfs_target>, ptid=...) at ../../src/gdb/thread.c:305
  #2  0x004ab5f4 in sol_thread_target::wait (this=<optimized out>, ptid=..., ourstatus=0xffbff620, options=...) at ../../src/gdb/sol-thread.c:459
  #3  0x0053019c in target_wait (ptid=..., status=status@entry=0xffbff620, options=...) at ../../src/gdb/target.c:2598
  #4  0x00395478 in do_target_wait_1 (inf=inf@entry=0x969288, ptid=..., status=status@entry=0xffbff620, options=<error reading variable: Cannot access memory at address 0x0>) at ../../src/gdb/infrun.c:3763
  #5  0x003a7e8c in <lambda(inferior*)>::operator() (inf=0x969288, __closure=<synthetic pointer>) at ../../src/gdb/infrun.c:3822
  #6  do_target_wait (options=..., ecs=0xffbff600) at ../../src/gdb/infrun.c:3841
  #7  fetch_inferior_event () at ../../src/gdb/infrun.c:4201
  #8  0x001b0bd8 in check_async_event_handlers () at ../../src/gdb/async-event.c:337
  #9  0x006c4e3c in gdb_do_one_event (mstimeout=mstimeout@entry=-1) at ../../src/gdbsupport/event-loop.cc:221
  #10 0x003d7ea0 in start_event_loop () at ../../src/gdb/main.c:411
  #11 captured_command_loop () at ../../src/gdb/main.c:471
  #12 0x003d9fa8 in captured_main (data=0xffbff84c) at ../../src/gdb/main.c:1330
  #13 gdb_main (args=args@entry=0xffbff84c) at ../../src/gdb/main.c:1345
  #14 0x006f7c5c in main (argc=4, argv=0xffbff8bc) at ../../src/gdb/gdb.c:32
  (gdb) frame 2
  #2  0x004ab5f4 in sol_thread_target::wait (this=<optimized out>, ptid=..., ourstatus=0xffbff620, options=...) at ../../src/gdb/sol-thread.c:459
  459                   add_thread (proc_target, rtnval);
  (gdb) p rtnval
  $1 = {m_pid = 7218, m_lwp = 0, m_tid = 1}
  (gdb) p current_inferior_.m_obj->thread_list.m_front.ptid 
  $2 = {m_pid = 7218, m_lwp = 1, m_tid = 0}
  (gdb) 

What this is telling us, is that, when GDB stopped after the ::wait
call, the ptid_t it got back was '{m_pid = 7218, m_lwp = 0, m_tid = 1}',
however, the original thread that GDB found when starting the
application was '{m_pid = 7218, m_lwp = 1, m_tid = 0}'.

This difference is what causes GDB to add the new thread.

My guess is that this m_lwp/m_tid difference is a bug somewhere in the
stack, and that really, we should be seeing the same ptid_t here.  If we
did, then GDB would not add the new thread, and the test messages would
not change.

What are your thoughts on this analysis?

Thanks,
Andrew


  reply	other threads:[~2022-11-24 16:01 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-05 12:04 Rainer Orth
2019-09-13 12:47 ` Rainer Orth
2019-09-13 22:18 ` Andrew Burgess
2022-11-22  9:18   ` Rainer Orth
2022-11-24 16:01     ` Andrew Burgess [this message]
2022-11-24 17:06       ` Andrew Burgess
2022-11-25 10:04         ` Rainer Orth
2022-11-25  9:58       ` Rainer Orth

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sfi82vg4.fsf@redhat.com \
    --to=aburgess@redhat.com \
    --cc=andrew.burgess@embecosm.com \
    --cc=gdb-patches@sourceware.org \
    --cc=ro@CeBiTec.Uni-Bielefeld.DE \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).