From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=AZ7P=3Y=redhat.com=aburgess@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by sourceware.org (Postfix) with ESMTPS id 44CAE382FC96
	for <gdb-patches@sourceware.org>; Thu, 24 Nov 2022 17:06:36 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 44CAE382FC96
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1669309595;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=2NDqFSQCVjm54N+U30ckCJswdRiAcrDfJuYIhvQJCyg=;
	b=XjgLJ+EFFN9mg5UctN4cmRvoJsHsb68zIpiazgqUreSIIQeUSzJQH7//Lr7c/XA8rxoXNk
	JjYx9WglVrAaeks07i8QbrbqYIFGBHV17ONYt5dltQbJ6VB7aKGA54goC2PPb8e5sdrsI5
	j00VJtTWmpAxkkOx7OmBW4OeB4pSOjU=
Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com
 [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id
 us-mta-508-XBqHe4UoOFSsqmRY5nAKjQ-1; Thu, 24 Nov 2022 12:06:34 -0500
X-MC-Unique: XBqHe4UoOFSsqmRY5nAKjQ-1
Received: by mail-wm1-f72.google.com with SMTP id h4-20020a1c2104000000b003d01b66fe65so2881901wmh.2
        for <gdb-patches@sourceware.org>; Thu, 24 Nov 2022 09:06:34 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=mime-version:message-id:date:references:in-reply-to:subject:cc:to
         :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=2NDqFSQCVjm54N+U30ckCJswdRiAcrDfJuYIhvQJCyg=;
        b=jCIaR+YXJRBCkUNXIlVpfz7omE23gfAmjrleiZbYGwq1aEHVTC9yU/EP2XkWcKuAR3
         A+MdB4jmir0LplIvjP57am9JmRaB3MvxWsbjyP2z90EYc4H7/AeTzuuhGCx0AajWHzxJ
         B5pdo+rJf7PduktJxvUIbC2EfnilICiVpqoj+FnC6I5JWw/UghbRNItCjOGsv8RtvZ6h
         kpwdRO/CE69aH/veOWEIszR4hRqIMmcA2giRIpcaNXJW7idipEppVqdosSvjR4FxCCru
         UKG37Ki5bDAEsgdJP3/SBW1NUn/Ubn44LQTyhwJTXykpQtPPjRWRiw7C/d/hT7n1Fqgo
         9aQA==
X-Gm-Message-State: ANoB5plEBbsKc0YSTvPuBSMi3C25mpw8Fd37rPuT0ZQLJCG5jve7LIPR
	qQD6Pjx8Dw0ot0ApeOcekQqxt8t8K0u+7NKqPWY47EeO+/MVBNfWktMcLdUlXqlX66ZlWVOg1M9
	GbWbGQg+/UUrolXzNMpMGlg==
X-Received: by 2002:adf:ef45:0:b0:230:c987:138 with SMTP id c5-20020adfef45000000b00230c9870138mr9630635wrp.518.1669309593385;
        Thu, 24 Nov 2022 09:06:33 -0800 (PST)
X-Google-Smtp-Source: AA0mqf65FLpeXfmjagyUTpOWfo3FZobnOuSqmt9PGAVd9OEI7KklhU3vdUzWkPqPrkYL+0aPYnmQuQ==
X-Received: by 2002:adf:ef45:0:b0:230:c987:138 with SMTP id c5-20020adfef45000000b00230c9870138mr9630611wrp.518.1669309592934;
        Thu, 24 Nov 2022 09:06:32 -0800 (PST)
Received: from localhost ([31.111.84.238])
        by smtp.gmail.com with ESMTPSA id i8-20020a5d4388000000b00228d52b935asm1809720wrq.71.2022.11.24.09.06.32
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 24 Nov 2022 09:06:32 -0800 (PST)
From: Andrew Burgess <aburgess@redhat.com>
To: Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE>, Andrew Burgess
 <andrew.burgess@embecosm.com>
Cc: gdb-patches@sourceware.org
Subject: Re: [PATCH] Fix expected received signal message in testsuite
In-Reply-To: <87sfi82vg4.fsf@redhat.com>
References: <yddlfv3eyq1.fsf@CeBiTec.Uni-Bielefeld.DE>
 <20190913221823.GV6076@embecosm.com>
 <yddmt8jqrdl.fsf@CeBiTec.Uni-Bielefeld.DE> <87sfi82vg4.fsf@redhat.com>
Date: Thu, 24 Nov 2022 17:06:30 +0000
Message-ID: <87o7sw2sfd.fsf@redhat.com>
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain
X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP,WEIRD_PORT autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gdb-patches.sourceware.org>

Andrew Burgess <aburgess@redhat.com> writes:

> Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE> writes:
>
>> Hi Andrew,
>>
>>> * Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE> [2019-09-05 14:04:06 +0200]:
>>>
>>>> Quite a number of tests FAIL on Solaris due to a mismatch between
>>>> expected and received messages: the testsuite expects something like
>>>> 
>>>> 	Program received signal SIGABRT, Aborted.
>>>> 
>>>> while on Solaris it gets
>>>> 
>>>> 	Thread 2 received signal SIGABRT, Aborted.
>>>> 
>>>> For a simple testcase, info threads shows
>>>> 
>>>> (gdb) info threads 
>>>>   Id   Target Id         Frame 
>>>>   1    LWP    1          main () at /vol/src/gnu/gdb/doc/bugs/ua.c:5
>>>> * 2    Thread 1 (LWP 1)  main () at /vol/src/gnu/gdb/doc/bugs/ua.c:5
>>>> 
>>>> I suspect this is due to support for the old pre-Solaris 9 MxN thread
>>>> model where user level threads were mapped to a different set of lwps.
>>>> 
>>>> For the moment, I'm dealing with this by allowing both forms of the
>>>> message in the testsuite.  The patch is almost completely mechanical,
>>>> with the exception of gdb.base/sigbpt.exp where the introduction of a
>>>> new group in the RE required adjustments in the $expect_out indices.
>>>
>>> I'm a little nervous about just allowing either "Thread" or "Program"
>>> for all tests for all targets.  Maybe others will disagree and think
>>> I'm worrying about nothing, but I wonder if we could be more
>>> conservative by adding a support function into lib/gdb.exp that takes
>>> the name of a signal and returns the string we expect from GDB, which
>>> we can then change based on Solaris/non-Solaris.
>>>
>>> I haven't looked through the patch in enough detail to know if there's
>>> any reason why this wouldn't work, so please push back if you think
>>> the idea is unworkable.
>>
>> sorry for letting the ball drop on this one.  Only recently did I
>> stumble across it again when looking into a related issue and now I
>> finally understand why Solaris is different here.
>>
>> [Thread starting at https://sourceware.org/ml/gdb-patches/2019-09/msg00050.html]
>>
>> * Consider the following testcase:
>>
>> $ cat selfkill.c 
>> #include <sys/types.h>
>> #include <signal.h>
>> #include <unistd.h>
>> #include <pthread.h>
>>
>> void *
>> selfkill (void *arg)
>> {
>>   kill (getpid (), SIGINT);
>>   return NULL;
>> }
>>
>> int
>> main (void)
>> {
>> #ifdef _REENTRANT
>>   pthread_t tid;
>>   pthread_create (&tid, NULL, selfkill, NULL);
>>   pthread_join (tid, NULL);
>> #else
>>   selfkill (NULL);
>> #endif
>>   return 0;
>> }
>>
>> * Now compile on Solaris 9, both without and with -pthread:
>>
>> $ gcc -o selfkill selfkill.c
>> $ gcc -pthread -o selfkill-mt selfkill.c
>>
>> * Run the identical binaries and versions of gdb (7.11 here) on both
>>   Solaris 9 and Solaris 10:
>>
>> $ gdb -q --batch -ex run selfkill{,-mt}
>>
>> ** Solaris 9, selfkill:
>>
>> Program received signal SIGINT, Interrupt.
>> 0xb5d54186 in _libc_kill () from /usr/lib/libc.so.1
>>
>> ** Solaris 9, selfkill-mt:
>>
>> [Thread debugging using libthread_db enabled]
>> [New Thread 1 (LWP 1)]
>> [New LWP    2        ]
>> [New Thread 2 (LWP 2)]
>>
>> Thread 2 received signal SIGINT, Interrupt.
>> [Switching to Thread 1 (LWP 1)]
>> 0xb5c9fad5 in _lwp_wait () from /usr/lib/libc.so.1
>>
>> ** Solaris 10, selfkill:
>>
>> [Thread debugging using libthread_db enabled]
>> [New Thread 1 (LWP 1)]
>>
>> Thread 2 received signal SIGINT, Interrupt.
>> [Switching to Thread 1 (LWP 1)]
>> 0xfef0c165 in kill () from /lib/libc.so.1
>>
>> ** Solaris 10, selfkill-mt:
>>
>> [Thread debugging using libthread_db enabled]
>> [New Thread 1 (LWP 1)]
>> [New LWP    2        ]
>> [New Thread 2 (LWP 2)]
>>
>> Thread 2 received signal SIGINT, Interrupt.
>> [Switching to Thread 1 (LWP 1)]
>> 0xfeedca05 in __lwp_wait () from /lib/libc.so.1
>>
>> ** Trying the same on Linux/x86_64, one sees the same behaviour as on
>>    Solaris 9: non-threaded and threaded programs behave differently.
>>
>> * As you can see, on Solaris 10 even the not explicitly threaded version
>>   of the test is shown as threaded, explaining the difference in the
>>   "... received signal" messages.
>>
>>   This is a consequence of the Thread Model Unification Project in
>>   Solaris 10, which removed the difference between non-threaded and
>>   threaded processes.  This has nothing to do with the removal of the
>>   pre-Solaris 9 MxN multilevel thread model as I'd originally
>>   suspected.
>
> I tried to take a look at this a little.  The only Solaris machines I
> have access to run on Sparc, not x86-64, but hopefully should still have
> much the same behaviour.
>
> I did manage to (eventually) build GDB on one of these machines, but,
> I'm not sure if I built it wrong, or if the Sparc/Solaris support is
> just bad, but GDB was crashing all over the place with assertion
> failures.
>
> Still, with some persistence I could see the behaviour you observe.
>
> Now, I've not done any Solaris work in >10years, so I don't claim to be
> any kind of expert, but I wonder if the fix you're proposing here isn't
> simply hiding a GDB bug.
>
> I wrote a simple test program that starts 3 worker threads and then
> blocks.  Here's the 'info threads' output for GNU/Linux:
>
>   (gdb) info threads 
>     Id   Target Id                                   Frame 
>   * 1    Thread 0x7ffff7da3740 (LWP 2243115) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>     2    Thread 0x7ffff7da2700 (LWP 2243118) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>     3    Thread 0x7ffff75a1700 (LWP 2243119) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>     4    Thread 0x7ffff6da0700 (LWP 2243120) "thr.x" 0x00007ffff7e74215 in nanosleep () from /lib64/libc.so.6
>
> What you'd expect.  Now here's the same on Solaris:
>
>   (gdb) info threads
>     Id   Target Id         Frame 
>   * 1    LWP    1          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     2    LWP    4          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     3    LWP    3          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     4    LWP    2          0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     5    Thread 1 (LWP 1)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     6    Thread 2 (LWP 2)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     7    Thread 3 (LWP 3)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>     8    Thread 4 (LWP 4)  0xfee4ddd4 in ___nanosleep () from /lib/libc.so.1
>
> This is inline with what you describe, but, I think we can all agree,
> this seems a little odd; are there really 8 thread like things running
> as part of this process?  The output of `ps -aL` would suggest not:
>
>   $ ps -aL
>      PID   LWP TTY        LTIME CMD
>     3855     1 pts/6       0:00 thr.x
>     3855     2 pts/6       0:00 thr.x
>     3855     3 pts/6       0:00 thr.x
>     3855     4 pts/6       0:00 thr.x
>     4132     1 pts/8       0:00 ps
>
> And also, when I run the same test application using the dbx debugger, I
> see this:
>
>   (dbx) threads
>   *>    t@1  a  l@1   ?()   signal SIGINT in  ___nanosleep() 
>         t@2  a  l@2   thread_worker()   running          in  ___nanosleep() 
>         t@3  a  l@3   thread_worker()   running          in  ___nanosleep() 
>         t@4  a  l@4   thread_worker()   running          in  ___nanosleep() 
>
> So here, the process is represented as just 4 thread like things.
>
> So, why does GDB think there are 8, while every tools that ships with
> Solaris seems to think there are 4?  My guess, is that is has something
> to do with the thread lookup code in sol-thread.c, and/or the operation
> of libthread-db.
>
> So, what I run your original selfkill test application, and use GDB to
> break on GDB's add_thread_with_info function (the thing that is
> responsible for printing the "New Thread ..." message), here's what I
> see:
>
>   (gdb) bt
>   #0  add_thread_with_info (targ=targ@entry=0x940678 <the_procfs_target>, ptid=..., priv=priv@entry=0x0) at ../../src/gdb/thread.c:290
>   #1  0x0053b61c in add_thread (targ=0x940678 <the_procfs_target>, ptid=...) at ../../src/gdb/thread.c:305
>   #2  0x004ab5f4 in sol_thread_target::wait (this=<optimized out>, ptid=..., ourstatus=0xffbff620, options=...) at ../../src/gdb/sol-thread.c:459
>   #3  0x0053019c in target_wait (ptid=..., status=status@entry=0xffbff620, options=...) at ../../src/gdb/target.c:2598
>   #4  0x00395478 in do_target_wait_1 (inf=inf@entry=0x969288, ptid=..., status=status@entry=0xffbff620, options=<error reading variable: Cannot access memory at address 0x0>) at ../../src/gdb/infrun.c:3763
>   #5  0x003a7e8c in <lambda(inferior*)>::operator() (inf=0x969288, __closure=<synthetic pointer>) at ../../src/gdb/infrun.c:3822
>   #6  do_target_wait (options=..., ecs=0xffbff600) at ../../src/gdb/infrun.c:3841
>   #7  fetch_inferior_event () at ../../src/gdb/infrun.c:4201
>   #8  0x001b0bd8 in check_async_event_handlers () at ../../src/gdb/async-event.c:337
>   #9  0x006c4e3c in gdb_do_one_event (mstimeout=mstimeout@entry=-1) at ../../src/gdbsupport/event-loop.cc:221
>   #10 0x003d7ea0 in start_event_loop () at ../../src/gdb/main.c:411
>   #11 captured_command_loop () at ../../src/gdb/main.c:471
>   #12 0x003d9fa8 in captured_main (data=0xffbff84c) at ../../src/gdb/main.c:1330
>   #13 gdb_main (args=args@entry=0xffbff84c) at ../../src/gdb/main.c:1345
>   #14 0x006f7c5c in main (argc=4, argv=0xffbff8bc) at ../../src/gdb/gdb.c:32
>   (gdb) frame 2
>   #2  0x004ab5f4 in sol_thread_target::wait (this=<optimized out>, ptid=..., ourstatus=0xffbff620, options=...) at ../../src/gdb/sol-thread.c:459
>   459                   add_thread (proc_target, rtnval);
>   (gdb) p rtnval
>   $1 = {m_pid = 7218, m_lwp = 0, m_tid = 1}
>   (gdb) p current_inferior_.m_obj->thread_list.m_front.ptid 
>   $2 = {m_pid = 7218, m_lwp = 1, m_tid = 0}
>   (gdb) 
>
> What this is telling us, is that, when GDB stopped after the ::wait
> call, the ptid_t it got back was '{m_pid = 7218, m_lwp = 0, m_tid = 1}',
> however, the original thread that GDB found when starting the
> application was '{m_pid = 7218, m_lwp = 1, m_tid = 0}'.
>
> This difference is what causes GDB to add the new thread.
>
> My guess is that this m_lwp/m_tid difference is a bug somewhere in the
> stack, and that really, we should be seeing the same ptid_t here.  If we
> did, then GDB would not add the new thread, and the test messages would
> not change.

So, to clarify this a little, the discrepancy seems to arise from
lwp_to_thread, this is where we query libthread-db.

Before this point, in sol_thread_target::wait, we call:

  ptid_t rtnval = beneath ()->wait (ptid, ourstatus, options);

this returns us the (maybe?) expected ptid_t {m_pid = 7218, m_lwp = 1,
m_tid = 0}, then when we call lwp_to_thread, we get back the alternative
ptid_t where the tid field is set, but the lwp field is not.

I don't know if this indicates a bug in libthread-db, or a bug in GDB.

Thanks,
Andrew