From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=0bxC=73=gmail.com=alves.ped@sourceware.org>
Received: from mail-wr1-f48.google.com (mail-wr1-f48.google.com [209.85.221.48])
	by sourceware.org (Postfix) with ESMTPS id 991F43858C54
	for <gdb-patches@sourceware.org>; Tue,  4 Apr 2023 13:57:05 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 991F43858C54
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=palves.net
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-wr1-f48.google.com with SMTP id r11so32887027wrr.12
        for <gdb-patches@sourceware.org>; Tue, 04 Apr 2023 06:57:05 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1680616624;
        h=content-transfer-encoding:content-language:in-reply-to:mime-version
         :user-agent:date:message-id:references:to:subject:from
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=l9de6svlK9CmDp5iVoGWy7z0mjdVJy3p0Y/DGjWF3TM=;
        b=MqITcspIYXos6YP4NaogW3NtdpEA6xQGIkn0jOdsB74JJi4z7OvTFZynHDS8c5oRtw
         Q2yLA614mDxxiC7ubBdnMHorSKI9/uBx2BYDetbG1xaAcLYzTEPuoT6s6Vq1Iw3bfjXA
         NhLy63EbRpYvZ+L5c4jgdSdNBIwTD9bHVhec8TXA4aE1Qql3YGlBCjoc6Jd8BCAtWaQ1
         YpfdLSi3HayjlOSNSx32TN5WBPgbFqXnPU8jO2sNIpAkPH77rQ7JBj5VCp+0pYAScSlU
         irCRJVCNy5y87z4/xGveX85HJS8SGLoswFMEPJMqp5rkivPQMf4BlHhOQ7QA1hm52/OP
         vXOg==
X-Gm-Message-State: AAQBX9dfUdaBxjXnVNrX73HSPnqjlNe1etbacfS9d9/WLFBHQUNS4RFw
	RYdkszzPuR+ijaaGfugo+JOt5oU/mHSelA==
X-Google-Smtp-Source: AKy350agwoYHij3IZTOeRV7aHMddlTaJ7b8y4aSHFSyPKCd7rBJZXIvhAUto/7eXFWZWL7UnMrxD+w==
X-Received: by 2002:adf:f209:0:b0:2ca:175b:d850 with SMTP id p9-20020adff209000000b002ca175bd850mr1830488wro.11.1680616623922;
        Tue, 04 Apr 2023 06:57:03 -0700 (PDT)
Received: from ?IPv6:2001:8a0:f93c:5900::1fe? ([2001:8a0:f93c:5900::1fe])
        by smtp.gmail.com with ESMTPSA id j8-20020adfe508000000b002cfe71153b4sm12282343wrm.60.2023.04.04.06.57.03
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Tue, 04 Apr 2023 06:57:03 -0700 (PDT)
From: Pedro Alves <pedro@palves.net>
Subject: Re: [PATCH 03/31] gdb/linux: Delete all other LWPs immediately on
 ptrace exec event
To: Andrew Burgess <aburgess@redhat.com>, gdb-patches@sourceware.org
References: <20221212203101.1034916-1-pedro@palves.net>
 <20221212203101.1034916-4-pedro@palves.net> <87ileucg5f.fsf@redhat.com>
Message-ID: <7346b585-adb2-743e-fdaf-213fc595f93b@palves.net>
Date: Tue, 4 Apr 2023 14:57:01 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.10.1
MIME-Version: 1.0
In-Reply-To: <87ileucg5f.fsf@redhat.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-5.1 required=5.0 tests=BAYES_00,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gdb-patches.sourceware.org>

Hi Andrew,

Took me a bit to find time to investigate this.  See below.

On 2023-03-21 2:50 p.m., Andrew Burgess wrote:
> Pedro Alves <pedro@palves.net> writes:
> 
>> I noticed that after a following patch ("Step over clone syscall w/
>> breakpoint, TARGET_WAITKIND_THREAD_CLONED"), the
>> gdb.threads/step-over-exec.exp was passing cleanly, but still, we'd
>> end up with four new unexpected GDB core dumps:
>>
>> 		 === gdb Summary ===
>>
>>  # of unexpected core files      4
>>  # of expected passes            48
>>
>> That said patch is making the pre-existing
>> gdb.threads/step-over-exec.exp testcase (almost silently) expose a
>> latent problem in gdb/linux-nat.c, resulting in a GDB crash when:
>>
>>  #1 - a non-leader thread execs
>>  #2 - the post-exec program stops somewhere
>>  #3 - you kill the inferior
>>
>> Instead of #3 directly, the testcase just returns, which ends up in
>> gdb_exit, tearing down GDB, which kills the inferior, and is thus
>> equivalent to #3 above.
>>
>> Vis:
>>
>>  $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true
>>  ...
>>  (top-gdb) r
>>  ...
>>  (gdb) b main
>>  ...
>>  (gdb) r
>>  ...
>>  Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69
>>  69        argv0 = argv[0];
>>  (gdb) c
>>  Continuing.
>>  [New Thread 0x7ffff7d89700 (LWP 2506975)]
>>  Other going in exec.
>>  Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>>  process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
>>
>>  Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28
>>  28        foo ();
>>  (gdb) k
>>  ...
>>  Thread 1 "gdb" received signal SIGSEGV, Segmentation fault.
>>  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>>  393         return m_suspend.waitstatus_pending_p;
>>  (top-gdb) bt
>>  #0  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
>>  #1  0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345
>>  #2  0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564
>>  #3  0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284
>>  #4  0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278
>>  #5  0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247
>>  #6  0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864
>>  #7  0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590
>>  #8  0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911
>>  ...
> 
> It wasn't 100% clear to me if the above session was supposed to show a
> failure with GDB prior to *this* commit, or was a demonstration of what
> would happen if this commit is skipped, and the later commits applied.

Yes, prior to this commit.

> 
> I thought it was the second case, but I was so unsure that I tried the
> reproducer anyway.   Just in case I'm wrong, the above example doesn't
> seem to fail prior to this commit.

This surprised me, and when I tried it myself, I was even more surprised,
for I couldn't reproduce it either!

But I figured it out.

I'm usually using Ubuntu 22.04 for development nowadays, and in that system, indeed I can't
reproduce it.  Right after the exec, GDB traps a load event for "libc.so.6", which leads to
gdb trying to open libthread_db for the post-exec inferior, and, it succeeds.  When we load
libthread_db, we call linux_stop_and_wait_all_lwps, which, as the name suggests, stops all lwps,
and then waits to see their stops.  While doing this, GDB detects that the pre-exec stale
LWP is gone, and deletes it.

The logs show:

[linux-nat] linux_nat_wait_1: waitpid 1725529 received SIGTRAP - Trace/breakpoint trap (stopped)
[linux-nat] save_stop_reason: 1725529.1725529.0 stopped by software breakpoint
[linux-nat] linux_nat_wait_1: waitpid(-1, ...) returned 0, ERRNO-OK
[linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725658.0, not stopped
[linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725529.0, has pending status
[linux-nat] linux_nat_wait_1: trap ptid is 1725529.1725529.0.
[linux-nat] linux_nat_wait_1: exit
[linux-nat] stop_callback: kill 1725529.1725658.0 **<SIGSTOP>**
[linux-nat] stop_callback: lwp kill -1 No such process
[linux-nat] wait_lwp: 1725529.1725658.0 vanished.

And the backtrace is:

(top-gdb) bt
#0  wait_lwp (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2069
#1  0x0000555555aa8fbf in stop_wait_callback (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2375
#2  0x0000555555ab12b3 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (__closure=0x0, ecall=..., args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:326
#3  0x0000555555ab12e2 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:320
#4  0x0000555555ab0610 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffca90, args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:289
#5  0x0000555555aa4c2d in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:867
#6  0x0000555555aa8a03 in linux_stop_and_wait_all_lwps () at ../../src/gdb/linux-nat.c:2229
#7  0x0000555555ac8525 in try_thread_db_load_1 (info=0x555556a66dd0) at ../../src/gdb/linux-thread-db.c:923
#8  0x0000555555ac89d5 in try_thread_db_load (library=0x5555560eca27 "libthread_db.so.1", check_auto_load_safe=false) at ../../src/gdb/linux-thread-db.c:1024
#9  0x0000555555ac8eda in try_thread_db_load_from_sdir () at ../../src/gdb/linux-thread-db.c:1108
#10 0x0000555555ac9278 in thread_db_load_search () at ../../src/gdb/linux-thread-db.c:1163
#11 0x0000555555ac9518 in thread_db_load () at ../../src/gdb/linux-thread-db.c:1225
#12 0x0000555555ac95e1 in check_for_thread_db () at ../../src/gdb/linux-thread-db.c:1268
#13 0x0000555555ac9657 in thread_db_new_objfile (objfile=0x555556943ed0) at ../../src/gdb/linux-thread-db.c:1297
#14 0x000055555569e2d2 in std::__invoke_impl<void, void (*&)(objfile*), objfile*> (__f=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:61
#15 0x000055555569c44a in std::__invoke_r<void, void (*&)(objfile*), objfile*> (__fn=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:111
#16 0x0000555555699d69 in std::_Function_handler<void (objfile*), void (*)(objfile*)>::_M_invoke(std::_Any_data const&, objfile*&&) (__functor=..., __args#0=@0x7fffffffce50: 0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:290
#17 0x0000555555b5f48b in std::function<void (objfile*)>::operator()(objfile*) const (this=0x5555567925d8, __args#0=0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:590
#18 0x0000555555b5eba4 in gdb::observers::observable<objfile*>::notify (this=0x5555565b5680 <gdb::observers::new_objfile>, args#0=0x555556943ed0) at ../../src/gdb/../gdbsupport/observable.h:166
#19 0x0000555555cdd85b in symbol_file_add_with_addrs (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1131
#20 0x0000555555cdd9c5 in symbol_file_add_from_bfd (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1167
#21 0x0000555555c9dd69 in solib_read_symbols (so=0x5555569792d0, flags=...) at ../../src/gdb/solib.c:730
#22 0x0000555555c9e7b7 in solib_add (pattern=0x0, from_tty=0, readsyms=1) at ../../src/gdb/solib.c:1041
#23 0x0000555555c9f61d in handle_solib_event () at ../../src/gdb/solib.c:1315
#24 0x0000555555729c26 in bpstat_stop_status (aspace=0x555556606800, bp_addr=0x7ffff7fe7278, thread=0x555556816bd0, ws=..., stop_chain=0x0) at ../../src/gdb/breakpoint.c:5702
#25 0x0000555555a62e41 in handle_signal_stop (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6517
#26 0x0000555555a61479 in handle_inferior_event (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6000
#27 0x0000555555a5c7b5 in fetch_inferior_event () at ../../src/gdb/infrun.c:4403
#28 0x0000555555a35b65 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:41
#29 0x0000555555aae0c9 in handle_target_event (error=0, client_data=0x0) at ../../src/gdb/linux-nat.c:4231


Now, when I try the same on a Fedora 32 machine, I see the GDB crash due to the stale
LWP still in the LWP list with no corresponding thread_info.  On this
machine, glibc predates the changes that make it possible to use libthread_db with
non-threaded processes, so try_thread_db_load doesn't manage to open a connection
to libthread_db, and thus we don't end up in linux_stop_and_wait_all_lwps, and thus
the stale lwp is not deleted.  And so a subsequent "kill" command crashes.

I wrote that patch originally on an Ubuntu 20.04 machine (vs the Ubuntu 22.04 I have now),
and it must be that that version also predates the glibc change, and thus behaves like
this Fedora 32 box.  You are very likely using a newer Fedora which has the glibc change.

> 
>>
>> The root of the problem is that when a non-leader LWP execs, it just
>> changes its tid to the tgid, replacing the pre-exec leader thread,
>> becoming the new leader.  There's no thread exit event for the execing
>> thread.  It's as if the old pre-exec LWP vanishes without trace.  The
>> ptrace man page says:
>>
>> "PTRACE_O_TRACEEXEC (since Linux 2.5.46)
>> 	Stop the tracee at the next execve(2).  A waitpid(2) by the
>> 	tracer will return a status value such that
>>
>> 	  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))
>>
>> 	If the execing thread is not a thread group leader, the thread
>> 	ID is reset to thread group leader's ID before this stop.
>> 	Since Linux 3.0, the former thread ID can be retrieved with
>> 	PTRACE_GETEVENTMSG."
>>
>> When the core of GDB processes an exec events, it deletes all the
>> threads of the inferior.  But, that is too late -- deleting the thread
>> does not delete the corresponding LWP, so we end leaving the pre-exec
>> non-leader LWP stale in the LWP list.  That's what leads to the crash
>> above -- linux_nat_target::kill iterates over all LWPs, and after the
>> patch in question, that code will look for the corresponding
>> thread_info for each LWP.  For the pre-exec non-leader LWP still
>> listed, won't find one.
>>
>> This patch fixes it, by deleting the pre-exec non-leader LWP (and
>> thread) from the LWP/thread lists as soon as we get an exec event out
>> of ptrace.
> 
> Given that we don't have a test *right now* for this issue, and instead
> rely on a future patch not failing.  I wondered if there was any way
> that we could trigger a failure.
> 
> So I was poking around looking for places where we iterate over the
> all_lwps() list wondering which we could trigger that might cause a
> failure...
> 
> ... and then I thought: why not just have GDB tell us that the
> all_lwps() list is broken.
> 
> So I hacked up a new 'maint info linux-lwps' command.  It's not very
> interesting right now, here's the output in a multi-threaded inferior
> prior to the exec:
> 
>   (gdb) maintenance info linux-lwps 
>   LWP Ptid          Thread ID
>   1707218.1707239.0 2           
>   1707218.1707218.0 1           
> 
> And in your failure case (after the exec):
> 
>   (gdb) maintenance info linux-lwps 
>   LWP Ptid          Thread ID 
>   1708883.1708895.0 None      
>   1708883.1708883.0 1         
> 
> And then we can check this from the testscript, and now we have a test
> that fails before this commit, and passes afterwards.
> 
> And in the future we might find other information we want to add in the
> new maintenance command.
> 
> What are your thoughts on including this, or something like this with
> this commit?  My patch, which applies on top of this commit, is included
> at the end of this email.  Please feel free to take any changes that you
> feel add value.

I'm totally fine with such a command, though the test I had added covers
as much as it would, as the "kill" command fails when the maint command
would fail, and passes when the maint command passes.  But I'll incorporate
it.