From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-f48.google.com (mail-wr1-f48.google.com [209.85.221.48]) by sourceware.org (Postfix) with ESMTPS id 991F43858C54 for ; Tue, 4 Apr 2023 13:57:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 991F43858C54 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=palves.net Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-wr1-f48.google.com with SMTP id r11so32887027wrr.12 for ; Tue, 04 Apr 2023 06:57:05 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680616624; h=content-transfer-encoding:content-language:in-reply-to:mime-version :user-agent:date:message-id:references:to:subject:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=l9de6svlK9CmDp5iVoGWy7z0mjdVJy3p0Y/DGjWF3TM=; b=MqITcspIYXos6YP4NaogW3NtdpEA6xQGIkn0jOdsB74JJi4z7OvTFZynHDS8c5oRtw Q2yLA614mDxxiC7ubBdnMHorSKI9/uBx2BYDetbG1xaAcLYzTEPuoT6s6Vq1Iw3bfjXA NhLy63EbRpYvZ+L5c4jgdSdNBIwTD9bHVhec8TXA4aE1Qql3YGlBCjoc6Jd8BCAtWaQ1 YpfdLSi3HayjlOSNSx32TN5WBPgbFqXnPU8jO2sNIpAkPH77rQ7JBj5VCp+0pYAScSlU irCRJVCNy5y87z4/xGveX85HJS8SGLoswFMEPJMqp5rkivPQMf4BlHhOQ7QA1hm52/OP vXOg== X-Gm-Message-State: AAQBX9dfUdaBxjXnVNrX73HSPnqjlNe1etbacfS9d9/WLFBHQUNS4RFw RYdkszzPuR+ijaaGfugo+JOt5oU/mHSelA== X-Google-Smtp-Source: AKy350agwoYHij3IZTOeRV7aHMddlTaJ7b8y4aSHFSyPKCd7rBJZXIvhAUto/7eXFWZWL7UnMrxD+w== X-Received: by 2002:adf:f209:0:b0:2ca:175b:d850 with SMTP id p9-20020adff209000000b002ca175bd850mr1830488wro.11.1680616623922; Tue, 04 Apr 2023 06:57:03 -0700 (PDT) Received: from ?IPv6:2001:8a0:f93c:5900::1fe? ([2001:8a0:f93c:5900::1fe]) by smtp.gmail.com with ESMTPSA id j8-20020adfe508000000b002cfe71153b4sm12282343wrm.60.2023.04.04.06.57.03 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 04 Apr 2023 06:57:03 -0700 (PDT) From: Pedro Alves Subject: Re: [PATCH 03/31] gdb/linux: Delete all other LWPs immediately on ptrace exec event To: Andrew Burgess , gdb-patches@sourceware.org References: <20221212203101.1034916-1-pedro@palves.net> <20221212203101.1034916-4-pedro@palves.net> <87ileucg5f.fsf@redhat.com> Message-ID: <7346b585-adb2-743e-fdaf-213fc595f93b@palves.net> Date: Tue, 4 Apr 2023 14:57:01 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: <87ileucg5f.fsf@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-5.1 required=5.0 tests=BAYES_00,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Andrew, Took me a bit to find time to investigate this. See below. On 2023-03-21 2:50 p.m., Andrew Burgess wrote: > Pedro Alves writes: > >> I noticed that after a following patch ("Step over clone syscall w/ >> breakpoint, TARGET_WAITKIND_THREAD_CLONED"), the >> gdb.threads/step-over-exec.exp was passing cleanly, but still, we'd >> end up with four new unexpected GDB core dumps: >> >> === gdb Summary === >> >> # of unexpected core files 4 >> # of expected passes 48 >> >> That said patch is making the pre-existing >> gdb.threads/step-over-exec.exp testcase (almost silently) expose a >> latent problem in gdb/linux-nat.c, resulting in a GDB crash when: >> >> #1 - a non-leader thread execs >> #2 - the post-exec program stops somewhere >> #3 - you kill the inferior >> >> Instead of #3 directly, the testcase just returns, which ends up in >> gdb_exit, tearing down GDB, which kills the inferior, and is thus >> equivalent to #3 above. >> >> Vis: >> >> $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true >> ... >> (top-gdb) r >> ... >> (gdb) b main >> ... >> (gdb) r >> ... >> Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69 >> 69 argv0 = argv[0]; >> (gdb) c >> Continuing. >> [New Thread 0x7ffff7d89700 (LWP 2506975)] >> Other going in exec. >> Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd >> process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd >> >> Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28 >> 28 foo (); >> (gdb) k >> ... >> Thread 1 "gdb" received signal SIGSEGV, Segmentation fault. >> 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393 >> 393 return m_suspend.waitstatus_pending_p; >> (top-gdb) bt >> #0 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393 >> #1 0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345 >> #2 0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564 >> #3 0x0000555555a92a26 in gdb::function_view::bind(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284 >> #4 0x0000555555a92a51 in gdb::function_view::bind(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278 >> #5 0x0000555555a91f84 in gdb::function_view::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247 >> #6 0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864 >> #7 0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 ) at ../../src/gdb/linux-nat.c:3590 >> #8 0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911 >> ... > > It wasn't 100% clear to me if the above session was supposed to show a > failure with GDB prior to *this* commit, or was a demonstration of what > would happen if this commit is skipped, and the later commits applied. Yes, prior to this commit. > > I thought it was the second case, but I was so unsure that I tried the > reproducer anyway. Just in case I'm wrong, the above example doesn't > seem to fail prior to this commit. This surprised me, and when I tried it myself, I was even more surprised, for I couldn't reproduce it either! But I figured it out. I'm usually using Ubuntu 22.04 for development nowadays, and in that system, indeed I can't reproduce it. Right after the exec, GDB traps a load event for "libc.so.6", which leads to gdb trying to open libthread_db for the post-exec inferior, and, it succeeds. When we load libthread_db, we call linux_stop_and_wait_all_lwps, which, as the name suggests, stops all lwps, and then waits to see their stops. While doing this, GDB detects that the pre-exec stale LWP is gone, and deletes it. The logs show: [linux-nat] linux_nat_wait_1: waitpid 1725529 received SIGTRAP - Trace/breakpoint trap (stopped) [linux-nat] save_stop_reason: 1725529.1725529.0 stopped by software breakpoint [linux-nat] linux_nat_wait_1: waitpid(-1, ...) returned 0, ERRNO-OK [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725658.0, not stopped [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725529.0, has pending status [linux-nat] linux_nat_wait_1: trap ptid is 1725529.1725529.0. [linux-nat] linux_nat_wait_1: exit [linux-nat] stop_callback: kill 1725529.1725658.0 **** [linux-nat] stop_callback: lwp kill -1 No such process [linux-nat] wait_lwp: 1725529.1725658.0 vanished. And the backtrace is: (top-gdb) bt #0 wait_lwp (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2069 #1 0x0000555555aa8fbf in stop_wait_callback (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2375 #2 0x0000555555ab12b3 in gdb::function_view::bind(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (__closure=0x0, ecall=..., args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:326 #3 0x0000555555ab12e2 in gdb::function_view::bind(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:320 #4 0x0000555555ab0610 in gdb::function_view::operator()(lwp_info*) const (this=0x7fffffffca90, args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:289 #5 0x0000555555aa4c2d in iterate_over_lwps(ptid_t, gdb::function_view) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:867 #6 0x0000555555aa8a03 in linux_stop_and_wait_all_lwps () at ../../src/gdb/linux-nat.c:2229 #7 0x0000555555ac8525 in try_thread_db_load_1 (info=0x555556a66dd0) at ../../src/gdb/linux-thread-db.c:923 #8 0x0000555555ac89d5 in try_thread_db_load (library=0x5555560eca27 "libthread_db.so.1", check_auto_load_safe=false) at ../../src/gdb/linux-thread-db.c:1024 #9 0x0000555555ac8eda in try_thread_db_load_from_sdir () at ../../src/gdb/linux-thread-db.c:1108 #10 0x0000555555ac9278 in thread_db_load_search () at ../../src/gdb/linux-thread-db.c:1163 #11 0x0000555555ac9518 in thread_db_load () at ../../src/gdb/linux-thread-db.c:1225 #12 0x0000555555ac95e1 in check_for_thread_db () at ../../src/gdb/linux-thread-db.c:1268 #13 0x0000555555ac9657 in thread_db_new_objfile (objfile=0x555556943ed0) at ../../src/gdb/linux-thread-db.c:1297 #14 0x000055555569e2d2 in std::__invoke_impl (__f=@0x5555567925d8: 0x555555ac95e8 ) at /usr/include/c++/11/bits/invoke.h:61 #15 0x000055555569c44a in std::__invoke_r (__fn=@0x5555567925d8: 0x555555ac95e8 ) at /usr/include/c++/11/bits/invoke.h:111 #16 0x0000555555699d69 in std::_Function_handler::_M_invoke(std::_Any_data const&, objfile*&&) (__functor=..., __args#0=@0x7fffffffce50: 0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:290 #17 0x0000555555b5f48b in std::function::operator()(objfile*) const (this=0x5555567925d8, __args#0=0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:590 #18 0x0000555555b5eba4 in gdb::observers::observable::notify (this=0x5555565b5680 , args#0=0x555556943ed0) at ../../src/gdb/../gdbsupport/observable.h:166 #19 0x0000555555cdd85b in symbol_file_add_with_addrs (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1131 #20 0x0000555555cdd9c5 in symbol_file_add_from_bfd (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1167 #21 0x0000555555c9dd69 in solib_read_symbols (so=0x5555569792d0, flags=...) at ../../src/gdb/solib.c:730 #22 0x0000555555c9e7b7 in solib_add (pattern=0x0, from_tty=0, readsyms=1) at ../../src/gdb/solib.c:1041 #23 0x0000555555c9f61d in handle_solib_event () at ../../src/gdb/solib.c:1315 #24 0x0000555555729c26 in bpstat_stop_status (aspace=0x555556606800, bp_addr=0x7ffff7fe7278, thread=0x555556816bd0, ws=..., stop_chain=0x0) at ../../src/gdb/breakpoint.c:5702 #25 0x0000555555a62e41 in handle_signal_stop (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6517 #26 0x0000555555a61479 in handle_inferior_event (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6000 #27 0x0000555555a5c7b5 in fetch_inferior_event () at ../../src/gdb/infrun.c:4403 #28 0x0000555555a35b65 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:41 #29 0x0000555555aae0c9 in handle_target_event (error=0, client_data=0x0) at ../../src/gdb/linux-nat.c:4231 Now, when I try the same on a Fedora 32 machine, I see the GDB crash due to the stale LWP still in the LWP list with no corresponding thread_info. On this machine, glibc predates the changes that make it possible to use libthread_db with non-threaded processes, so try_thread_db_load doesn't manage to open a connection to libthread_db, and thus we don't end up in linux_stop_and_wait_all_lwps, and thus the stale lwp is not deleted. And so a subsequent "kill" command crashes. I wrote that patch originally on an Ubuntu 20.04 machine (vs the Ubuntu 22.04 I have now), and it must be that that version also predates the glibc change, and thus behaves like this Fedora 32 box. You are very likely using a newer Fedora which has the glibc change. > >> >> The root of the problem is that when a non-leader LWP execs, it just >> changes its tid to the tgid, replacing the pre-exec leader thread, >> becoming the new leader. There's no thread exit event for the execing >> thread. It's as if the old pre-exec LWP vanishes without trace. The >> ptrace man page says: >> >> "PTRACE_O_TRACEEXEC (since Linux 2.5.46) >> Stop the tracee at the next execve(2). A waitpid(2) by the >> tracer will return a status value such that >> >> status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8)) >> >> If the execing thread is not a thread group leader, the thread >> ID is reset to thread group leader's ID before this stop. >> Since Linux 3.0, the former thread ID can be retrieved with >> PTRACE_GETEVENTMSG." >> >> When the core of GDB processes an exec events, it deletes all the >> threads of the inferior. But, that is too late -- deleting the thread >> does not delete the corresponding LWP, so we end leaving the pre-exec >> non-leader LWP stale in the LWP list. That's what leads to the crash >> above -- linux_nat_target::kill iterates over all LWPs, and after the >> patch in question, that code will look for the corresponding >> thread_info for each LWP. For the pre-exec non-leader LWP still >> listed, won't find one. >> >> This patch fixes it, by deleting the pre-exec non-leader LWP (and >> thread) from the LWP/thread lists as soon as we get an exec event out >> of ptrace. > > Given that we don't have a test *right now* for this issue, and instead > rely on a future patch not failing. I wondered if there was any way > that we could trigger a failure. > > So I was poking around looking for places where we iterate over the > all_lwps() list wondering which we could trigger that might cause a > failure... > > ... and then I thought: why not just have GDB tell us that the > all_lwps() list is broken. > > So I hacked up a new 'maint info linux-lwps' command. It's not very > interesting right now, here's the output in a multi-threaded inferior > prior to the exec: > > (gdb) maintenance info linux-lwps > LWP Ptid Thread ID > 1707218.1707239.0 2 > 1707218.1707218.0 1 > > And in your failure case (after the exec): > > (gdb) maintenance info linux-lwps > LWP Ptid Thread ID > 1708883.1708895.0 None > 1708883.1708883.0 1 > > And then we can check this from the testscript, and now we have a test > that fails before this commit, and passes afterwards. > > And in the future we might find other information we want to add in the > new maintenance command. > > What are your thoughts on including this, or something like this with > this commit? My patch, which applies on top of this commit, is included > at the end of this email. Please feel free to take any changes that you > feel add value. I'm totally fine with such a command, though the test I had added covers as much as it would, as the "kill" command fails when the maint command would fail, and passes when the maint command passes. But I'll incorporate it.