From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 8FA093858C2C for ; Fri, 26 May 2023 14:46:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8FA093858C2C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1685112365; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=CoJVh8tCx8YRPwZ6CCNLHnfREoi29IpiQ5z/vLB+G2k=; b=CFcF1x6B+KAqMnLZOOU7TfP+ntOCoVPmHovQ3Rms94J0f73FBGex0zrobi5Kd3mbLRXHgz UW/7iJ84+aiNY/Qt1uBgW4wt/LxxMRawnfjRDjp57m4uiJaUVv7ctbe0v1Xo1saztgnPLy ONwScRVU9+M7d7S5aTf9AAZZG45F3eY= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-19-HAy4MuNsPbqfMbgnwfxjvw-1; Fri, 26 May 2023 10:46:03 -0400 X-MC-Unique: HAy4MuNsPbqfMbgnwfxjvw-1 Received: by mail-wr1-f70.google.com with SMTP id ffacd0b85a97d-3078b9943d6so353842f8f.1 for ; Fri, 26 May 2023 07:46:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685112362; x=1687704362; h=mime-version:message-id:date:references:in-reply-to:subject:to:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CoJVh8tCx8YRPwZ6CCNLHnfREoi29IpiQ5z/vLB+G2k=; b=bZ0/O1xRxz01VvH/xcygJzqlhoXmnpz3ZdadiwkY88ybAu3HUNNwYwSMme2PD8R4no jOokOhxHsV7ls+laNsP0s7SUy3zd9Wxn7kZMzIYNVy4ENjCnsAb2c/sGXq625css/X6q cNUNbIZBn+qPEBgS+Qw9hBTHfI2GizJCGb4pNBW8V8EhM8GVvcaC6gUy7Un8Ji3Y4pD6 bUlDRWNZnJSBFVQtCsxBpB66CYf4hweZ2d1QQ21RLXxD5pmhmgVXn0pb2Ypp47hkdepD w/xUykQ70kR72JFfkk+cGml8rAXs5oCmEAn30ea5ay7+b5wH2Y653JUgqGUtrmP0RkmP NpUw== X-Gm-Message-State: AC+VfDzO6fLh3i0whC9iZxLlVoQ160WmT5sGjv5Vx0P/Oa3K1qC+2zzE VKPVmgG1GBXTa0Pcbu3KtwNuorRZS1CDws7MbUP9B9ikW2BEseWIlZU4OZKDZucrNYbMzyBB10Q EO/qQag6MOcHEz+YGKEWeBcLcs/0z/w== X-Received: by 2002:a5d:6644:0:b0:306:a780:83fd with SMTP id f4-20020a5d6644000000b00306a78083fdmr1742348wrw.47.1685112362172; Fri, 26 May 2023 07:46:02 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ61c4wJC4fTJ86Tz/FrmW/aTCoNp+qyDtSPETdGTX8Hr4Y2SEyzMmemFkjWiX1MUwJr1GhbCg== X-Received: by 2002:a5d:6644:0:b0:306:a780:83fd with SMTP id f4-20020a5d6644000000b00306a78083fdmr1742326wrw.47.1685112361606; Fri, 26 May 2023 07:46:01 -0700 (PDT) Received: from localhost (11.72.115.87.dyn.plus.net. [87.115.72.11]) by smtp.gmail.com with ESMTPSA id n16-20020adfe790000000b0030adfa48e1esm126939wrm.29.2023.05.26.07.46.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 26 May 2023 07:46:01 -0700 (PDT) From: Andrew Burgess To: Pedro Alves , gdb-patches@sourceware.org Subject: Re: [PATCH 03/31] gdb/linux: Delete all other LWPs immediately on ptrace exec event In-Reply-To: <7346b585-adb2-743e-fdaf-213fc595f93b@palves.net> References: <20221212203101.1034916-1-pedro@palves.net> <20221212203101.1034916-4-pedro@palves.net> <87ileucg5f.fsf@redhat.com> <7346b585-adb2-743e-fdaf-213fc595f93b@palves.net> Date: Fri, 26 May 2023 15:45:59 +0100 Message-ID: <87ttvz5exk.fsf@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Pedro, Sorry for the delay in looking at this again. I had to find some time to investigate this a little more as my result were still not aligning with what you reported, but I think I understand what's going on now... Pedro Alves writes: > Hi Andrew, > > Took me a bit to find time to investigate this. See below. > > On 2023-03-21 2:50 p.m., Andrew Burgess wrote: >> Pedro Alves writes: >> >>> I noticed that after a following patch ("Step over clone syscall w/ >>> breakpoint, TARGET_WAITKIND_THREAD_CLONED"), the >>> gdb.threads/step-over-exec.exp was passing cleanly, but still, we'd >>> end up with four new unexpected GDB core dumps: >>> >>> === gdb Summary === >>> >>> # of unexpected core files 4 >>> # of expected passes 48 >>> >>> That said patch is making the pre-existing >>> gdb.threads/step-over-exec.exp testcase (almost silently) expose a >>> latent problem in gdb/linux-nat.c, resulting in a GDB crash when: >>> >>> #1 - a non-leader thread execs >>> #2 - the post-exec program stops somewhere >>> #3 - you kill the inferior >>> >>> Instead of #3 directly, the testcase just returns, which ends up in >>> gdb_exit, tearing down GDB, which kills the inferior, and is thus >>> equivalent to #3 above. >>> >>> Vis: >>> >>> $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true >>> ... >>> (top-gdb) r >>> ... >>> (gdb) b main >>> ... >>> (gdb) r >>> ... >>> Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69 >>> 69 argv0 = argv[0]; >>> (gdb) c >>> Continuing. >>> [New Thread 0x7ffff7d89700 (LWP 2506975)] >>> Other going in exec. >>> Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd >>> process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd >>> >>> Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28 >>> 28 foo (); >>> (gdb) k >>> ... >>> Thread 1 "gdb" received signal SIGSEGV, Segmentation fault. >>> 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393 >>> 393 return m_suspend.waitstatus_pending_p; >>> (top-gdb) bt >>> #0 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393 >>> #1 0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345 >>> #2 0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564 >>> #3 0x0000555555a92a26 in gdb::function_view::bind(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284 >>> #4 0x0000555555a92a51 in gdb::function_view::bind(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278 >>> #5 0x0000555555a91f84 in gdb::function_view::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247 >>> #6 0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864 >>> #7 0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 ) at ../../src/gdb/linux-nat.c:3590 >>> #8 0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911 >>> ... >> >> It wasn't 100% clear to me if the above session was supposed to show a >> failure with GDB prior to *this* commit, or was a demonstration of what >> would happen if this commit is skipped, and the later commits applied. > > Yes, prior to this commit. OK, but check your backtrace, notice frame #2 is in kill_unfollowed_child_callback. This is only added in a later patch in this series. If we roll back to just this patch then the crash doesn't reproduce! But I can explain that... > >> >> I thought it was the second case, but I was so unsure that I tried the >> reproducer anyway. Just in case I'm wrong, the above example doesn't >> seem to fail prior to this commit. > > This surprised me, and when I tried it myself, I was even more surprised, > for I couldn't reproduce it either! > > But I figured it out. > > I'm usually using Ubuntu 22.04 for development nowadays, and in that system, indeed I can't > reproduce it. Right after the exec, GDB traps a load event for "libc.so.6", which leads to > gdb trying to open libthread_db for the post-exec inferior, and, it succeeds. When we load > libthread_db, we call linux_stop_and_wait_all_lwps, which, as the name suggests, stops all lwps, > and then waits to see their stops. While doing this, GDB detects that the pre-exec stale > LWP is gone, and deletes it. > > The logs show: > > [linux-nat] linux_nat_wait_1: waitpid 1725529 received SIGTRAP - Trace/breakpoint trap (stopped) > [linux-nat] save_stop_reason: 1725529.1725529.0 stopped by software breakpoint > [linux-nat] linux_nat_wait_1: waitpid(-1, ...) returned 0, ERRNO-OK > [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725658.0, not stopped > [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725529.0, has pending status > [linux-nat] linux_nat_wait_1: trap ptid is 1725529.1725529.0. > [linux-nat] linux_nat_wait_1: exit > [linux-nat] stop_callback: kill 1725529.1725658.0 **** > [linux-nat] stop_callback: lwp kill -1 No such process > [linux-nat] wait_lwp: 1725529.1725658.0 vanished. When we look at just this patch, the linux_nat_target::kill function calls kill_unfollowed_fork_children, which, unlike your later updated kill_unfollowed_child_callback, doesn't require us to lookup a thread_info object. It is the lookup (and dereference) of the thread_info object that causes the segfault you are seeing. Once we skip that code, the current linux_nat_target::kill function actually starts with exactly the same function calls as linux_stop_and_wait_all_lwps (maybe we should be calling that function?) And so, for _me_ when I 'kill' I end up calling the stop_callback followed by the stop_wait_callback, which cleans up the rogue thread just like you see. > > And the backtrace is: > > (top-gdb) bt > #0 wait_lwp (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2069 > #1 0x0000555555aa8fbf in stop_wait_callback (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2375 > #2 0x0000555555ab12b3 in gdb::function_view::bind(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (__closure=0x0, ecall=..., args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:326 > #3 0x0000555555ab12e2 in gdb::function_view::bind(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:320 > #4 0x0000555555ab0610 in gdb::function_view::operator()(lwp_info*) const (this=0x7fffffffca90, args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:289 > #5 0x0000555555aa4c2d in iterate_over_lwps(ptid_t, gdb::function_view) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:867 > #6 0x0000555555aa8a03 in linux_stop_and_wait_all_lwps () at ../../src/gdb/linux-nat.c:2229 > #7 0x0000555555ac8525 in try_thread_db_load_1 (info=0x555556a66dd0) at ../../src/gdb/linux-thread-db.c:923 > #8 0x0000555555ac89d5 in try_thread_db_load (library=0x5555560eca27 "libthread_db.so.1", check_auto_load_safe=false) at ../../src/gdb/linux-thread-db.c:1024 > #9 0x0000555555ac8eda in try_thread_db_load_from_sdir () at ../../src/gdb/linux-thread-db.c:1108 > #10 0x0000555555ac9278 in thread_db_load_search () at ../../src/gdb/linux-thread-db.c:1163 > #11 0x0000555555ac9518 in thread_db_load () at ../../src/gdb/linux-thread-db.c:1225 > #12 0x0000555555ac95e1 in check_for_thread_db () at ../../src/gdb/linux-thread-db.c:1268 > #13 0x0000555555ac9657 in thread_db_new_objfile (objfile=0x555556943ed0) at ../../src/gdb/linux-thread-db.c:1297 > #14 0x000055555569e2d2 in std::__invoke_impl (__f=@0x5555567925d8: 0x555555ac95e8 ) at /usr/include/c++/11/bits/invoke.h:61 > #15 0x000055555569c44a in std::__invoke_r (__fn=@0x5555567925d8: 0x555555ac95e8 ) at /usr/include/c++/11/bits/invoke.h:111 > #16 0x0000555555699d69 in std::_Function_handler::_M_invoke(std::_Any_data const&, objfile*&&) (__functor=..., __args#0=@0x7fffffffce50: 0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:290 > #17 0x0000555555b5f48b in std::function::operator()(objfile*) const (this=0x5555567925d8, __args#0=0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:590 > #18 0x0000555555b5eba4 in gdb::observers::observable::notify (this=0x5555565b5680 , args#0=0x555556943ed0) at ../../src/gdb/../gdbsupport/observable.h:166 > #19 0x0000555555cdd85b in symbol_file_add_with_addrs (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1131 > #20 0x0000555555cdd9c5 in symbol_file_add_from_bfd (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1167 > #21 0x0000555555c9dd69 in solib_read_symbols (so=0x5555569792d0, flags=...) at ../../src/gdb/solib.c:730 > #22 0x0000555555c9e7b7 in solib_add (pattern=0x0, from_tty=0, readsyms=1) at ../../src/gdb/solib.c:1041 > #23 0x0000555555c9f61d in handle_solib_event () at ../../src/gdb/solib.c:1315 > #24 0x0000555555729c26 in bpstat_stop_status (aspace=0x555556606800, bp_addr=0x7ffff7fe7278, thread=0x555556816bd0, ws=..., stop_chain=0x0) at ../../src/gdb/breakpoint.c:5702 > #25 0x0000555555a62e41 in handle_signal_stop (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6517 > #26 0x0000555555a61479 in handle_inferior_event (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6000 > #27 0x0000555555a5c7b5 in fetch_inferior_event () at ../../src/gdb/infrun.c:4403 > #28 0x0000555555a35b65 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:41 > #29 0x0000555555aae0c9 in handle_target_event (error=0, client_data=0x0) at ../../src/gdb/linux-nat.c:4231 > > > Now, when I try the same on a Fedora 32 machine, I see the GDB crash due to the stale > LWP still in the LWP list with no corresponding thread_info. On this > machine, glibc predates the changes that make it possible to use libthread_db with > non-threaded processes, so try_thread_db_load doesn't manage to open a connection > to libthread_db, and thus we don't end up in linux_stop_and_wait_all_lwps, and thus > the stale lwp is not deleted. And so a subsequent "kill" command crashes. > > I wrote that patch originally on an Ubuntu 20.04 machine (vs the Ubuntu 22.04 I have now), > and it must be that that version also predates the glibc change, and thus behaves like > this Fedora 32 box. You are very likely using a newer Fedora which has the glibc change. To my shame I actually running an even older Fedora install. One day GDB will be finished, then I'll find time to reinstall :) So I'm not cleaning up the thread in the libthread_db code. > >> >>> >>> The root of the problem is that when a non-leader LWP execs, it just >>> changes its tid to the tgid, replacing the pre-exec leader thread, >>> becoming the new leader. There's no thread exit event for the execing >>> thread. It's as if the old pre-exec LWP vanishes without trace. The >>> ptrace man page says: >>> >>> "PTRACE_O_TRACEEXEC (since Linux 2.5.46) >>> Stop the tracee at the next execve(2). A waitpid(2) by the >>> tracer will return a status value such that >>> >>> status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8)) >>> >>> If the execing thread is not a thread group leader, the thread >>> ID is reset to thread group leader's ID before this stop. >>> Since Linux 3.0, the former thread ID can be retrieved with >>> PTRACE_GETEVENTMSG." >>> >>> When the core of GDB processes an exec events, it deletes all the >>> threads of the inferior. But, that is too late -- deleting the thread >>> does not delete the corresponding LWP, so we end leaving the pre-exec >>> non-leader LWP stale in the LWP list. That's what leads to the crash >>> above -- linux_nat_target::kill iterates over all LWPs, and after the >>> patch in question, that code will look for the corresponding >>> thread_info for each LWP. For the pre-exec non-leader LWP still >>> listed, won't find one. >>> >>> This patch fixes it, by deleting the pre-exec non-leader LWP (and >>> thread) from the LWP/thread lists as soon as we get an exec event out >>> of ptrace. >> >> Given that we don't have a test *right now* for this issue, and instead >> rely on a future patch not failing. I wondered if there was any way >> that we could trigger a failure. >> >> So I was poking around looking for places where we iterate over the >> all_lwps() list wondering which we could trigger that might cause a >> failure... >> >> ... and then I thought: why not just have GDB tell us that the >> all_lwps() list is broken. >> >> So I hacked up a new 'maint info linux-lwps' command. It's not very >> interesting right now, here's the output in a multi-threaded inferior >> prior to the exec: >> >> (gdb) maintenance info linux-lwps >> LWP Ptid Thread ID >> 1707218.1707239.0 2 >> 1707218.1707218.0 1 >> >> And in your failure case (after the exec): >> >> (gdb) maintenance info linux-lwps >> LWP Ptid Thread ID >> 1708883.1708895.0 None >> 1708883.1708883.0 1 >> >> And then we can check this from the testscript, and now we have a test >> that fails before this commit, and passes afterwards. >> >> And in the future we might find other information we want to add in the >> new maintenance command. >> >> What are your thoughts on including this, or something like this with >> this commit? My patch, which applies on top of this commit, is included >> at the end of this email. Please feel free to take any changes that you >> feel add value. > > I'm totally fine with such a command, though the test I had added covers > as much as it would, as the "kill" command fails when the maint command > would fail, and passes when the maint command passes. But I'll incorporate > it. Given the above this isn't quite right. It would appear that someone using the libthread_db code would, as you point out, see both tests fail. But without the libthread_db code the 'maint' command will show the problem, while the 'kill' command isn't going to, even if you force the 'kill' earlier like you propose in your new patch[1]. This doesn't actually change anything -- you did add the new 'maint' command in your next patch, it was just the behaviour I was seeing didn't align with what you saw, and I think we now know why. I'll reply separately to [1] with my feedback on that patch. [1] https://inbox.sourceware.org/gdb-patches/5b80a2c3-3679-fb86-27f3-0dcc9c019562@palves.net/#t Thanks, Andrew