Hi! On 2023-04-04 2:57 p.m., Pedro Alves wrote: > On 2023-03-21 2:50 p.m., Andrew Burgess wrote: >> >> I thought it was the second case, but I was so unsure that I tried the >> reproducer anyway. Just in case I'm wrong, the above example doesn't >> seem to fail prior to this commit. > > This surprised me, and when I tried it myself, I was even more surprised, > for I couldn't reproduce it either! > > But I figured it out. > > I'm usually using Ubuntu 22.04 for development nowadays, and in that system, indeed I can't > reproduce it. Right after the exec, GDB traps a load event for "libc.so.6", which leads to > gdb trying to open libthread_db for the post-exec inferior, and, it succeeds. When we load > libthread_db, we call linux_stop_and_wait_all_lwps, which, as the name suggests, stops all lwps, > and then waits to see their stops. While doing this, GDB detects that the pre-exec stale > LWP is gone, and deletes it. > > The logs show: > > [linux-nat] linux_nat_wait_1: waitpid 1725529 received SIGTRAP - Trace/breakpoint trap (stopped) > [linux-nat] save_stop_reason: 1725529.1725529.0 stopped by software breakpoint > [linux-nat] linux_nat_wait_1: waitpid(-1, ...) returned 0, ERRNO-OK > [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725658.0, not stopped > [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725529.0, has pending status > [linux-nat] linux_nat_wait_1: trap ptid is 1725529.1725529.0. > [linux-nat] linux_nat_wait_1: exit > [linux-nat] stop_callback: kill 1725529.1725658.0 **** > [linux-nat] stop_callback: lwp kill -1 No such process > [linux-nat] wait_lwp: 1725529.1725658.0 vanished. > > And the backtrace is: > > (top-gdb) bt > #0 wait_lwp (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2069 > #1 0x0000555555aa8fbf in stop_wait_callback (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2375 > #2 0x0000555555ab12b3 in gdb::function_view::bind(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (__closure=0x0, ecall=..., args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:326 > #3 0x0000555555ab12e2 in gdb::function_view::bind(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:320 > #4 0x0000555555ab0610 in gdb::function_view::operator()(lwp_info*) const (this=0x7fffffffca90, args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:289 > #5 0x0000555555aa4c2d in iterate_over_lwps(ptid_t, gdb::function_view) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:867 > #6 0x0000555555aa8a03 in linux_stop_and_wait_all_lwps () at ../../src/gdb/linux-nat.c:2229 > #7 0x0000555555ac8525 in try_thread_db_load_1 (info=0x555556a66dd0) at ../../src/gdb/linux-thread-db.c:923 > #8 0x0000555555ac89d5 in try_thread_db_load (library=0x5555560eca27 "libthread_db.so.1", check_auto_load_safe=false) at ../../src/gdb/linux-thread-db.c:1024 > #9 0x0000555555ac8eda in try_thread_db_load_from_sdir () at ../../src/gdb/linux-thread-db.c:1108 > #10 0x0000555555ac9278 in thread_db_load_search () at ../../src/gdb/linux-thread-db.c:1163 > #11 0x0000555555ac9518 in thread_db_load () at ../../src/gdb/linux-thread-db.c:1225 > #12 0x0000555555ac95e1 in check_for_thread_db () at ../../src/gdb/linux-thread-db.c:1268 > #13 0x0000555555ac9657 in thread_db_new_objfile (objfile=0x555556943ed0) at ../../src/gdb/linux-thread-db.c:1297 > #14 0x000055555569e2d2 in std::__invoke_impl (__f=@0x5555567925d8: 0x555555ac95e8 ) at /usr/include/c++/11/bits/invoke.h:61 > #15 0x000055555569c44a in std::__invoke_r (__fn=@0x5555567925d8: 0x555555ac95e8 ) at /usr/include/c++/11/bits/invoke.h:111 > #16 0x0000555555699d69 in std::_Function_handler::_M_invoke(std::_Any_data const&, objfile*&&) (__functor=..., __args#0=@0x7fffffffce50: 0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:290 > #17 0x0000555555b5f48b in std::function::operator()(objfile*) const (this=0x5555567925d8, __args#0=0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:590 > #18 0x0000555555b5eba4 in gdb::observers::observable::notify (this=0x5555565b5680 , args#0=0x555556943ed0) at ../../src/gdb/../gdbsupport/observable.h:166 > #19 0x0000555555cdd85b in symbol_file_add_with_addrs (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1131 > #20 0x0000555555cdd9c5 in symbol_file_add_from_bfd (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1167 > #21 0x0000555555c9dd69 in solib_read_symbols (so=0x5555569792d0, flags=...) at ../../src/gdb/solib.c:730 > #22 0x0000555555c9e7b7 in solib_add (pattern=0x0, from_tty=0, readsyms=1) at ../../src/gdb/solib.c:1041 > #23 0x0000555555c9f61d in handle_solib_event () at ../../src/gdb/solib.c:1315 > #24 0x0000555555729c26 in bpstat_stop_status (aspace=0x555556606800, bp_addr=0x7ffff7fe7278, thread=0x555556816bd0, ws=..., stop_chain=0x0) at ../../src/gdb/breakpoint.c:5702 > #25 0x0000555555a62e41 in handle_signal_stop (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6517 > #26 0x0000555555a61479 in handle_inferior_event (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6000 > #27 0x0000555555a5c7b5 in fetch_inferior_event () at ../../src/gdb/infrun.c:4403 > #28 0x0000555555a35b65 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:41 > #29 0x0000555555aae0c9 in handle_target_event (error=0, client_data=0x0) at ../../src/gdb/linux-nat.c:4231 > > > Now, when I try the same on a Fedora 32 machine, I see the GDB crash due to the stale > LWP still in the LWP list with no corresponding thread_info. On this > machine, glibc predates the changes that make it possible to use libthread_db with > non-threaded processes, so try_thread_db_load doesn't manage to open a connection > to libthread_db, and thus we don't end up in linux_stop_and_wait_all_lwps, and thus > the stale lwp is not deleted. And so a subsequent "kill" command crashes. > > I wrote that patch originally on an Ubuntu 20.04 machine (vs the Ubuntu 22.04 I have now), > and it must be that that version also predates the glibc change, and thus behaves like > this Fedora 32 box. You are very likely using a newer Fedora which has the glibc change. ... >> What are your thoughts on including this, or something like this with >> this commit? My patch, which applies on top of this commit, is included >> at the end of this email. Please feel free to take any changes that you >> feel add value. > > I'm totally fine with such a command, though the test I had added covers > as much as it would, as the "kill" command fails when the maint command > would fail, and passes when the maint command passes. But I'll incorporate > it. > I realized that my description of the problem above practically suggests a way to expose the crash everywhere -- just catch the exec event with "catch exec", so that the post-exec program doesn't even get to the libc.so.6 load event, and issue "kill" there, or use "maint info linux-lwps". So I've adjusted the patch to add a new testcase doing that. I've attached two patches, one adding your "maint info linux-lwps", now with NEWS/docs, and the updated version of the crash fix and testcase. WDYT? Pedro Alves