Hi!

On 2023-04-04 2:57 p.m., Pedro Alves wrote:
> On 2023-03-21 2:50 p.m., Andrew Burgess wrote:
>>
>> I thought it was the second case, but I was so unsure that I tried the
>> reproducer anyway.   Just in case I'm wrong, the above example doesn't
>> seem to fail prior to this commit.
> 
> This surprised me, and when I tried it myself, I was even more surprised,
> for I couldn't reproduce it either!
> 
> But I figured it out.
> 
> I'm usually using Ubuntu 22.04 for development nowadays, and in that system, indeed I can't
> reproduce it.  Right after the exec, GDB traps a load event for "libc.so.6", which leads to
> gdb trying to open libthread_db for the post-exec inferior, and, it succeeds.  When we load
> libthread_db, we call linux_stop_and_wait_all_lwps, which, as the name suggests, stops all lwps,
> and then waits to see their stops.  While doing this, GDB detects that the pre-exec stale
> LWP is gone, and deletes it.
> 
> The logs show:
> 
> [linux-nat] linux_nat_wait_1: waitpid 1725529 received SIGTRAP - Trace/breakpoint trap (stopped)
> [linux-nat] save_stop_reason: 1725529.1725529.0 stopped by software breakpoint
> [linux-nat] linux_nat_wait_1: waitpid(-1, ...) returned 0, ERRNO-OK
> [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725658.0, not stopped
> [linux-nat] resume_stopped_resumed_lwps: NOT resuming LWP 1725529.1725529.0, has pending status
> [linux-nat] linux_nat_wait_1: trap ptid is 1725529.1725529.0.
> [linux-nat] linux_nat_wait_1: exit
> [linux-nat] stop_callback: kill 1725529.1725658.0 **<SIGSTOP>**
> [linux-nat] stop_callback: lwp kill -1 No such process
> [linux-nat] wait_lwp: 1725529.1725658.0 vanished.
> 
> And the backtrace is:
> 
> (top-gdb) bt
> #0  wait_lwp (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2069
> #1  0x0000555555aa8fbf in stop_wait_callback (lp=0x555556f37350) at ../../src/gdb/linux-nat.c:2375
> #2  0x0000555555ab12b3 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (__closure=0x0, ecall=..., args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:326
> #3  0x0000555555ab12e2 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:320
> #4  0x0000555555ab0610 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffca90, args#0=0x555556f37350) at ../../src/gdb/../gdbsupport/function-view.h:289
> #5  0x0000555555aa4c2d in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:867
> #6  0x0000555555aa8a03 in linux_stop_and_wait_all_lwps () at ../../src/gdb/linux-nat.c:2229
> #7  0x0000555555ac8525 in try_thread_db_load_1 (info=0x555556a66dd0) at ../../src/gdb/linux-thread-db.c:923
> #8  0x0000555555ac89d5 in try_thread_db_load (library=0x5555560eca27 "libthread_db.so.1", check_auto_load_safe=false) at ../../src/gdb/linux-thread-db.c:1024
> #9  0x0000555555ac8eda in try_thread_db_load_from_sdir () at ../../src/gdb/linux-thread-db.c:1108
> #10 0x0000555555ac9278 in thread_db_load_search () at ../../src/gdb/linux-thread-db.c:1163
> #11 0x0000555555ac9518 in thread_db_load () at ../../src/gdb/linux-thread-db.c:1225
> #12 0x0000555555ac95e1 in check_for_thread_db () at ../../src/gdb/linux-thread-db.c:1268
> #13 0x0000555555ac9657 in thread_db_new_objfile (objfile=0x555556943ed0) at ../../src/gdb/linux-thread-db.c:1297
> #14 0x000055555569e2d2 in std::__invoke_impl<void, void (*&)(objfile*), objfile*> (__f=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:61
> #15 0x000055555569c44a in std::__invoke_r<void, void (*&)(objfile*), objfile*> (__fn=@0x5555567925d8: 0x555555ac95e8 <thread_db_new_objfile(objfile*)>) at /usr/include/c++/11/bits/invoke.h:111
> #16 0x0000555555699d69 in std::_Function_handler<void (objfile*), void (*)(objfile*)>::_M_invoke(std::_Any_data const&, objfile*&&) (__functor=..., __args#0=@0x7fffffffce50: 0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:290
> #17 0x0000555555b5f48b in std::function<void (objfile*)>::operator()(objfile*) const (this=0x5555567925d8, __args#0=0x555556943ed0) at /usr/include/c++/11/bits/std_function.h:590
> #18 0x0000555555b5eba4 in gdb::observers::observable<objfile*>::notify (this=0x5555565b5680 <gdb::observers::new_objfile>, args#0=0x555556943ed0) at ../../src/gdb/../gdbsupport/observable.h:166
> #19 0x0000555555cdd85b in symbol_file_add_with_addrs (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1131
> #20 0x0000555555cdd9c5 in symbol_file_add_from_bfd (abfd=..., name=0x5555569794e0 "/lib/x86_64-linux-gnu/libc.so.6", add_flags=..., addrs=0x7fffffffd0c0, flags=..., parent=0x0) at ../../src/gdb/symfile.c:1167
> #21 0x0000555555c9dd69 in solib_read_symbols (so=0x5555569792d0, flags=...) at ../../src/gdb/solib.c:730
> #22 0x0000555555c9e7b7 in solib_add (pattern=0x0, from_tty=0, readsyms=1) at ../../src/gdb/solib.c:1041
> #23 0x0000555555c9f61d in handle_solib_event () at ../../src/gdb/solib.c:1315
> #24 0x0000555555729c26 in bpstat_stop_status (aspace=0x555556606800, bp_addr=0x7ffff7fe7278, thread=0x555556816bd0, ws=..., stop_chain=0x0) at ../../src/gdb/breakpoint.c:5702
> #25 0x0000555555a62e41 in handle_signal_stop (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6517
> #26 0x0000555555a61479 in handle_inferior_event (ecs=0x7fffffffd670) at ../../src/gdb/infrun.c:6000
> #27 0x0000555555a5c7b5 in fetch_inferior_event () at ../../src/gdb/infrun.c:4403
> #28 0x0000555555a35b65 in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:41
> #29 0x0000555555aae0c9 in handle_target_event (error=0, client_data=0x0) at ../../src/gdb/linux-nat.c:4231
> 
> 
> Now, when I try the same on a Fedora 32 machine, I see the GDB crash due to the stale
> LWP still in the LWP list with no corresponding thread_info.  On this
> machine, glibc predates the changes that make it possible to use libthread_db with
> non-threaded processes, so try_thread_db_load doesn't manage to open a connection
> to libthread_db, and thus we don't end up in linux_stop_and_wait_all_lwps, and thus
> the stale lwp is not deleted.  And so a subsequent "kill" command crashes.
> 
> I wrote that patch originally on an Ubuntu 20.04 machine (vs the Ubuntu 22.04 I have now),
> and it must be that that version also predates the glibc change, and thus behaves like
> this Fedora 32 box.  You are very likely using a newer Fedora which has the glibc change.

...

>> What are your thoughts on including this, or something like this with
>> this commit?  My patch, which applies on top of this commit, is included
>> at the end of this email.  Please feel free to take any changes that you
>> feel add value.
> 
> I'm totally fine with such a command, though the test I had added covers
> as much as it would, as the "kill" command fails when the maint command
> would fail, and passes when the maint command passes.  But I'll incorporate
> it.
> 

I realized that my description of the problem above practically
suggests a way to expose the crash everywhere -- just catch the exec
event with "catch exec", so that the post-exec program doesn't even
get to the libc.so.6 load event, and issue "kill" there, or use "maint info linux-lwps".
So I've adjusted the patch to add a new testcase doing that.  I've attached two
patches, one adding your "maint info linux-lwps", now with NEWS/docs, and
the updated version of the crash fix and testcase.

WDYT?

Pedro Alves