Hi Ulrich and community, Please find the new patch [See:- 0001-Fix-multi-thread-bug-in-AIX.patch ]. The last message size was too long. So, sending again so it can be part of mailing list as well. I have moved get_signaled_thread () function which tells us which is the kernel thread that caused an event due to which the debugger had to wait to rs6000-aix-nat.c. So, we figure out which kernel thread might have caused an event in the rs6000-aix-nat.c code itself. If let us say the main thread is pthreaded or a new user thread is being created or user thread is deleted, then I have taken care of what to return in sync_threadlists () code itself. >>So, if you observe output 3 or 4, the program first multi threads, >>I mean thread events are handled first and then the threads fork. >>So, when this happens, I cannot return ptid_t (parent_pid). If I do >>so, the GDB core will treat it as a new process and add it in my >>threadlist as say process 100 despite existence of 'thread 1' >>representing the same. So, I need to correctly send which thread >>did the fork () event or which thread of the process is the one who >>gave birth to a new inferior process [say 2 or 3 in output 3 below], >>I mean which thread caused the mult process event when the process >>is mutli threaded. This has to handled here as control from target.c >>comes directly to rs6000-aix-nat::wait and not through >>aix-thread.c::wait since fork () is a process event.. >So this last bit seems to be the problem. Could you elaborate on >what the exact call stack is? I thought once the thread layer is >initialized, calls to ::wait should always go through it ... Kindly see the backtrace sections * BT:- Thread_wait [which is on a thread event like new thread born or main process is pthreaded], * BT:- Post thread wait in rs6000-aix-nat::wait [which is the beneath ()->wait () in aix_thread_target::wait], * BT:- If direct rs6000-aix-nat::wait [ where in output 3 and 4 {below in the previous email} you can see it will directly come to rs6000-aix-nat.c if the main process after having threads forks or uses a fork () call ] pasted below in this email. >So the way this works e.g. on Linux is that the process layer handles >both processes and the *kernel* aspect of threads, while the thread >layer handles the *user-space* (libc/libpthread) aspect of threads. >In terms of the GDB ptid_t, this means that both the "pid" and "lwp" >field are "owned" by the process layer (which would be rs6000-aix-nat.c >in your case), while only the "tid" field is owned by the thread >layer (which would be aix-thread.c). >Linux does that because it allows correctly debugging programs that >only use the kernel threading capabilities without using libpthread, >e.g. by directly calling the "clone" system call and not "pthread_create". >Such threads won't be in the thread list managed by the user space >library, but are still handled by the process layer in GDB, tracked >as lwp without associated tid. >Not sure if something like that is even possible in AIX. If it does >make sense to handle things similarly in AIX (one other reason would >be ptrace commands that require LWPs, e.g. like the VSX register >access you had in another thread), some code would indeed need >to move, e.g. everything related to accessing *kernel* threads >(fetch_regs_kernel_thread etc.), while code that accesses *user* >threads via the libpthread accessors (fetch_regs_user_thread etc.) >would still remain in aix-thread.c. With this patch I have moved all my lwp checks in rs6000-aix-nat.c file and user thread things in aix-thread.c.. Yes, this will help us in the vector patch. >>While debugged in depth last two days I realised our pid_to_str >>is needed in rs6000-aix-nat.c as control comes here in search of it. >>If it doesn't GDB treats all threads as process. >This is again very suspicious. We obviously already have >threads, so the thread layer should be initialized. This >means that any "pid_to_str" call should go through the >*thread* layer (implementation in aix-thread.c). If that >doesn't happen, we should understand why. (This may be the >same problem that causes "wait" to be called from the >wrong layer, as seen above.) Kindly check the backtrace section [ BT:- pid_to_str stack ] below this email. So, what is happening is a thread event will come through threads and a process even will come through process layer. For example, while I press an interrupt key [Ctrl+c] in a multi process scenario, for the GDB core knowing which process is needed. By looking at the stack, it is built assuming the target will figure out the kernel thread that eventually caused this event in the process layer. Secondly kindly look at aix-thread.c:pid_to_str. We have a beneath()->pid_to_str () there in case the process is not threaded. So, we need one in the rs6000-aix-nat.c. aix_thread_target::pid_to_str (ptid_t ptid) { if (!PD_TID (ptid)) return beneath ()->pid_to_str (ptid); return string_printf (_("Thread %s"), pulongest (ptid.tid ())); } >>I have added an assertion here just >>to be sure. I get what you are thinking. >Having an assertion is of course good, but it isn't obvious to >me that this never can be hit. So, while I ran a few unit tests I did not find any case where we might end swapping the pid. So, I added the same so that if anyone hits this in the future, we are aware and can change accordingly. >The point is if GDB stops because the target received a signal, it >should automatically switch to the particular thread where the signal >was in fact received. I don't think this will actually happen in all >cases with the current code. >Shouldn't you instead check for *any* signal in get_signaled_thread? Yes, kindly check the get_signaled_thread_rs6000 (). ---------------------------------------------------- These are the changes I made thinking about how we can handle that get_signaled thread in one place. I have also attached the outputs and programs below. Also, now we pass ptid in some functions instead of pid in aix-thread.c. Kindly let me know what you think. Have a nice day ahead. Thanks and regards, Aditya. Below outputs are the one's obtained by ./gdb ./gdb --------------------------------------------------------------------------------------------- BT:- Thread_wait Thread 1 hit Breakpoint 1, aix_thread_target::wait (this=0x11001f758 <_aixthread.rw_+24>, ptid=..., status=0xffffffffffff360, options=...) at aix-thread.c:1051 1051 pid_to_prc (&ptid); (gdb) bt #0 aix_thread_target::wait (this=0x11001f758 <_aixthread.rw_+24>, ptid=..., status=0xffffffffffff360, options=...) at aix-thread.c:1051 #1 0x0000000100340778 in target_wait (ptid=..., status=0xffffffffffff360, options=...) at target.c:2598 #2 0x000000010037f158 in do_target_wait_1 (inf=0x1101713f0, ptid=..., status=0xffffffffffff360, options=...) at infrun.c:3763 #3 0x000000010037f41c in ::operator()(inferior *) const ( __closure=0xffffffffffff130, inf=0x1101713f0) at infrun.c:3822 #4 0x000000010037f85c in do_target_wait (ecs=0xffffffffffff338, options=...) at infrun.c:3841 #5 0x0000000100380cc8 in fetch_inferior_event () at infrun.c:4201 #6 0x0000000100a1e354 in inferior_event_handler (event_type=INF_REG_EVENT) at inf-loop.c:41 #7 0x0000000100392700 in infrun_async_inferior_event_handler (data=0x0) at infrun.c:9555 #8 0x0000000100677d88 in check_async_event_handlers () at async-event.c:337 #9 0x000000010067439c in gdb_do_one_event (mstimeout=-1) at event-loop.cc:221 #10 0x0000000100001dd0 in start_event_loop () at main.c:411 #11 0x0000000100001fd8 in captured_command_loop () at main.c:471 #12 0x0000000100004150 in captured_main (data=0xffffffffffff9f0) at main.c:1330 #13 0x0000000100004224 in gdb_main (args=0xffffffffffff9f0) at main.c:1345 #14 0x0000000100000aa0 in main (argc=2, argv=0xffffffffffffa90) at gdb.c:32 ------------------------------------------------------------ BT:- Post thread wait in rs6000-aix-nat::wait (gdb) c Continuing. Thread 1 hit Breakpoint 2, rs6000_nat_target::wait (this=0x1100a2e10 <_rs6000aixnat.rw_>, ptid=..., ourstatus=0xffffffffffff360, options=...) at rs6000-aix-nat.c:695 695 set_sigint_trap (); (gdb) bt #0 rs6000_nat_target::wait (this=0x1100a2e10 <_rs6000aixnat.rw_>, ptid=..., ourstatus=0xffffffffffff360, options=...) at rs6000-aix-nat.c:695 #1 0x0000000100599d68 in aix_thread_target::wait (this=0x11001f758 <_aixthread.rw_+24>, ptid=..., status=0xffffffffffff360, options=...) at aix-thread.c:1053 #2 0x0000000100340778 in target_wait (ptid=..., status=0xffffffffffff360, options=...) at target.c:2598 #3 0x000000010037f158 in do_target_wait_1 (inf=0x1101713f0, ptid=..., status=0xffffffffffff360, options=...) at infrun.c:3763 #4 0x000000010037f41c in ::operator()(inferior *) const ( __closure=0xffffffffffff130, inf=0x1101713f0) at infrun.c:3822 #5 0x000000010037f85c in do_target_wait (ecs=0xffffffffffff338, options=...) at infrun.c:3841 #6 0x0000000100380cc8 in fetch_inferior_event () at infrun.c:4201 #7 0x0000000100a1e354 in inferior_event_handler (event_type=INF_REG_EVENT) at inf-loop.c:41 #8 0x0000000100392700 in infrun_async_inferior_event_handler (data=0x0) at infrun.c:9555 #9 0x0000000100677d88 in check_async_event_handlers () at async-event.c:337 #10 0x000000010067439c in gdb_do_one_event (mstimeout=-1) at event-loop.cc:221 #11 0x0000000100001dd0 in start_event_loop () at main.c:411 #12 0x0000000100001fd8 in captured_command_loop () at main.c:471 #13 0x0000000100004150 in captured_main (data=0xffffffffffff9f0) at main.c:1330 #14 0x0000000100004224 in gdb_main (args=0xffffffffffff9f0) at main.c:1345 #15 0x0000000100000aa0 in main (argc=2, argv=0xffffffffffffa90) at gdb.c:32 ----------------------------------------------------------------------------------------------------- BT:- If direct rs6000-aix-nat::wait Thread 1 hit Breakpoint 2, rs6000_nat_target::wait (this=0x1100a2e10 <_rs6000aixnat.rw_>, ptid=..., ourstatus=0xffffffffffff360, options=...) at rs6000-aix-nat.c:695 695 set_sigint_trap (); (gdb) bt #0 rs6000_nat_target::wait (this=0x1100a2e10 <_rs6000aixnat.rw_>, ptid=..., ourstatus=0xffffffffffff360, options=...) at rs6000-aix-nat.c:695 #1 0x0000000100340778 in target_wait (ptid=..., status=0xffffffffffff360, options=...) at target.c:2598 #2 0x000000010037f158 in do_target_wait_1 (inf=0x1105f4430, ptid=..., status=0xffffffffffff360, options=...) at infrun.c:3763 #3 0x000000010037f41c in ::operator()(inferior *) const ( __closure=0xffffffffffff130, inf=0x1105f4430) at infrun.c:3822 #4 0x000000010037f85c in do_target_wait (ecs=0xffffffffffff338, options=...) at infrun.c:3841 #5 0x0000000100380cc8 in fetch_inferior_event () at infrun.c:4201 #6 0x0000000100a1e354 in inferior_event_handler (event_type=INF_REG_EVENT) at inf-loop.c:41 #7 0x0000000100392700 in infrun_async_inferior_event_handler (data=0x0) at infrun.c:9555 #8 0x0000000100677d88 in check_async_event_handlers () at async-event.c:337 #9 0x000000010067439c in gdb_do_one_event (mstimeout=-1) at event-loop.cc:221 #10 0x0000000100001dd0 in start_event_loop () at main.c:411 #11 0x0000000100001fd8 in captured_command_loop () at main.c:471 #12 0x0000000100004150 in captured_main (data=0xffffffffffff9f0) at main.c:1330 #13 0x0000000100004224 in gdb_main (args=0xffffffffffff9f0) at main.c:1345 #14 0x0000000100000aa0 in main (argc=2, argv=0xffffffffffffa90) at gdb.c:32 --------------------------------------------------------- BT:- pid_to_str stack (gdb) bt #0 rs6000_nat_target::pid_to_str[abi:cxx11](ptid_t) (this=0x1100a2e10 <_rs6000aixnat.rw_>, ptid=...) at rs6000-aix-nat.c:674 #1 0x00000001003409ec in target_pid_to_str[abi:cxx11](ptid_t) (ptid=...) at target.c:2623 #2 0x000000010038fc08 in normal_stop () at infrun.c:8697 #3 0x0000000100380ff4 in fetch_inferior_event () at infrun.c:4266 #4 0x0000000100a1e354 in inferior_event_handler (event_type=INF_REG_EVENT) at inf-loop.c:41 #5 0x0000000100392700 in infrun_async_inferior_event_handler (data=0x0) at infrun.c:9555 #6 0x0000000100677d88 in check_async_event_handlers () at async-event.c:337 #7 0x000000010067439c in gdb_do_one_event (mstimeout=-1) at event-loop.cc:221 #8 0x0000000100001dd0 in start_event_loop () at main.c:411 #9 0x0000000100001fd8 in captured_command_loop () at main.c:471 #10 0x0000000100004150 in captured_main (data=0xffffffffffff9f0) at main.c:1330 #11 0x0000000100004224 in gdb_main (args=0xffffffffffff9f0) at main.c:1345 #12 0x0000000100000aa0 in main (argc=2, argv=0xffffffffffffa90) at gdb.c:32 - ________________________________ From: Aditya Kamath1 Sent: 08 December 2022 15:58 To: Ulrich Weigand ; simark@simark.ca ; gdb-patches@sourceware.org Cc: Sangamesh Mallayya Subject: Re: [PATCH] 0001-Fix-multi-thread-debug-bug-in-AIX.patch Hi Ulrich and community, Please find the new patch [See:- 0001-Fix-multi-thread-bug-in-AIX.patch ] I have moved get_signaled_thread () function which tells us which is the kernel thread that caused an event due to which the debugger had to wait to rs6000-aix-nat.c. So, we figure out which kernel thread might have caused an event in the rs6000-aix-nat.c code itself. If let us say the main thread is pthreaded or a new user thread is being created or user thread is deleted, then I have taken care of what to return in sync_threadlists () code itself. >>So, if you observe output 3 or 4, the program first multi threads, >>I mean thread events are handled first and then the threads fork. >>So, when this happens, I cannot return ptid_t (parent_pid). If I do >>so, the GDB core will treat it as a new process and add it in my >>threadlist as say process 100 despite existence of 'thread 1' >>representing the same. So, I need to correctly send which thread >>did the fork () event or which thread of the process is the one who >>gave birth to a new inferior process [say 2 or 3 in output 3 below], >>I mean which thread caused the mult process event when the process >>is mutli threaded. This has to handled here as control from target.c >>comes directly to rs6000-aix-nat::wait and not through >>aix-thread.c::wait since fork () is a process event.. >So this last bit seems to be the problem. Could you elaborate on >what the exact call stack is? I thought once the thread layer is >initialized, calls to ::wait should always go through it ... Kindly see the backtrace sections * BT:- Thread_wait [which is on a thread event like new thread born or main process is pthreaded], * BT:- Post thread wait in rs6000-aix-nat::wait [which is the beneath ()->wait () in aix_thread_target::wait], * BT:- If direct rs6000-aix-nat::wait [ where in output 3 and 4 {below in this email} you can see it will directly come to rs6000-aix-nat.c if the main process after having threads forks or uses a fork () call ] pasted below in this email. >So the way this works e.g. on Linux is that the process layer handles >both processes and the *kernel* aspect of threads, while the thread >layer handles the *user-space* (libc/libpthread) aspect of threads. >In terms of the GDB ptid_t, this means that both the "pid" and "lwp" >field are "owned" by the process layer (which would be rs6000-aix-nat.c >in your case), while only the "tid" field is owned by the thread >layer (which would be aix-thread.c). >Linux does that because it allows correctly debugging programs that >only use the kernel threading capabilities without using libpthread, >e.g. by directly calling the "clone" system call and not "pthread_create". >Such threads won't be in the thread list managed by the user space >library, but are still handled by the process layer in GDB, tracked >as lwp without associated tid. >Not sure if something like that is even possible in AIX. If it does >make sense to handle things similarly in AIX (one other reason would >be ptrace commands that require LWPs, e.g. like the VSX register >access you had in another thread), some code would indeed need >to move, e.g. everything related to accessing *kernel* threads >(fetch_regs_kernel_thread etc.), while code that accesses *user* >threads via the libpthread accessors (fetch_regs_user_thread etc.) >would still remain in aix-thread.c. With this patch I have moved all my lwp checks in rs6000-aix-nat.c file and user thread things in aix-thread.c.. Yes, this will help us in the vector patch. >>While debugged in depth last two days I realised our pid_to_str >>is needed in rs6000-aix-nat.c as control comes here in search of it. >>If it doesn't GDB treats all threads as process. >This is again very suspicious. We obviously already have >threads, so the thread layer should be initialized. This >means that any "pid_to_str" call should go through the >*thread* layer (implementation in aix-thread.c). If that >doesn't happen, we should understand why. (This may be the >same problem that causes "wait" to be called from the >wrong layer, as seen above.) Kindly check the backtrace section [ BT:- pid_to_str stack ] below this email. So, what is happening is a thread event will come through threads and a process even will come through process layer. For example, while I press an interrupt key [Ctrl+c] in a multi process scenario, for the GDB core knowing which process is needed. By looking at the stack, it is built assuming the target will figure out the kernel thread that eventually caused this event in the process layer. Secondly kindly look at aix-thread.c:pid_to_str. We have a beneath()->pid_to_str () there in case the process is not threaded. So, we need one in the rs6000-aix-nat.c. aix_thread_target::pid_to_str (ptid_t ptid) { if (!PD_TID (ptid)) return beneath ()->pid_to_str (ptid); return string_printf (_("Thread %s"), pulongest (ptid.tid ())); } >>I have added an assertion here just >>to be sure. I get what you are thinking. >Having an assertion is of course good, but it isn't obvious to >me that this never can be hit. So, while I ran a few unit tests I did not find any case where we might end swapping the pid. So, I added the same so that if anyone hits this in the future, we are aware and can change accordingly. >The point is if GDB stops because the target received a signal, it >should automatically switch to the particular thread where the signal >was in fact received. I don't think this will actually happen in all >cases with the current code. >Shouldn't you instead check for *any* signal in get_signaled_thread? Yes, kindly check the get_signaled_thread_rs6000 (). ---------------------------------------------------- These are the changes I made thinking about how we can handle that get_signaled thread in one place. I have also attached the outputs and programs below. Also, now we pass ptid in some functions instead of pid in aix-thread.c. Kindly let me know what you think. Have a nice day ahead. Thanks and regards, Aditya. --------------------------------------------------------------------------------------------- BT:- Thread_wait Thread 1 hit Breakpoint 1, aix_thread_target::wait (this=0x11001f758 <_aixthread.rw_+24>, ptid=..., status=0xffffffffffff360, options=...) at aix-thread.c:1051 1051 pid_to_prc (&ptid); (gdb) bt #0 aix_thread_target::wait (this=0x11001f758 <_aixthread.rw_+24>, ptid=..., status=0xffffffffffff360, options=...) at aix-thread.c:1051 #1 0x0000000100340778 in target_wait (ptid=..., status=0xffffffffffff360, options=...) at target.c:2598 #2 0x000000010037f158 in do_target_wait_1 (inf=0x1101713f0, ptid=..., status=0xffffffffffff360, options=...) at infrun.c:3763 #3 0x000000010037f41c in ::operator()(inferior *) const ( __closure=0xffffffffffff130, inf=0x1101713f0) at infrun.c:3822 #4 0x000000010037f85c in do_target_wait (ecs=0xffffffffffff338, options=...) at infrun.c:3841 #5 0x0000000100380cc8 in fetch_inferior_event () at infrun.c:4201 #6 0x0000000100a1e354 in inferior_event_handler (event_type=INF_REG_EVENT) at inf-loop.c:41 #7 0x0000000100392700 in infrun_async_inferior_event_handler (data=0x0) at infrun.c:9555 #8 0x0000000100677d88 in check_async_event_handlers () at async-event.c:337 #9 0x000000010067439c in gdb_do_one_event (mstimeout=-1) at event-loop.cc:221 #10 0x0000000100001dd0 in start_event_loop () at main.c:411 #11 0x0000000100001fd8 in captured_command_loop () at main.c:471 #12 0x0000000100004150 in captured_main (data=0xffffffffffff9f0) at main.c:1330 #13 0x0000000100004224 in gdb_main (args=0xffffffffffff9f0) at main.c:1345 #14 0x0000000100000aa0 in main (argc=2, argv=0xffffffffffffa90) at gdb.c:32 ------------------------------------------------------------ BT:- Post thread wait in rs6000-aix-nat::wait (gdb) c Continuing. Thread 1 hit Breakpoint 2, rs6000_nat_target::wait (this=0x1100a2e10 <_rs6000aixnat.rw_>, ptid=..., ourstatus=0xffffffffffff360, options=...) at rs6000-aix-nat.c:695 695 set_sigint_trap (); (gdb) bt #0 rs6000_nat_target::wait (this=0x1100a2e10 <_rs6000aixnat.rw_>, ptid=..., ourstatus=0xffffffffffff360, options=...) at rs6000-aix-nat.c:695 #1 0x0000000100599d68 in aix_thread_target::wait (this=0x11001f758 <_aixthread.rw_+24>, ptid=..., status=0xffffffffffff360, options=...) at aix-thread.c:1053 #2 0x0000000100340778 in target_wait (ptid=..., status=0xffffffffffff360, options=...) at target.c:2598 #3 0x000000010037f158 in do_target_wait_1 (inf=0x1101713f0, ptid=..., status=0xffffffffffff360, options=...) at infrun.c:3763 #4 0x000000010037f41c in ::operator()(inferior *) const ( __closure=0xffffffffffff130, inf=0x1101713f0) at infrun.c:3822 #5 0x000000010037f85c in do_target_wait (ecs=0xffffffffffff338, options=...) at infrun.c:3841 #6 0x0000000100380cc8 in fetch_inferior_event () at infrun.c:4201 #7 0x0000000100a1e354 in inferior_event_handler (event_type=INF_REG_EVENT) at inf-loop.c:41 #8 0x0000000100392700 in infrun_async_inferior_event_handler (data=0x0) at infrun.c:9555 #9 0x0000000100677d88 in check_async_event_handlers () at async-event.c:337 #10 0x000000010067439c in gdb_do_one_event (mstimeout=-1) at event-loop.cc:221 #11 0x0000000100001dd0 in start_event_loop () at main.c:411 #12 0x0000000100001fd8 in captured_command_loop () at main.c:471 #13 0x0000000100004150 in captured_main (data=0xffffffffffff9f0) at main.c:1330 #14 0x0000000100004224 in gdb_main (args=0xffffffffffff9f0) at main.c:1345 #15 0x0000000100000aa0 in main (argc=2, argv=0xffffffffffffa90) at gdb.c:32 ----------------------------------------------------------------------------------------------------- BT:- If direct rs6000-aix-nat::wait Thread 1 hit Breakpoint 2, rs6000_nat_target::wait (this=0x1100a2e10 <_rs6000aixnat.rw_>, ptid=..., ourstatus=0xffffffffffff360, options=...) at rs6000-aix-nat.c:695 695 set_sigint_trap (); (gdb) bt #0 rs6000_nat_target::wait (this=0x1100a2e10 <_rs6000aixnat.rw_>, ptid=..., ourstatus=0xffffffffffff360, options=...) at rs6000-aix-nat.c:695 #1 0x0000000100340778 in target_wait (ptid=..., status=0xffffffffffff360, options=...) at target.c:2598 #2 0x000000010037f158 in do_target_wait_1 (inf=0x1105f4430, ptid=..., status=0xffffffffffff360, options=...) at infrun.c:3763 #3 0x000000010037f41c in ::operator()(inferior *) const ( __closure=0xffffffffffff130, inf=0x1105f4430) at infrun.c:3822 #4 0x000000010037f85c in do_target_wait (ecs=0xffffffffffff338, options=...) at infrun.c:3841 #5 0x0000000100380cc8 in fetch_inferior_event () at infrun.c:4201 #6 0x0000000100a1e354 in inferior_event_handler (event_type=INF_REG_EVENT) at inf-loop.c:41 #7 0x0000000100392700 in infrun_async_inferior_event_handler (data=0x0) at infrun.c:9555 #8 0x0000000100677d88 in check_async_event_handlers () at async-event.c:337 #9 0x000000010067439c in gdb_do_one_event (mstimeout=-1) at event-loop.cc:221 #10 0x0000000100001dd0 in start_event_loop () at main.c:411 #11 0x0000000100001fd8 in captured_command_loop () at main.c:471 #12 0x0000000100004150 in captured_main (data=0xffffffffffff9f0) at main.c:1330 #13 0x0000000100004224 in gdb_main (args=0xffffffffffff9f0) at main.c:1345 #14 0x0000000100000aa0 in main (argc=2, argv=0xffffffffffffa90) at gdb.c:32 --------------------------------------------------------- BT:- pid_to_str stack (gdb) bt #0 rs6000_nat_target::pid_to_str[abi:cxx11](ptid_t) (this=0x1100a2e10 <_rs6000aixnat.rw_>, ptid=...) at rs6000-aix-nat.c:674 #1 0x00000001003409ec in target_pid_to_str[abi:cxx11](ptid_t) (ptid=...) at target.c:2623 #2 0x000000010038fc08 in normal_stop () at infrun.c:8697 #3 0x0000000100380ff4 in fetch_inferior_event () at infrun.c:4266 #4 0x0000000100a1e354 in inferior_event_handler (event_type=INF_REG_EVENT) at inf-loop.c:41 #5 0x0000000100392700 in infrun_async_inferior_event_handler (data=0x0) at infrun.c:9555 #6 0x0000000100677d88 in check_async_event_handlers () at async-event.c:337 #7 0x000000010067439c in gdb_do_one_event (mstimeout=-1) at event-loop.cc:221 #8 0x0000000100001dd0 in start_event_loop () at main.c:411 #9 0x0000000100001fd8 in captured_command_loop () at main.c:471 #10 0x0000000100004150 in captured_main (data=0xffffffffffff9f0) at main.c:1330 #11 0x0000000100004224 in gdb_main (args=0xffffffffffff9f0) at main.c:1345 #12 0x0000000100000aa0 in main (argc=2, argv=0xffffffffffffa90) at gdb.c:32 --------------------------------------------------------------------- Program1:- [Credits gdb.threads/continuous-pending.c] #include #include #include #include #include pthread_barrier_t barrier; #define NUM_THREADS 3 void * thread_function (void *arg) { /* This ensures that the breakpoint is only hit after both threads are created, so the test can always switch to the non-event thread when the breakpoint triggers. */ pthread_barrier_wait (&barrier); while (1); /* break here */ } int main (void) { int i; alarm (300); pthread_barrier_init (&barrier, NULL, NUM_THREADS); for (i = 0; i < NUM_THREADS; i++) { pthread_t thread; int res; res = pthread_create (&thread, NULL, thread_function, NULL); assert (res == 0); } while (1) sleep (1); return 0; } ---------------------------------------------------------------- Output1:- Single process Reading symbols from /home/aditya/gdb_tests/continue-pending-status... (gdb) r Starting program: /home/aditya/gdb_tests/continue-pending-status ^C[New Thread 258] [New Thread 515] [New Thread 772] Thread 3 received signal SIGINT, Interrupt. [Switching to Thread 515] thread_function (arg=0x0) at /home/aditya/gdb_tests/continue-pending-status.c:36 36 while (1); /* break here */ (gdb) info threads Id Target Id Frame 1 Thread 1 (tid 24838585, running) warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 2 Thread 258 (tid 23134635, running) thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/continue-pending-status.c:36 * 3 Thread 515 (tid 30146867, running) thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/continue-pending-status.c:36 4 Thread 772 (tid 27853165, running) thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/continue-pending-status.c:36 --------------------------------------------------------------------------------- Program 2:- Multi process Code #include #include #include #include #include pthread_barrier_t barrier; #define NUM_THREADS 2 void * thread_function (void *arg) { /* This ensures that the breakpoint is only hit after both threads are created, so the test can always switch to the non-event thread when the breakpoint triggers. */ pthread_barrier_wait (&barrier); pid_t child; child = fork (); if (child > 0) printf ("I am parent \n"); else{ printf (" Iam child \n"); child = fork (); if (child > 0) printf ("From child I became a parent \n"); else printf ("I am grandchild \n"); } while (1); /* break here */ } int main (void) { int i; alarm (300); pthread_barrier_init (&barrier, NULL, NUM_THREADS); for (i = 0; i < NUM_THREADS; i++) { pthread_t thread; int res; res = pthread_create (&thread, NULL, thread_function, NULL); assert (res == 0); } while (1) { sleep (15); break; } return 0; } ------------------------------------------------------------------------- Output 2:- With detach-on-fork on Reading symbols from /home/aditya/gdb_tests/ultimate-multi-thread-fork... (gdb) r Starting program: /home/aditya/gdb_tests/ultimate-multi-thread-fork [New Thread 258] [New Thread 515] [Detaching after fork from child process 8323572] Iam child I am grandchild From child I became a parent I am parent [Detaching after fork from child process 11665884] Iam child I am grandchild From child I became a parent I am parent ^C Thread 2 received signal SIGINT, Interrupt. [Switching to Thread 258] thread_function (arg=0x0) at /home/aditya/gdb_tests/ultimate-multi-thread-fork.c:32 32 while (1); /* break here */ (gdb) info threads Id Target Id Frame 1 Thread 1 (tid 27263269, running) warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) * 2 Thread 258 (tid 28705075, running) thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/ultimate-multi-thread-fork.c:32 3 Thread 515 (tid 27853169, running) thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/ultimate-multi-thread-fork.c:32 ------------------------------------------------------------------------- Output 3:- With detach-on-fork off Reading symbols from /home/aditya/gdb_tests/ultimate-multi-thread-fork... (gdb) set detach-on-fork off (gdb) r Starting program: /home/aditya/gdb_tests/ultimate-multi-thread-fork [New Thread 258] [New Thread 515] [New inferior 2 (Process 15466928)] [New inferior 3 (Process 13894048)] I am parent I am parent ^C Thread 1.1 received signal SIGINT, Interrupt. [Switching to Thread 1] 0xd0595fb0 in _p_nsleep () from /usr/lib/libpthread.a(shr_xpg5.o) (gdb) info threads Id Target Id Frame * 1.1 Thread 1 0xd0595fb0 in _p_nsleep () from /usr/lib/libpthread.a(shr_xpg5.o) 1.2 Thread 258 0xd0595fb0 in _p_nsleep () from /usr/lib/libpthread.a(shr_xpg5.o) 1.3 Thread 515 0xd0595fb0 in _p_nsleep () from /usr/lib/libpthread.a(shr_xpg5.o) 2.1 Process 15466928 0xd0594fc8 in ?? () 3.1 Process 13894048 0xd0594fc8 in ?? () -------------------------------------------------- Output 4:- detach fork off and following child Reading symbols from /home/aditya/gdb_tests/ultimate-multi-thread-fork... (gdb) set detach-on-fork off (gdb) set follow-fork-mode child (gdb) r Starting program: /home/aditya/gdb_tests/ultimate-multi-thread-fork [New Thread 258] [New Thread 515] [Attaching after Thread 515 fork to child Process 13894050] [New inferior 2 (Process 13894050)] Iam child [Attaching after Process 13894050 fork to child Process 11010474] [New inferior 3 (Process 11010474)] I am grandchild ^CReading symbols from /home/aditya/gdb_tests/ultimate-multi-thread-fork... Thread 3.1 received signal SIGINT, Interrupt. [Switching to Process 11010474] thread_function (arg=0x0) at /home/aditya/gdb_tests/ultimate-multi-thread-fork.c:32 32 while (1); /* break here */ (gdb) info threads Id Target Id Frame 1.1 Thread 1 0xd0594fc8 in _sigsetmask () from /usr/lib/libpthread.a(shr_xpg5.o) 1.2 Thread 258 0xd0594fc8 in _sigsetmask () from /usr/lib/libpthread.a(shr_xpg5.o) 1.3 Thread 515 0xd0594fc8 in _sigsetmask () from /usr/lib/libpthread.a(shr_xpg5.o) 2.1 Process 13894050 0xd0594fc8 in ?? () * 3.1 Process 11010474 thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/ultimate-multi-thread-fork.c:32 ________________________________ From: Ulrich Weigand Sent: 06 December 2022 00:03 To: simark@simark.ca ; Aditya Kamath1 ; gdb-patches@sourceware.org Cc: Sangamesh Mallayya Subject: Re: [PATCH] 0001-Fix-multi-thread-debug-bug-in-AIX.patch Aditya Kamath1 wrote: >>I'm not sure why it is necessary to handle this in the process layer >>(rs6000-aix-nat.c) instead of the thread layer (aix-thread.c). >>What specifically breaks if you do not have these rs6000-aix-nat.c >>changes? > >So, if you observe output 3 or 4, the program first multi threads, >I mean thread events are handled first and then the threads fork. >So, when this happens, I cannot return ptid_t (parent_pid). If I do >so, the GDB core will treat it as a new process and add it in my >threadlist as say process 100 despite existence of 'thread 1' >representing the same. So, I need to correctly send which thread >did the fork () event or which thread of the process is the one who >gave birth to a new inferior process [say 2 or 3 in output 3 below], >I mean which thread caused the mult process event when the process >is mutli threaded. This has to handled here as control from target.c >comes directly to rs6000-aix-nat::wait and not through >aix-thread.c::wait since fork () is a process event.. So this last bit seems to be the problem. Could you elaborate on what the exact call stack is? I thought once the thread layer is initialized, calls to ::wait should always go through it ... >>If you *do* need to handle LWPs (kernel thread IDs) in the process >>layer (this can be a reasonable choice, and it done by several other >>native targets), then it should be *consistent*, and *all* LWP handling >>should be in the process layer. In particular, under no circumstances >>does it make sense to duplicate the "find current/signalled thread" >>code in *both* the process any thread layers. > >This not straightforward to do. The reason being say our application is pthreaded >We need our sync_threadlists() code to detect multiple threads and sync.. >We cannot handle this in rs6000-aix-nat.c with the current design of the code.. >Let's say child process is multi-threaded things can get complex.. >It will require us to move that whole GDB list and Pthread list sync code to >rs6000-aix-nat.c code. The essence or most selling product or the USP >[Unique Selling Proposition] of aix-thread.c code will be lost. So the way this works e.g. on Linux is that the process layer handles both processes and the *kernel* aspect of threads, while the thread layer handles the *user-space* (libc/libpthread) aspect of threads. In terms of the GDB ptid_t, this means that both the "pid" and "lwp" field are "owned" by the process layer (which would be rs6000-aix-nat.c in your case), while only the "tid" field is owned by the thread layer (which would be aix-thread.c). Linux does that because it allows correctly debugging programs that only use the kernel threading capabilities without using libpthread, e.g. by directly calling the "clone" system call and not "pthread_create". Such threads won't be in the thread list managed by the user space library, but are still handled by the process layer in GDB, tracked as lwp without associated tid. Not sure if something like that is even possible in AIX. If it does make sense to handle things similarly in AIX (one other reason would be ptrace commands that require LWPs, e.g. like the VSX register access you had in another thread), some code would indeed need to move, e.g. everything related to accessing *kernel* threads (fetch_regs_kernel_thread etc.), while code that accesses *user* threads via the libpthread accessors (fetch_regs_user_thread etc.) would still remain in aix-thread.c. >>>[Switching to process 16777620] > >>This outputs inferior_ptid ... > >Yes, you were right > >>>* 1.1 process 16777620 0xd0595fb0 in _p_nsleep () >>> from /usr/lib/libpthread.a(shr_xpg5.o) >>> 1.2 process 16777620 0xd0595fb0 in _p_nsleep () >>> from /usr/lib/libpthread.a(shr_xpg5.o) >>> 1.3 process 16777620 0xd0595fb0 in _p_nsleep () >>> from /usr/lib/libpthread.a(shr_xpg5.o) >>> 2.1 process 8323570 0xd0594fc8 in ?? () >>> 3.1 process 17957172 0xd0594fc8 in ?? () > >>... and this outputs the ptid values for those threads. > >>If it says "process ...", then those ptid values have not >>properly been switched over to the (pid, lwp, tid) format. >While debugged in depth last two days I realised our pid_to_str >is needed in rs6000-aix-nat.c as control comes here in search of it. >If it doesn't GDB treats all threads as process. This is again very suspicious. We obviously already have threads, so the thread layer should be initialized. This means that any "pid_to_str" call should go through the *thread* layer (implementation in aix-thread.c). If that doesn't happen, we should understand why. (This may be the same problem that causes "wait" to be called from the wrong layer, as seen above.) >>You should verify that the sync_threadlists code handles >>all multi-process cases correctly. I haven't looked at >>this in detail, but are you sure that here: > >>>@@ -841,8 +829,22 @@ sync_threadlists (int pid) > >> } > >> else if (cmp_result > 0) > >> { >>>- delete_thread (gbuf[gi]); > > >>you never accidentally switch the *pid* part (if "gptid" >>belows to a different pid than "pptid")? > >So, this is not the reason. I have added an assertion here just >to be sure. I get what you are thinking. Having an assertion is of course good, but it isn't obvious to me that this never can be hit. >>Hmm. So when "wait" returns, it needs to determine which thread >>triggered the event that caused ptrace to stop. On Linux, "wait" >>will actually return the LWP of that thread, so it can be directly >>used. It seems on AIX, "wait" only returns a PID, and you do not >>immediately know which thread caused the event? > >>In that case, I can see why you'd have to consider SIGINT as well >>as SIGTRAP. However, it seems to me that even those two are not the >>*only* cases that can cause "wait" to return - doesn't *any* signal >>(potentially) trigger a ptrace intercept (causing wait to return)? > >>But that's probably a more general problem, and wouldn't occur in >>this simple test case. > >Exactly. So I tried debugging few examples causing a few other signals >as mentioned in this document [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=reference-signal-handling]. >In AIX we have most of them mentioned in the link. It does not block >us from doing things or crashes incase of a segment fault signal >[from our debugger code]. Abort also works fine. Let me know what you think. The point is if GDB stops because the target received a signal, it should automatically switch to the particular thread where the signal was in fact received. I don't think this will actually happen in all cases with the current code. Shouldn't you instead check for *any* signal in get_signaled_thread? Bye, Ulrich