Hi Ulrich and community, Please find attached the new patch [See:- 0001-Fix-multi-thread-debug-bug-in-AIX.patch ] So now the proposed patch attached in this email works fine for single process and multi process code with detach fork on and off options. Kindly see output 1 and program 1 for single process case and program 2 and outputs 2, 3 and 4 for multi process case [It points out that if a process is a thread, it is a thread. It is highlighted in bold]. The unit test cases run well. Output 4 is following child. It is attached to be sure things work in all cases. The proposed patch has a pid_to_str () function that helps us. So, when in a multi process case a switch happens between process due to any event the control checks in our rs6000-aix-nat.c for thread or process information from thread_target_id_str () from thread.c file. This part is now solved. There are some optimisations for which I need your advice since you are the expert. So here are my reasons.. >>ourstatus->set_forked (ptid_t (pid)); >>-            return ptid_t (parent_pid); >>+            return find_the_return_ptid (parent_pid); >> >>Kindly note that the control will always not come from aix-thread.c >>for such events. Hence, we cannot take care of the same there, >>though it will be a relief if we can do that. >I'm not sure why it is necessary to handle this in the process layer >(rs6000-aix-nat.c) instead of the thread layer (aix-thread.c). >What specifically breaks if you do not have these rs6000-aix-nat.c >changes? So, if you observe output 3 or 4, the program first multi threads, I mean thread events are handled first and then the threads fork. So, when this happens, I cannot return ptid_t (parent_pid). If I do so, the GDB core will treat it as a new process and add it in my threadlist as say process 100 despite existence of 'thread 1' representing the same. So, I need to correctly send which thread did the fork () event or which thread of the process is the one who gave birth to a new inferior process [say 2 or 3 in output 3 below], I mean which thread caused the mult process event when the process is mutli threaded. This has to handled here as control from target.c comes directly to rs6000-aix-nat::wait and not through aix-thread.c::wait since fork () is a process event.. >If you *do* need to handle LWPs (kernel thread IDs) in the process >layer (this can be a reasonable choice, and it done by several other >native targets), then it should be *consistent*, and *all* LWP handling >should be in the process layer. In particular, under no circumstances >does it make sense to duplicate the "find current/signalled thread" >code in *both* the process any thread layers. This not straightforward to do. The reason being say our application is pthreaded We need our sync_threadlists() code to detect multiple threads and sync.. We cannot handle this in rs6000-aix-nat.c with the current design of the code.. Let's say child process is multi-threaded things can get complex.. It will require us to move that whole GDB list and Pthread list sync code to rs6000-aix-nat.c code. The essence or most selling product or the USP [Unique Selling Proposition] of aix-thread.c code will be lost. Let me know what you think of this, and I will modify this patch as per how you guide or suggest.. This is where I wanted to reach out to all of you in the community as I think this might be a code design change needed. >>[Switching to process 16777620] >This outputs inferior_ptid ... Yes, you were right >>* 1.1 process 16777620 0xd0595fb0 in _p_nsleep () >> from /usr/lib/libpthread.a(shr_xpg5.o) >> 1.2 process 16777620 0xd0595fb0 in _p_nsleep () >> from /usr/lib/libpthread.a(shr_xpg5.o) >> 1.3 process 16777620 0xd0595fb0 in _p_nsleep () >> from /usr/lib/libpthread.a(shr_xpg5.o) >> 2.1 process 8323570 0xd0594fc8 in ?? () >> 3.1 process 17957172 0xd0594fc8 in ?? () >... and this outputs the ptid values for those threads. >If it says "process ...", then those ptid values have not >properly been switched over to the (pid, lwp, tid) format. >You should verify that the sync_threadlists code handles >all multi-process cases correctly. I haven't looked at >this in detail, but are you sure that here: >>@@ -841,8 +829,22 @@ sync_threadlists (int pid) >> } >> else if (cmp_result > 0) >> { >>- delete_thread (gbuf[gi]); >you never accidentally switch the *pid* part (if "gptid" >belows to a different pid than "pptid")? So, this is not the reason. I have added an assertion here just to be sure. I get what you are thinking. While debugged in depth last two days I realised our pid_to_str is needed in rs6000-aix-nat.c as control comes here in search of it. If it doesn't GDB treats all threads as process. It is the patch. Kindly see it and the outputs [3 and 4 below this email] as well. >Hmm. So when "wait" returns, it needs to determine which thread >triggered the event that caused ptrace to stop. On Linux, "wait" >will actually return the LWP of that thread, so it can be directly >used. It seems on AIX, "wait" only returns a PID, and you do not >immediately know which thread caused the event? >In that case, I can see why you'd have to consider SIGINT as well >as SIGTRAP. However, it seems to me that even those two are not the >*only* cases that can cause "wait" to return - doesn't *any* signal >(potentially) trigger a ptrace intercept (causing wait to return)? >But that's probably a more general problem, and wouldn't occur in >this simple test case. Exactly. So I tried debugging few examples causing a few other signals as mentioned in this document [https://www.ibm.com/docs/en/sdk-java-technology/8?topic=reference-signal-handling]. In AIX we have most of them mentioned in the link. It does not block us from doing things or crashes incase of a segment fault signal [from our debugger code]. Abort also works fine. Let me know what you think. ---------------------------------------------------------------- Program1:- [Credits gdb.threads/continuous-pending.c] #include #include #include #include #include pthread_barrier_t barrier; #define NUM_THREADS 3 void * thread_function (void *arg) { /* This ensures that the breakpoint is only hit after both threads are created, so the test can always switch to the non-event thread when the breakpoint triggers. */ pthread_barrier_wait (&barrier); while (1); /* break here */ } int main (void) { int i; alarm (300); pthread_barrier_init (&barrier, NULL, NUM_THREADS); for (i = 0; i < NUM_THREADS; i++) { pthread_t thread; int res; res = pthread_create (&thread, NULL, thread_function, NULL); assert (res == 0); } while (1) sleep (1); return 0; } ---------------------------------------------------------------- Output1:- Single process Reading symbols from /home/aditya/gdb_tests/continue-pending-status... (gdb) r Starting program: /home/aditya/gdb_tests/continue-pending-status ^C[New Thread 258] [New Thread 515] [New Thread 772] Thread 3 received signal SIGINT, Interrupt. [Switching to Thread 515] thread_function (arg=0x0) at /home/aditya/gdb_tests/continue-pending-status.c:36 36 while (1); /* break here */ (gdb) info threads Id Target Id Frame 1 Thread 1 (tid 24838585, running) warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 2 Thread 258 (tid 23134635, running) thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/continue-pending-status.c:36 * 3 Thread 515 (tid 30146867, running) thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/continue-pending-status.c:36 4 Thread 772 (tid 27853165, running) thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/continue-pending-status.c:36 --------------------------------------------------------------------------------- Program 2:- Multi process Code #include #include #include #include #include pthread_barrier_t barrier; #define NUM_THREADS 2 void * thread_function (void *arg) { /* This ensures that the breakpoint is only hit after both threads are created, so the test can always switch to the non-event thread when the breakpoint triggers. */ pthread_barrier_wait (&barrier); pid_t child; child = fork (); if (child > 0) printf ("I am parent \n"); else{ printf (" Iam child \n"); child = fork (); if (child > 0) printf ("From child I became a parent \n"); else printf ("I am grandchild \n"); } while (1); /* break here */ } int main (void) { int i; alarm (300); pthread_barrier_init (&barrier, NULL, NUM_THREADS); for (i = 0; i < NUM_THREADS; i++) { pthread_t thread; int res; res = pthread_create (&thread, NULL, thread_function, NULL); assert (res == 0); } while (1) { sleep (15); break; } return 0; } ------------------------------------------------------------------------- Output 2:- With detach-on-fork on Reading symbols from /home/aditya/gdb_tests/ultimate-multi-thread-fork... (gdb) r Starting program: /home/aditya/gdb_tests/ultimate-multi-thread-fork [New Thread 258] [New Thread 515] [Detaching after fork from child process 8323572] Iam child I am grandchild From child I became a parent I am parent [Detaching after fork from child process 11665884] Iam child I am grandchild From child I became a parent I am parent ^C Thread 2 received signal SIGINT, Interrupt. [Switching to Thread 258] thread_function (arg=0x0) at /home/aditya/gdb_tests/ultimate-multi-thread-fork.c:32 32 while (1); /* break here */ (gdb) info threads Id Target Id Frame 1 Thread 1 (tid 27263269, running) warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) * 2 Thread 258 (tid 28705075, running) thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/ultimate-multi-thread-fork.c:32 3 Thread 515 (tid 27853169, running) thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/ultimate-multi-thread-fork.c:32 ------------------------------------------------------------------------- Output 3:- With detach-on-fork off Reading symbols from /home/aditya/gdb_tests/ultimate-multi-thread-fork... (gdb) set detach-on-fork off (gdb) r Starting program: /home/aditya/gdb_tests/ultimate-multi-thread-fork [New Thread 258] [New Thread 515] [New inferior 2 (Process 15466928)] [New inferior 3 (Process 13894048)] I am parent I am parent ^C Thread 1.1 received signal SIGINT, Interrupt. [Switching to Thread 1] 0xd0595fb0 in _p_nsleep () from /usr/lib/libpthread.a(shr_xpg5.o) (gdb) info threads Id Target Id Frame * 1.1 Thread 1 0xd0595fb0 in _p_nsleep () from /usr/lib/libpthread.a(shr_xpg5.o) 1.2 Thread 258 0xd0595fb0 in _p_nsleep () from /usr/lib/libpthread.a(shr_xpg5.o) 1.3 Thread 515 0xd0595fb0 in _p_nsleep () from /usr/lib/libpthread.a(shr_xpg5.o) 2.1 Process 15466928 0xd0594fc8 in ?? () 3.1 Process 13894048 0xd0594fc8 in ?? () -------------------------------------------------- Output 4:- detach fork off and following child Reading symbols from /home/aditya/gdb_tests/ultimate-multi-thread-fork... (gdb) set detach-on-fork off (gdb) set follow-fork-mode child (gdb) r Starting program: /home/aditya/gdb_tests/ultimate-multi-thread-fork [New Thread 258] [New Thread 515] [Attaching after Thread 515 fork to child Process 13894050] [New inferior 2 (Process 13894050)] Iam child [Attaching after Process 13894050 fork to child Process 11010474] [New inferior 3 (Process 11010474)] I am grandchild ^CReading symbols from /home/aditya/gdb_tests/ultimate-multi-thread-fork... Thread 3.1 received signal SIGINT, Interrupt. [Switching to Process 11010474] thread_function (arg=0x0) at /home/aditya/gdb_tests/ultimate-multi-thread-fork.c:32 32 while (1); /* break here */ (gdb) info threads Id Target Id Frame 1.1 Thread 1 0xd0594fc8 in _sigsetmask () from /usr/lib/libpthread.a(shr_xpg5.o) 1.2 Thread 258 0xd0594fc8 in _sigsetmask () from /usr/lib/libpthread.a(shr_xpg5.o) 1.3 Thread 515 0xd0594fc8 in _sigsetmask () from /usr/lib/libpthread.a(shr_xpg5.o) 2.1 Process 13894050 0xd0594fc8 in ?? () * 3.1 Process 11010474 thread_function (arg=warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.) 0x0) at /home/aditya/gdb_tests/ultimate-multi-thread-fork.c:32 ________________________________ From: Ulrich Weigand Sent: 30 November 2022 20:27 To: simark@simark.ca ; Aditya Kamath1 ; gdb-patches@sourceware.org Cc: Sangamesh Mallayya Subject: Re: [PATCH] 0001-Fix-multi-thread-debug-bug-in-AIX.patch Aditya Kamath1 wrote: >Before we had an extra thread 'process 100' apart from 'thread 1'. >So, in case someone interrupted a thread with ctrl+c.. In the >pd_update () even if we don't have thread who signalled this interrupt >when we return ptid_t (pid) it was fine. But now with no 'process 100' >and only 'thread 1', we need to take care of interrupt as well, >otherwise GDB core will take ptid_t(100) as a new process. >So the change >+ if (thrinf.ti_cursig == SIGTRAP || thrinf.ti_cursig == SIGINT) Hmm. So when "wait" returns, it needs to determine which thread triggered the event that caused ptrace to stop. On Linux, "wait" will actually return the LWP of that thread, so it can be directly used. It seems on AIX, "wait" only returns a PID, and you do not immediately know which thread caused the event? In that case, I can see why you'd have to consider SIGINT as well as SIGTRAP. However, it seems to me that even those two are not the *only* cases that can cause "wait" to return - doesn't *any* signal (potentially) trigger a ptrace intercept (causing wait to return)? But that's probably a more general problem, and wouldn't occur in this simple test case. >In this case if we return a ptid_t (pid) then gdb core treats it >as a new process since our parent thread is pthreaded GDB core >only knows threads like ptid_t (pid, tid, pthid) and not >ptid_t (pid). In order to avoid this, the proposed patch uses a >function call find_the_return_ptid () to figure out the same. >The changes like below are for this reason. > >ourstatus->set_forked (ptid_t (pid)); >-            return ptid_t (parent_pid); >+            return find_the_return_ptid (parent_pid); > >Kindly note that the control will always not come from aix-thread.c >for such events. Hence, we cannot take care of the same there, >though it will be a relief if we can do that. I'm not sure why it is necessary to handle this in the process layer (rs6000-aix-nat.c) instead of the thread layer (aix-thread.c). What specifically breaks if you do not have these rs6000-aix-nat.c changes? If you *do* need to handle LWPs (kernel thread IDs) in the process layer (this can be a reasonable choice, and it done by several other native targets), then it should be *consistent*, and *all* LWP handling should be in the process layer. In particular, under no circumstances does it make sense to duplicate the "find current/signalled thread" code in *both* the process any thread layers. >[Switching to process 16777620] This outputs inferior_ptid ... >0xd0595fb0 in _p_nsleep () from /usr/lib/libpthread.a(shr_xpg5.o) >(gdb) info threads > Id Target Id Frame >* 1.1 process 16777620 0xd0595fb0 in _p_nsleep () > from /usr/lib/libpthread.a(shr_xpg5.o) > 1.2 process 16777620 0xd0595fb0 in _p_nsleep () > from /usr/lib/libpthread.a(shr_xpg5.o) > 1.3 process 16777620 0xd0595fb0 in _p_nsleep () > from /usr/lib/libpthread.a(shr_xpg5.o) > 2.1 process 8323570 0xd0594fc8 in ?? () > 3.1 process 17957172 0xd0594fc8 in ?? () ... and this outputs the ptid values for those threads. If it says "process ...", then those ptid values have not properly been switched over to the (pid, lwp, tid) format. You should verify that the sync_threadlists code handles all multi-process cases correctly. I haven't looked at this in detail, but are you sure that here: @@ -841,8 +829,22 @@ sync_threadlists (int pid) } else if (cmp_result > 0) { - delete_thread (gbuf[gi]); - gi++; + if (gptid.is_pid ()) + { + thread_change_ptid (proc_target, gptid, pptid); you never accidentally switch the *pid* part (if "gptid" belows to a different pid than "pptid")? Bye, Ulrich