Hi Ulrich, Tom and community, Thank you for all the guidance in this bug fix so far. So we have two problems to fix when we follow the child process before we commit which I felt needed a discussion before we make any adjustments in our AIX target code. Kindly find attached the patch where I have the latest version with me. I have removed the thread_info parameter in the call backs since you mentioned that pthdb_user_t is what is defined in the OS. Consider the program below and the output as well for the explanation. So what is happening is that the when after a new process is born, its pthread library is getting intialised and we have changed its ptid from ptid (pid, 0, 0) to ptid (pid, 0, tid). Since we follow fork the code in inferior.c file will switch to the thread child where the child is reported as ptid (pid, 0, 0) but exists as ptid (pid, 0, tid). This leads to this crash. We did try with two variables if you recall in the previous patch. But your point of pd_active being there for it was valid. So somehow something isn't correct that I did not understand. We have pd_activate () in only two places. So is follow_fork () is expecting us to switch to child process and then change the ptid of the child?? If yes, how do we go?? And if not where are we going wrong here. Also this ptid_t (pid, 0, 0) and our main thread being ptid_t (pid, 0, tid) might need a smarted way to switch to the main thread's process space and set the right current inferior process in pdc_read_memory. Kindly check it in this patch and let me know if we can do it better. I have done a nicer way in fetch_registers (). That is why I had changed the first parameter in previous patch version of pdc_read_memory (). So these are the only places that is blocking us from committing this change. Rest are same as before and remaing cases work correctly. -------------------------- >If you no longer use switch_to_thread, can't you then continue to >use ptid_t (user_current_pid)? So the reason is stated in the first paragraph of this email. >B.t.w. the fact that the message is now getting so long is >actually an indicator that it might be preferable to break >the patch out into multiple commits - the solib change is >one obvious stand-alone patch, and maybe it would make sense >to split off the aix-thread changes also needed for single- >inferior debugging from the multi-inferior support changes. >Given that we've already spent a long time on this patch, >I'm not insisting on this change - but something to keep >in mind for future patches. Yes, I have changed the commit message. Thank you. Sure will learn the art of splitting patches. -------------------------- Waiting for a reply soon. Have a nice day ahead. With curiosity and regards, Aditya. ----------------------------------------- Code :- #include #include #include #include #include pthread_barrier_t barrier; #define NUM_THREADS 2 void * thread_function (void *arg) { /* This ensures that the breakpoint is only hit after both threads are created, so the test can always switch to the non-event thread when the breakpoint triggers. */ pthread_barrier_wait (&barrier); pid_t child; child = fork (); if (child > 0) printf ("I am parent \n"); else { child = fork (); if (child > 0) printf ("I am child \n"); else printf ("I am grandchild \n"); } while (1); /* break here */ } int main (void) { int i; pthread_t thread[NUM_THREADS]; alarm (300); pthread_barrier_init (&barrier, NULL, NUM_THREADS); for (i = 0; i < NUM_THREADS; i++) { int res; res = pthread_create (&thread[i], NULL, thread_function, NULL); assert (res == 0); } while (1) { sleep (15); } return 0; } ----------------------------------------- problematic output:- Reading symbols from /home/aditya/gdb_tests/ultimate-multi-thread-fork... (gdb) set follow-fork-mode child (gdb) set detach-on-fork off (gdb) r Starting program: /home/aditya/gdb_tests/ultimate-multi-thread-fork [New Thread 258] [New Thread 515] [Attaching after Thread 515 fork to child process 14090746] [New inferior 2 (process 14090746)] thread.c:1337: internal-error: switch_to_thread: Assertion `thr != NULL' failed. A problem internal to GDB has been detected, ----------------------------------------- Procstack for the crash. ---------- tid# 33554919 (pthread ID: 1) ---------- 0x090000000016a474 __fd_poll(??, ??, ??) + 0xb4 0x0000000100674954 poll(0x1104b6e70, 0x4, 0xffffffffffffffff) + 0x2c 0x0000000100675f5c gdb_wait_for_event(int)(0x1ffffe9c8) + 0xd0 0x0000000100674df8 gdb_do_one_event(int)(0xffffffff1072b9d0) + 0x1d4 0x00000001000093b4 gdb_readline_wrapper(char const*)(0x11072b9d0) + 0xec 0x0000000100038e2c defaulted_query(char const*, char, char*)(0x101742c80, 0x101ab0850, 0xfffffffffffeca8) + 0x370 0x0000000100039218 query(char const*, ...)(0x101742c80) + 0x4c 0x00000001000375b0 internal_vproblem(internal_problem*, char const*, int, char const*, char*)(0x110002540, 0x10190eb30, 0x53910166090, 0x10190ea30, 0xfffffffffffef08) + 0x378 0x0000000100037a14 internal_verror(char const*, int, char const*, char*)(0x10190eb30, 0x539ffffed10, 0x10190ea30, 0xfffffffffffef08) + 0x40 0x00000001000365b0 internal_error_loc(char const*, int, char const*, ...)(0x10190eb30, 0x53900000000, 0x10190ea30) + 0x58 0x00000001005a6ddc switch_to_thread(thread_info*)(0x0) + 0x48 0x00000001003767cc follow_fork()() + 0x4c8 0x0000000100385f68 handle_inferior_event(execution_control_state*)(0xffffffffffff3a8) + 0xda8 0x0000000100380e40 fetch_inferior_event()() + 0x2f8 0x0000000100a1ef04 inferior_event_handler(inferior_event_type)(0x10207830) + 0x38 0x00000001003926fc infrun_async_inferior_event_handler(void*)(0x0) + 0x30 0x00000001006786c4 check_async_event_handlers()() + 0x94 0x0000000100674cd8 gdb_do_one_event(int)(0xfffffffffffff840) + 0xb4 0x0000000100001dcc start_event_loop()() + 0x28 0x0000000100001fd4 captured_command_loop()() + 0x58 0x000000010000414c captured_main(void*)(0xffffffffffffa60) + 0x2c 0x0000000100004220 gdb_main(captured_main_args*)(0xffffffffffffa60) + 0x20 0x0000000100000a9c main(0x200000000, 0xffffffffffffb00) + 0x58 ________________________________ From: Ulrich Weigand Sent: 14 February 2023 00:31 To: simark@simark.ca ; Aditya Kamath1 ; gdb-patches@sourceware.org Cc: Sangamesh Mallayya Subject: Re: [PATCH] 0001-Fix-multi-thread-debug-bug-in-AIX.patch Aditya Kamath1 wrote: >Also find attached a document that I have proposed as a commit message. Thanks for the details. However, this is too much detail for a commit message - you don't need to include full debug sessions documenting how you found a bug; instead, please concisely summarize *what* the bug *is*, and how (at a high level) the patch fixes the bug. To be more specific, I'd write a commit message for this patch somewhat along these lines: Fix multi-threaded debugging under AIX Multi-threaded debugging using the libpthdebug debug interface is currently broken due to multiple issues. When debugging a single inferior, we were getting assertion failures in get_aix_thread_info as no tp->priv structure was allocated for the main thread. Fix this by switching the main thread from a (pid, 0, 0) ptid_t to a (pid, 0, tid) ptid_t and allocaing the tp->priv structure in sync_threadlists. As a result, the switch_to_thread call in pdc_read_data could now fail since the main thread no longer uses (pid, 0, 0). Replace the call by only switching inferior_ptid, the current inferior, and the current address space (like proc-service.c). Add similar switching to pdc_write_data where it was missing completely. When debugging multiple inferiors, an additional set of problems prevented correct multi-threaded debugging: First of all, aix-thread.c used to have a number of global variables holding per-inferior information. Switch these to a per-inferior data structure instead. Also, sync_threadlists was getting confused as we were comparing the list of threads returned by libpthdebug for *one* process with GDB's list of threads for *all* processes. Only use the GDB threads of the current inferior instead. Finally, the presence of the thread library in any but the first inferior was not correctly detected due to a bug in solib-aix.c, where the BFD file name for shared library members was changed when the library was loaded for the first time, which caused the library to no longer be recognized by name when loaded a second time, (I'm not saying you need to use this exact message - maybe I'm missing something or getting some details wrong. But this is the style you should be roughly going for.) B.t.w. the fact that the message is now getting so long is actually an indicator that it might be preferable to break the patch out into multiple commits - the solib change is one obvious stand-alone patch, and maybe it would make sense to split off the aix-thread changes also needed for single- inferior debugging from the multi-inferior support changes. Given that we've already spent a long time on this patch, I'm not insisting on this change - but something to keep in mind for future patches. Some additional comments on your latest changes: >However, we also need to keep in mind that before we think this will >work, our libpthread library is only ready when the following condition >in the wait () of aix-thread.c is satisfied. > >/* Check whether libpthdebug might be ready to be initialized. */ > if (!data->pd_active && status->kind () == TARGET_WAITKIND_STOPPED > && status->sig () == GDB_SIGNAL_TRAP) > >Until then changing the “process ” to “thread ” is incorrect. >Even though the session is ready and initalised via pd_enable () and >pd_activate () >functions respectively. Therfore this made us to keep >a variable pthdebugready in all functions that lead to sync_threadlists () >so that we change the process thread to a thread with private data only >when libpthdebug is initialised for a particular process. I do not understand this change. The ->pd_active flag is supposed to track exactly that information, why do we need to duplicate it into yet another flag? Note that the point of the the "if" block above is that *it is calling pd_activacte()*, which will set ->pd_active if the library is in fact ready to be used. If there's anything wrong that causes pd_active to be set when the thread library is, in fact, not yet active, that's a bug we need to find and fix. Also, as long as the thread library is not ready, we should not be calling sync_threadlists in the first place. >So why did we make this change >- if (user_current_pid != 0) >- switch_to_thread (current_inferior ()->process_target (), >- ptid_t (user_current_pid)); > in pdc_read_data and change our user variable which was the process > ID to a thread? Wasn’t it already doing the job? >This now will use the ptid_t (user_current_pid) to switch the thread (). >However, our parent process or main thread of it, is threaded i.e is ptid_t >(user_current_pid, 0, tid). Hence, we will crash with an assertion >failure that thread ptid_t (user_current_pid) has not been found. Ah, I see. That makes sense. >-static int pdc_read_data (pthdb_user_t, void *, pthdb_addr_t, size_t); >-static int pdc_write_data (pthdb_user_t, void *, pthdb_addr_t, size_t); >+static int pdc_read_data (thread_info*, void *, pthdb_addr_t, size_t); >+static int pdc_write_data (thread_info*, void *, pthdb_addr_t, size_t); These changes are also confusing. First of all, my understanding has been that the signature of these functions is fixed by the OS, since they are passed as callbacks to pthdb_session_init. This means you cannot just go and change them arbitrarily ... In addition, I'm not sure the change makes sense semantically. Note that you create one pd_session object *per inferior*, not one per thread. The pthdb_user_t identifies the pd_session, so it doesn't make sense to use the thread_info pointer as pthdb_user_t, even if that were possible from an API perspective. What was the reason for not just continuing to use the PID here? >+ scoped_restore save_inferior_ptid = make_scoped_restore (&inferior_ptid); > /* Before the first inferior is added, we pass inferior_ptid.pid () > from pd_enable () which is 0. There is no need to switch threads > during first initialisation. In the rest of the callbacks the > current thread needs to be correct. */ >- if (user_current_pid != 0) >- switch_to_thread (current_inferior ()->process_target (), >- ptid_t (user_current_pid)); >+ inferior_ptid = user_current_thread->ptid; If you no longer use switch_to_thread, can't you then continue to use ptid_t (user_current_pid)? This is only used during the target_read_memory call, which should go down to the process target, which doesn't require TIDs. >+ tp = find_thread_ptid (proc_target, ptid_t (pid)); >+ >+ /* If the pthreadlibrary is not ready to debug >+ then this is just a main process which needs >+ a priv to be set. The if condition below does >+ the same. Otherwise we go to the for loop to >+ sync the pthread and GDB thread lists. */ This goes back to my question above, if the library is not yet ready, first of all we should never even get here, and second, all PTIDs should still be PID-only and nobody should ever look for any aix_thread_info ... > static ptid_t >-pd_activate (int pid) >+pd_activate (pid_t pid, int pthdebugready) I don't understand this flag - the point of this function is to *find out whether* the library is ready - either pthdb_session_init succeeds (and thus the library is ready) or it fails (and thus the library is not ready). > /* If we're debugging a core file or an attached inferior, the > pthread library may already have been initialized, so try to > activate thread debugging. */ >- pd_activate (inferior_ptid.pid ()); >+ pd_activate (inferior_ptid.pid (), 0); I guess this is the point? As the comment says, this should only ever make any difference for core files or attaching to an already running inferior, never for starting up an inferior under GDB. If this isn't correct, we need to understand why. Bye, Ulrich