[PATCH 00/11] Linux: Fix issues around thread group leader exits

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 00/11] Linux: Fix issues around thread group leader exits
@ 2022-03-03 14:40 Pedro Alves
  2022-03-03 14:40 ` [PATCH 01/11] Fix gdbserver/linux target_waitstatus logging assert Pedro Alves
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

This series addresses some problems around thread group leader exits
on GNU/Linux (native and gdbserver):

 - addresses a race in zombie leader detection
 - make sure only the leader's exit status is interpreted as whole-process exit status
 - make sure the whole-process exit is always the last event reported

Tested on GNU/Linux x86-64, native and extended-remote gdbserver.  New
testcase added at the end of the series.

Lancelot SIX (2):
  Ensure EXIT is last event, gdb/linux
  Process exit status is leader exit status testcase

Pedro Alves (9):
  Fix gdbserver/linux target_waitstatus logging assert
  Fix gdb.threads/clone-new-thread-event.exp race
  Fix gdb.threads/current-lwp-dead.exp race
  gdb: Reorganize linux_nat_filter_event
  gdbserver: Reorganize linux_process_target::filter_event
  gdbserver: Reindent check_zombie_leaders
  Re-add zombie leader on exit, gdb/linux
  Re-add zombie leader on exit, gdbserver/linux
  Ensure EXIT is last event, gdbserver/linux

 gdb/linux-nat.c                               | 226 ++++++++++----
 .../gdb.threads/clone-new-thread-event.c      |  16 +-
 .../gdb.threads/clone-new-thread-event.exp    |   2 +
 gdb/testsuite/gdb.threads/current-lwp-dead.c  | 101 ++++---
 .../gdb.threads/current-lwp-dead.exp          |  23 +-
 ...rocess-exit-status-is-leader-exit-status.c |  64 ++++
 ...cess-exit-status-is-leader-exit-status.exp |  45 +++
 gdbserver/linux-low.cc                        | 281 +++++++++++-------
 gdbserver/linux-low.h                         |  15 +-
 9 files changed, 563 insertions(+), 210 deletions(-)
 create mode 100644 gdb/testsuite/gdb.threads/process-exit-status-is-leader-exit-status.c
 create mode 100644 gdb/testsuite/gdb.threads/process-exit-status-is-leader-exit-status.exp


base-commit: c2b167b3d601dd20de0640e2fbd46835599a1537
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/11] Fix gdbserver/linux target_waitstatus logging assert
  2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
@ 2022-03-03 14:40 ` Pedro Alves
  2022-03-03 14:40 ` [PATCH 02/11] Fix gdb.threads/clone-new-thread-event.exp race Pedro Alves
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

Turning on debug output in gdbserver leads to an assertion failure if
gdbserver reports a non-signal event:

    [threads] wait_1: LWP 3273770: extended event with waitstatus status->kind = EXECD, execd_pathname = gdb.threads/non-ldr-exc-1/non-ldr-exc-1
    [threads] wait_1: Hit a non-gdbserver trap event.
  ../../src/gdbserver/../gdb/target/waitstatus.h:365: A problem internal to GDBserver has been detected.
  sig: Assertion `m_kind == TARGET_WAITKIND_STOPPED || m_kind == TARGET_WAITKIND_SIGNALLED' failed.

Fix it in the obvious way, using target_waitstatus::to_string(),
resulting in, for example:

  [threads] wait_1: ret = LWP 1542412.1542412, status->kind = STOPPED, sig = GDB_SIGNAL_TRAP

Change-Id: Ia4832f9b4fa39f4af67fcaf21fd4d909a285a645
---
 gdbserver/linux-low.cc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gdbserver/linux-low.cc b/gdbserver/linux-low.cc
index 301e42a36f3..e7c01420ae3 100644
--- a/gdbserver/linux-low.cc
+++ b/gdbserver/linux-low.cc
@@ -3493,9 +3493,9 @@ linux_process_target::wait_1 (ptid_t ptid, target_waitstatus *ourstatus,
 
   gdb_assert (step_over_bkpt == null_ptid);
 
-  threads_debug_printf ("ret = %s, %d, %d",
+  threads_debug_printf ("ret = %s, %s",
 			target_pid_to_str (ptid_of (current_thread)).c_str (),
-			ourstatus->kind (), ourstatus->sig ());
+			ourstatus->to_string ().c_str ());
 
   if (ourstatus->kind () == TARGET_WAITKIND_EXITED)
     return filter_exit_event (event_child, ourstatus);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 02/11] Fix gdb.threads/clone-new-thread-event.exp race
  2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
  2022-03-03 14:40 ` [PATCH 01/11] Fix gdbserver/linux target_waitstatus logging assert Pedro Alves
@ 2022-03-03 14:40 ` Pedro Alves
  2022-03-03 14:40 ` [PATCH 03/11] Fix gdb.threads/current-lwp-dead.exp race Pedro Alves
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

If we make GDB report the process EXIT event for the leader thread,
instead of whatever is the last thread in the LWP list, as will be
done in a latter patch of this series, then
gdb.threads/current-lwp-dead.exp starts failing:

 (gdb) FAIL: gdb.threads/clone-new-thread-event.exp: catch SIGUSR1 (the program exited)

This is a testcase race -- the main thread does not wait for the
spawned clone "thread" to finish before exiting, so the main program
may exit before the second thread is scheduled and reports its
SIGUSR1.  With the change to make GDB report the EXIT for the leader,
the race is 100% reproducible by adding a sleep(), like so:

 --- c/gdb/testsuite/gdb.threads/clone-new-thread-event.c
 +++ w/gdb/testsuite/gdb.threads/clone-new-thread-event.c
 @@ -51,6 +51,7 @@ local_gettid (void)
  static int
  fn (void *unused)
  {
 +  sleep (1);
    tkill (local_gettid (), SIGUSR1);
    return 0;
  }

Resulting in:

 Breakpoint 1, main (argc=1, argv=0x7fffffffd418) at gdb.threads/clone-new-thread-event.c:65
 65        stack = malloc (STACK_SIZE);
 (gdb) continue
 Continuing.
 [New LWP 3715562]
 [Inferior 1 (process 3715555) exited normally]
 (gdb) FAIL: gdb.threads/clone-new-thread-event.exp: catch SIGUSR1 (the program exited)

That inferior exit reported is actually correct.  The main thread has
indeed exited, and that's the thread that has the right exit code to
report to the user, as that's the exit code that is reported to the
program's parent.  In this case, GDB managed to collect the exit code
for the leader thread before reaping the other thread, because in
reality, the testcase isn't creating standard threads, it is using raw
clone, and the new clones are put in their own thread group.

Fix it by making the main thread wait for the child to exit.  Also,
run the program to completion for completeness.

Change-Id: I315cd3dc2b9e860395dcab9658341ea868d7a6bf
---
 .../gdb.threads/clone-new-thread-event.c         | 16 +++++++++++++++-
 .../gdb.threads/clone-new-thread-event.exp       |  2 ++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/gdb/testsuite/gdb.threads/clone-new-thread-event.c b/gdb/testsuite/gdb.threads/clone-new-thread-event.c
index 800514d0b6d..9e7ceb97a65 100644
--- a/gdb/testsuite/gdb.threads/clone-new-thread-event.c
+++ b/gdb/testsuite/gdb.threads/clone-new-thread-event.c
@@ -26,6 +26,7 @@
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/syscall.h>
+#include <sys/wait.h>
 
 #include <features.h>
 #ifdef __UCLIBC__
@@ -59,7 +60,7 @@ int
 main (int argc, char **argv)
 {
   unsigned char *stack;
-  int new_pid;
+  int new_pid, status, ret;
 
   stack = malloc (STACK_SIZE);
   assert (stack != NULL);
@@ -71,5 +72,18 @@ main (int argc, char **argv)
 		   , NULL, NULL, NULL, NULL);
   assert (new_pid > 0);
 
+  /* Note the clone call above didn't use CLONE_THREAD, so it actually
+     put the new thread in a new thread group.  However, the new clone
+     is still reported with PTRACE_EVENT_CLONE to GDB, since we didn't
+     use CLONE_VFORK (results in PTRACE_EVENT_VFORK) nor set the
+     termination signal to SIGCHLD (results in PTRACE_EVENT_FORK), so
+     GDB thinks of it as a new thread of the same inferior.  It's a
+     bit of an odd setup, but it's not important for what we're
+     testing, and, it let's us conveniently use waitpid to wait for
+     the clone, which you can't with CLONE_THREAD.  */
+  ret = waitpid (new_pid, &status, __WALL);
+  assert (ret == new_pid);
+  assert (WIFSIGNALED (status) && WTERMSIG (status) == SIGUSR1);
+
   return 0;
 }
diff --git a/gdb/testsuite/gdb.threads/clone-new-thread-event.exp b/gdb/testsuite/gdb.threads/clone-new-thread-event.exp
index fdebd488d8a..db5ae39aded 100644
--- a/gdb/testsuite/gdb.threads/clone-new-thread-event.exp
+++ b/gdb/testsuite/gdb.threads/clone-new-thread-event.exp
@@ -31,3 +31,5 @@ if { ![runto_main] } {
 gdb_test "continue" \
     "Thread 2 received signal SIGUSR1, User defined signal 1.*" \
     "catch SIGUSR1"
+
+gdb_continue_to_end "" continue 1
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 03/11] Fix gdb.threads/current-lwp-dead.exp race
  2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
  2022-03-03 14:40 ` [PATCH 01/11] Fix gdbserver/linux target_waitstatus logging assert Pedro Alves
  2022-03-03 14:40 ` [PATCH 02/11] Fix gdb.threads/clone-new-thread-event.exp race Pedro Alves
@ 2022-03-03 14:40 ` Pedro Alves
  2022-03-03 14:40 ` [PATCH 04/11] gdb: Reorganize linux_nat_filter_event Pedro Alves
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

If we make GDB report the process EXIT event for the leader thread, as
will be done in a latter patch of this series, then
gdb.threads/current-lwp-dead.exp starts failing:

 (gdb) break fn_return
 Breakpoint 2 at 0x5555555551b5: file /home/pedro/rocm/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/current-lwp-dead.c, line 45.
 (gdb) continue
 Continuing.
 [New LWP 2138466]
 [Inferior 1 (process 2138459) exited normally]
 (gdb) FAIL: gdb.threads/current-lwp-dead.exp: continue to breakpoint: fn_return (the program exited)

The inferior exit reported is actually correct.  The main thread has
indeed exited, and that's the thread that has the right exit code to
report to the user, as that's the exit code that is reported to the
program's parent.  In this case, GDB managed to collect the exit code
for the leader thread before reaping the other thread, because in
reality, the testcase isn't creating standard threads, it is using raw
clone, and the new clones are put in their own thread group.

Fix it by making the main "thread" not exit until the scenario we're
exercising plays out.  Also, run the program to completion for
completeness.

The original program really wanted the leader thread to exit before
the fn_return function was reached -- it was important that the
current thread as pointed by inferior_ptid was gone when infrun got
the breakpoint event.  I've tweaked the testcase to ensure that that
condition is still held, though it is no longer the main thread that
exits.  This required a bit of synchronization between the threads,
which required using CLONE_VM unconditionally.  The #ifdef guards were
added as a fix for
https://sourceware.org/bugzilla/show_bug.cgi?id=11214, though I don't
think they were necessary because the program is not using TLS.  If it
turns out they were necessary, we can link the testcase with "-z now"
instead, which was mentioned as an alternative workaround in that
Bugzilla.

Change-Id: I7be2f0da4c2fe8f80a60bdde5e6c623d8bd5a0aa
---
 gdb/testsuite/gdb.threads/current-lwp-dead.c  | 101 ++++++++++++------
 .../gdb.threads/current-lwp-dead.exp          |  23 +++-
 2 files changed, 87 insertions(+), 37 deletions(-)

diff --git a/gdb/testsuite/gdb.threads/current-lwp-dead.c b/gdb/testsuite/gdb.threads/current-lwp-dead.c
index 76babc42ce8..ceb3ae47a4d 100644
--- a/gdb/testsuite/gdb.threads/current-lwp-dead.c
+++ b/gdb/testsuite/gdb.threads/current-lwp-dead.c
@@ -15,6 +15,18 @@
    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.
 
+
+   The original issue we're trying to test is described in this
+   thread:
+
+     https://sourceware.org/legacy-ml/gdb-patches/2009-06/msg00802.html
+
+   The NEW_THREAD_EVENT code the comments below refer to no longer
+   exists in GDB, so the following comments are kept for historical
+   reasons, and to guide future updates to the testcase.
+
+   ---
+
    Do not use threads as we need to exploit a bug in LWP code masked by the
    threads code otherwise.
 
@@ -29,60 +41,81 @@
 #include <assert.h>
 #include <unistd.h>
 #include <stdlib.h>
-
-#include <features.h>
-#ifdef __UCLIBC__
-#if !(defined(__UCLIBC_HAS_MMU__) || defined(__ARCH_HAS_MMU__))
-#define HAS_NOMMU
-#endif
-#endif
+#include <sys/types.h>
+#include <sys/wait.h>
 
 #define STACK_SIZE 0x1000
 
-static int
-fn_return (void *unused)
-{
-  return 0;	/* at-fn_return */
-}
+/* True if the 'fn_return' thread has been reached at the point after
+   its parent is already gone.  */
+volatile int fn_return_reached = 0;
+
+/* True if the 'fn' thread has exited.  */
+volatile int fn_exited = 0;
+
+/* Wrapper around clone.  */
 
 static int
-fn (void *unused)
+do_clone (int (*fn)(void *))
 {
-  int i;
   unsigned char *stack;
   int new_pid;
 
-  i = sleep (1);
-  assert (i == 0);
-
   stack = malloc (STACK_SIZE);
   assert (stack != NULL);
 
-  new_pid = clone (fn_return, stack + STACK_SIZE, CLONE_FILES
-#if defined(__UCLIBC__) && defined(HAS_NOMMU)
-		   | CLONE_VM
-#endif /* defined(__UCLIBC__) && defined(HAS_NOMMU) */
-		   , NULL, NULL, NULL, NULL);
+  new_pid = clone (fn, stack + STACK_SIZE, CLONE_FILES | CLONE_VM,
+		   NULL, NULL, NULL, NULL);
   assert (new_pid > 0);
 
+  return new_pid;
+}
+
+static int
+fn_return (void *unused)
+{
+  /* Wait until our direct parent exits.  We want the breakpoint set a
+     couple lines below to hit with the previously-selected thread
+     gone.  */
+  while (!fn_exited)
+    usleep (1);
+
+  fn_return_reached = 1; /* at-fn_return */
+  return 0;
+}
+
+static int
+fn (void *unused)
+{
+  do_clone (fn_return);
   return 0;
 }
 
 int
 main (int argc, char **argv)
 {
-  unsigned char *stack;
-  int new_pid;
-
-  stack = malloc (STACK_SIZE);
-  assert (stack != NULL);
-
-  new_pid = clone (fn, stack + STACK_SIZE, CLONE_FILES
-#if defined(__UCLIBC__) && defined(HAS_NOMMU)
-		   | CLONE_VM
-#endif /* defined(__UCLIBC__) && defined(HAS_NOMMU) */
-		   , NULL, NULL, NULL, NULL);
-  assert (new_pid > 0);
+  int new_pid, status, ret;
+
+  new_pid = do_clone (fn);
+
+  /* Note the clone call above didn't use CLONE_THREAD, so it actually
+     put the new child in a new thread group.  However, the new clone
+     is still reported with PTRACE_EVENT_CLONE to GDB, since we didn't
+     use CLONE_VFORK (results in PTRACE_EVENT_VFORK) nor set the
+     termination signal to SIGCHLD (results in PTRACE_EVENT_FORK), so
+     GDB thinks of it as a new thread of the same inferior.  It's a
+     bit of an odd setup, but it's not important for what we're
+     testing, and, it let's us conveniently use waitpid to wait for
+     the child, which you can't with CLONE_THREAD.  */
+  ret = waitpid (new_pid, &status, __WALL);
+  assert (ret == new_pid);
+  assert (WIFEXITED (status) && WEXITSTATUS (status) == 0);
+
+  fn_exited = 1;
+
+  /* Don't exit before the breakpoint at fn_return triggers.  */
+  while (!fn_return_reached)
+    usleep (1);
 
   return 0;
 }
diff --git a/gdb/testsuite/gdb.threads/current-lwp-dead.exp b/gdb/testsuite/gdb.threads/current-lwp-dead.exp
index b69fdbb5988..6728dbe87ab 100644
--- a/gdb/testsuite/gdb.threads/current-lwp-dead.exp
+++ b/gdb/testsuite/gdb.threads/current-lwp-dead.exp
@@ -15,8 +15,14 @@
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 
-# Please email any bugs, comments, and/or additions to this file to:
-# bug-gdb@gnu.org
+# Regression test for issue originally described here:
+#
+#  https://sourceware.org/legacy-ml/gdb-patches/2009-06/msg00802.html
+#
+# The relevant code has since been removed from GDB, but it doesn't
+# hurt to keep the testcase.
+
+standard_testfile
 
 # This only works with on Linux targets.
 if ![istarget *-*-linux*] then {
@@ -31,5 +37,16 @@ if {[runto_main] <= 0} {
     return -1
 }
 
-gdb_breakpoint "fn_return"
+# Run to "fn" so that thread 2 is made current.
+gdb_breakpoint "fn"
+gdb_continue_to_breakpoint "fn" ".*do_clone.*"
+
+# Run to thread 3, at a point where thread 2 is gone.
+set line [gdb_get_line_number "at-fn_return"]
+gdb_breakpoint $line
 gdb_continue_to_breakpoint "fn_return" ".*at-fn_return.*"
+
+# Confirm thread 2 is really gone.
+gdb_test "info threads 2" "No threads match '2'\\."
+
+gdb_continue_to_end "" continue 1
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 04/11] gdb: Reorganize linux_nat_filter_event
  2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
                   ` (2 preceding siblings ...)
  2022-03-03 14:40 ` [PATCH 03/11] Fix gdb.threads/current-lwp-dead.exp race Pedro Alves
@ 2022-03-03 14:40 ` Pedro Alves
  2022-03-03 14:40 ` [PATCH 05/11] gdbserver: Reorganize linux_process_target::filter_event Pedro Alves
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

Reorganize linux-nat.c:linux_nat_filter_event such that all the
handling for events for LWPs not in the LWP list is together.  This
helps make a following patch clearer.  The comments and debug messages
have also been tweaked - the end goal is to have them synchronized
with the gdbserver counterpart.

Change-Id: I8586d8dcd76d8bd3795145e3056fc660e3b2cd22
---
 gdb/linux-nat.c | 75 ++++++++++++++++++++++++++-----------------------
 1 file changed, 40 insertions(+), 35 deletions(-)

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index 13682fcff43..1555d3a79e3 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -2776,46 +2776,51 @@ linux_nat_filter_event (int lwpid, int status)
 
   lp = find_lwp_pid (ptid_t (lwpid));
 
-  /* Check for stop events reported by a process we didn't already
-     know about - anything not already in our LWP list.
-
-     If we're expecting to receive stopped processes after
-     fork, vfork, and clone events, then we'll just add the
-     new one to our list and go back to waiting for the event
-     to be reported - the stopped process might be returned
-     from waitpid before or after the event is.
-
-     But note the case of a non-leader thread exec'ing after the
-     leader having exited, and gone from our lists.  The non-leader
-     thread changes its tid to the tgid.  */
-
-  if (WIFSTOPPED (status) && lp == NULL
-      && (WSTOPSIG (status) == SIGTRAP && event == PTRACE_EVENT_EXEC))
+  /* Check for events reported by anything not in our LWP list.  */
+  if (lp == nullptr)
     {
-      /* A multi-thread exec after we had seen the leader exiting.  */
-      linux_nat_debug_printf ("Re-adding thread group leader LWP %d.", lwpid);
+      if (WIFSTOPPED (status))
+	{
+	  if (WSTOPSIG (status) == SIGTRAP && event == PTRACE_EVENT_EXEC)
+	    {
+	      /* A non-leader thread exec'ed after we've seen the
+		 leader zombie, and removed it from our lists (in
+		 check_zombie_leaders).  The non-leader thread changes
+		 its tid to the tgid.  */
+	      linux_nat_debug_printf
+		("Re-adding thread group leader LWP %d after exec.",
+		 lwpid);
 
-      lp = add_lwp (ptid_t (lwpid, lwpid));
-      lp->stopped = 1;
-      lp->resumed = 1;
-      add_thread (linux_target, lp->ptid);
-    }
+	      lp = add_lwp (ptid_t (lwpid, lwpid));
+	      lp->stopped = 1;
+	      lp->resumed = 1;
+	      add_thread (linux_target, lp->ptid);
+	    }
+	  else
+	    {
+	      /* A process we are controlling has forked and the new
+		 child's stop was reported to us by the kernel.  Save
+		 its PID and go back to waiting for the fork event to
+		 be reported - the stopped process might be returned
+		 from waitpid before or after the fork event is.  */
+	      linux_nat_debug_printf
+		("Saving LWP %d status %s in stopped_pids list",
+		 lwpid, status_to_str (status).c_str ());
+	      add_to_pid_list (&stopped_pids, lwpid, status);
+	    }
+	}
+      else
+	{
+	  /* Don't report an event for the exit of an LWP not in our
+	     list, i.e. not part of any inferior we're debugging.
+	     This can happen if we detach from a program we originally
+	     forked and then it exits.  */
+	}
 
-  if (WIFSTOPPED (status) && !lp)
-    {
-      linux_nat_debug_printf ("saving LWP %ld status %s in stopped_pids list",
-			      (long) lwpid, status_to_str (status).c_str ());
-      add_to_pid_list (&stopped_pids, lwpid, status);
-      return;
+      if (lp == nullptr)
+	return;
     }
 
-  /* Make sure we don't report an event for the exit of an LWP not in
-     our list, i.e. not part of the current process.  This can happen
-     if we detach from a program we originally forked and then it
-     exits.  */
-  if (!WIFSTOPPED (status) && !lp)
-    return;
-
   /* This LWP is stopped now.  (And if dead, this prevents it from
      ever being continued.)  */
   lp->stopped = 1;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 05/11] gdbserver: Reorganize linux_process_target::filter_event
  2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
                   ` (3 preceding siblings ...)
  2022-03-03 14:40 ` [PATCH 04/11] gdb: Reorganize linux_nat_filter_event Pedro Alves
@ 2022-03-03 14:40 ` Pedro Alves
  2022-03-03 14:40 ` [PATCH 06/11] gdbserver: Reindent check_zombie_leaders Pedro Alves
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

Reorganize linux-low.cc:linux_process_target::filter_event such that
all the handling for events for LWPs not in the LWP list is together.
This helps make a following patch clearer.  The comments and debug
messages have also been tweaked to have them synchronized with the GDB
counterpart.

Change-Id: If9019635f63a846607cfda44b454b4254a404019
---
 gdbserver/linux-low.cc | 76 ++++++++++++++++++++++--------------------
 1 file changed, 40 insertions(+), 36 deletions(-)

diff --git a/gdbserver/linux-low.cc b/gdbserver/linux-low.cc
index e7c01420ae3..2d14a0254c7 100644
--- a/gdbserver/linux-low.cc
+++ b/gdbserver/linux-low.cc
@@ -2148,46 +2148,50 @@ linux_process_target::filter_event (int lwpid, int wstat)
 
   child = find_lwp_pid (ptid_t (lwpid));
 
-  /* Check for stop events reported by a process we didn't already
-     know about - anything not already in our LWP list.
-
-     If we're expecting to receive stopped processes after
-     fork, vfork, and clone events, then we'll just add the
-     new one to our list and go back to waiting for the event
-     to be reported - the stopped process might be returned
-     from waitpid before or after the event is.
-
-     But note the case of a non-leader thread exec'ing after the
-     leader having exited, and gone from our lists (because
-     check_zombie_leaders deleted it).  The non-leader thread
-     changes its tid to the tgid.  */
-
-  if (WIFSTOPPED (wstat) && child == NULL && WSTOPSIG (wstat) == SIGTRAP
-      && linux_ptrace_get_extended_event (wstat) == PTRACE_EVENT_EXEC)
+  /* Check for events reported by anything not in our LWP list.  */
+  if (child == nullptr)
     {
-      ptid_t child_ptid;
-
-      /* A multi-thread exec after we had seen the leader exiting.  */
-      threads_debug_printf ("Re-adding thread group leader LWP %d after exec.",
-			    lwpid);
+      if (WIFSTOPPED (wstat))
+	{
+	  if (WSTOPSIG (wstat) == SIGTRAP
+	      && linux_ptrace_get_extended_event (wstat) == PTRACE_EVENT_EXEC)
+	    {
+	      /* A non-leader thread exec'ed after we've seen the
+		 leader zombie, and removed it from our lists (in
+		 check_zombie_leaders).  The non-leader thread changes
+		 its tid to the tgid.  */
+	      threads_debug_printf
+		("Re-adding thread group leader LWP %d after exec.",
+		 lwpid);
 
-      child_ptid = ptid_t (lwpid, lwpid);
-      child = add_lwp (child_ptid);
-      child->stopped = 1;
-      switch_to_thread (child->thread);
-    }
+	      child = add_lwp (ptid_t (lwpid, lwpid));
+	      child->stopped = 1;
+	      switch_to_thread (child->thread);
+	    }
+	  else
+	    {
+	      /* A process we are controlling has forked and the new
+		 child's stop was reported to us by the kernel.  Save
+		 its PID and go back to waiting for the fork event to
+		 be reported - the stopped process might be returned
+		 from waitpid before or after the fork event is.  */
+	      threads_debug_printf
+		("Saving LWP %d status %s in stopped_pids list",
+		 lwpid, status_to_str (wstat).c_str ());
+	      add_to_pid_list (&stopped_pids, lwpid, wstat);
+	    }
+	}
+      else
+	{
+	  /* Don't report an event for the exit of an LWP not in our
+	     list, i.e. not part of any inferior we're debugging.
+	     This can happen if we detach from a program we originally
+	     forked and then it exits.  */
+	}
 
-  /* If we didn't find a process, one of two things presumably happened:
-     - A process we started and then detached from has exited.  Ignore it.
-     - A process we are controlling has forked and the new child's stop
-     was reported to us by the kernel.  Save its PID.  */
-  if (child == NULL && WIFSTOPPED (wstat))
-    {
-      add_to_pid_list (&stopped_pids, lwpid, wstat);
-      return;
+      if (child == nullptr)
+	return;
     }
-  else if (child == NULL)
-    return;
 
   thread = get_lwp_thread (child);
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 06/11] gdbserver: Reindent check_zombie_leaders
  2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
                   ` (4 preceding siblings ...)
  2022-03-03 14:40 ` [PATCH 05/11] gdbserver: Reorganize linux_process_target::filter_event Pedro Alves
@ 2022-03-03 14:40 ` Pedro Alves
  2022-03-03 14:40 ` [PATCH 07/11] Re-add zombie leader on exit, gdb/linux Pedro Alves
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

This fixes the indentation of
linux_process_target::check_zombie_leaders, which will help with
keeping its comments in sync with the gdb/linux-nat.c counterpart.

Change-Id: I37332343bd80423d934249e3de2d04feefad1891
---
 gdbserver/linux-low.cc | 101 ++++++++++++++++++++---------------------
 1 file changed, 50 insertions(+), 51 deletions(-)

diff --git a/gdbserver/linux-low.cc b/gdbserver/linux-low.cc
index 2d14a0254c7..4635d310839 100644
--- a/gdbserver/linux-low.cc
+++ b/gdbserver/linux-low.cc
@@ -1721,57 +1721,56 @@ iterate_over_lwps (ptid_t filter,
 void
 linux_process_target::check_zombie_leaders ()
 {
-  for_each_process ([this] (process_info *proc) {
-    pid_t leader_pid = pid_of (proc);
-    struct lwp_info *leader_lp;
-
-    leader_lp = find_lwp_pid (ptid_t (leader_pid));
-
-    threads_debug_printf ("leader_pid=%d, leader_lp!=NULL=%d, "
-			  "num_lwps=%d, zombie=%d",
-			  leader_pid, leader_lp!= NULL, num_lwps (leader_pid),
-			  linux_proc_pid_is_zombie (leader_pid));
-
-    if (leader_lp != NULL && !leader_lp->stopped
-	/* Check if there are other threads in the group, as we may
-	   have raced with the inferior simply exiting.  */
-	&& !last_thread_of_process_p (leader_pid)
-	&& linux_proc_pid_is_zombie (leader_pid))
-      {
-	/* A leader zombie can mean one of two things:
-
-	   - It exited, and there's an exit status pending
-	   available, or only the leader exited (not the whole
-	   program).  In the latter case, we can't waitpid the
-	   leader's exit status until all other threads are gone.
-
-	   - There are 3 or more threads in the group, and a thread
-	   other than the leader exec'd.  On an exec, the Linux
-	   kernel destroys all other threads (except the execing
-	   one) in the thread group, and resets the execing thread's
-	   tid to the tgid.  No exit notification is sent for the
-	   execing thread -- from the ptracer's perspective, it
-	   appears as though the execing thread just vanishes.
-	   Until we reap all other threads except the leader and the
-	   execing thread, the leader will be zombie, and the
-	   execing thread will be in `D (disc sleep)'.  As soon as
-	   all other threads are reaped, the execing thread changes
-	   it's tid to the tgid, and the previous (zombie) leader
-	   vanishes, giving place to the "new" leader.  We could try
-	   distinguishing the exit and exec cases, by waiting once
-	   more, and seeing if something comes out, but it doesn't
-	   sound useful.  The previous leader _does_ go away, and
-	   we'll re-add the new one once we see the exec event
-	   (which is just the same as what would happen if the
-	   previous leader did exit voluntarily before some other
-	   thread execs).  */
-
-	threads_debug_printf ("Thread group leader %d zombie "
-			      "(it exited, or another thread execd).",
-			      leader_pid);
-
-	delete_lwp (leader_lp);
-      }
+  for_each_process ([this] (process_info *proc)
+    {
+      pid_t leader_pid = pid_of (proc);
+      lwp_info *leader_lp = find_lwp_pid (ptid_t (leader_pid));
+
+      threads_debug_printf ("leader_pid=%d, leader_lp!=NULL=%d, "
+			    "num_lwps=%d, zombie=%d",
+			    leader_pid, leader_lp!= NULL, num_lwps (leader_pid),
+			    linux_proc_pid_is_zombie (leader_pid));
+
+      if (leader_lp != NULL && !leader_lp->stopped
+	  /* Check if there are other threads in the group, as we may
+	     have raced with the inferior simply exiting.  */
+	  && !last_thread_of_process_p (leader_pid)
+	  && linux_proc_pid_is_zombie (leader_pid))
+	{
+	  /* A leader zombie can mean one of two things:
+
+	     - It exited, and there's an exit status pending
+	     available, or only the leader exited (not the whole
+	     program).  In the latter case, we can't waitpid the
+	     leader's exit status until all other threads are gone.
+
+	     - There are 3 or more threads in the group, and a thread
+	     other than the leader exec'd.  On an exec, the Linux
+	     kernel destroys all other threads (except the execing
+	     one) in the thread group, and resets the execing thread's
+	     tid to the tgid.  No exit notification is sent for the
+	     execing thread -- from the ptracer's perspective, it
+	     appears as though the execing thread just vanishes.
+	     Until we reap all other threads except the leader and the
+	     execing thread, the leader will be zombie, and the
+	     execing thread will be in `D (disc sleep)'.  As soon as
+	     all other threads are reaped, the execing thread changes
+	     it's tid to the tgid, and the previous (zombie) leader
+	     vanishes, giving place to the "new" leader.  We could try
+	     distinguishing the exit and exec cases, by waiting once
+	     more, and seeing if something comes out, but it doesn't
+	     sound useful.  The previous leader _does_ go away, and
+	     we'll re-add the new one once we see the exec event
+	     (which is just the same as what would happen if the
+	     previous leader did exit voluntarily before some other
+	     thread execs).  */
+
+	  threads_debug_printf ("Thread group leader %d zombie "
+				"(it exited, or another thread execd).",
+				leader_pid);
+
+	  delete_lwp (leader_lp);
+	}
     });
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 07/11] Re-add zombie leader on exit, gdb/linux
  2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
                   ` (5 preceding siblings ...)
  2022-03-03 14:40 ` [PATCH 06/11] gdbserver: Reindent check_zombie_leaders Pedro Alves
@ 2022-03-03 14:40 ` Pedro Alves
  2022-03-07 20:08   ` Simon Marchi
  2022-03-03 14:40 ` [PATCH 08/11] Re-add zombie leader on exit, gdbserver/linux Pedro Alves
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

The current zombie leader detection code in linux-nat.c has a race --
if a multi-threaded inferior exits just before check_zombie_leaders
finds that the leader is now zombie via checking /proc/PID/status,
check_zombie_leaders deletes the leader, assuming we won't get an
event for that exit (which we won't in some scenarios, but not in this
one).  That might seem mostly harmless, but it has some downsides:

 - later when we continue pulling events out of the kernel, we will
   collect the exit event of the non-leader threads, and once we see
   the last lwp in our list exit, we return _that_ lwp's exit code as
   whole-process exit code to infrun, instead of the leader's exit
   code.

 - this can cause a hang in stop_all_threads in infrun.c.  Say there
   are 2 threads in the process.  stop_all_threads stops each of those
   threads, and then waits for two stop or exit events, one for each
   thread.  If the whole process exits, and check_zombie_leaders hits
   the false-positive case, linux-nat.c will only return one event to
   GDB (the whole-process exit returned when we see the last thread,
   the non-leader thread, exit), making stop_all_threads hang forever
   waiting for a second event that will never come.

However, in this false-positive scenario, where the whole process is
exiting, as opposed to just the leader (with pthread_exit(), for
example), we _will_ get an exit event shortly for the leader, after we
collect the exit event of all the other non-leader threads.  Or put
another way, we _always_ get an event for the leader after we see it
become zombie.

I tried a number of approaches to fix this:

#1 - My first thought to address the race was to make GDB always
report the whole-process exit status for the leader thread, not for
whatever is the last lwp in the list.  We _always_ get a final exit
(or exec) event for the leader, and when the race triggers, we're not
collecting it.

#2 - My second thought was to try to plug the race in the first place.

I thought of making GDB call waitpid/WNOHANG for all non-leader
threads immediately when the zombie leader is detected, assuming there
would be an exit event pending for each of them waiting to be
collected.  Turns out that that doesn't work -- you can see the leader
become zombie _before_ the kernel kills all other threads.  Waitpid in
that small time window returns 0, indicating no-event.  Thankfully we
hit that race window all the time, which avoided trading one race for
another.  Looking at the non-leader thread's status in /proc doesn't
help either, the threads are still in running state for a bit, for the
same reason.

#3 - My next attempt, which seemed promising, was to synchronously
stop and wait for the stop for each of the non-leader threads.  For
the scenario in question, this will collect all the exit statuses of
the non-leader threads.  Then, if we are left with only the zombie
leader in the lwp list, it means we either have a normal while-process
exit or an exec, in which case we should not delete the leader.  If
_only_ the leader exited, like in gdb.threads/leader-exit.exp, then
after pausing threads, we will still have at least one live non-leader
thread in the list, and so we delete the leader lwp.  I got this
working and polished, and it was only after staring at the kernel code
to convince myself that this would really work (and it would, for the
scenario I considered), that I realized I had failed to account for
one scenario -- if any non-leader thread is _already_ stopped when
some thread triggers a group exit, like e.g., if you have some threads
stopped and then resume just one thread with scheduler-locking or
non-stop, and that thread exits the process.  I also played with
PTRACE_EVENT_EXIT, see if it would help in any way to plug the race,
and I couldn't find a way that it would result in any practical
difference compared to looking at /proc/PID/status, with respect to
having a race.

So I concluded that there's no way to plug the race, we just have to
deal with it.  Which means, going back to approach #1.  That is the
approach taken by this patch.

Change-Id: I6309fd4727da8c67951f9cea557724b77e8ee979
---
 gdb/linux-nat.c | 107 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 80 insertions(+), 27 deletions(-)

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index 1555d3a79e3..d97a770bf83 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -247,6 +247,14 @@ static void save_stop_reason (struct lwp_info *lp);
 static void close_proc_mem_file (pid_t pid);
 static void open_proc_mem_file (ptid_t ptid);
 
+/* Return TRUE if LWP is the leader thread of the process.  */
+
+static bool
+is_leader (lwp_info *lp)
+{
+  return lp->ptid.pid () == lp->ptid.lwp ();
+}
+
 \f
 /* LWP accessors.  */
 
@@ -2814,7 +2822,23 @@ linux_nat_filter_event (int lwpid, int status)
 	  /* Don't report an event for the exit of an LWP not in our
 	     list, i.e. not part of any inferior we're debugging.
 	     This can happen if we detach from a program we originally
-	     forked and then it exits.  */
+	     forked and then it exits.  However, note that we may have
+	     earlier deleted a leader of an inferior we're debugging,
+	     in check_zombie_leaders.  Re-add it back here if so.  */
+	  for (inferior *inf : all_inferiors (linux_target))
+	    {
+	      if (inf->pid == lwpid)
+		{
+		  linux_nat_debug_printf
+		    ("Re-adding thread group leader LWP %d after exit.",
+		     lwpid);
+
+		  lp = add_lwp (ptid_t (lwpid, lwpid));
+		  lp->resumed = 1;
+		  add_thread (linux_target, lp->ptid);
+		  break;
+		}
+	    }
 	}
 
       if (lp == nullptr)
@@ -2865,13 +2889,12 @@ linux_nat_filter_event (int lwpid, int status)
   /* Check if the thread has exited.  */
   if (WIFEXITED (status) || WIFSIGNALED (status))
     {
-      if (!report_thread_events
-	  && num_lwps (lp->ptid.pid ()) > 1)
+      if (!report_thread_events && !is_leader (lp))
 	{
 	  linux_nat_debug_printf ("%s exited.",
 				  lp->ptid.to_string ().c_str ());
 
-	  /* If there is at least one more LWP, then the exit signal
+	  /* If this was not the leader exiting, then the exit signal
 	     was not the end of the debugged application and should be
 	     ignored.  */
 	  exit_lwp (lp);
@@ -3014,33 +3037,63 @@ check_zombie_leaders (void)
       leader_lp = find_lwp_pid (ptid_t (inf->pid));
       if (leader_lp != NULL
 	  /* Check if there are other threads in the group, as we may
-	     have raced with the inferior simply exiting.  */
+	     have raced with the inferior simply exiting.  Note this
+	     isn't a watertight check.  If the inferior is
+	     multi-threaded and is exiting, it may be we see the
+	     leader as zombie before we reap all the non-leader
+	     threads.  See comments below.  */
 	  && num_lwps (inf->pid) > 1
 	  && linux_proc_pid_is_zombie (inf->pid))
 	{
+	  /* A zombie leader in a multi-threaded program can mean one
+	     of three things:
+
+	     #1 - Only the leader exited, not the whole program, e.g.,
+	     with pthread_exit.  Since we can't reap the leader's exit
+	     status until all other threads are gone and reaped too,
+	     we want to delete the zombie leader right away, as it
+	     can't be debugged, we can't read its registers, etc.
+	     This is the main reason we check for zombie leaders
+	     disappearing.
+
+	     #2 - The whole thread-group/process exited (a group exit,
+	     via e.g. exit(3), and there is (or will be shortly) an
+	     exit reported for each thread in the process, and then
+	     finally an exit for the leader once the non-leaders are
+	     reaped.
+
+	     #3 - There are 3 or more threads in the group, and a
+	     thread other than the leader exec'd.  See comments on
+	     exec events at the top of the file.
+
+	     Ideally we would never delete the leader for case #2.
+	     Instead, we want to collect the exit status of each
+	     non-leader thread, and then finally collect the exit
+	     status of the leader as normal and use its exit code as
+	     whole-process exit code.  Unfortunately, there's no
+	     race-free way to distinguish cases #1 and #2.  We can't
+	     assume the exit events for the non-leaders threads are
+	     already pending in the kernel, nor can we assume the
+	     non-leader threads are in zombie state already.  Between
+	     the leader becoming zombie and the non-leaders exiting
+	     and becoming zombie themselves, there's a small time
+	     window, so such a check would be racy.  Temporarily
+	     pausing all threads and checking to see if all threads
+	     exit or not before re-resuming them would work in the
+	     case that all threads are running right now, but it
+	     wouldn't work if some thread is currently already
+	     ptrace-stopped, e.g., due to scheduler-locking.
+
+	     So what we do is we delete the leader anyhow, and then
+	     later on when we see its exit status, we re-add it back.
+	     We also make sure that we only report a whole-process
+	     exit when we see the leader exiting, as opposed to when
+	     the last LWP in the LWP list exits, which can be a
+	     non-leader if we deleted the leader here.  */
 	  linux_nat_debug_printf ("Thread group leader %d zombie "
-				  "(it exited, or another thread execd).",
+				  "(it exited, or another thread execd), "
+				  "deleting it.",
 				  inf->pid);
-
-	  /* A leader zombie can mean one of two things:
-
-	     - It exited, and there's an exit status pending
-	     available, or only the leader exited (not the whole
-	     program).  In the latter case, we can't waitpid the
-	     leader's exit status until all other threads are gone.
-
-	     - There are 3 or more threads in the group, and a thread
-	     other than the leader exec'd.  See comments on exec
-	     events at the top of the file.  We could try
-	     distinguishing the exit and exec cases, by waiting once
-	     more, and seeing if something comes out, but it doesn't
-	     sound useful.  The previous leader _does_ go away, and
-	     we'll re-add the new one once we see the exec event
-	     (which is just the same as what would happen if the
-	     previous leader did exit voluntarily before some other
-	     thread execs).  */
-
-	  linux_nat_debug_printf ("Thread group leader %d vanished.", inf->pid);
 	  exit_lwp (leader_lp);
 	}
     }
@@ -3057,7 +3110,7 @@ filter_exit_event (struct lwp_info *event_child,
 {
   ptid_t ptid = event_child->ptid;
 
-  if (num_lwps (ptid.pid ()) > 1)
+  if (!is_leader (event_child))
     {
       if (report_thread_events)
 	ourstatus->set_thread_exited (0);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 08/11] Re-add zombie leader on exit, gdbserver/linux
  2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
                   ` (6 preceding siblings ...)
  2022-03-03 14:40 ` [PATCH 07/11] Re-add zombie leader on exit, gdb/linux Pedro Alves
@ 2022-03-03 14:40 ` Pedro Alves
  2022-03-03 14:40 ` [PATCH 09/11] Ensure EXIT is last event, gdb/linux Pedro Alves
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

Same as the previous patch, but for GDBserver.

In summary, the current zombie leader detection code in linux-low.cc
has a race -- if a multi-threaded inferior exits just before
check_zombie_leaders finds that the leader is now zombie via checking
/proc/PID/status, check_zombie_leaders deletes the leader, assuming we
won't get an event for that exit (which we won't in some scenarios,
but not in this one), which is a false-positive scenario, where the
whole process is simply exiting.  Later when we see the last LWP in
our list exit, we report that LWP's exit status as exit code, even
though for the (real) parent process, the exit code that counts is the
child's leader thread's exit code.

Like for GDB, the solution here is to:

   - only report whole-process exit events for the leader.
   - re-add the leader back to the LWP list when we finally see it
     exit.

Change-Id: Id2d7bbb51a415534e1294fff1d555b7ecaa87f1d
---
 gdbserver/linux-low.cc | 120 ++++++++++++++++++++++++++++-------------
 1 file changed, 82 insertions(+), 38 deletions(-)

diff --git a/gdbserver/linux-low.cc b/gdbserver/linux-low.cc
index 4635d310839..5c037881670 100644
--- a/gdbserver/linux-low.cc
+++ b/gdbserver/linux-low.cc
@@ -135,6 +135,15 @@ typedef struct
 /* Does the current host support PTRACE_GETREGSET?  */
 int have_ptrace_getregset = -1;
 
+/* Return TRUE if THREAD is the leader thread of the process.  */
+
+static bool
+is_leader (thread_info *thread)
+{
+  ptid_t ptid = ptid_of (thread);
+  return ptid.pid () == ptid.lwp ();
+}
+
 /* LWP accessors.  */
 
 /* See nat/linux-nat.h.  */
@@ -1733,42 +1742,63 @@ linux_process_target::check_zombie_leaders ()
 
       if (leader_lp != NULL && !leader_lp->stopped
 	  /* Check if there are other threads in the group, as we may
-	     have raced with the inferior simply exiting.  */
+	     have raced with the inferior simply exiting.  Note this
+	     isn't a watertight check.  If the inferior is
+	     multi-threaded and is exiting, it may be we see the
+	     leader as zombie before we reap all the non-leader
+	     threads.  See comments below.  */
 	  && !last_thread_of_process_p (leader_pid)
 	  && linux_proc_pid_is_zombie (leader_pid))
 	{
-	  /* A leader zombie can mean one of two things:
-
-	     - It exited, and there's an exit status pending
-	     available, or only the leader exited (not the whole
-	     program).  In the latter case, we can't waitpid the
-	     leader's exit status until all other threads are gone.
-
-	     - There are 3 or more threads in the group, and a thread
-	     other than the leader exec'd.  On an exec, the Linux
-	     kernel destroys all other threads (except the execing
-	     one) in the thread group, and resets the execing thread's
-	     tid to the tgid.  No exit notification is sent for the
-	     execing thread -- from the ptracer's perspective, it
-	     appears as though the execing thread just vanishes.
-	     Until we reap all other threads except the leader and the
-	     execing thread, the leader will be zombie, and the
-	     execing thread will be in `D (disc sleep)'.  As soon as
-	     all other threads are reaped, the execing thread changes
-	     it's tid to the tgid, and the previous (zombie) leader
-	     vanishes, giving place to the "new" leader.  We could try
-	     distinguishing the exit and exec cases, by waiting once
-	     more, and seeing if something comes out, but it doesn't
-	     sound useful.  The previous leader _does_ go away, and
-	     we'll re-add the new one once we see the exec event
-	     (which is just the same as what would happen if the
-	     previous leader did exit voluntarily before some other
-	     thread execs).  */
-
+	  /* A zombie leader in a multi-threaded program can mean one
+	     of three things:
+
+	     #1 - Only the leader exited, not the whole program, e.g.,
+	     with pthread_exit.  Since we can't reap the leader's exit
+	     status until all other threads are gone and reaped too,
+	     we want to delete the zombie leader right away, as it
+	     can't be debugged, we can't read its registers, etc.
+	     This is the main reason we check for zombie leaders
+	     disappearing.
+
+	     #2 - The whole thread-group/process exited (a group exit,
+	     via e.g. exit(3), and there is (or will be shortly) an
+	     exit reported for each thread in the process, and then
+	     finally an exit for the leader once the non-leaders are
+	     reaped.
+
+	     #3 - There are 3 or more threads in the group, and a
+	     thread other than the leader exec'd.  See comments on
+	     exec events at the top of the file.
+
+	     Ideally we would never delete the leader for case #2.
+	     Instead, we want to collect the exit status of each
+	     non-leader thread, and then finally collect the exit
+	     status of the leader as normal and use its exit code as
+	     whole-process exit code.  Unfortunately, there's no
+	     race-free way to distinguish cases #1 and #2.  We can't
+	     assume the exit events for the non-leaders threads are
+	     already pending in the kernel, nor can we assume the
+	     non-leader threads are in zombie state already.  Between
+	     the leader becoming zombie and the non-leaders exiting
+	     and becoming zombie themselves, there's a small time
+	     window, so such a check would be racy.  Temporarily
+	     pausing all threads and checking to see if all threads
+	     exit or not before re-resuming them would work in the
+	     case that all threads are running right now, but it
+	     wouldn't work if some thread is currently already
+	     ptrace-stopped, e.g., due to scheduler-locking.
+
+	     So what we do is we delete the leader anyhow, and then
+	     later on when we see its exit status, we re-add it back.
+	     We also make sure that we only report a whole-process
+	     exit when we see the leader exiting, as opposed to when
+	     the last LWP in the LWP list exits, which can be a
+	     non-leader if we deleted the leader here.  */
 	  threads_debug_printf ("Thread group leader %d zombie "
-				"(it exited, or another thread execd).",
+				"(it exited, or another thread execd), "
+				"deleting it.",
 				leader_pid);
-
 	  delete_lwp (leader_lp);
 	}
     });
@@ -2185,7 +2215,22 @@ linux_process_target::filter_event (int lwpid, int wstat)
 	  /* Don't report an event for the exit of an LWP not in our
 	     list, i.e. not part of any inferior we're debugging.
 	     This can happen if we detach from a program we originally
-	     forked and then it exits.  */
+	     forked and then it exits.  However, note that we may have
+	     earlier deleted a leader of an inferior we're debugging,
+	     in check_zombie_leaders.  Re-add it back here if so.  */
+	  find_process ([&] (process_info *proc)
+	    {
+	      if (proc->pid == lwpid)
+		{
+		  threads_debug_printf
+		    ("Re-adding thread group leader LWP %d after exit.",
+		     lwpid);
+
+		  child = add_lwp (ptid_t (lwpid, lwpid));
+		  return true;
+		}
+	      return false;
+	    });
 	}
 
       if (child == nullptr)
@@ -2209,11 +2254,10 @@ linux_process_target::filter_event (int lwpid, int wstat)
 	  unsuspend_all_lwps (child);
 	}
 
-      /* If there is at least one more LWP, then the exit signal was
-	 not the end of the debugged application and should be
-	 ignored, unless GDB wants to hear about thread exits.  */
-      if (cs.report_thread_events
-	  || last_thread_of_process_p (pid_of (thread)))
+      /* If this is not the leader LWP, then the exit signal was not
+	 the end of the debugged application and should be ignored,
+	 unless GDB wants to hear about thread exits.  */
+      if (cs.report_thread_events || is_leader (thread))
 	{
 	  /* Since events are serialized to GDB core, and we can't
 	     report this one right now.  Leave the status pending for
@@ -2780,7 +2824,7 @@ linux_process_target::filter_exit_event (lwp_info *event_child,
   struct thread_info *thread = get_lwp_thread (event_child);
   ptid_t ptid = ptid_of (thread);
 
-  if (!last_thread_of_process_p (pid_of (thread)))
+  if (!is_leader (thread))
     {
       if (cs.report_thread_events)
 	ourstatus->set_thread_exited (0);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 09/11] Ensure EXIT is last event, gdb/linux
  2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
                   ` (7 preceding siblings ...)
  2022-03-03 14:40 ` [PATCH 08/11] Re-add zombie leader on exit, gdbserver/linux Pedro Alves
@ 2022-03-03 14:40 ` Pedro Alves
  2022-03-07 20:24   ` Simon Marchi
  2022-03-03 14:40 ` [PATCH 10/11] Ensure EXIT is last event, gdbserver/linux Pedro Alves
  2022-03-03 14:40 ` [PATCH 11/11] Process exit status is leader exit status testcase Pedro Alves
  10 siblings, 1 reply; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

From: Lancelot SIX <lancelot.six@amd.com>

When all threads of a multi-threaded process terminate, we can end up
with multiple LWPs having a pending exit event (lets say that the
events arrive faster than GDB processes them, which can be simulated
by introducing an artificial delay in GDB).  When we have multiple
pending events to report to the core, linux_nat_wait_1 uses
select_event_lwp to select randomly one of the pending events to
report.

If we have multiple pending exit events, and the randomization picks
the leader's exit to report, filter_exit_event sees that this is the
leader exiting, thus it is the exit of the whole process and thus
reports an EXITED event to the core, while leaving the other threads'
exit statuses pending.

This is problematic for infrun.c:stop_all_threads, which asks the
target to report all thread exit events to infrun.  For example, in
stop_all_threads, core GDB counts 2 threads that needs to be stopped.
It then asks the target to stop those 2 threads (with
target_stop(ptid)), and waits for 2 events to come back from the
target.  Unfortunately, when waiting for events, the linux-nat target,
due to the random event selecting mentioned above, reports the
whole-process EXIT event even though the other thread has exited and
its exit hasn't been reported yet.  As a consequence, stop_all_threads
receives one event, but waits indefinitely for the second one to come.
Effectively, GDB hangs forever.

To fix this, this commit makes sure that a leader's exit event is not
considered for selection by select_event_lwp as long as there is at
least one other thread with a pending event remaining in the process.
Considering that the leader thread's exit event can only be generated
by the kernel once it has reaped all the non-leader threads, we are
guaranteed that all other threads do have an exit event which is ready
to be processed.  Once all other exit events are processed,
select_event_lwp will consider the leader's exit status.

Tested on Linux-x86_64 with no regression observed.

Co-Authored-By: Pedro Alves <pedro@palves.net>
Change-Id: Id17ad5de76518925a968c0902860646820679dfa
---
 gdb/linux-nat.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index d97a770bf83..5d8887b11f6 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -2482,15 +2482,50 @@ status_callback (struct lwp_info *lp)
   return 1;
 }
 
-/* Count the LWP's that have had events.  */
+/* Return true if another thread of the same process as LP has a
+   pending status ready to be processed.  LP is assumed to be the
+   leader of its process.  */
+
+static bool
+non_leader_lwp_in_process_with_pending_event (lwp_info *lp)
+{
+  gdb_assert (is_leader (lp));
+
+  for (lwp_info *other : all_lwps ())
+    {
+      if (other->ptid.pid () == lp->ptid.pid ()
+	  && !is_leader (other)
+	  && other->resumed
+	  && lwp_status_pending_p (other))
+	return true;
+    }
+
+  return false;
+}
+
+/* Indicate whether LP has a pending event which should be considered
+   for immediate processing.  Does not consider a leader thread's exit
+   event before the non-leader threads have reported their exits.  */
+
+static bool
+has_reportable_pending_event (struct lwp_info *lp)
+{
+  return (lp->resumed && lwp_status_pending_p (lp)
+	  && !(!WIFSTOPPED (lp->status)
+	       && is_leader (lp)
+	       && non_leader_lwp_in_process_with_pending_event (lp)));
+}
+
+/* Count the LWP's that have had reportable events.  */
 
 static int
 count_events_callback (struct lwp_info *lp, int *count)
 {
   gdb_assert (count != NULL);
 
-  /* Select only resumed LWPs that have an event pending.  */
-  if (lp->resumed && lwp_status_pending_p (lp))
+  /* Select only resumed LWPs that have a reportable event
+     pending.  */
+  if (has_reportable_pending_event (lp))
     (*count)++;
 
   return 0;
@@ -2526,8 +2561,9 @@ select_event_lwp_callback (struct lwp_info *lp, int *selector)
 {
   gdb_assert (selector != NULL);
 
-  /* Select only resumed LWPs that have an event pending.  */
-  if (lp->resumed && lwp_status_pending_p (lp))
+  /* Select only resumed LWPs that have a reportable event
+     pending.  */
+  if (has_reportable_pending_event (lp))
     if ((*selector)-- == 0)
       return 1;
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 10/11] Ensure EXIT is last event, gdbserver/linux
  2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
                   ` (8 preceding siblings ...)
  2022-03-03 14:40 ` [PATCH 09/11] Ensure EXIT is last event, gdb/linux Pedro Alves
@ 2022-03-03 14:40 ` Pedro Alves
  2022-03-03 14:40 ` [PATCH 11/11] Process exit status is leader exit status testcase Pedro Alves
  10 siblings, 0 replies; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

Same as the previous patch, but for GDBserver.

In a nutshell, don't report exit events for the leader thread until
all pending events for other threads are flushed.

Change-Id: I3f855f48bea15a527a87565f8b9f4000169cd6c0
---
 gdbserver/linux-low.cc | 46 +++++++++++++++++++++++++++++++++---------
 gdbserver/linux-low.h  | 15 ++++++++++++--
 2 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/gdbserver/linux-low.cc b/gdbserver/linux-low.cc
index 5c037881670..15c2914e444 100644
--- a/gdbserver/linux-low.cc
+++ b/gdbserver/linux-low.cc
@@ -1655,16 +1655,10 @@ lwp_resumed (struct lwp_info *lwp)
 }
 
 bool
-linux_process_target::status_pending_p_callback (thread_info *thread,
-						 ptid_t ptid)
+linux_process_target::status_pending_p (thread_info *thread)
 {
   struct lwp_info *lp = get_thread_lwp (thread);
 
-  /* Check if we're only interested in events from a specific process
-     or a specific LWP.  */
-  if (!thread->id.matches (ptid))
-    return 0;
-
   if (!lwp_resumed (lp))
     return 0;
 
@@ -1678,6 +1672,38 @@ linux_process_target::status_pending_p_callback (thread_info *thread,
   return lp->status_pending_p;
 }
 
+bool
+linux_process_target::non_leader_lwp_in_process_with_pending_event
+  (thread_info *thread)
+{
+  gdb_assert (is_leader (thread));
+
+  thread_info *found = find_thread (ptid_t (pid_of (thread)),
+				    [&] (thread_info *other)
+    {
+      return !is_leader (other) && status_pending_p (other);
+    });
+
+  return found != nullptr;
+}
+
+bool
+linux_process_target::has_reportable_pending_event (thread_info *thread,
+						    ptid_t filter_ptid)
+{
+  /* Check if we're only interested in events from a specific process
+     or a specific thread.  */
+  if (!thread->id.matches (filter_ptid))
+    return false;
+
+  lwp_info *lp = get_thread_lwp (thread);
+
+  return (status_pending_p (thread)
+	  && !(!WIFSTOPPED (lp->status_pending)
+	       && is_leader (thread)
+	       && non_leader_lwp_in_process_with_pending_event (thread)));
+}
+
 struct lwp_info *
 find_lwp_pid (ptid_t ptid)
 {
@@ -2446,7 +2472,7 @@ linux_process_target::wait_for_event_filtered (ptid_t wait_ptid,
     {
       event_thread = find_thread_in_random ([&] (thread_info *thread)
 	{
-	  return status_pending_p_callback (thread, filter_ptid);
+	  return has_reportable_pending_event (thread, filter_ptid);
 	});
 
       if (event_thread != NULL)
@@ -2562,7 +2588,7 @@ linux_process_target::wait_for_event_filtered (ptid_t wait_ptid,
 	 any.  */
       event_thread = find_thread_in_random ([&] (thread_info *thread)
 	{
-	  return status_pending_p_callback (thread, filter_ptid);
+	  return has_reportable_pending_event (thread, filter_ptid);
 	});
 
       if (event_thread != NULL)
@@ -2902,7 +2928,7 @@ linux_process_target::wait_1 (ptid_t ptid, target_waitstatus *ourstatus,
 
   auto status_pending_p_any = [&] (thread_info *thread)
     {
-      return status_pending_p_callback (thread, minus_one_ptid);
+      return status_pending_p (thread);
     };
 
   auto not_stopped = [&] (thread_info *thread)
diff --git a/gdbserver/linux-low.h b/gdbserver/linux-low.h
index 27cc9641f12..bd71e673871 100644
--- a/gdbserver/linux-low.h
+++ b/gdbserver/linux-low.h
@@ -483,8 +483,19 @@ class linux_process_target : public process_stratum_target
      they should be re-issued if necessary.  */
   void resume_one_thread (thread_info *thread, bool leave_all_stopped);
 
-  /* Return true if this lwp has an interesting status pending.  */
-  bool status_pending_p_callback (thread_info *thread, ptid_t ptid);
+  /* Return true if THREAD has an interesting status pending.  */
+  bool status_pending_p (thread_info *thread);
+
+  /* Return true if another thread of the same process as THREAD has a
+     pending status ready to be processed.  THREAD is assumed to be
+     the leader of its process.  */
+  bool non_leader_lwp_in_process_with_pending_event (thread_info *thread);
+
+  /* Indicate if THREAD has a pending event which should be considered
+     for immediate processing.  Only threads that match FILTER_PTID
+     are considered.  Does not consider a leader thread's exit event
+     before the non-leader threads have reported their exits.  */
+  bool has_reportable_pending_event (thread_info *thread, ptid_t filter_ptid);
 
   /* Resume LWPs that are currently stopped without any pending status
      to report, but are resumed from the core's perspective.  */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 11/11] Process exit status is leader exit status testcase
  2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
                   ` (9 preceding siblings ...)
  2022-03-03 14:40 ` [PATCH 10/11] Ensure EXIT is last event, gdbserver/linux Pedro Alves
@ 2022-03-03 14:40 ` Pedro Alves
  2023-06-22 11:28   ` Ilya Leoshkevich
  10 siblings, 1 reply; 23+ messages in thread
From: Pedro Alves @ 2022-03-03 14:40 UTC (permalink / raw)
  To: gdb-patches

From: Lancelot SIX <lancelot.six@amd.com>

This adds a multi-threaded testcase that has all threads in the
process exit with a different exit code, and ensures that GDB reports
the thread group leader's exit status as the whole-process exit
status.  Before this set of patches, this would randomly report the
exit code of some other thread, and thus fail.

Tested on Linux-x86_64, native and gdbserver.

Co-Authored-By: Pedro Alves <pedro@palves.net>
Change-Id: I30cba2ff4576fb01b5169cc72667f3268d919557
---
 ...rocess-exit-status-is-leader-exit-status.c | 64 +++++++++++++++++++
 ...cess-exit-status-is-leader-exit-status.exp | 45 +++++++++++++
 2 files changed, 109 insertions(+)
 create mode 100644 gdb/testsuite/gdb.threads/process-exit-status-is-leader-exit-status.c
 create mode 100644 gdb/testsuite/gdb.threads/process-exit-status-is-leader-exit-status.exp

diff --git a/gdb/testsuite/gdb.threads/process-exit-status-is-leader-exit-status.c b/gdb/testsuite/gdb.threads/process-exit-status-is-leader-exit-status.c
new file mode 100644
index 00000000000..c27c266ef28
--- /dev/null
+++ b/gdb/testsuite/gdb.threads/process-exit-status-is-leader-exit-status.c
@@ -0,0 +1,64 @@
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2022 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+#include <pthread.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+#define NUM_THREADS 32
+
+pthread_barrier_t barrier;
+
+static void
+do_exit (int exitcode)
+{
+  /* Synchronize all threads up to here so that they all exit at
+     roughly the same time.  */
+  pthread_barrier_wait (&barrier);
+
+  /* All threads exit with SYS_exit, even the main thread, to avoid
+     exiting with a group-exit syscall, as that syscall changes the
+     exit status of all still-alive threads, thus potentially masking
+     a bug.  */
+  syscall (SYS_exit, exitcode);
+}
+
+static void *
+start (void *arg)
+{
+  int thread_return_value = *(int *) arg;
+
+  do_exit (thread_return_value);
+}
+
+int
+main(void)
+{
+  pthread_t threads[NUM_THREADS];
+  int thread_return_val[NUM_THREADS];
+  int i;
+
+  pthread_barrier_init (&barrier, NULL, NUM_THREADS + 1);
+
+  for (i = 0; i < NUM_THREADS; ++i)
+    {
+      thread_return_val[i] = i + 2;
+      pthread_create (&threads[i], NULL, start, &thread_return_val[i]);
+    }
+
+  do_exit (1);
+}
diff --git a/gdb/testsuite/gdb.threads/process-exit-status-is-leader-exit-status.exp b/gdb/testsuite/gdb.threads/process-exit-status-is-leader-exit-status.exp
new file mode 100644
index 00000000000..655e4bc209b
--- /dev/null
+++ b/gdb/testsuite/gdb.threads/process-exit-status-is-leader-exit-status.exp
@@ -0,0 +1,45 @@
+# Copyright (C) 2022 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# GDB must always report the process's exit status based on the exit
+# status of the thread group leader thread.  Test that when multiple
+# threads exit simultaneously, GDB doesn't confuse the non-leader
+# threads' exit status for the process's exit status.  GDB used to
+# have a race condition to led to randomly handling this incorrectly.
+#
+# Since the improper behavior is racy in nature, this test is not
+# expected to be able to reproduce the error reliably.  Multiple
+# executions (or increasing the number of iterations) might be
+# required to reproduce the error with a misbehaving GDB.
+
+if { ![istarget "*-*-linux*"] } {
+    return 0
+}
+
+standard_testfile
+
+if {[prepare_for_testing "failed to prepare" $testfile $srcfile {debug pthreads}] == -1} {
+    return -1
+}
+
+for {set iteration 0} {$iteration < 10} {incr iteration} {
+    with_test_prefix "iteration=$iteration" {
+	if {![runto_main]} {
+	    return
+	}
+
+	gdb_test "continue" "\\\[Inferior 1 \\(.*\\) exited with code 01\\\]"
+    }
+}
-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 07/11] Re-add zombie leader on exit, gdb/linux
  2022-03-03 14:40 ` [PATCH 07/11] Re-add zombie leader on exit, gdb/linux Pedro Alves
@ 2022-03-07 20:08   ` Simon Marchi
  2022-03-07 20:27     ` Pedro Alves
  0 siblings, 1 reply; 23+ messages in thread
From: Simon Marchi @ 2022-03-07 20:08 UTC (permalink / raw)
  To: Pedro Alves, gdb-patches

On 2022-03-03 09:40, Pedro Alves wrote:
> @@ -2814,7 +2822,23 @@ linux_nat_filter_event (int lwpid, int status)
>  	  /* Don't report an event for the exit of an LWP not in our
>  	     list, i.e. not part of any inferior we're debugging.
>  	     This can happen if we detach from a program we originally
> -	     forked and then it exits.  */
> +	     forked and then it exits.  However, note that we may have
> +	     earlier deleted a leader of an inferior we're debugging,
> +	     in check_zombie_leaders.  Re-add it back here if so.  */
> +	  for (inferior *inf : all_inferiors (linux_target))
> +	    {
> +	      if (inf->pid == lwpid)
> +		{
> +		  linux_nat_debug_printf
> +		    ("Re-adding thread group leader LWP %d after exit.",
> +		     lwpid);
> +
> +		  lp = add_lwp (ptid_t (lwpid, lwpid));
> +		  lp->resumed = 1;
> +		  add_thread (linux_target, lp->ptid);
> +		  break;
> +		}
> +	    }

Should we do this only if WIFEXITED?  We don't expect to get any other
event from a deleted leader, so it's more to be safe in case there is a
kernel bug or something that would make us see something else.

Otherwise, LGTM.

Simon

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 09/11] Ensure EXIT is last event, gdb/linux
  2022-03-03 14:40 ` [PATCH 09/11] Ensure EXIT is last event, gdb/linux Pedro Alves
@ 2022-03-07 20:24   ` Simon Marchi
  2022-03-09  0:21     ` Lancelot SIX
  0 siblings, 1 reply; 23+ messages in thread
From: Simon Marchi @ 2022-03-07 20:24 UTC (permalink / raw)
  To: Pedro Alves, gdb-patches

On 2022-03-03 09:40, Pedro Alves wrote:
> This is problematic for infrun.c:stop_all_threads, which asks the
> target to report all thread exit events to infrun.  For example, in
> stop_all_threads, core GDB counts 2 threads that needs to be stopped.
> It then asks the target to stop those 2 threads (with
> target_stop(ptid)), and waits for 2 events to come back from the
> target.  Unfortunately, when waiting for events, the linux-nat target,
> due to the random event selecting mentioned above, reports the
> whole-process EXIT event even though the other thread has exited and
> its exit hasn't been reported yet.  As a consequence, stop_all_threads
> receives one event, but waits indefinitely for the second one to come.
> Effectively, GDB hangs forever.

I don't really understand how we end up waiting forever.  Why does
reporting the leader exit event first make it so we discard the event
for the other thread?

Other than that, it looks sensible to me to ensure we return events in
this order.

Simon

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 07/11] Re-add zombie leader on exit, gdb/linux
  2022-03-07 20:08   ` Simon Marchi
@ 2022-03-07 20:27     ` Pedro Alves
  2022-03-07 20:31       ` Simon Marchi
  0 siblings, 1 reply; 23+ messages in thread
From: Pedro Alves @ 2022-03-07 20:27 UTC (permalink / raw)
  To: Simon Marchi, gdb-patches

On 2022-03-07 20:08, Simon Marchi wrote:
> On 2022-03-03 09:40, Pedro Alves wrote:
>> @@ -2814,7 +2822,23 @@ linux_nat_filter_event (int lwpid, int status)
>>  	  /* Don't report an event for the exit of an LWP not in our
>>  	     list, i.e. not part of any inferior we're debugging.
>>  	     This can happen if we detach from a program we originally
>> -	     forked and then it exits.  */
>> +	     forked and then it exits.  However, note that we may have
>> +	     earlier deleted a leader of an inferior we're debugging,
>> +	     in check_zombie_leaders.  Re-add it back here if so.  */
>> +	  for (inferior *inf : all_inferiors (linux_target))
>> +	    {
>> +	      if (inf->pid == lwpid)
>> +		{
>> +		  linux_nat_debug_printf
>> +		    ("Re-adding thread group leader LWP %d after exit.",
>> +		     lwpid);
>> +
>> +		  lp = add_lwp (ptid_t (lwpid, lwpid));
>> +		  lp->resumed = 1;
>> +		  add_thread (linux_target, lp->ptid);
>> +		  break;
>> +		}
>> +	    }
> 
> Should we do this only if WIFEXITED?  We don't expect to get any other
> event from a deleted leader, so it's more to be safe in case there is a
> kernel bug or something that would make us see something else.
> 

Hmm, can you clarify?  We're doing this for both WIFEXITED and WIFSIGNALED.  The
WIFSTOPPED case doesn't get here, it's handled in the if/then branch.  I do think
we need to do this for exits due to WIFSIGNALED just the same.  Do you think
otherwise?

> Otherwise, LGTM.
> 
> Simon
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 07/11] Re-add zombie leader on exit, gdb/linux
  2022-03-07 20:27     ` Pedro Alves
@ 2022-03-07 20:31       ` Simon Marchi
  2022-03-09 14:37         ` Pedro Alves
  0 siblings, 1 reply; 23+ messages in thread
From: Simon Marchi @ 2022-03-07 20:31 UTC (permalink / raw)
  To: Pedro Alves, gdb-patches

On 2022-03-07 15:27, Pedro Alves wrote:
> Hmm, can you clarify?  We're doing this for both WIFEXITED and WIFSIGNALED.  The
> WIFSTOPPED case doesn't get here, it's handled in the if/then branch.  I do think
> we need to do this for exits due to WIFSIGNALED just the same.  Do you think
> otherwise?

Ah I missed the WIFSTOPPED check above.  The case I was thinking was "what if we
get a WIFSTOPPED for a deleted zombie leader, that would probably be some problem
in the kernel, we don't want to add it back".  And I didn't think about WSIGNALED.

So effectively, I was proposing to have:

else if (WIFSIGNALED (status) || WIFEXITED (status)

instead of an "else".  But given that WIFSIGNALED and WIFEXITED are the only
plausible alternatives to WIFSTOPPED, it's not really necessary.

Simon

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 09/11] Ensure EXIT is last event, gdb/linux
  2022-03-07 20:24   ` Simon Marchi
@ 2022-03-09  0:21     ` Lancelot SIX
  2022-03-09 14:45       ` Pedro Alves
  0 siblings, 1 reply; 23+ messages in thread
From: Lancelot SIX @ 2022-03-09  0:21 UTC (permalink / raw)
  To: Simon Marchi; +Cc: Pedro Alves, gdb-patches

On Mon, Mar 07, 2022 at 03:24:49PM -0500, Simon Marchi wrote:
> On 2022-03-03 09:40, Pedro Alves wrote:
> > This is problematic for infrun.c:stop_all_threads, which asks the
> > target to report all thread exit events to infrun.  For example, in
> > stop_all_threads, core GDB counts 2 threads that needs to be stopped.
> > It then asks the target to stop those 2 threads (with
> > target_stop(ptid)), and waits for 2 events to come back from the
> > target.  Unfortunately, when waiting for events, the linux-nat target,
> > due to the random event selecting mentioned above, reports the
> > whole-process EXIT event even though the other thread has exited and
> > its exit hasn't been reported yet.  As a consequence, stop_all_threads
> > receives one event, but waits indefinitely for the second one to come.
> > Effectively, GDB hangs forever.
> 
> I don't really understand how we end up waiting forever.  Why does
> reporting the leader exit event first make it so we discard the event
> for the other thread?
> 
> Other than that, it looks sensible to me to ensure we return events in
> this order.
> 
> Simon

Hi,

You are right, this patch is not directly linked to GDB hanging forever.
It was written as part of a series originally started to fix a hang,
which ended up handling more.  I think there have been mixup when
reworking the patches before sending.

The hang can happen because of the false positive of the zombie
detection (which is worked around in another patch in this series).  In
this case, if GDB is waiting for stop events from 2 threads in
stop_all_threads and the zombie detection drops a thread at that moment,
then only the second thread reports an event, causing GDB to hang
forever waiting for an event from the first thread.  Hanging issue left
aside, we end up with a process terminating with the exit event + exit
code from a non leader thread which is incorrect.

This lead us to realize that there are other situations where the
process exit is not reported with the correct thread, and so the exit
code returned to infrun is not the expected one.

When a multithreaded process terminates, we are supposed to first see
the exit of each non-leader thread, and only then the exit of the leader
thread.  This is what we pulled from the kernel using waitpid in
linux-nat.  However linux-nat uses some "fairness" and randomly selects
one of the events it cached to report to infrun first.  In current GDB
(i.e.  before this series), if the random function chooses the exit from
the leader first, linux-nat generates a THREAD_EXITED event for it, and
will generated the EXITED event when it sees the last remaining
(non-leader) thread exit.  The exit code associated with this EXITED
event is the one of the non-leader thread, and this is wrong.

This patch improves the random selection and ensure that we do not
report the leader exit as long as there are non-leader exit events
pending (and there should be).  For any well-behaved process, this
ensures that we report the correct exit code with the EXITED event.

This overall helps to reason in infrun: we receive the THREAD_EXITED
events first, and the EXITED event last.

We will rework the commit message and remove the mention of the hang in
this patch.

Best,
Lancelot.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 07/11] Re-add zombie leader on exit, gdb/linux
  2022-03-07 20:31       ` Simon Marchi
@ 2022-03-09 14:37         ` Pedro Alves
  0 siblings, 0 replies; 23+ messages in thread
From: Pedro Alves @ 2022-03-09 14:37 UTC (permalink / raw)
  To: Simon Marchi, gdb-patches

On 2022-03-07 20:31, Simon Marchi wrote:
> On 2022-03-07 15:27, Pedro Alves wrote:
>> Hmm, can you clarify?  We're doing this for both WIFEXITED and WIFSIGNALED.  The
>> WIFSTOPPED case doesn't get here, it's handled in the if/then branch.  I do think
>> we need to do this for exits due to WIFSIGNALED just the same.  Do you think
>> otherwise?
> 
> Ah I missed the WIFSTOPPED check above.  The case I was thinking was "what if we
> get a WIFSTOPPED for a deleted zombie leader, that would probably be some problem
> in the kernel, we don't want to add it back".  And I didn't think about WSIGNALED.
> 
> So effectively, I was proposing to have:
> 
> else if (WIFSIGNALED (status) || WIFEXITED (status)
> 
> instead of an "else".  But given that WIFSIGNALED and WIFEXITED are the only
> plausible alternatives to WIFSTOPPED, it's not really necessary.

Yeah, it's pretty standard to use !WIFSTOPPED to mean "exited for any reason".



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 09/11] Ensure EXIT is last event, gdb/linux
  2022-03-09  0:21     ` Lancelot SIX
@ 2022-03-09 14:45       ` Pedro Alves
  2022-03-09 22:29         ` Lancelot SIX
  0 siblings, 1 reply; 23+ messages in thread
From: Pedro Alves @ 2022-03-09 14:45 UTC (permalink / raw)
  To: Lancelot SIX, Simon Marchi; +Cc: gdb-patches

On 2022-03-09 00:21, Lancelot SIX wrote:
> On Mon, Mar 07, 2022 at 03:24:49PM -0500, Simon Marchi wrote:
>> On 2022-03-03 09:40, Pedro Alves wrote:
>>> This is problematic for infrun.c:stop_all_threads, which asks the
>>> target to report all thread exit events to infrun.  For example, in
>>> stop_all_threads, core GDB counts 2 threads that needs to be stopped.
>>> It then asks the target to stop those 2 threads (with
>>> target_stop(ptid)), and waits for 2 events to come back from the
>>> target.  Unfortunately, when waiting for events, the linux-nat target,
>>> due to the random event selecting mentioned above, reports the
>>> whole-process EXIT event even though the other thread has exited and
>>> its exit hasn't been reported yet.  As a consequence, stop_all_threads
>>> receives one event, but waits indefinitely for the second one to come.
>>> Effectively, GDB hangs forever.
>>
>> I don't really understand how we end up waiting forever.  Why does
>> reporting the leader exit event first make it so we discard the event
>> for the other thread?
>>
>> Other than that, it looks sensible to me to ensure we return events in
>> this order.
>>
>> Simon
> 
> Hi,
> 
> You are right, this patch is not directly linked to GDB hanging forever.
> It was written as part of a series originally started to fix a hang,
> which ended up handling more.  I think there have been mixup when
> reworking the patches before sending.

My bad, I based the commit message off of the wrong patch's original commit
message, and never questioned it.  :-P

> 
> The hang can happen because of the false positive of the zombie
> detection (which is worked around in another patch in this series).  

Right, the zombie detection patch (#7) also ensures that we only ever
report the process-exit event for the leader thread, instead of whatever
is the last lwp in the list, which is what fixes it.

> In
> this case, if GDB is waiting for stop events from 2 threads in
> stop_all_threads and the zombie detection drops a thread at that moment,
> then only the second thread reports an event, causing GDB to hang
> forever waiting for an event from the first thread.  Hanging issue left
> aside, we end up with a process terminating with the exit event + exit
> code from a non leader thread which is incorrect.

Thanks.  I double checked now that this is already described in patch #7's
commit log, so there's nothing to change in any other patch, AFAICT.

> 
> This lead us to realize that there are other situations where the
> process exit is not reported with the correct thread, and so the exit
> code returned to infrun is not the expected one.
> 
> When a multithreaded process terminates, we are supposed to first see
> the exit of each non-leader thread, and only then the exit of the leader
> thread.  This is what we pulled from the kernel using waitpid in
> linux-nat.  However linux-nat uses some "fairness" and randomly selects
> one of the events it cached to report to infrun first.  In current GDB
> (i.e.  before this series), if the random function chooses the exit from
> the leader first, linux-nat generates a THREAD_EXITED event for it, and
> will generated the EXITED event when it sees the last remaining
> (non-leader) thread exit.  The exit code associated with this EXITED
> event is the one of the non-leader thread, and this is wrong.

Right.  This is no longer a concern after patch #7, though, as linux-nat
will only ever report the EXITED event for the leader.

> 
> This patch improves the random selection and ensure that we do not
> report the leader exit as long as there are non-leader exit events
> pending (and there should be).  For any well-behaved process, this
> ensures that we report the correct exit code with the EXITED event.

Right, agreed, if applied in isolation.  It would help in that case, though if the
zombie detection kicked in, the issue would still trigger, as you're aware.  But since
patch #7 ensured we only ever report the exit for the leader, that patch fixed both
the "well-behaved" case, and the case the leader really exits.

So the only remaining effect of this patch is the ensuring a natural order
of events out of linux-nat, and we must evaluate it on those grounds alone now.

> 
> This overall helps to reason in infrun: we receive the THREAD_EXITED
> events first, and the EXITED event last.
> 
> We will rework the commit message and remove the mention of the hang in
> this patch.
> 

As we discussed yesterday internally, for stop_all_threads, it actually doesn't
matter which order the events are sent, as all events will be left pending.
Later on, infrun itself will pick one event at random between all pending
events on the infrun side (infrun.c:random_pending_event_thread), and may well pick up
the EXITED event before the THREAD_EXITED event anyhow, so to ensure proper ordering
throughout we'd need to tweak infrun as well.

There's also another point that we're not addressing yet, which is that EXECD events are
similar to EXITED events, in that they are process-wide events.  For example, if a non-leader
thread returns some event to infrun and it is left pending, and then the leader execs, and
the target reports an EXECD event for it and that exec event is also left pending, we'd want to
be sure that the non-leader event is never ever processed after the EXECD event, as that
would result in GDB thinking that the post-exec program had a second thread, when it really
does not.

I think the "badness" with exec does not happen in practice on native Linux currently, because
before the leader sees an exec event out of the kernel, it must see an exit event for the other
thread, even if it was ptrace-stopped, and so linux-nat deletes the non-leader thread before the
leader sees the execd, and as the pending events as stored in the thread, the pending event is
deleted as well everywhere automatically.

However, for remote/gdbserver, it's not so clear something odd can't happen, because gdbserver's
linux-low.cc deleting a thread does not immediately result in gdb's corresponding thread being
deleted.  I'm not sure we're guaranteed to delete the non-leader's pending event in all stages 
that queue events.  I'm thinking of gdbserver's notification queue and remote.c's queue.

So overall, I'm thinking of dropping this patch from the series for now, as it's not really
needed AFAICT, and is incomplete.

I've dropped patches #9 and #10 locally, and left the testcase (patch #11) running in a loop
testing both native and gdbserver for over an hour and saw no failures.  Any objections
to pushing the series without those two patches?

Pedro Alves

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 09/11] Ensure EXIT is last event, gdb/linux
  2022-03-09 14:45       ` Pedro Alves
@ 2022-03-09 22:29         ` Lancelot SIX
  2022-03-10 11:46           ` Pedro Alves
  0 siblings, 1 reply; 23+ messages in thread
From: Lancelot SIX @ 2022-03-09 22:29 UTC (permalink / raw)
  To: Pedro Alves; +Cc: Simon Marchi, gdb-patches

> 
> So overall, I'm thinking of dropping this patch from the series for now, as it's not really
> needed AFAICT, and is incomplete.
> 
> I've dropped patches #9 and #10 locally, and left the testcase (patch #11) running in a loop
> testing both native and gdbserver for over an hour and saw no failures.  Any objections
> to pushing the series without those two patches?

WHIW, that would be fine by me.  As you pointed out the real fix for the
issues is provided by other patches of this series anyway.

Lancelot.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 09/11] Ensure EXIT is last event, gdb/linux
  2022-03-09 22:29         ` Lancelot SIX
@ 2022-03-10 11:46           ` Pedro Alves
  0 siblings, 0 replies; 23+ messages in thread
From: Pedro Alves @ 2022-03-10 11:46 UTC (permalink / raw)
  To: Lancelot SIX; +Cc: Simon Marchi, gdb-patches

On 2022-03-09 22:29, Lancelot SIX wrote:
>>
>> So overall, I'm thinking of dropping this patch from the series for now, as it's not really
>> needed AFAICT, and is incomplete.
>>
>> I've dropped patches #9 and #10 locally, and left the testcase (patch #11) running in a loop
>> testing both native and gdbserver for over an hour and saw no failures.  Any objections
>> to pushing the series without those two patches?
> 
> WHIW, that would be fine by me.  As you pointed out the real fix for the
> issues is provided by other patches of this series anyway.

Alright, Simon also said offlist he was fine with the rest of the series, so I went ahead
and merged it.

Thanks for the reviews/discussions.  

Pedro Alves

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 11/11] Process exit status is leader exit status testcase
  2022-03-03 14:40 ` [PATCH 11/11] Process exit status is leader exit status testcase Pedro Alves
@ 2023-06-22 11:28   ` Ilya Leoshkevich
  2023-06-22 13:07     ` Tom de Vries
  0 siblings, 1 reply; 23+ messages in thread
From: Ilya Leoshkevich @ 2023-06-22 11:28 UTC (permalink / raw)
  To: Pedro Alves, gdb-patches

On Thu, 2022-03-03 at 14:40 +0000, Pedro Alves wrote:
> From: Lancelot SIX <lancelot.six@amd.com>
> 
> This adds a multi-threaded testcase that has all threads in the
> process exit with a different exit code, and ensures that GDB reports
> the thread group leader's exit status as the whole-process exit
> status.  Before this set of patches, this would randomly report the
> exit code of some other thread, and thus fail.
> 
> Tested on Linux-x86_64, native and gdbserver.
> 
> Co-Authored-By: Pedro Alves <pedro@palves.net>
> Change-Id: I30cba2ff4576fb01b5169cc72667f3268d919557
> ---
>  ...rocess-exit-status-is-leader-exit-status.c | 64
> +++++++++++++++++++
>  ...cess-exit-status-is-leader-exit-status.exp | 45 +++++++++++++
>  2 files changed, 109 insertions(+)
>  create mode 100644 gdb/testsuite/gdb.threads/process-exit-status-is-
> leader-exit-status.c
>  create mode 100644 gdb/testsuite/gdb.threads/process-exit-status-is-
> leader-exit-status.exp

This test fails on kernels >= 6.1.  The symptom is that the kernel
reports that all threads exit with the same status - that of the first
thread to exit.  Bisect points to:

commit d80f7d7b2c75c5954d335dffbccca62a5002c3e0
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue Jun 21 14:38:52 2022 -0500

signal: Guarantee that SIGNAL_GROUP_EXIT is set on process exit

Track how many threads have not started exiting and when the last
thread starts exiting set SIGNAL_GROUP_EXIT.

This guarantees that SIGNAL_GROUP_EXIT will get set when a process
exits.  In practice this achieves nothing as glibc's implementation of
_exit calls sys_group_exit then sys_exit.  While glibc's implemenation
of pthread_exit calls exit (which cleansup and calls _exit) if it is
the last thread and sys_exit if it is the last thread.

This means the only way the kernel might observe a process that does
not set call exit_group is if the language runtime does not use glibc.

With more cleanups I hope to move the decrement of quick_threads
earlier.

Link:
https://lkml.kernel.org/r/87bkukd4tc.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

I'm not sure what to do with it - can one claim that this is a kernel
regression?  Or should GDB be able to deal with that?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 11/11] Process exit status is leader exit status testcase
  2023-06-22 11:28   ` Ilya Leoshkevich
@ 2023-06-22 13:07     ` Tom de Vries
  0 siblings, 0 replies; 23+ messages in thread
From: Tom de Vries @ 2023-06-22 13:07 UTC (permalink / raw)
  To: Ilya Leoshkevich, Pedro Alves, gdb-patches

On 6/22/23 13:28, Ilya Leoshkevich via Gdb-patches wrote:
>>   create mode 100644 gdb/testsuite/gdb.threads/process-exit-status-is-
>> leader-exit-status.exp
> 
> This test fails on kernels >= 6.1.  The symptom is that the kernel
> reports that all threads exit with the same status - that of the first
> thread to exit.  Bisect points to:
> 
> 

There's a gdb PR open about this: 
https://sourceware.org/bugzilla/show_bug.cgi?id=29965.

There's an off-list conversation about this with Eric. Last update is 
from January.  I'll ping him.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-06-22 13:07 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-03 14:40 [PATCH 00/11] Linux: Fix issues around thread group leader exits Pedro Alves
2022-03-03 14:40 ` [PATCH 01/11] Fix gdbserver/linux target_waitstatus logging assert Pedro Alves
2022-03-03 14:40 ` [PATCH 02/11] Fix gdb.threads/clone-new-thread-event.exp race Pedro Alves
2022-03-03 14:40 ` [PATCH 03/11] Fix gdb.threads/current-lwp-dead.exp race Pedro Alves
2022-03-03 14:40 ` [PATCH 04/11] gdb: Reorganize linux_nat_filter_event Pedro Alves
2022-03-03 14:40 ` [PATCH 05/11] gdbserver: Reorganize linux_process_target::filter_event Pedro Alves
2022-03-03 14:40 ` [PATCH 06/11] gdbserver: Reindent check_zombie_leaders Pedro Alves
2022-03-03 14:40 ` [PATCH 07/11] Re-add zombie leader on exit, gdb/linux Pedro Alves
2022-03-07 20:08   ` Simon Marchi
2022-03-07 20:27     ` Pedro Alves
2022-03-07 20:31       ` Simon Marchi
2022-03-09 14:37         ` Pedro Alves
2022-03-03 14:40 ` [PATCH 08/11] Re-add zombie leader on exit, gdbserver/linux Pedro Alves
2022-03-03 14:40 ` [PATCH 09/11] Ensure EXIT is last event, gdb/linux Pedro Alves
2022-03-07 20:24   ` Simon Marchi
2022-03-09  0:21     ` Lancelot SIX
2022-03-09 14:45       ` Pedro Alves
2022-03-09 22:29         ` Lancelot SIX
2022-03-10 11:46           ` Pedro Alves
2022-03-03 14:40 ` [PATCH 10/11] Ensure EXIT is last event, gdbserver/linux Pedro Alves
2022-03-03 14:40 ` [PATCH 11/11] Process exit status is leader exit status testcase Pedro Alves
2023-06-22 11:28   ` Ilya Leoshkevich
2023-06-22 13:07     ` Tom de Vries

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).