public inbox for gdb-cvs@sourceware.org
help / color / mirror / Atom feed
* [binutils-gdb] gdbserver/linux: set lwp !stopped when failing to resume
@ 2022-03-31 17:04 Simon Marchi
  0 siblings, 0 replies; only message in thread
From: Simon Marchi @ 2022-03-31 17:04 UTC (permalink / raw)
  To: gdb-cvs

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=20471e00e2ea400c5a7126a98fb58c6a83b50628

commit 20471e00e2ea400c5a7126a98fb58c6a83b50628
Author: Simon Marchi <simon.marchi@polymtl.ca>
Date:   Tue Jan 18 14:33:57 2022 -0500

    gdbserver/linux: set lwp !stopped when failing to resume
    
    I see some failures, at least in gdb.multi/multi-re-run.exp and
    gdb.threads/interrupted-hand-call.exp.  Running `stress -C $(nproc)` at
    the same time as the test makes those tests relatively frequent.
    
    Let's take gdb.multi/multi-re-run.exp as an example.  The failure looks
    like this, an unexpected "no resumed":
    
        continue
        Continuing.
        No unwaited-for children left.
        (gdb) FAIL: gdb.multi/multi-re-run.exp: re_run_inf=2: iter=1: continue until exit
    
    The situation is:
    
     - Inferior 1 is stopped somewhere, it won't really play a role here.
     - Inferior 2 has 2 threads, both stopped.
     - We resume inferior 2, the leader thread is expected to exit, making
       the process exit.
    
    From GDB's perspective, a failing run looks like this:
    
        [infrun] fetch_inferior_event: enter
          [infrun] scoped_disable_commit_resumed: reason=handling event
          [infrun] do_target_wait: Found 2 inferiors, starting at #1
          [infrun] random_pending_event_thread: None found.
          [remote] wait: enter
            [remote] Packet received: T0506:20dcffffff7f0000;07:20dcffffff7f0000;10:9551555555550000;thread:pae4cd.ae4cd;core:e;
          [remote] wait: exit
          [infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
          [infrun] print_target_wait_results:   713933.713933.0 [Thread 713933.713933],
          [infrun] print_target_wait_results:   status->kind = STOPPED, sig = GDB_SIGNAL_TRAP
          [infrun] handle_inferior_event: status->kind = STOPPED, sig = GDB_SIGNAL_TRAP
          [infrun] clear_step_over_info: clearing step over info
          [infrun] context_switch: Switching context from 0.0.0 to 713933.713933.0
          [infrun] handle_signal_stop: stop_pc=0x555555555195
          [infrun] start_step_over: enter
            [infrun] start_step_over: stealing global queue of threads to step, length = 0
            [infrun] operator(): step-over queue now empty
          [infrun] start_step_over: exit
          [infrun] process_event_stop_test: no stepping, continue
          [remote] Sending packet: $Z0,555555555194,1#8e
          [remote] Packet received: OK
          [infrun] resume_1: step=0, signal=GDB_SIGNAL_0, trap_expected=0, current thread [713933.713933.0] at 0x555555555195
          [remote] Sending packet: $QPassSignals:e;10;14;17;1a;1b;1c;21;24;25;2c;4c;97;#0a
          [remote] Packet received: OK
          [remote] Sending packet: $vCont;c:pae4cd.-1#9f
          [infrun] prepare_to_wait: prepare_to_wait
          [infrun] reset: reason=handling event
          [infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target extended-remote
          [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
          [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
        [infrun] fetch_inferior_event: exit
        [infrun] fetch_inferior_event: enter
          [infrun] scoped_disable_commit_resumed: reason=handling event
          [infrun] do_target_wait: Found 2 inferiors, starting at #0
          [infrun] random_pending_event_thread: None found.
          [remote] wait: enter
            [remote] Packet received: N
          [remote] wait: exit
          [infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
          [infrun] print_target_wait_results:   -1.0.0 [process -1],
          [infrun] print_target_wait_results:   status->kind = NO_RESUMED
          [infrun] handle_inferior_event: status->kind = NO_RESUMED
          [remote] Sending packet: $Hgp0.0#ad
          [remote] Packet received: OK
          [remote] Sending packet: $qXfer:threads:read::0,1000#92
          [remote] Packet received: l<threads>\n<thread id="pae4cb.ae4cb" core="3" name="multi-re-run-1" handle="40c7c6f7ff7f0000"/>\n<thread id="pae4cb.ae4cc" core="2" name="multi-re-run-1" handle="40b6c6f7ff7f0000"/>\n<thread id="pae4cd.ae4ce" core="1" name="multi-re-run-2" handle="40b6c6f7ff7f0000"/>\n</threads>\n
          [infrun] stop_waiting: stop_waiting
          [remote] Sending packet: $qXfer:threads:read::0,1000#92
          [remote] Packet received: l<threads>\n<thread id="pae4cb.ae4cb" core="3" name="multi-re-run-1" handle="40c7c6f7ff7f0000"/>\n<thread id="pae4cb.ae4cc" core="2" name="multi-re-run-1" handle="40b6c6f7ff7f0000"/>\n<thread id="pae4cd.ae4ce" core="1" name="multi-re-run-2" handle="40b6c6f7ff7f0000"/>\n</threads>\n
          [infrun] infrun_async: enable=0
          [infrun] reset: reason=handling event
          [infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target extended-remote
          [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
          [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
        [infrun] fetch_inferior_event: exit
    
    We can see that we resume the inferior with vCont;c, but got NO_RESUMED.
    When the test passes, we get an EXITED status to indicate the process
    has exited.
    
    From GDBserver's point of view, it looks like this.  The logs contain
    some logging I added and that are part of this patch.
    
        [remote] getpkt: getpkt ("vCont;c:pae4cf.-1");  [no ack sent]
        [threads] resume: enter
          [threads] thread_needs_step_over: Need step over [LWP 713931]? Ignoring, should remain stopped
          [threads] thread_needs_step_over: Need step over [LWP 713932]? Ignoring, should remain stopped
          [threads] get_pc: pc is 0x555555555195
          [threads] thread_needs_step_over: Need step over [LWP 713935]? No, no breakpoint found at 0x555555555195
          [threads] get_pc: pc is 0x7ffff7d35a95
          [threads] thread_needs_step_over: Need step over [LWP 713936]? No, no breakpoint found at 0x7ffff7d35a95
          [threads] resume: Resuming, no pending status or step over needed
          [threads] resume_one_thread: resuming LWP 713935
          [threads] proceed_one_lwp: lwp 713935
          [threads] resume_one_lwp_throw:   continue from pc 0x555555555195
          [threads] resume_one_lwp_throw: Resuming lwp 713935 (continue, signal 0, stop not expected)
          [threads] resume_one_lwp_throw: NOW ptid=713935.713935.0 stopped=0 resumed=0
          [threads] resume_one_thread: resuming LWP 713936
          [threads] proceed_one_lwp: lwp 713936
          [threads] resume_one_lwp_throw:   continue from pc 0x7ffff7d35a95
          [threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, signal 0, stop not expected)
          [threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
        [threads] resume: exit
        [threads] wait_1: enter
          [threads] wait_1: [<all threads>]
          [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
          [threads] resume_stopped_resumed_lwps: resuming stopped-resumed LWP LWP 713935.713936 at 7ffff7d35a95: step=0
          [threads] resume_one_lwp_throw:   continue from pc 0x7ffff7d35a95
          [threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, signal 0, stop not expected)
          [threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
          [threads] operator(): check_zombie_leaders: leader_pid=713931, leader_lp!=NULL=1, num_lwps=2, zombie=0
          [threads] operator(): check_zombie_leaders: leader_pid=713935, leader_lp!=NULL=1, num_lwps=2, zombie=1
          [threads] operator(): Thread group leader 713935 zombie (it exited, or another thread execd).
          [threads] delete_lwp: deleting 713935
          [threads] wait_for_event_filtered: exit (no unwaited-for LWP)
        sigchld_handler
          [threads] wait_1: ret = null_ptid, TARGET_WAITKIND_NO_RESUMED
        [threads] wait_1: exit
    
    What happens is:
    
     - We resume the leader (713935) successfully.
     - The leader exits.
     - We resume the secondary thread (713936), we get ESRCH.  This is
       expected this the leader has exited.
     - resume_one_lwp_throw throws, it's caught by resume_one_lwp.
     - resume_one_lwp checks with check_ptrace_stopped_lwp_gone that the
       failure can be explained by the LWP becoming zombie, and swallows the
       error.
     - Note that this means that the secondary lwp still has stopped==1.
     - wait_1 is called, probably because linux_process_target::resume marks
       the async pipe at the end.
     - The exit event isn't ready yet, probably because the machine is under
       load, so waitpid returns nothing.
     - check_zombie_leaders detects that the leader is zombie and deletes
     - We try to find a resumed (non-stopped) LWP to get an event from,
       there's none since the leader (that was resumed) is now deleted, and
       the secondary thread is still marked stopped.
       wait_for_event_filtered returns -1, causing wait_1 to return
       NO_RESUMED.
    
    What I notice here is that there is some kind of race between the
    availability of the process' exit notification and the call to wait_1
    that results from marking the async pipe at the end of resume.
    
    I think what we want from this wait_1 invocation is to keep waiting, as
    we will eventually get thread exit notifications for both of our
    threads.
    
    The fix I came up with is to mark the secondary thread as !stopped (or
    resumed) when we fail to resume it.  This makes wait_1 see that there is
    at least one resume lwp, so it won't return NO_RESUMED.  I think this
    makes sense to consider it resumed, because we are going to receive an
    exit event for it.  Here's the GDBserver logs with the fix applied:
    
        [threads] resume: enter
          [threads] thread_needs_step_over: Need step over [LWP 724595]? Ignoring, should remain stopped
          [threads] thread_needs_step_over: Need step over [LWP 724596]? Ignoring, should remain stopped
          [threads] get_pc: pc is 0x555555555195
          [threads] thread_needs_step_over: Need step over [LWP 724597]? No, no breakpoint found at 0x555555555195
          [threads] get_pc: pc is 0x7ffff7d35a95
          [threads] thread_needs_step_over: Need step over [LWP 724598]? No, no breakpoint found at 0x7ffff7d35a95
          [threads] resume: Resuming, no pending status or step over needed
          [threads] resume_one_thread: resuming LWP 724597
          [threads] proceed_one_lwp: lwp 724597
          [threads] resume_one_lwp_throw:   continue from pc 0x555555555195
          [threads] resume_one_lwp_throw: Resuming lwp 724597 (continue, signal 0, stop not expected)
          [threads] resume_one_lwp_throw: NOW ptid=724597.724597.0 stopped=0 resumed=0
          [threads] resume_one_thread: resuming LWP 724598
          [threads] proceed_one_lwp: lwp 724598
          [threads] resume_one_lwp_throw:   continue from pc 0x7ffff7d35a95
          [threads] resume_one_lwp_throw: Resuming lwp 724598 (continue, signal 0, stop not expected)
          [threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
        [threads] resume: exit
        [threads] wait_1: enter
          [threads] wait_1: [<all threads>]
        sigchld_handler
          [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
          [threads] operator(): check_zombie_leaders: leader_pid=724595, leader_lp!=NULL=1, num_lwps=2, zombie=0
          [threads] operator(): check_zombie_leaders: leader_pid=724597, leader_lp!=NULL=1, num_lwps=2, zombie=1
          [threads] operator(): Thread group leader 724597 zombie (it exited, or another thread execd).
          [threads] delete_lwp: deleting 724597
          [threads] wait_for_event_filtered: sigsuspend'ing
        sigchld_handler
          [threads] wait_for_event_filtered: waitpid(-1, ...) returned 724598, ERRNO-OK
          [threads] wait_for_event_filtered: waitpid 724598 received 0 (exited)
          [threads] filter_event: 724598 exited
          [threads] wait_for_event_filtered: waitpid(-1, ...) returned 724597, ERRNO-OK
          [threads] wait_for_event_filtered: waitpid 724597 received 0 (exited)
          [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
        sigchld_handler
          [threads] wait_1: ret = LWP 724597.724598, exited with retcode 0
        [threads] wait_1: exit
    
    Change-Id: Idf0bdb4cb0313f1b49e4864071650cc83fb3c100

Diff:
---
 gdbserver/linux-low.cc | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/gdbserver/linux-low.cc b/gdbserver/linux-low.cc
index 7726a4a0c36..d68d4199f26 100644
--- a/gdbserver/linux-low.cc
+++ b/gdbserver/linux-low.cc
@@ -4091,7 +4091,15 @@ linux_process_target::resume_one_lwp_throw (lwp_info *lwp, int step,
 	  (PTRACE_TYPE_ARG4) (uintptr_t) signal);
 
   if (errno)
-    perror_with_name ("resuming thread");
+    {
+      int saved_errno = errno;
+
+      threads_debug_printf ("ptrace errno = %d (%s)",
+			    saved_errno, strerror (saved_errno));
+
+      errno = saved_errno;
+      perror_with_name ("resuming thread");
+    }
 
   /* Successfully resumed.  Clear state that no longer makes sense,
      and mark the LWP as running.  Must not do this before resuming
@@ -4152,7 +4160,15 @@ linux_process_target::resume_one_lwp (lwp_info *lwp, int step, int signal,
     }
   catch (const gdb_exception_error &ex)
     {
-      if (!check_ptrace_stopped_lwp_gone (lwp))
+      if (check_ptrace_stopped_lwp_gone (lwp))
+	{
+	  /* This could because we tried to resume an LWP after its leader
+	     exited.  Mark it as resumed, so we can collect an exit event
+	     from it.  */
+	  lwp->stopped = 0;
+	  lwp->stop_reason = TARGET_STOPPED_BY_NO_REASON;
+	}
+      else
 	throw;
     }
 }


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2022-03-31 17:04 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-31 17:04 [binutils-gdb] gdbserver/linux: set lwp !stopped when failing to resume Simon Marchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).