From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 1879) id 20D9B385741C; Thu, 31 Mar 2022 17:04:38 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 20D9B385741C Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable From: Simon Marchi To: gdb-cvs@sourceware.org Subject: [binutils-gdb] gdbserver/linux: set lwp !stopped when failing to resume X-Act-Checkin: binutils-gdb X-Git-Author: Simon Marchi X-Git-Refname: refs/heads/master X-Git-Oldrev: 7820b634f74e4df874a00fe08afdb7ecc3d4747c X-Git-Newrev: 20471e00e2ea400c5a7126a98fb58c6a83b50628 Message-Id: <20220331170438.20D9B385741C@sourceware.org> Date: Thu, 31 Mar 2022 17:04:38 +0000 (GMT) X-BeenThere: gdb-cvs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-cvs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Mar 2022 17:04:38 -0000 https://sourceware.org/git/gitweb.cgi?p=3Dbinutils-gdb.git;h=3D20471e00e2ea= 400c5a7126a98fb58c6a83b50628 commit 20471e00e2ea400c5a7126a98fb58c6a83b50628 Author: Simon Marchi Date: Tue Jan 18 14:33:57 2022 -0500 gdbserver/linux: set lwp !stopped when failing to resume =20 I see some failures, at least in gdb.multi/multi-re-run.exp and gdb.threads/interrupted-hand-call.exp. Running `stress -C $(nproc)` at the same time as the test makes those tests relatively frequent. =20 Let's take gdb.multi/multi-re-run.exp as an example. The failure looks like this, an unexpected "no resumed": =20 continue Continuing. No unwaited-for children left. (gdb) FAIL: gdb.multi/multi-re-run.exp: re_run_inf=3D2: iter=3D1: c= ontinue until exit =20 The situation is: =20 - Inferior 1 is stopped somewhere, it won't really play a role here. - Inferior 2 has 2 threads, both stopped. - We resume inferior 2, the leader thread is expected to exit, making the process exit. =20 From GDB's perspective, a failing run looks like this: =20 [infrun] fetch_inferior_event: enter [infrun] scoped_disable_commit_resumed: reason=3Dhandling event [infrun] do_target_wait: Found 2 inferiors, starting at #1 [infrun] random_pending_event_thread: None found. [remote] wait: enter [remote] Packet received: T0506:20dcffffff7f0000;07:20dcffffff7= f0000;10:9551555555550000;thread:pae4cd.ae4cd;core:e; [remote] wait: exit [infrun] print_target_wait_results: target_wait (-1.0.0 [process = -1], status) =3D [infrun] print_target_wait_results: 713933.713933.0 [Thread 713= 933.713933], [infrun] print_target_wait_results: status->kind =3D STOPPED, s= ig =3D GDB_SIGNAL_TRAP [infrun] handle_inferior_event: status->kind =3D STOPPED, sig =3D= GDB_SIGNAL_TRAP [infrun] clear_step_over_info: clearing step over info [infrun] context_switch: Switching context from 0.0.0 to 713933.7= 13933.0 [infrun] handle_signal_stop: stop_pc=3D0x555555555195 [infrun] start_step_over: enter [infrun] start_step_over: stealing global queue of threads to s= tep, length =3D 0 [infrun] operator(): step-over queue now empty [infrun] start_step_over: exit [infrun] process_event_stop_test: no stepping, continue [remote] Sending packet: $Z0,555555555194,1#8e [remote] Packet received: OK [infrun] resume_1: step=3D0, signal=3DGDB_SIGNAL_0, trap_expected= =3D0, current thread [713933.713933.0] at 0x555555555195 [remote] Sending packet: $QPassSignals:e;10;14;17;1a;1b;1c;21;24;= 25;2c;4c;97;#0a [remote] Packet received: OK [remote] Sending packet: $vCont;c:pae4cd.-1#9f [infrun] prepare_to_wait: prepare_to_wait [infrun] reset: reason=3Dhandling event [infrun] maybe_set_commit_resumed_all_targets: enabling commit-re= sumed for target extended-remote [infrun] maybe_call_commit_resumed_all_targets: calling commit_re= sumed for target extended-remote [infrun] maybe_call_commit_resumed_all_targets: calling commit_re= sumed for target extended-remote [infrun] fetch_inferior_event: exit [infrun] fetch_inferior_event: enter [infrun] scoped_disable_commit_resumed: reason=3Dhandling event [infrun] do_target_wait: Found 2 inferiors, starting at #0 [infrun] random_pending_event_thread: None found. [remote] wait: enter [remote] Packet received: N [remote] wait: exit [infrun] print_target_wait_results: target_wait (-1.0.0 [process = -1], status) =3D [infrun] print_target_wait_results: -1.0.0 [process -1], [infrun] print_target_wait_results: status->kind =3D NO_RESUMED [infrun] handle_inferior_event: status->kind =3D NO_RESUMED [remote] Sending packet: $Hgp0.0#ad [remote] Packet received: OK [remote] Sending packet: $qXfer:threads:read::0,1000#92 [remote] Packet received: l\n\n\n\n\n [infrun] stop_waiting: stop_waiting [remote] Sending packet: $qXfer:threads:read::0,1000#92 [remote] Packet received: l\n\n\n\n\n [infrun] infrun_async: enable=3D0 [infrun] reset: reason=3Dhandling event [infrun] maybe_set_commit_resumed_all_targets: enabling commit-re= sumed for target extended-remote [infrun] maybe_call_commit_resumed_all_targets: calling commit_re= sumed for target extended-remote [infrun] maybe_call_commit_resumed_all_targets: calling commit_re= sumed for target extended-remote [infrun] fetch_inferior_event: exit =20 We can see that we resume the inferior with vCont;c, but got NO_RESUMED. When the test passes, we get an EXITED status to indicate the process has exited. =20 From GDBserver's point of view, it looks like this. The logs contain some logging I added and that are part of this patch. =20 [remote] getpkt: getpkt ("vCont;c:pae4cf.-1"); [no ack sent] [threads] resume: enter [threads] thread_needs_step_over: Need step over [LWP 713931]? Ig= noring, should remain stopped [threads] thread_needs_step_over: Need step over [LWP 713932]? Ig= noring, should remain stopped [threads] get_pc: pc is 0x555555555195 [threads] thread_needs_step_over: Need step over [LWP 713935]? No= , no breakpoint found at 0x555555555195 [threads] get_pc: pc is 0x7ffff7d35a95 [threads] thread_needs_step_over: Need step over [LWP 713936]? No= , no breakpoint found at 0x7ffff7d35a95 [threads] resume: Resuming, no pending status or step over needed [threads] resume_one_thread: resuming LWP 713935 [threads] proceed_one_lwp: lwp 713935 [threads] resume_one_lwp_throw: continue from pc 0x555555555195 [threads] resume_one_lwp_throw: Resuming lwp 713935 (continue, si= gnal 0, stop not expected) [threads] resume_one_lwp_throw: NOW ptid=3D713935.713935.0 stoppe= d=3D0 resumed=3D0 [threads] resume_one_thread: resuming LWP 713936 [threads] proceed_one_lwp: lwp 713936 [threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95 [threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, si= gnal 0, stop not expected) [threads] resume_one_lwp_throw: ptrace errno =3D 3 (No such proce= ss) [threads] resume: exit [threads] wait_1: enter [threads] wait_1: [] [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, E= RRNO-OK [threads] resume_stopped_resumed_lwps: resuming stopped-resumed L= WP LWP 713935.713936 at 7ffff7d35a95: step=3D0 [threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95 [threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, si= gnal 0, stop not expected) [threads] resume_one_lwp_throw: ptrace errno =3D 3 (No such proce= ss) [threads] operator(): check_zombie_leaders: leader_pid=3D713931, = leader_lp!=3DNULL=3D1, num_lwps=3D2, zombie=3D0 [threads] operator(): check_zombie_leaders: leader_pid=3D713935, = leader_lp!=3DNULL=3D1, num_lwps=3D2, zombie=3D1 [threads] operator(): Thread group leader 713935 zombie (it exite= d, or another thread execd). [threads] delete_lwp: deleting 713935 [threads] wait_for_event_filtered: exit (no unwaited-for LWP) sigchld_handler [threads] wait_1: ret =3D null_ptid, TARGET_WAITKIND_NO_RESUMED [threads] wait_1: exit =20 What happens is: =20 - We resume the leader (713935) successfully. - The leader exits. - We resume the secondary thread (713936), we get ESRCH. This is expected this the leader has exited. - resume_one_lwp_throw throws, it's caught by resume_one_lwp. - resume_one_lwp checks with check_ptrace_stopped_lwp_gone that the failure can be explained by the LWP becoming zombie, and swallows the error. - Note that this means that the secondary lwp still has stopped=3D=3D1. - wait_1 is called, probably because linux_process_target::resume marks the async pipe at the end. - The exit event isn't ready yet, probably because the machine is under load, so waitpid returns nothing. - check_zombie_leaders detects that the leader is zombie and deletes - We try to find a resumed (non-stopped) LWP to get an event from, there's none since the leader (that was resumed) is now deleted, and the secondary thread is still marked stopped. wait_for_event_filtered returns -1, causing wait_1 to return NO_RESUMED. =20 What I notice here is that there is some kind of race between the availability of the process' exit notification and the call to wait_1 that results from marking the async pipe at the end of resume. =20 I think what we want from this wait_1 invocation is to keep waiting, as we will eventually get thread exit notifications for both of our threads. =20 The fix I came up with is to mark the secondary thread as !stopped (or resumed) when we fail to resume it. This makes wait_1 see that there is at least one resume lwp, so it won't return NO_RESUMED. I think this makes sense to consider it resumed, because we are going to receive an exit event for it. Here's the GDBserver logs with the fix applied: =20 [threads] resume: enter [threads] thread_needs_step_over: Need step over [LWP 724595]? Ig= noring, should remain stopped [threads] thread_needs_step_over: Need step over [LWP 724596]? Ig= noring, should remain stopped [threads] get_pc: pc is 0x555555555195 [threads] thread_needs_step_over: Need step over [LWP 724597]? No= , no breakpoint found at 0x555555555195 [threads] get_pc: pc is 0x7ffff7d35a95 [threads] thread_needs_step_over: Need step over [LWP 724598]? No= , no breakpoint found at 0x7ffff7d35a95 [threads] resume: Resuming, no pending status or step over needed [threads] resume_one_thread: resuming LWP 724597 [threads] proceed_one_lwp: lwp 724597 [threads] resume_one_lwp_throw: continue from pc 0x555555555195 [threads] resume_one_lwp_throw: Resuming lwp 724597 (continue, si= gnal 0, stop not expected) [threads] resume_one_lwp_throw: NOW ptid=3D724597.724597.0 stoppe= d=3D0 resumed=3D0 [threads] resume_one_thread: resuming LWP 724598 [threads] proceed_one_lwp: lwp 724598 [threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95 [threads] resume_one_lwp_throw: Resuming lwp 724598 (continue, si= gnal 0, stop not expected) [threads] resume_one_lwp_throw: ptrace errno =3D 3 (No such proce= ss) [threads] resume: exit [threads] wait_1: enter [threads] wait_1: [] sigchld_handler [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, E= RRNO-OK [threads] operator(): check_zombie_leaders: leader_pid=3D724595, = leader_lp!=3DNULL=3D1, num_lwps=3D2, zombie=3D0 [threads] operator(): check_zombie_leaders: leader_pid=3D724597, = leader_lp!=3DNULL=3D1, num_lwps=3D2, zombie=3D1 [threads] operator(): Thread group leader 724597 zombie (it exite= d, or another thread execd). [threads] delete_lwp: deleting 724597 [threads] wait_for_event_filtered: sigsuspend'ing sigchld_handler [threads] wait_for_event_filtered: waitpid(-1, ...) returned 7245= 98, ERRNO-OK [threads] wait_for_event_filtered: waitpid 724598 received 0 (exi= ted) [threads] filter_event: 724598 exited [threads] wait_for_event_filtered: waitpid(-1, ...) returned 7245= 97, ERRNO-OK [threads] wait_for_event_filtered: waitpid 724597 received 0 (exi= ted) [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, E= RRNO-OK sigchld_handler [threads] wait_1: ret =3D LWP 724597.724598, exited with retcode 0 [threads] wait_1: exit =20 Change-Id: Idf0bdb4cb0313f1b49e4864071650cc83fb3c100 Diff: --- gdbserver/linux-low.cc | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/gdbserver/linux-low.cc b/gdbserver/linux-low.cc index 7726a4a0c36..d68d4199f26 100644 --- a/gdbserver/linux-low.cc +++ b/gdbserver/linux-low.cc @@ -4091,7 +4091,15 @@ linux_process_target::resume_one_lwp_throw (lwp_info= *lwp, int step, (PTRACE_TYPE_ARG4) (uintptr_t) signal); =20 if (errno) - perror_with_name ("resuming thread"); + { + int saved_errno =3D errno; + + threads_debug_printf ("ptrace errno =3D %d (%s)", + saved_errno, strerror (saved_errno)); + + errno =3D saved_errno; + perror_with_name ("resuming thread"); + } =20 /* Successfully resumed. Clear state that no longer makes sense, and mark the LWP as running. Must not do this before resuming @@ -4152,7 +4160,15 @@ linux_process_target::resume_one_lwp (lwp_info *lwp,= int step, int signal, } catch (const gdb_exception_error &ex) { - if (!check_ptrace_stopped_lwp_gone (lwp)) + if (check_ptrace_stopped_lwp_gone (lwp)) + { + /* This could because we tried to resume an LWP after its leader + exited. Mark it as resumed, so we can collect an exit event + from it. */ + lwp->stopped =3D 0; + lwp->stop_reason =3D TARGET_STOPPED_BY_NO_REASON; + } + else throw; } }