From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.polymtl.ca (smtp.polymtl.ca [132.207.4.11]) by sourceware.org (Postfix) with ESMTPS id 490B238515FB for ; Wed, 18 Aug 2021 21:05:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 490B238515FB Received: from simark.ca (simark.ca [158.69.221.121]) (authenticated bits=0) by smtp.polymtl.ca (8.14.7/8.14.7) with ESMTP id 17IL5IeD017196 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 18 Aug 2021 17:05:22 -0400 DKIM-Filter: OpenDKIM Filter v2.11.0 smtp.polymtl.ca 17IL5IeD017196 Received: from [10.0.0.11] (192-222-157-6.qc.cable.ebox.net [192.222.157.6]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by simark.ca (Postfix) with ESMTPSA id E92E61E4A3 for ; Wed, 18 Aug 2021 17:05:17 -0400 (EDT) Subject: Re: [PATCH] gdb/linux-nat: use LP's inferior when handling vfork done in linux_handle_extended_wait To: gdb-patches@sourceware.org References: <20210812154823.966724-1-simon.marchi@polymtl.ca> From: Simon Marchi Message-ID: <46a82d89-a8cb-20a4-cbaf-11ca25f56025@polymtl.ca> Date: Wed, 18 Aug 2021 17:05:17 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: <20210812154823.966724-1-simon.marchi@polymtl.ca> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Poly-FromMTA: (simark.ca [158.69.221.121]) at Wed, 18 Aug 2021 21:05:18 +0000 X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, NICE_REPLY_A, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Aug 2021 21:05:33 -0000 On 2021-08-12 11:48 a.m., Simon Marchi wrote: > I spotted this bug by reading the code and subsequently wrote a test to > reproduce it. The bug is caught by the assertions that are added. > Otherwise the bug wouldn't cause a visible problem, but GDB would still > be in a wrong state. > > When receiving a PTRACE_EVENT_VFORK_DONE, we do: > > if (event == PTRACE_EVENT_VFORK_DONE) > { > if (current_inferior ()->waiting_for_vfork_done) > { > linux_nat_debug_printf > ("Got expected PTRACE_EVENT_VFORK_DONE from LWP %ld: stopping", > lp->ptid.lwp ()); > > ourstatus->kind = TARGET_WAITKIND_VFORK_DONE; > return 0; > } > > linux_nat_debug_printf > ("Got PTRACE_EVENT_VFORK_DONE from LWP %ld: ignoring", lp->ptid.lwp ()); > > return 1; > } > > However, the current inferior may not be the inferior for which we have > received the event, it could be another inferior of the linux-nat > target. The current inferior is set in do_target_wait_1 before calling > target_wait, so that target_wait hits the target stack we want. And > there isn't anything making LP's inferior current between pulling the > event out of the kernel and that code that handles > PTRACE_EVENT_VFORK_DONE. > > Let's say that inferior 1 is waiting for a vfork done event while > inferior 2 is executing, doing things not vfork-related. Inferior 2 is > chosen by do_target_wait and made the current one prior to calling > target_wait. We pull the PTRACE_EVENT_VFORK_DONE event from the kernel > for inferior 1. We check the current_inferior (inferior 2)'s > waiting_for_vfork_done flag. It is false, so we (wrongfully) ignore the > event that was meant for inferior 1. > > I thought that this would end up in inferior 1 getting stuck, waiting > for the event forever. But we actually happen to attempt to resume > inferior 1 at various points (like in resume_stopped_resumed_lwps) after > having pulled the events), so inferior 1 resumes execution and is not > stuck. However, the inferior::waiting_for_vfork_done flag stays set for > inferior 1, which is not a correct state. A correct execution would > return TARGET_WAITKIND_VFORK_DONE to infrun for inferior 1, which would > clear the flag. When processing TARGET_WAITKIND_VFORK_DONE, infrun also > clears the program_space::breakpoints_not_allowed flag. That sounds > important, and failing to do that might lead to some other bad behavior. > > To catch this bad state, add some assertions in handle_inferior_event. > The idea is that if an inferior is a vfork parent, we expect that it > can't report any events other than TARGET_WAITKIND_VFORK_DONE or > TARGET_WAITKIND_SIGNALLED (which can happen if you kill -9 it during the > vfork period). The test program does multiple consecutive vforks. If > the bug explained above happens and waiting_for_vfork_done stays > wrongfully set, the assertion will fail when a different event. > > If I run the test without the fix in linux-nat.c, I get: > > run^M > Starting program: /home/simark/build/binutils-gdb/gdb/testsuite/outputs/gdb.base/vfork-multi-inferior/vfork-multi-inferior-vforker ^M > [Detaching after vfork from child process 822537]^M > /home/simark/src/binutils-gdb/gdb/infrun.c:5255: internal-error: void handle_inferior_event(execution_control_state*): Assertion `!inf->waiting_for_vfork_ done' failed.^M > > At some point during my investigation, I suspected the random inferior > selection in do_target_wait to not work properly (always choosing one of > the two inferiors), so I wrote the test to try with the vforker as > inferior 1 and then as inferior 2. In the end, I don't think that's > necessary, but I left the test as-is, as it will just test more > combinations, and the test runs quickly anyway. > > Change-Id: Ie5b922d4f988d3fabf9f41fdeb83166da00af269 > --- > gdb/infrun.c | 13 +++ > gdb/linux-nat.c | 4 +- > .../gdb.base/vfork-multi-inferior-other.c | 12 ++ > .../gdb.base/vfork-multi-inferior-vforker.c | 24 ++++ > .../gdb.base/vfork-multi-inferior.exp | 107 ++++++++++++++++++ > 5 files changed, 159 insertions(+), 1 deletion(-) > create mode 100644 gdb/testsuite/gdb.base/vfork-multi-inferior-other.c > create mode 100644 gdb/testsuite/gdb.base/vfork-multi-inferior-vforker.c > create mode 100644 gdb/testsuite/gdb.base/vfork-multi-inferior.exp > > diff --git a/gdb/infrun.c b/gdb/infrun.c > index 5ee650fa4645..87224f1e5096 100644 > --- a/gdb/infrun.c > +++ b/gdb/infrun.c > @@ -5242,6 +5242,19 @@ handle_inferior_event (struct execution_control_state *ecs) > > mark_non_executing_threads (ecs->target, ecs->ptid, ecs->ws); > > + /* If an inferior is a vfork parent, we expect it to be stopped, the only > + two possible outcomes for it is VFORK_DONE (when the vfork child exits > + or execs) or SIGNALLED, if is is forcefully killed (e.g. kill -9) from > + outside GDB. */ > + if (ecs->ws.kind != TARGET_WAITKIND_VFORK_DONE > + && ecs->ws.kind != TARGET_WAITKIND_SIGNALLED) > + { > + inferior *inf = find_inferior_ptid (ecs->target, ecs->ptid); > + gdb_assert (inf != nullptr); > + gdb_assert (inf->vfork_child == nullptr); > + gdb_assert (!inf->waiting_for_vfork_done); > + } New piece of information. Today I learned that when a multi-threaded program vforks, only the vfork-ing thread is frozen until the vfork child execs or exits (this is documented in the vfork man page on Linux). Other threads in the vfork-ing process still run. So while a thread is waiting for a vfork done event, other threads could in theory report some other event. But that leads to the question: is it really safe that the other threads in the vfork-ing process stay running? During the window of time where the vfork-child and parent share an address space, we remove the breakpoints from the parent: https://gitlab.com/gnutools/binutils-gdb/-/blob/master/gdb/infrun.c#L439-449 On non-stop targets, by keeping the other threads running, don't we risk them missing a breakpoint? On all-stop targets, when we resume the vfork parent (expecting to receive a vfork done event, which is the sign where it's finally safe to re-insert the breakpoints), we do resume all threads. That means the other threads of the vfork parent run free without breakpoints inserted, can't they also miss a breakpoint? I'm thinking that when handling a vfork where we detach the child, the safe thing to do would be: - non-stop target: stop all threads of the interior, remove breakpoints, resume only the vfork-ing thread, get the vfork done event, re-insert the breakpoints, resume all threads meant to be running - all-stop target: all threads are naturally stopped when the vfork event is reported, remove breakpoints, resume only the vfork-ing thread, get the vfork done event, re-insert breakpoints, resume all threads meant to be running That's pretty much the same thing as when doing an in-line single-step. Maybe that's actually what we do today, but I couldn't find signs of it. In the mean time, I think that the fix in linux-nat.c (shown below) is still right and would like to merge it, but I would remove the assertions. But that means I wouldn't know how to write a test. > + > switch (ecs->ws.kind) > { > case TARGET_WAITKIND_LOADED: > diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c > index 211e447dc6f4..1b45f4e6c875 100644 > --- a/gdb/linux-nat.c > +++ b/gdb/linux-nat.c > @@ -2027,7 +2027,9 @@ linux_handle_extended_wait (struct lwp_info *lp, int status) > > if (event == PTRACE_EVENT_VFORK_DONE) > { > - if (current_inferior ()->waiting_for_vfork_done) > + inferior *inf = find_inferior_ptid (linux_target, lp->ptid); > + > + if (inf->waiting_for_vfork_done) Simon