From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <simon.marchi@polymtl.ca>
Received: from smtp.polymtl.ca (smtp.polymtl.ca [132.207.4.11])
 by sourceware.org (Postfix) with ESMTPS id 490B238515FB
 for <gdb-patches@sourceware.org>; Wed, 18 Aug 2021 21:05:23 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 490B238515FB
Received: from simark.ca (simark.ca [158.69.221.121]) (authenticated bits=0)
 by smtp.polymtl.ca (8.14.7/8.14.7) with ESMTP id 17IL5IeD017196
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <gdb-patches@sourceware.org>; Wed, 18 Aug 2021 17:05:22 -0400
DKIM-Filter: OpenDKIM Filter v2.11.0 smtp.polymtl.ca 17IL5IeD017196
Received: from [10.0.0.11] (192-222-157-6.qc.cable.ebox.net [192.222.157.6])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by simark.ca (Postfix) with ESMTPSA id E92E61E4A3
 for <gdb-patches@sourceware.org>; Wed, 18 Aug 2021 17:05:17 -0400 (EDT)
Subject: Re: [PATCH] gdb/linux-nat: use LP's inferior when handling vfork done
 in linux_handle_extended_wait
To: gdb-patches@sourceware.org
References: <20210812154823.966724-1-simon.marchi@polymtl.ca>
From: Simon Marchi <simon.marchi@polymtl.ca>
Message-ID: <46a82d89-a8cb-20a4-cbaf-11ca25f56025@polymtl.ca>
Date: Wed, 18 Aug 2021 17:05:17 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.13.0
MIME-Version: 1.0
In-Reply-To: <20210812154823.966724-1-simon.marchi@polymtl.ca>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Poly-FromMTA: (simark.ca [158.69.221.121]) at Wed, 18 Aug 2021 21:05:18 +0000
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, NICE_REPLY_A,
 RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Aug 2021 21:05:33 -0000

On 2021-08-12 11:48 a.m., Simon Marchi wrote:
> I spotted this bug by reading the code and subsequently wrote a test to
> reproduce it.  The bug is caught by the assertions that are added.
> Otherwise the bug wouldn't cause a visible problem, but GDB would still
> be in a wrong state.
> 
> When receiving a PTRACE_EVENT_VFORK_DONE, we do:
> 
>   if (event == PTRACE_EVENT_VFORK_DONE)
>     {
>       if (current_inferior ()->waiting_for_vfork_done)
> 	{
> 	  linux_nat_debug_printf
> 	    ("Got expected PTRACE_EVENT_VFORK_DONE from LWP %ld: stopping",
> 	     lp->ptid.lwp ());
> 
> 	  ourstatus->kind = TARGET_WAITKIND_VFORK_DONE;
> 	  return 0;
> 	}
> 
>       linux_nat_debug_printf
> 	("Got PTRACE_EVENT_VFORK_DONE from LWP %ld: ignoring", lp->ptid.lwp ());
> 
>       return 1;
>     }
> 
> However, the current inferior may not be the inferior for which we have
> received the event, it could be another inferior of the linux-nat
> target.  The current inferior is set in do_target_wait_1 before calling
> target_wait, so that target_wait hits the target stack we want.  And
> there isn't anything making LP's inferior current between pulling the
> event out of the kernel and that code that handles
> PTRACE_EVENT_VFORK_DONE.
> 
> Let's say that inferior 1 is waiting for a vfork done event while
> inferior 2 is executing, doing things not vfork-related.  Inferior 2 is
> chosen by do_target_wait and made the current one prior to calling
> target_wait.  We pull the PTRACE_EVENT_VFORK_DONE event from the kernel
> for inferior 1.  We check the current_inferior (inferior 2)'s
> waiting_for_vfork_done flag.  It is false, so we (wrongfully) ignore the
> event that was meant for inferior 1.
> 
> I thought that this would end up in inferior 1 getting stuck, waiting
> for the event forever.  But we actually happen to attempt to resume
> inferior 1 at various points (like in resume_stopped_resumed_lwps) after
> having pulled the events), so inferior 1 resumes execution and is not
> stuck.  However, the inferior::waiting_for_vfork_done flag stays set for
> inferior 1, which is not a correct state.  A correct execution would
> return TARGET_WAITKIND_VFORK_DONE to infrun for inferior 1, which would
> clear the flag.  When processing TARGET_WAITKIND_VFORK_DONE, infrun also
> clears the program_space::breakpoints_not_allowed flag.  That sounds
> important, and failing to do that might lead to some other bad behavior.
> 
> To catch this bad state, add some assertions in handle_inferior_event.
> The idea is that if an inferior is a vfork parent, we expect that it
> can't report any events other than TARGET_WAITKIND_VFORK_DONE or
> TARGET_WAITKIND_SIGNALLED (which can happen if you kill -9 it during the
> vfork period).  The test program does multiple consecutive vforks.  If
> the bug explained above happens and waiting_for_vfork_done stays
> wrongfully set, the assertion will fail when a different event.
> 
> If I run the test without the fix in linux-nat.c, I get:
> 
>     run^M
>     Starting program: /home/simark/build/binutils-gdb/gdb/testsuite/outputs/gdb.base/vfork-multi-inferior/vfork-multi-inferior-vforker ^M
>     [Detaching after vfork from child process 822537]^M
>     /home/simark/src/binutils-gdb/gdb/infrun.c:5255: internal-error: void handle_inferior_event(execution_control_state*): Assertion `!inf->waiting_for_vfork_    done' failed.^M
> 
> At some point during my investigation, I suspected the random inferior
> selection in do_target_wait to not work properly (always choosing one of
> the two inferiors), so I wrote the test to try with the vforker as
> inferior 1 and then as inferior 2.  In the end, I don't think that's
> necessary, but I left the test as-is, as it will just test more
> combinations, and the test runs quickly anyway.
> 
> Change-Id: Ie5b922d4f988d3fabf9f41fdeb83166da00af269
> ---
>  gdb/infrun.c                                  |  13 +++
>  gdb/linux-nat.c                               |   4 +-
>  .../gdb.base/vfork-multi-inferior-other.c     |  12 ++
>  .../gdb.base/vfork-multi-inferior-vforker.c   |  24 ++++
>  .../gdb.base/vfork-multi-inferior.exp         | 107 ++++++++++++++++++
>  5 files changed, 159 insertions(+), 1 deletion(-)
>  create mode 100644 gdb/testsuite/gdb.base/vfork-multi-inferior-other.c
>  create mode 100644 gdb/testsuite/gdb.base/vfork-multi-inferior-vforker.c
>  create mode 100644 gdb/testsuite/gdb.base/vfork-multi-inferior.exp
> 
> diff --git a/gdb/infrun.c b/gdb/infrun.c
> index 5ee650fa4645..87224f1e5096 100644
> --- a/gdb/infrun.c
> +++ b/gdb/infrun.c
> @@ -5242,6 +5242,19 @@ handle_inferior_event (struct execution_control_state *ecs)
>  
>    mark_non_executing_threads (ecs->target, ecs->ptid, ecs->ws);
>  
> +  /* If an inferior is a vfork parent, we expect it to be stopped, the only
> +     two possible outcomes for it is VFORK_DONE (when the vfork child exits
> +     or execs) or SIGNALLED, if is is forcefully killed (e.g. kill -9) from
> +     outside GDB.  */
> +  if (ecs->ws.kind != TARGET_WAITKIND_VFORK_DONE
> +      && ecs->ws.kind != TARGET_WAITKIND_SIGNALLED)
> +    {
> +      inferior *inf = find_inferior_ptid (ecs->target, ecs->ptid);
> +      gdb_assert (inf != nullptr);
> +      gdb_assert (inf->vfork_child == nullptr);
> +      gdb_assert (!inf->waiting_for_vfork_done);
> +    }

New piece of information.  Today I learned that when a multi-threaded
program vforks, only the vfork-ing thread is frozen until the vfork
child execs or exits (this is documented in the vfork man page on
Linux).  Other threads in the vfork-ing process still run.  So while a
thread is waiting for a vfork done event, other threads could in theory
report some other event.

But that leads to the question: is it really safe that the other threads
in the vfork-ing process stay running?  During the window of time where
the vfork-child and parent share an address space, we remove the
breakpoints from the parent:

  https://gitlab.com/gnutools/binutils-gdb/-/blob/master/gdb/infrun.c#L439-449

On non-stop targets, by keeping the other threads running, don't we risk
them missing a breakpoint?

On all-stop targets, when we resume the vfork parent (expecting to
receive a vfork done event, which is the sign where it's finally safe to
re-insert the breakpoints), we do resume all threads.  That means the
other threads of the vfork parent run free without breakpoints inserted,
can't they also miss a breakpoint?

I'm thinking that when handling a vfork where we detach the child, the
safe thing to do would be:

 - non-stop target: stop all threads of the interior, remove
   breakpoints, resume only the vfork-ing thread, get the vfork done
   event, re-insert the breakpoints, resume all threads meant to be
   running

 - all-stop target: all threads are naturally stopped when the vfork
   event is reported, remove breakpoints, resume only the vfork-ing
   thread, get the vfork done event, re-insert breakpoints, resume all
   threads meant to be running

That's pretty much the same thing as when doing an in-line single-step.
Maybe that's actually what we do today, but I couldn't find signs of it.

In the mean time, I think that the fix in linux-nat.c (shown below) is
still right and would like to merge it, but I would remove the
assertions.  But that means I wouldn't know how to write a test.

> +
>    switch (ecs->ws.kind)
>      {
>      case TARGET_WAITKIND_LOADED:
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 211e447dc6f4..1b45f4e6c875 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2027,7 +2027,9 @@ linux_handle_extended_wait (struct lwp_info *lp, int status)
>  
>    if (event == PTRACE_EVENT_VFORK_DONE)
>      {
> -      if (current_inferior ()->waiting_for_vfork_done)
> +      inferior *inf = find_inferior_ptid (linux_target, lp->ptid);
> +
> +      if (inf->waiting_for_vfork_done)

Simon