From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=G4XC=7T=simark.ca=simark@sourceware.org>
Received: from simark.ca (simark.ca [158.69.221.121])
	by sourceware.org (Postfix) with ESMTPS id 9ADC33858C1F
	for <gdb-patches@sourceware.org>; Mon, 27 Mar 2023 21:04:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9ADC33858C1F
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=simark.ca
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=simark.ca
Received: from [10.0.0.170] (unknown [217.28.27.60])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits))
	(No client certificate requested)
	by simark.ca (Postfix) with ESMTPSA id 34F3E1E0D3;
	Mon, 27 Mar 2023 17:04:47 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=simark.ca; s=mail;
	t=1679951087; bh=LlZB84lSC5jeeRhIYawLdB3uzNLTOFtUAhQnGaca9cM=;
	h=Date:Subject:To:References:From:In-Reply-To:From;
	b=OPOy4H9yiDhrjw4yEI4wjKMOnnHbOZaMGMv9zwSdETxLnQhKfNT49OcUa0mNiJEFa
	 y2/KQgNSLozqp13jrOu9nw9B84uvbY8Ah0renqzatZ3sbzFZk6dGLSvSD4OeD0PUxj
	 zaJ4TiQY+j3wekK9qOS5KCADNeuoE+BxmdTKzqlg=
Message-ID: <1bc805b6-04f7-a22a-c666-67054ad09dfa@simark.ca>
Date: Mon, 27 Mar 2023 17:04:46 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.9.0
Subject: Re: [PATCH 5/9] fbsd-nat: Defer any ineligible events reported by
 wait.
Content-Language: fr
To: John Baldwin <jhb@FreeBSD.org>, gdb-patches@sourceware.org
References: <20230228181845.99936-1-jhb@FreeBSD.org>
 <20230228181845.99936-6-jhb@FreeBSD.org>
 <e96945e0-ba6e-0568-7106-27696cd46a88@simark.ca>
 <fb0a6e52-1e39-7995-4115-471dcae4d1f3@FreeBSD.org>
From: Simon Marchi <simark@simark.ca>
In-Reply-To: <fb0a6e52-1e39-7995-4115-471dcae4d1f3@FreeBSD.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-10.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gdb-patches.sourceware.org>

On 3/27/23 16:49, John Baldwin wrote:
> On 3/20/23 12:09 PM, Simon Marchi wrote:
>> On 2/28/23 13:18, John Baldwin wrote:
>>> If wait_1 finds an event for a thread or process that does not match
>>> the set of threads and processes previously resumed, defer the event.
>>> If the event is for a specific thread, suspend the thread and continue
>>> the associated process before waiting for another event.
>>
>> I do not understand this, because it doesn't match my mental model of
>> how things should work (which is derived from working with linux-nat).
>> In that model, a thread with a pending event should never be
>> ptrace-resumed.  So the last sentence doesn't make sense, it implies
>> that a thread has a pending and is ptrace-resumed (since we suspend it).
> 
> The difference here might be that on FreeBSD ptrace can only PT_CONTINUE
> and wait() for entire processes.  When a thread contains multiple threads,
> PT_CONTINUE resumes all threads belonging to that process.  Individual
> threads can be controlled by using PT_SUSPEND on a specific LWP.  Those
> LWPs will stay asleep after PT_CONTINUE.  If any thread encounters a
> stop event (breakpoint, fork event, etc.) the entire process stops (the
> kernel signals threads in userland if needed to get them back into the
> kernel to stop).  Once it is stopped it reports an event via wait().
> 
> Currently on FreeBSD wait() only reports a single event at once.  So
> if multiple threads hit a breakpoint, wait() returns for the first one,
> PT_LWPINFO is used on the pid to find out what specific event happened
> (and which LWP it happened for).  The other thread will keep its event
> pending in the kernel, and after PT_CONTINUE it will immediately trigger
> another stop and an event reported from wait().

Thanks for the explanation.  I'll keep that in mind when re-reading
(perhaps a v2, if you want to rebase and submit a fresh version).

>>> One specific example of such an event is if a thread is created while
>>> another thread in the same process hits a breakpoint.  If the second
>>> thread's event is reported first, the target resume method does not
>>> yet "know" about the new thread and will not suspend it via
>>> PT_SUSPEND.
>>
>> I was surprised to read that the resume method suspends some threads.  I
>> don't think it should, but I do see the current code.  The current code
>> appears to interpret the resume request as: "ensure the state of thread
>> resumption looks exactly like this", so it suspends threads as needed,
>> to match the requested resumption state.  I don't think it's supposed to
>> work like that.  The resume method should only resume some threads, and
>> if the core of GDB wants some threads stopped, it will call the stop
>> method.
> 
> See above.  I can't individually suspend/resume threads.  I can only
> resume processes, but decide which set of threads within a process
> to keep resumed when resuming the entire process.

Ack.

> 
>> Anyhow, just to be sure I understand the problem you describe: when
>> fbsd-nat reports the breakpoint hit from "another thread", is the
>> "thread created" event still in the kernel, not consumed by GDB?
> 
> Yes.  What happens is that GDB tries to single-step a thread over a
> breakpoint, so it does a resume() of a single ptid_t.  To implement
> this, fbsd-nat's resume has to explicitly PT_SUSPEND all the other
> LWPs in the process before doing a PT_CONTINUE so that only the
> stepped thread will execute.  However, this logic doesn't yet know
> about the new thread (and it hasn't started executing in userland
> yet).  However, once the new thread starts executing, it immediately
> reports a ptrace event.  This confuses GDB currently becuase that
> ptrace event can be reported right after the PT_CONTINUE.  I think
> in part this is because unlike on Linux, when thread A creates thread
> B, thread A doesn't report anything.  Instead, only thread B reports
> an event when it starts.  On Linux thread creation is more like fork
> where thread A reports an event saying it created B.

Ack.

>  
>>> When wait is called, it will probably return the event
>>> from the first thread before the result of the step from second
>>> thread.  This is the case reported in PR 21497.
>>
>>> ---
>>>   gdb/fbsd-nat.c | 34 +++++++++++++++++++++++++++++++++-
>>>   1 file changed, 33 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c
>>> index ca278b871ef..3f7278c6ea0 100644
>>> --- a/gdb/fbsd-nat.c
>>> +++ b/gdb/fbsd-nat.c
>>> @@ -1514,7 +1514,39 @@ fbsd_nat_target::wait (ptid_t ptid, struct target_waitstatus *ourstatus,
>>>     if (is_async_p ())
>>>       async_file_flush ();
>>>   -  wptid = wait_1 (ptid, ourstatus, target_options);
>>> +  while (1)
>>> +    {
>>> +      wptid = wait_1 (ptid, ourstatus, target_options);
>>> +
>>> +      /* If no event was found, just return.  */
>>> +      if (ourstatus->kind () == TARGET_WAITKIND_IGNORE
>>> +      || ourstatus->kind () == TARGET_WAITKIND_NO_RESUMED)
>>> +    break;
>>> +
>>> +      /* If an event is reported for a thread or process while
>>> +     stepping some other thread, suspend the thread reporting the
>>> +     event and defer the event until it can be reported to the
>>> +     core.  */
>>> +      if (!wptid.matches (resume_ptid))
>>> +    {
>>> +      add_pending_event (wptid, *ourstatus);
>>> +      fbsd_nat_debug_printf ("deferring event [%s], [%s]",
>>> +                 target_pid_to_str (wptid).c_str (),
>>> +                 ourstatus->to_string ().c_str ());
>>> +      if (wptid.pid () == resume_ptid.pid ())
>>> +        {
>>> +          fbsd_nat_debug_printf ("suspending thread [%s]",
>>> +                     target_pid_to_str (wptid).c_str ());
>>> +          if (ptrace (PT_SUSPEND, wptid.lwp (), NULL, 0) == -1)
>>> +        perror_with_name (("ptrace (PT_SUSPEND)"));
>>
>>
>> This is the part I don't understand.  After wait_1 returned an event for
>> a specific thread, I would expect this thread to be ptrace-stopped.  So,
>> I don't see the need to suspend it.  You'd just leave it in the "resumed
>> from the
>> until.
>>
>>> +          if (ptrace (PT_CONTINUE, wptid.pid (), (caddr_t) 1, 0) == -1)
>>> +        perror_with_name (("ptrace (PT_CONTINUE)"));
>>
>> I don't get why you continue `wptid.pid ()`.  What assures you that this
>> particular thread was stopped previously?  Perhaps I don't understand
>> because I assume somethings about pids/lwps and ptrace that are only
>> true on Linux.
> 
> In this specific case, here is the sequence of events:
> 
> - process P1 contains two threads, T1, and T2
> - T1 creates a new thread T3, but T3 isn't yet scheduled to execute
> - T2 hits a breakpoint so the process stops
> - wait() returns P1
> - PT_LWPINFO reports the breakpoint event for T2
> - GDB undoes the breakpoint and tries to single-step T2 over the
>   breakpoint
> - GDB calls resume with step=1 and scope_ptid=ptid(P1, T2)
> - fbsd_nat::resume iterates over the threads it knows about, but
>   it only knows about T1 and T2
>   - PT_SUSPEND T1
>   - PT_RESUME T2
> - the P1 process is now resumed via PT_CONTINUE
> - T3 starts executing and reports its "born" event
> - wait() returns P1
> - PT_LWPINFO reports the thread birth event for T3
> 
> Previously at this point fbsd_nat::wait() returned an event for T3
> which triggers the assertion failure in the PR.  With this patch
> what happens now is:
> 
> - fbsd_nat::wait() realizes T3 isn't "resumed" from the core's
>   perspective so explicitly suspends it via PT_SUSPEND
> - the P1 process is continued via PT_CONTINUE
> - T2 completes its step and reports the event
> - wait() returns P1
> - PT_LWPINFO reports the step-complete event for P2
> - the event for T2 is returned from fbsd_nat::wait() to the core

Ok, thanks, that makes sense.

Simon