From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from simark.ca (simark.ca [158.69.221.121]) by sourceware.org (Postfix) with ESMTPS id 9ADC33858C1F for ; Mon, 27 Mar 2023 21:04:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9ADC33858C1F Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=simark.ca Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=simark.ca Received: from [10.0.0.170] (unknown [217.28.27.60]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by simark.ca (Postfix) with ESMTPSA id 34F3E1E0D3; Mon, 27 Mar 2023 17:04:47 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=simark.ca; s=mail; t=1679951087; bh=LlZB84lSC5jeeRhIYawLdB3uzNLTOFtUAhQnGaca9cM=; h=Date:Subject:To:References:From:In-Reply-To:From; b=OPOy4H9yiDhrjw4yEI4wjKMOnnHbOZaMGMv9zwSdETxLnQhKfNT49OcUa0mNiJEFa y2/KQgNSLozqp13jrOu9nw9B84uvbY8Ah0renqzatZ3sbzFZk6dGLSvSD4OeD0PUxj zaJ4TiQY+j3wekK9qOS5KCADNeuoE+BxmdTKzqlg= Message-ID: <1bc805b6-04f7-a22a-c666-67054ad09dfa@simark.ca> Date: Mon, 27 Mar 2023 17:04:46 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [PATCH 5/9] fbsd-nat: Defer any ineligible events reported by wait. Content-Language: fr To: John Baldwin , gdb-patches@sourceware.org References: <20230228181845.99936-1-jhb@FreeBSD.org> <20230228181845.99936-6-jhb@FreeBSD.org> From: Simon Marchi In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-10.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 3/27/23 16:49, John Baldwin wrote: > On 3/20/23 12:09 PM, Simon Marchi wrote: >> On 2/28/23 13:18, John Baldwin wrote: >>> If wait_1 finds an event for a thread or process that does not match >>> the set of threads and processes previously resumed, defer the event. >>> If the event is for a specific thread, suspend the thread and continue >>> the associated process before waiting for another event. >> >> I do not understand this, because it doesn't match my mental model of >> how things should work (which is derived from working with linux-nat). >> In that model, a thread with a pending event should never be >> ptrace-resumed. So the last sentence doesn't make sense, it implies >> that a thread has a pending and is ptrace-resumed (since we suspend it). > > The difference here might be that on FreeBSD ptrace can only PT_CONTINUE > and wait() for entire processes. When a thread contains multiple threads, > PT_CONTINUE resumes all threads belonging to that process. Individual > threads can be controlled by using PT_SUSPEND on a specific LWP. Those > LWPs will stay asleep after PT_CONTINUE. If any thread encounters a > stop event (breakpoint, fork event, etc.) the entire process stops (the > kernel signals threads in userland if needed to get them back into the > kernel to stop). Once it is stopped it reports an event via wait(). > > Currently on FreeBSD wait() only reports a single event at once. So > if multiple threads hit a breakpoint, wait() returns for the first one, > PT_LWPINFO is used on the pid to find out what specific event happened > (and which LWP it happened for). The other thread will keep its event > pending in the kernel, and after PT_CONTINUE it will immediately trigger > another stop and an event reported from wait(). Thanks for the explanation. I'll keep that in mind when re-reading (perhaps a v2, if you want to rebase and submit a fresh version). >>> One specific example of such an event is if a thread is created while >>> another thread in the same process hits a breakpoint. If the second >>> thread's event is reported first, the target resume method does not >>> yet "know" about the new thread and will not suspend it via >>> PT_SUSPEND. >> >> I was surprised to read that the resume method suspends some threads. I >> don't think it should, but I do see the current code. The current code >> appears to interpret the resume request as: "ensure the state of thread >> resumption looks exactly like this", so it suspends threads as needed, >> to match the requested resumption state. I don't think it's supposed to >> work like that. The resume method should only resume some threads, and >> if the core of GDB wants some threads stopped, it will call the stop >> method. > > See above. I can't individually suspend/resume threads. I can only > resume processes, but decide which set of threads within a process > to keep resumed when resuming the entire process. Ack. > >> Anyhow, just to be sure I understand the problem you describe: when >> fbsd-nat reports the breakpoint hit from "another thread", is the >> "thread created" event still in the kernel, not consumed by GDB? > > Yes. What happens is that GDB tries to single-step a thread over a > breakpoint, so it does a resume() of a single ptid_t. To implement > this, fbsd-nat's resume has to explicitly PT_SUSPEND all the other > LWPs in the process before doing a PT_CONTINUE so that only the > stepped thread will execute. However, this logic doesn't yet know > about the new thread (and it hasn't started executing in userland > yet). However, once the new thread starts executing, it immediately > reports a ptrace event. This confuses GDB currently becuase that > ptrace event can be reported right after the PT_CONTINUE. I think > in part this is because unlike on Linux, when thread A creates thread > B, thread A doesn't report anything. Instead, only thread B reports > an event when it starts. On Linux thread creation is more like fork > where thread A reports an event saying it created B. Ack. > >>> When wait is called, it will probably return the event >>> from the first thread before the result of the step from second >>> thread. This is the case reported in PR 21497. >> >>> --- >>> gdb/fbsd-nat.c | 34 +++++++++++++++++++++++++++++++++- >>> 1 file changed, 33 insertions(+), 1 deletion(-) >>> >>> diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c >>> index ca278b871ef..3f7278c6ea0 100644 >>> --- a/gdb/fbsd-nat.c >>> +++ b/gdb/fbsd-nat.c >>> @@ -1514,7 +1514,39 @@ fbsd_nat_target::wait (ptid_t ptid, struct target_waitstatus *ourstatus, >>> if (is_async_p ()) >>> async_file_flush (); >>> - wptid = wait_1 (ptid, ourstatus, target_options); >>> + while (1) >>> + { >>> + wptid = wait_1 (ptid, ourstatus, target_options); >>> + >>> + /* If no event was found, just return. */ >>> + if (ourstatus->kind () == TARGET_WAITKIND_IGNORE >>> + || ourstatus->kind () == TARGET_WAITKIND_NO_RESUMED) >>> + break; >>> + >>> + /* If an event is reported for a thread or process while >>> + stepping some other thread, suspend the thread reporting the >>> + event and defer the event until it can be reported to the >>> + core. */ >>> + if (!wptid.matches (resume_ptid)) >>> + { >>> + add_pending_event (wptid, *ourstatus); >>> + fbsd_nat_debug_printf ("deferring event [%s], [%s]", >>> + target_pid_to_str (wptid).c_str (), >>> + ourstatus->to_string ().c_str ()); >>> + if (wptid.pid () == resume_ptid.pid ()) >>> + { >>> + fbsd_nat_debug_printf ("suspending thread [%s]", >>> + target_pid_to_str (wptid).c_str ()); >>> + if (ptrace (PT_SUSPEND, wptid.lwp (), NULL, 0) == -1) >>> + perror_with_name (("ptrace (PT_SUSPEND)")); >> >> >> This is the part I don't understand. After wait_1 returned an event for >> a specific thread, I would expect this thread to be ptrace-stopped. So, >> I don't see the need to suspend it. You'd just leave it in the "resumed >> from the >> until. >> >>> + if (ptrace (PT_CONTINUE, wptid.pid (), (caddr_t) 1, 0) == -1) >>> + perror_with_name (("ptrace (PT_CONTINUE)")); >> >> I don't get why you continue `wptid.pid ()`. What assures you that this >> particular thread was stopped previously? Perhaps I don't understand >> because I assume somethings about pids/lwps and ptrace that are only >> true on Linux. > > In this specific case, here is the sequence of events: > > - process P1 contains two threads, T1, and T2 > - T1 creates a new thread T3, but T3 isn't yet scheduled to execute > - T2 hits a breakpoint so the process stops > - wait() returns P1 > - PT_LWPINFO reports the breakpoint event for T2 > - GDB undoes the breakpoint and tries to single-step T2 over the > breakpoint > - GDB calls resume with step=1 and scope_ptid=ptid(P1, T2) > - fbsd_nat::resume iterates over the threads it knows about, but > it only knows about T1 and T2 > - PT_SUSPEND T1 > - PT_RESUME T2 > - the P1 process is now resumed via PT_CONTINUE > - T3 starts executing and reports its "born" event > - wait() returns P1 > - PT_LWPINFO reports the thread birth event for T3 > > Previously at this point fbsd_nat::wait() returned an event for T3 > which triggers the assertion failure in the PR. With this patch > what happens now is: > > - fbsd_nat::wait() realizes T3 isn't "resumed" from the core's > perspective so explicitly suspends it via PT_SUSPEND > - the P1 process is continued via PT_CONTINUE > - T2 completes its step and reports the event > - wait() returns P1 > - PT_LWPINFO reports the step-complete event for P2 > - the event for T2 is returned from fbsd_nat::wait() to the core Ok, thanks, that makes sense. Simon