From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from simark.ca (simark.ca [158.69.221.121]) by sourceware.org (Postfix) with ESMTPS id BC8563858D32 for ; Tue, 18 Jul 2023 20:42:38 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org BC8563858D32 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=simark.ca Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=simark.ca DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=simark.ca; s=mail; t=1689712957; bh=8A6ZlNBt2smK0p7srOEua3zgE0HhQqClomtFwz2R3LU=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=l9P74uAcIc+6T0e+WY0vU6D5rsFm6tuf/FyT8Inf5Wo1bgXLSYlzlRXUGTuokmygJ PNQMUVDkW4X680ChGeWVqFonz/UW3oTnwvaUbxECElHUQjz8S99ggpBM3xf80qNayN ea1jSnNdlHXhFLQFXOPGPjookY9BT/6z8WvxfFHU= Received: from [10.0.0.11] (modemcable238.237-201-24.mc.videotron.ca [24.201.237.238]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by simark.ca (Postfix) with ESMTPSA id 17E011E00F; Tue, 18 Jul 2023 16:42:37 -0400 (EDT) Message-ID: <3e1e1db0-13d9-dd32-b4bb-051149ae6e76@simark.ca> Date: Tue, 18 Jul 2023 16:42:36 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [PATCHv2 5/8] gdb: don't resume vfork parent while child is still running Content-Language: en-US To: Andrew Burgess , gdb-patches@sourceware.org Cc: tankut.baris.aktemur@intel.com References: <9b3303bed5b6afcbe2e11e65d9696e3b59a61826.1688484032.git.aburgess@redhat.com> From: Simon Marchi In-Reply-To: <9b3303bed5b6afcbe2e11e65d9696e3b59a61826.1688484032.git.aburgess@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 2023-07-04 11:22, Andrew Burgess via Gdb-patches wrote: > Like the last few commit, this fixes yet another vfork related issue. > Like the commit titled: > > gdb: don't restart vfork parent while waiting for child to finish > > which addressed a case in linux-nat where we would try to resume a > vfork parent, this commit addresses a very similar case, but this time > occurring in infrun.c. Just like with that previous commit, this bug > results in the assert: > > x86-linux-dregs.c:146: internal-error: x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed. > > In this case the issue occurs when target-non-stop is on, but non-stop > is off, and again, schedule-multiple is on. As with the previous > commit, GDB is in follow-fork-mode child. If you have not done so, it > is worth reading the earlier commit as many of the problems leading to > the failure are the same, they just appear in a different part of GDB. > > Here are the steps leading to the assertion failure: > > 1. The user performs a 'next' over a vfork, GDB stop in the vfork > child, > > 2. As we are planning to follow the child GDB sets the vfork_parent > and vfork_child member variables in the two inferiors, the > thread_waiting_for_vfork_done member is left as nullptr, that member > is only used when GDB is planning to follow the parent inferior, > > 3. The user does 'continue', our expectation is that the vfork child > should resume, and once that process has exited or execd, GDB should > detach from the vfork parent. As a result of the 'continue' GDB > eventually enters the proceed function, > > 4. In proceed we selected a ptid_t to resume, because > schedule-multiple is on we select minus_one_ptid (see > user_visible_resume_ptid), > > 5. As GDB is running in all-stop on top of non-stop mode, in the > proceed function we iterate over all threads that match the resume > ptid, which turns out to be all threads, and call > proceed_resume_thread_checked. One of the threads we iterate over > is the vfork parent thread, > > 6. As the thread passed to proceed_resume_thread_checked doesn't > match any of the early return conditions, GDB will set the thread > resumed, > > 7. As we are resuming one thread at a time, this thread is seen by > the lower layers (e.g. linux-nat) as the "event thread", which means > we don't apply any of the checks, e.g. is this thread a > vfork parent, instead we assume that GDB core knows what it's doing, > and linux-nat will resume the thread, we have now incorrectly set > running the vfork parent thread when this thread should be waiting > for the vfork child to complete, > > 8. Back in the proceed function GDB continues to iterate over all > threads, and now (correctly) resumes the vfork child thread, > > 8. As the vfork child is still alive the kernel holds the vfork > parent stopped, > > 9. Eventually the child performs its exec and GDB is sent and EXECD > event. However, because the parent is resumed, as soon as the child > performs its exec the vfork parent also sends a VFORK_DONE event to > GDB, > > 10. Depending on timing both of these events might seem to arrive in > GDB at the same time. Normally GDB expects to see the EXECD or > EXITED/SIGNALED event from the vfork child before getting the > VFORK_DONE in the parent. We know this because it is as a result of > the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see > handle_vfork_child_exec_or_exit for details). Further the comment > in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that > when we remain attached to the child (not the parent) we should not > expect to see a VFORK_DONE, > > 11. If both events arrive at the same time then GDB will randomly > choose one event to handle first, in some cases this will be the > VFORK_DONE. As described above, upon seeing a VFORK_DONE GDB > expects that (a) the vfork child has finished, however, in this case > this is not completely true, the child has finished, but GDB has not > processed the event associated with the completion yet, and (b) upon > seeing a VFORK_DONE GDB assumes we are remaining attached to the > parent, and so resumes the parent process, > > 12. GDB now handles the EXECD event. In our case we are detaching > from the parent, so GDB calls target_detach (see > handle_vfork_child_exec_or_exit), > > 13. While this has been going on the vfork parent is executing, and > might even exit, > > 14. In linux_nat_target::detach the first thing we do is stop all > threads in the process we're detaching from, the result of the stop > request will be cached on the lwp_info object, > > 15. In our case the vfork parent has exited though, so when GDB > waits for the thread, instead of a stop due to signal, we instead > get a thread exited status, > > 16. Later in the detach process we try to resume the threads just > prior to making the ptrace call to actually detach (see > detach_one_lwp), as part of the process to resume a thread we try to > touch some registers within the thread, and before doing this GDB > asserts that the thread is stopped, > > 17. An exited thread is not classified as stopped, and so the assert > triggers! > > Just like with the earlier commit, the fix is to spot the vfork parent > status of the thread, and not resume such threads. Where the earlier > commit fixed this in linux-nat, in this case I think the fix should > live in infrun.c, in proceed_resume_thread_checked. This function > already has a similar check to not resume the vfork parent in the case > where we are planning to follow the vfork parent, I propose adding a > similar case that checks for the vfork parent when we plan to follow > the vfork child. > > This new check will mean that at step #6 above GDB doesn't try to > resume the vfork parent thread, which prevents the VFORK_DONE from > ever arriving. Once GDB sees the EXECD/EXITED/SIGNALLED event from > the vfork child GDB will detach from the parent. > > There's no test included in this commit. In a subsequent commit I > will expand gdb.base/foll-vfork.exp which is when this bug would be > exposed. > > If you do want to reproduce this failure then you will for certainly > need to run the gdb.base/foll-vfork.exp test in a loop as the failures > are all very timing sensitive. I've found that running multiple > copies in parallel makes the failure more likely to appear, I usually > run ~6 copies in parallel and expect to see a failure after within > 10mins. Hi Andrew, Since this commit, I see this on native-gdbserver and native-extended-gdbserver: FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to end of inferior 2 (timeout) FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: inferior 1 (timeout) FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: print unblock_parent = 1 (timeout) FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to break_parent (timeout) I haven't had the time to read this vfork series, but I look forward to, since I also did some vfork fixes not too long ago. Simon