From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 22D5E385E017 for ; Fri, 21 Jul 2023 09:47:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 22D5E385E017 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1689932830; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=H7pKNE0rHlb+cFo4RaOeE3nPXqL0nKpywJf1wMEBkrg=; b=S35u8p0saTFsYB0Sc0Dl80lhGEuAKTTjQdItCu/T8VHW/uQdqCzSBqSFypv/BUhwnUJioS p4vp20O83sRlElFFaGGgT7j8TxCsm50uR2O9pVJfDSdTpka+vkT4xYofkpRUvWnIUHB3l9 p6yF3+xQdVBYMR6M5g/RelU1/XOPrLY= Received: from mail-lf1-f70.google.com (mail-lf1-f70.google.com [209.85.167.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-303-es8dAD_nNMacNsR9Tj-X6g-1; Fri, 21 Jul 2023 05:47:08 -0400 X-MC-Unique: es8dAD_nNMacNsR9Tj-X6g-1 Received: by mail-lf1-f70.google.com with SMTP id 2adb3069b0e04-4fab61bb53bso1694035e87.0 for ; Fri, 21 Jul 2023 02:47:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689932827; x=1690537627; h=mime-version:message-id:date:references:in-reply-to:subject:cc:to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=H7pKNE0rHlb+cFo4RaOeE3nPXqL0nKpywJf1wMEBkrg=; b=TGB+nXL/okuFYQXBZi+waTCcQnPvVdXY2gGpmstR4LQ6PirZvmfbEW1UtxSsqEZoNT R7tOcPr+62yS7NeDgkX8g8sLcIIZ42GqzpvwVlqwA05U2yQR/qWN3foVbPFXXHOar1VK bg+EmerbwOlSopR0/aD43WPxVZfo3C55umoH/6K3uuS4iiQcS96DGMorbIqjRSgdhJMI 4U06ffhYQHOT6L3hNwIK3t8CubcQEpXH62IrFhtSJTKKm7373S2COR9savNwiTPnP0G4 /lEcuiODrNIjqIBWN9zaMBWXHR42CEcnAVXRntjM9PFNv+GFCG5XPZLy217Sae9/Kfmb RkJQ== X-Gm-Message-State: ABy/qLbwJvm3lgO05CPqFz8NrX4xJCKM+yYHxPTC346n5h3Q7b8ONNRz pgeT+DyCyqqt9amDbqpSq1mjA5V83vE6c21Y37392s8gxy6o/C7rILwkSSJV8MjqV/rSZHOizXP ZVpXRdunGvz20k5Yf3Nyh9A== X-Received: by 2002:a19:f802:0:b0:4f8:6e6e:4100 with SMTP id a2-20020a19f802000000b004f86e6e4100mr891711lff.52.1689932826945; Fri, 21 Jul 2023 02:47:06 -0700 (PDT) X-Google-Smtp-Source: APBJJlFsxrAuQuc1S96adWDaMsAMI+JqvddYNWkFtHvHF9bB/QjCmA4z8UOqspYBovOefwoK/olvuA== X-Received: by 2002:a19:f802:0:b0:4f8:6e6e:4100 with SMTP id a2-20020a19f802000000b004f86e6e4100mr891697lff.52.1689932826503; Fri, 21 Jul 2023 02:47:06 -0700 (PDT) Received: from localhost (93.72.115.87.dyn.plus.net. [87.115.72.93]) by smtp.gmail.com with ESMTPSA id q9-20020adff789000000b003142439c7bcsm3711784wrp.80.2023.07.21.02.47.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 21 Jul 2023 02:47:06 -0700 (PDT) From: Andrew Burgess To: Simon Marchi , gdb-patches@sourceware.org Cc: tankut.baris.aktemur@intel.com Subject: Re: [PATCHv2 5/8] gdb: don't resume vfork parent while child is still running In-Reply-To: <3e1e1db0-13d9-dd32-b4bb-051149ae6e76@simark.ca> References: <9b3303bed5b6afcbe2e11e65d9696e3b59a61826.1688484032.git.aburgess@redhat.com> <3e1e1db0-13d9-dd32-b4bb-051149ae6e76@simark.ca> Date: Fri, 21 Jul 2023 10:47:04 +0100 Message-ID: <87y1j9fvfr.fsf@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_BARRACUDACENTRAL,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Simon Marchi writes: > On 2023-07-04 11:22, Andrew Burgess via Gdb-patches wrote: >> Like the last few commit, this fixes yet another vfork related issue. >> Like the commit titled: >> >> gdb: don't restart vfork parent while waiting for child to finish >> >> which addressed a case in linux-nat where we would try to resume a >> vfork parent, this commit addresses a very similar case, but this time >> occurring in infrun.c. Just like with that previous commit, this bug >> results in the assert: >> >> x86-linux-dregs.c:146: internal-error: x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed. >> >> In this case the issue occurs when target-non-stop is on, but non-stop >> is off, and again, schedule-multiple is on. As with the previous >> commit, GDB is in follow-fork-mode child. If you have not done so, it >> is worth reading the earlier commit as many of the problems leading to >> the failure are the same, they just appear in a different part of GDB. >> >> Here are the steps leading to the assertion failure: >> >> 1. The user performs a 'next' over a vfork, GDB stop in the vfork >> child, >> >> 2. As we are planning to follow the child GDB sets the vfork_parent >> and vfork_child member variables in the two inferiors, the >> thread_waiting_for_vfork_done member is left as nullptr, that member >> is only used when GDB is planning to follow the parent inferior, >> >> 3. The user does 'continue', our expectation is that the vfork child >> should resume, and once that process has exited or execd, GDB should >> detach from the vfork parent. As a result of the 'continue' GDB >> eventually enters the proceed function, >> >> 4. In proceed we selected a ptid_t to resume, because >> schedule-multiple is on we select minus_one_ptid (see >> user_visible_resume_ptid), >> >> 5. As GDB is running in all-stop on top of non-stop mode, in the >> proceed function we iterate over all threads that match the resume >> ptid, which turns out to be all threads, and call >> proceed_resume_thread_checked. One of the threads we iterate over >> is the vfork parent thread, >> >> 6. As the thread passed to proceed_resume_thread_checked doesn't >> match any of the early return conditions, GDB will set the thread >> resumed, >> >> 7. As we are resuming one thread at a time, this thread is seen by >> the lower layers (e.g. linux-nat) as the "event thread", which means >> we don't apply any of the checks, e.g. is this thread a >> vfork parent, instead we assume that GDB core knows what it's doing, >> and linux-nat will resume the thread, we have now incorrectly set >> running the vfork parent thread when this thread should be waiting >> for the vfork child to complete, >> >> 8. Back in the proceed function GDB continues to iterate over all >> threads, and now (correctly) resumes the vfork child thread, >> >> 8. As the vfork child is still alive the kernel holds the vfork >> parent stopped, >> >> 9. Eventually the child performs its exec and GDB is sent and EXECD >> event. However, because the parent is resumed, as soon as the child >> performs its exec the vfork parent also sends a VFORK_DONE event to >> GDB, >> >> 10. Depending on timing both of these events might seem to arrive in >> GDB at the same time. Normally GDB expects to see the EXECD or >> EXITED/SIGNALED event from the vfork child before getting the >> VFORK_DONE in the parent. We know this because it is as a result of >> the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see >> handle_vfork_child_exec_or_exit for details). Further the comment >> in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that >> when we remain attached to the child (not the parent) we should not >> expect to see a VFORK_DONE, >> >> 11. If both events arrive at the same time then GDB will randomly >> choose one event to handle first, in some cases this will be the >> VFORK_DONE. As described above, upon seeing a VFORK_DONE GDB >> expects that (a) the vfork child has finished, however, in this case >> this is not completely true, the child has finished, but GDB has not >> processed the event associated with the completion yet, and (b) upon >> seeing a VFORK_DONE GDB assumes we are remaining attached to the >> parent, and so resumes the parent process, >> >> 12. GDB now handles the EXECD event. In our case we are detaching >> from the parent, so GDB calls target_detach (see >> handle_vfork_child_exec_or_exit), >> >> 13. While this has been going on the vfork parent is executing, and >> might even exit, >> >> 14. In linux_nat_target::detach the first thing we do is stop all >> threads in the process we're detaching from, the result of the stop >> request will be cached on the lwp_info object, >> >> 15. In our case the vfork parent has exited though, so when GDB >> waits for the thread, instead of a stop due to signal, we instead >> get a thread exited status, >> >> 16. Later in the detach process we try to resume the threads just >> prior to making the ptrace call to actually detach (see >> detach_one_lwp), as part of the process to resume a thread we try to >> touch some registers within the thread, and before doing this GDB >> asserts that the thread is stopped, >> >> 17. An exited thread is not classified as stopped, and so the assert >> triggers! >> >> Just like with the earlier commit, the fix is to spot the vfork parent >> status of the thread, and not resume such threads. Where the earlier >> commit fixed this in linux-nat, in this case I think the fix should >> live in infrun.c, in proceed_resume_thread_checked. This function >> already has a similar check to not resume the vfork parent in the case >> where we are planning to follow the vfork parent, I propose adding a >> similar case that checks for the vfork parent when we plan to follow >> the vfork child. >> >> This new check will mean that at step #6 above GDB doesn't try to >> resume the vfork parent thread, which prevents the VFORK_DONE from >> ever arriving. Once GDB sees the EXECD/EXITED/SIGNALLED event from >> the vfork child GDB will detach from the parent. >> >> There's no test included in this commit. In a subsequent commit I >> will expand gdb.base/foll-vfork.exp which is when this bug would be >> exposed. >> >> If you do want to reproduce this failure then you will for certainly >> need to run the gdb.base/foll-vfork.exp test in a loop as the failures >> are all very timing sensitive. I've found that running multiple >> copies in parallel makes the failure more likely to appear, I usually >> run ~6 copies in parallel and expect to see a failure after within >> 10mins. > > Hi Andrew, > > Since this commit, I see this on native-gdbserver and > native-extended-gdbserver: > > FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to end of inferior 2 (timeout) > FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: inferior 1 (timeout) > FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: print unblock_parent = 1 (timeout) > FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to break_parent (timeout) > > I haven't had the time to read this vfork series, but I look forward to, > since I also did some vfork fixes not too long ago. Thanks, I'll take a look. Andrew