From: "Aktemur, Tankut Baris" <tankut.baris.aktemur@intel.com>
To: Andrew Burgess <aburgess@redhat.com>,
"gdb-patches@sourceware.org" <gdb-patches@sourceware.org>
Subject: RE: [PATCHv2 2/8] gdb: don't restart vfork parent while waiting for child to finish
Date: Wed, 5 Jul 2023 10:08:51 +0000 [thread overview]
Message-ID: <DM4PR11MB73030084E89DAFC10937E832C42FA@DM4PR11MB7303.namprd11.prod.outlook.com> (raw)
In-Reply-To: <a9b31c5abcb5c63bb329c62be568ca0c3a139692.1688484032.git.aburgess@redhat.com>
On Tuesday, July 4, 2023 5:23 PM, Andrew Burgess wrote:
> While working on a later patch, which changes gdb.base/foll-vfork.exp,
> I noticed that sometimes I would hit this assert:
>
> x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.
>
> I eventually tracked it down to a combination of schedule-multiple
> mode being on, target-non-stop being off, follow-fork-mode being set
> to child, and some bad timing. The failing case is pretty simple, a
> single threaded application performs a vfork, the child process then
> execs some other application while the parent process (once the vfork
> child has completed its exec) just exits. As best I understand
> things, here's what happens when things go wrong:
>
> 1. The parent process performs a vfork, GDB sees the VFORKED event
> and creates an inferior and thread for the vfork child,
>
> 2. GDB resumes the vfork child process. As schedule-multiple is on
> and target-non-stop is off, this is translated into a request to
> start all processes (see user_visible_resume_ptid),
>
> 3. In the linux-nat layer we spot that one of the threads we are
> about to start is a vfork parent, and so don't start that
> thread (see resume_lwp), the vfork child thread is resumed,
>
> 4. GDB waits for the next event, eventually entering
> linux_nat_target::wait, which in turn calls linux_nat_wait_1,
>
> 5. In linux_nat_wait_1 we eventually call
> resume_stopped_resumed_lwps, this should restart threads that have
> stopped but don't actually have anything interesting to report.
>
> 6. Unfortunately, resume_stopped_resumed_lwps doesn't check for
> vfork parents like resume_lwp does, so at this point the vfork
> parent is resumed. This feels like the start of the bug, and this
> is where I'm proposing to fix things, but, resuming the vfork parent
> isn't the worst thing in the world because....
>
> 7. As the vfork child is still alive the kernel holds the vfork
> parent stopped,
>
> 8. Eventually the child performs its exec and GDB is sent and EXECD
> event. However, because the parent is resumed, as soon as the child
> performs its exec the vfork parent also sends a VFORK_DONE event to
> GDB,
>
> 9. Depending on timing both of these events might seem to arrive in
> GDB at the same time. Normally GDB expects to see the EXECD or
> EXITED/SIGNALED event from the vfork child before getting the
> VFORK_DONE in the parent. We know this because it is as a result of
> the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
> handle_vfork_child_exec_or_exit for details). Further the comment
> in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
> when we remain attached to the child (not the parent) we should not
> expect to see a VFORK_DONE,
>
> 10. If both events arrive at the same time then GDB will randomly
> choose one event to handle first, in some cases this will be the
> VFORK_DONE. As described above, upon seeing a VFORK_DONE GDB
> expects that (a) the vfork child has finished, however, in this case
> this is not completely true, the child has finished, but GDB has not
> processed the event associated with the completion yet, and (b) upon
> seeing a VFORK_DONE GDB assumes we are remaining attached to the
> parent, and so resumes the parent process,
>
> 11. GDB now handles the EXECD event. In our case we are detaching
> from the parent, so GDB calls target_detach (see
> handle_vfork_child_exec_or_exit),
>
> 12. While this has been going on the vfork parent is executing, and
> might even exit,
>
> 13. In linux_nat_target::detach the first thing we do is stop all
> threads in the process we're detaching from, the result of the stop
> request will be cached on the lwp_info object,
>
> 14. In our case the vfork parent has exited though, so when GDB
> waits for the thread, instead of a stop due to signal, we instead
> get a thread exited status,
>
> 15. Later in the detach process we try to resume the threads just
> prior to making the ptrace call to actually detach (see
> detach_one_lwp), as part of the process to resume a thread we try to
> touch some registers within the thread, and before doing this GDB
> asserts that the thread is stopped,
>
> 16. An exited thread is not classified as stopped, and so the assert
> triggers!
>
> So there's two bugs I see here. The first, and most critical one here
> is in step #6. I think that resume_stopped_resumed_lwps should not
> resume a vfork parent, just like resume_lwp doesn't resume a vfork
> parent.
>
> With this change in place the vfork parent will remain stopped in step
> instead GDB will only see the EXECD/EXITED/SIGNALLED event. The
> problems in #9 and #10 are therefore skipped and we arrive at #11,
> handling the EXECD event. As the parent is still stopped #12 doesn't
> apply, and in #13 when we try to stop the process we will see that it
> is already stopped, there's no risk of the vfork parent exiting before
> we get to this point. And finally, in #15 we are safe to poke the
> process registers because it will not have exited by this point.
>
> However, I did mention two bugs.
>
> The second bug I've not yet managed to actually trigger, but I'm
> convinced it must exist: if we forget vforks for a moment, in step #13
> above, when linux_nat_target::detach is called, we first try to stop
> all threads in the process GDB is detaching from. If we imagine a
> multi-threaded inferior with many threads, and GDB running in non-stop
> mode, then, if the user tries to detach there is a chance that thread
> could exit just as linux_nat_target::detach is entered, in which case
> we should be able to trigger the same assert.
>
> But, like I said, I've not (yet) managed to trigger this second bug,
> and even if I could, the fix would not belong in this commit, so I'm
> pointing this out just for completeness.
>
> There's no test included in this commit. In a couple of commits time
> I will expand gdb.base/foll-vfork.exp which is when this bug would be
> exposed. Unfortunately there are at least two other bugs (separate
> from the ones discussed above) that need fixing first, these will be
> fixed in the next commits before the gdb.base/foll-vfork.exp test is
> expanded.
>
> If you do want to reproduce this failure then you will for certainly
> need to run the gdb.base/foll-vfork.exp test in a loop as the failures
> are all very timing sensitive. I've found that running multiple
> copies in parallel makes the failure more likely to appear, I usually
> run ~6 copies in parallel and expect to see a failure after within
> 10mins.
> ---
> gdb/linux-nat.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 383ef58fa23..7e121b7ab41 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -3346,7 +3346,14 @@ linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus
> *ourstatus,
> static int
> resume_stopped_resumed_lwps (struct lwp_info *lp, const ptid_t wait_ptid)
> {
> - if (!lp->stopped)
> + struct inferior *inf = find_inferior_ptid (linux_target, lp->ptid);
Nit: The 'struct' keyword can be omitted.
-Baris
Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928
next prev parent reply other threads:[~2023-07-05 10:08 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-22 13:17 [PATCH 0/8] Some vfork related fixes Andrew Burgess
2023-06-22 13:17 ` [PATCH 1/8] gdb: catch more errors in gdb.base/foll-vfork.exp Andrew Burgess
2023-06-22 13:17 ` [PATCH 2/8] gdb: don't restart vfork parent while waiting for child to finish Andrew Burgess
2023-06-22 13:17 ` [PATCH 3/8] gdb: fix an issue with vfork in non-stop mode Andrew Burgess
2023-06-22 13:17 ` [PATCH 4/8] gdb, infrun: refactor part of `proceed` into separate function Andrew Burgess
2023-06-28 8:47 ` Aktemur, Tankut Baris
2023-07-04 15:24 ` Andrew Burgess
2023-06-22 13:17 ` [PATCH 5/8] gdb: don't resume vfork parent while child is still running Andrew Burgess
2023-06-22 13:17 ` [PATCH 6/8] gdb/testsuite: expand gdb.base/foll-vfork.exp Andrew Burgess
2023-06-22 13:17 ` [PATCH 7/8] gdb/testsuite: remove use of sleep from gdb.base/foll-vfork.exp Andrew Burgess
2023-06-22 13:17 ` [PATCH 8/8] gdb: additional debug output in infrun.c and linux-nat.c Andrew Burgess
2023-07-04 15:22 ` [PATCHv2 0/8] Some vfork related fixes Andrew Burgess
2023-07-04 15:22 ` [PATCHv2 1/8] gdb: catch more errors in gdb.base/foll-vfork.exp Andrew Burgess
2023-07-04 15:22 ` [PATCHv2 2/8] gdb: don't restart vfork parent while waiting for child to finish Andrew Burgess
2023-07-05 10:08 ` Aktemur, Tankut Baris [this message]
2023-07-04 15:22 ` [PATCHv2 3/8] gdb: fix an issue with vfork in non-stop mode Andrew Burgess
2023-07-04 15:22 ` [PATCHv2 4/8] gdb, infrun: refactor part of `proceed` into separate function Andrew Burgess
2023-07-04 15:22 ` [PATCHv2 5/8] gdb: don't resume vfork parent while child is still running Andrew Burgess
2023-07-18 20:42 ` Simon Marchi
2023-07-21 9:47 ` Andrew Burgess
2023-07-23 8:53 ` Andrew Burgess
2023-08-16 14:02 ` Andrew Burgess
2023-08-17 6:36 ` Tom de Vries
2023-08-17 7:01 ` Tom de Vries
2023-08-17 8:06 ` Tom de Vries
2023-08-17 8:22 ` Tom de Vries
2023-07-04 15:22 ` [PATCHv2 6/8] gdb/testsuite: expand gdb.base/foll-vfork.exp Andrew Burgess
2023-07-05 11:22 ` Aktemur, Tankut Baris
2023-07-04 15:22 ` [PATCHv2 7/8] gdb/testsuite: remove use of sleep from gdb.base/foll-vfork.exp Andrew Burgess
2023-07-04 15:22 ` [PATCHv2 8/8] gdb: additional debug output in infrun.c and linux-nat.c Andrew Burgess
2023-07-05 11:30 ` [PATCHv2 0/8] Some vfork related fixes Aktemur, Tankut Baris
2023-07-17 8:53 ` Andrew Burgess
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DM4PR11MB73030084E89DAFC10937E832C42FA@DM4PR11MB7303.namprd11.prod.outlook.com \
--to=tankut.baris.aktemur@intel.com \
--cc=aburgess@redhat.com \
--cc=gdb-patches@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).