public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed
From: "Aktemur, Tankut Baris" <tankut.baris.aktemur@intel.com>
To: Andrew Burgess <aburgess@redhat.com>,
	"gdb-patches@sourceware.org" <gdb-patches@sourceware.org>
Subject: RE: [PATCHv2 2/8] gdb: don't restart vfork parent while waiting for child to finish
Date: Wed, 5 Jul 2023 10:08:51 +0000	[thread overview]
Message-ID: <DM4PR11MB73030084E89DAFC10937E832C42FA@DM4PR11MB7303.namprd11.prod.outlook.com> (raw)
In-Reply-To: <a9b31c5abcb5c63bb329c62be568ca0c3a139692.1688484032.git.aburgess@redhat.com>

On Tuesday, July 4, 2023 5:23 PM, Andrew Burgess wrote:
> While working on a later patch, which changes gdb.base/foll-vfork.exp,
> I noticed that sometimes I would hit this assert:
> 
>   x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.
> 
> I eventually tracked it down to a combination of schedule-multiple
> mode being on, target-non-stop being off, follow-fork-mode being set
> to child, and some bad timing.  The failing case is pretty simple, a
> single threaded application performs a vfork, the child process then
> execs some other application while the parent process (once the vfork
> child has completed its exec) just exits.  As best I understand
> things, here's what happens when things go wrong:
> 
>   1. The parent process performs a vfork, GDB sees the VFORKED event
>   and creates an inferior and thread for the vfork child,
> 
>   2. GDB resumes the vfork child process.  As schedule-multiple is on
>   and target-non-stop is off, this is translated into a request to
>   start all processes (see user_visible_resume_ptid),
> 
>   3. In the linux-nat layer we spot that one of the threads we are
>   about to start is a vfork parent, and so don't start that
>   thread (see resume_lwp), the vfork child thread is resumed,
> 
>   4. GDB waits for the next event, eventually entering
>   linux_nat_target::wait, which in turn calls linux_nat_wait_1,
> 
>   5. In linux_nat_wait_1 we eventually call
>   resume_stopped_resumed_lwps, this should restart threads that have
>   stopped but don't actually have anything interesting to report.
> 
>   6. Unfortunately, resume_stopped_resumed_lwps doesn't check for
>   vfork parents like resume_lwp does, so at this point the vfork
>   parent is resumed.  This feels like the start of the bug, and this
>   is where I'm proposing to fix things, but, resuming the vfork parent
>   isn't the worst thing in the world because....
> 
>   7. As the vfork child is still alive the kernel holds the vfork
>   parent stopped,
> 
>   8. Eventually the child performs its exec and GDB is sent and EXECD
>   event.  However, because the parent is resumed, as soon as the child
>   performs its exec the vfork parent also sends a VFORK_DONE event to
>   GDB,
> 
>   9. Depending on timing both of these events might seem to arrive in
>   GDB at the same time.  Normally GDB expects to see the EXECD or
>   EXITED/SIGNALED event from the vfork child before getting the
>   VFORK_DONE in the parent.  We know this because it is as a result of
>   the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
>   handle_vfork_child_exec_or_exit for details).  Further the comment
>   in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
>   when we remain attached to the child (not the parent) we should not
>   expect to see a VFORK_DONE,
> 
>   10. If both events arrive at the same time then GDB will randomly
>   choose one event to handle first, in some cases this will be the
>   VFORK_DONE.  As described above, upon seeing a VFORK_DONE GDB
>   expects that (a) the vfork child has finished, however, in this case
>   this is not completely true, the child has finished, but GDB has not
>   processed the event associated with the completion yet, and (b) upon
>   seeing a VFORK_DONE GDB assumes we are remaining attached to the
>   parent, and so resumes the parent process,
> 
>   11. GDB now handles the EXECD event.  In our case we are detaching
>   from the parent, so GDB calls target_detach (see
>   handle_vfork_child_exec_or_exit),
> 
>   12. While this has been going on the vfork parent is executing, and
>   might even exit,
> 
>   13. In linux_nat_target::detach the first thing we do is stop all
>   threads in the process we're detaching from, the result of the stop
>   request will be cached on the lwp_info object,
> 
>   14. In our case the vfork parent has exited though, so when GDB
>   waits for the thread, instead of a stop due to signal, we instead
>   get a thread exited status,
> 
>   15. Later in the detach process we try to resume the threads just
>   prior to making the ptrace call to actually detach (see
>   detach_one_lwp), as part of the process to resume a thread we try to
>   touch some registers within the thread, and before doing this GDB
>   asserts that the thread is stopped,
> 
>   16. An exited thread is not classified as stopped, and so the assert
>   triggers!
> 
> So there's two bugs I see here.  The first, and most critical one here
> is in step #6.  I think that resume_stopped_resumed_lwps should not
> resume a vfork parent, just like resume_lwp doesn't resume a vfork
> parent.
> 
> With this change in place the vfork parent will remain stopped in step
> instead GDB will only see the EXECD/EXITED/SIGNALLED event.  The
> problems in #9 and #10 are therefore skipped and we arrive at #11,
> handling the EXECD event.  As the parent is still stopped #12 doesn't
> apply, and in #13 when we try to stop the process we will see that it
> is already stopped, there's no risk of the vfork parent exiting before
> we get to this point.  And finally, in #15 we are safe to poke the
> process registers because it will not have exited by this point.
> 
> However, I did mention two bugs.
> 
> The second bug I've not yet managed to actually trigger, but I'm
> convinced it must exist: if we forget vforks for a moment, in step #13
> above, when linux_nat_target::detach is called, we first try to stop
> all threads in the process GDB is detaching from.  If we imagine a
> multi-threaded inferior with many threads, and GDB running in non-stop
> mode, then, if the user tries to detach there is a chance that thread
> could exit just as linux_nat_target::detach is entered, in which case
> we should be able to trigger the same assert.
> 
> But, like I said, I've not (yet) managed to trigger this second bug,
> and even if I could, the fix would not belong in this commit, so I'm
> pointing this out just for completeness.
> 
> There's no test included in this commit.  In a couple of commits time
> I will expand gdb.base/foll-vfork.exp which is when this bug would be
> exposed.  Unfortunately there are at least two other bugs (separate
> from the ones discussed above) that need fixing first, these will be
> fixed in the next commits before the gdb.base/foll-vfork.exp test is
> expanded.
> 
> If you do want to reproduce this failure then you will for certainly
> need to run the gdb.base/foll-vfork.exp test in a loop as the failures
> are all very timing sensitive.  I've found that running multiple
> copies in parallel makes the failure more likely to appear, I usually
> run ~6 copies in parallel and expect to see a failure after within
> 10mins.
> ---
>  gdb/linux-nat.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 383ef58fa23..7e121b7ab41 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -3346,7 +3346,14 @@ linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus
> *ourstatus,
>  static int
>  resume_stopped_resumed_lwps (struct lwp_info *lp, const ptid_t wait_ptid)
>  {
> -  if (!lp->stopped)
> +  struct inferior *inf = find_inferior_ptid (linux_target, lp->ptid);

Nit: The 'struct' keyword can be omitted.

-Baris


Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva  
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928


  reply	other threads:[~2023-07-05 10:08 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-22 13:17 [PATCH 0/8] Some vfork related fixes Andrew Burgess
2023-06-22 13:17 ` [PATCH 1/8] gdb: catch more errors in gdb.base/foll-vfork.exp Andrew Burgess
2023-06-22 13:17 ` [PATCH 2/8] gdb: don't restart vfork parent while waiting for child to finish Andrew Burgess
2023-06-22 13:17 ` [PATCH 3/8] gdb: fix an issue with vfork in non-stop mode Andrew Burgess
2023-06-22 13:17 ` [PATCH 4/8] gdb, infrun: refactor part of `proceed` into separate function Andrew Burgess
2023-06-28  8:47   ` Aktemur, Tankut Baris
2023-07-04 15:24     ` Andrew Burgess
2023-06-22 13:17 ` [PATCH 5/8] gdb: don't resume vfork parent while child is still running Andrew Burgess
2023-06-22 13:17 ` [PATCH 6/8] gdb/testsuite: expand gdb.base/foll-vfork.exp Andrew Burgess
2023-06-22 13:17 ` [PATCH 7/8] gdb/testsuite: remove use of sleep from gdb.base/foll-vfork.exp Andrew Burgess
2023-06-22 13:17 ` [PATCH 8/8] gdb: additional debug output in infrun.c and linux-nat.c Andrew Burgess
2023-07-04 15:22 ` [PATCHv2 0/8] Some vfork related fixes Andrew Burgess
2023-07-04 15:22   ` [PATCHv2 1/8] gdb: catch more errors in gdb.base/foll-vfork.exp Andrew Burgess
2023-07-04 15:22   ` [PATCHv2 2/8] gdb: don't restart vfork parent while waiting for child to finish Andrew Burgess
2023-07-05 10:08     ` Aktemur, Tankut Baris [this message]
2023-07-04 15:22   ` [PATCHv2 3/8] gdb: fix an issue with vfork in non-stop mode Andrew Burgess
2023-07-04 15:22   ` [PATCHv2 4/8] gdb, infrun: refactor part of `proceed` into separate function Andrew Burgess
2023-07-04 15:22   ` [PATCHv2 5/8] gdb: don't resume vfork parent while child is still running Andrew Burgess
2023-07-18 20:42     ` Simon Marchi
2023-07-21  9:47       ` Andrew Burgess
2023-07-23  8:53       ` Andrew Burgess
2023-08-16 14:02         ` Andrew Burgess
2023-08-17  6:36           ` Tom de Vries
2023-08-17  7:01             ` Tom de Vries
2023-08-17  8:06               ` Tom de Vries
2023-08-17  8:22                 ` Tom de Vries
2023-07-04 15:22   ` [PATCHv2 6/8] gdb/testsuite: expand gdb.base/foll-vfork.exp Andrew Burgess
2023-07-05 11:22     ` Aktemur, Tankut Baris
2023-07-04 15:22   ` [PATCHv2 7/8] gdb/testsuite: remove use of sleep from gdb.base/foll-vfork.exp Andrew Burgess
2023-07-04 15:22   ` [PATCHv2 8/8] gdb: additional debug output in infrun.c and linux-nat.c Andrew Burgess
2023-07-05 11:30   ` [PATCHv2 0/8] Some vfork related fixes Aktemur, Tankut Baris
2023-07-17  8:53     ` Andrew Burgess

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM4PR11MB73030084E89DAFC10937E832C42FA@DM4PR11MB7303.namprd11.prod.outlook.com \
    --to=tankut.baris.aktemur@intel.com \
    --cc=aburgess@redhat.com \
    --cc=gdb-patches@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).