public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed
From: Andrew Burgess <aburgess@redhat.com>
To: gdb-patches@sourceware.org
Cc: Simon Marchi <simark@simark.ca>
Subject: Re: [PATCH] gdb: fix auxv cache clearing from new_objfile observer
Date: Fri, 06 Oct 2023 17:43:17 +0100	[thread overview]
Message-ID: <87y1gfsnve.fsf@redhat.com> (raw)
In-Reply-To: <c3a8e6149dd66bbc380bc8e03cbd0110233a5abd.1696349903.git.aburgess@redhat.com>

Andrew Burgess <aburgess@redhat.com> writes:

> It was pointed out on the mailing list that a recently added
> test (gdb.python/py-progspace-events.exp) was failing when run with
> the native-extended-gdbserver board.  This test was added with this
> commit:
>
>   commit 59912fb2d22f8a4bb0862487f12a5cc65b6a013f
>   Date:   Tue Sep 19 11:45:36 2023 +0100
>
>       gdb: add Python events for program space addition and removal
>
> It turns out though that the test is failing due to a existing bug
> in GDB, the new test just exposes the problem.  Additionally, the
> failure really doesn't even rely on the new functionality added in the
> above commit.  I reduced the test to a simple set of steps that
> reproduced the failure and tested against GDB 13, and the test passes;
> so the bug was introduced since then.  In fact, the bug was introduced
> with this commit:
>
>   commit a2827364e2bf19910fa5a54364a594a5ba3033b8
>   Date:   Fri Sep 8 15:48:16 2023 +0100
>
>       gdb: remove final user of the executable_changed observer
>
> This commit changed how the per-inferior auxv data cache is managed,
> specifically, when the cache is cleared, and it is this that leads to
> the failure.
>
> This bug is interesting because it exposes a number of issues with
> GDB, I'll explain all of the problems I see, though ultimately, I only
> propose fixing one problem in this commit, which is enough to resolve
> the crash we are currently seeing.
>
> The crash that we are seeing manifests like this:
>
>   ...
>   [Inferior 2 (process 3970384) exited normally]
>   +inferior 1
>   [Switching to inferior 1 [process 3970383] (/tmp/build/gdb/testsuite/outputs/gdb.python/py-progspace-events/py-progspace-events)]
>   [Switching to thread 1.1 (Thread 3970383.3970383)]
>   #0  breakpt () at /tmp/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.python/py-progspace-events.c:28
>   28	{ /* Nothing.  */ }
>   (gdb) step
>   +step
>   terminate called after throwing an instance of 'gdb_exception_error'
>
>   Fatal signal: Aborted
>   ... etc ...
>
> What's happening is that GDB attempts to refile the auxv cache as a
> result of the gdbarch_has_shared_address_space call in
> program_space::~program_space, the backtrace looks like this:
>
>   #0  0x00007fb4f419a9a5 in raise () from /lib64/libpthread.so.0
>   #1  0x00000000008b635d in handle_fatal_signal (sig=6) at ../../src/gdb/event-top.c:912
>   #2  <signal handler called>
>   #3  0x00007fb4f38e3625 in raise () from /lib64/libc.so.6
>   #4  0x00007fb4f38cc8d9 in abort () from /lib64/libc.so.6
>   #5  0x00007fb4f3c70756 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
>   #6  0x00007fb4f3c7c6dc in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
>   #7  0x00007fb4f3c7b6e9 in __cxa_call_terminate () from /lib64/libstdc++.so.6
>   #8  0x00007fb4f3c7c094 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
>   #9  0x00007fb4f3a80c63 in _Unwind_RaiseException_Phase2 () from /lib64/libgcc_s.so.1
>   #10 0x00007fb4f3a8154e in _Unwind_Resume () from /lib64/libgcc_s.so.1
>   #11 0x0000000000e8832d in target_read_alloc_1<unsigned char> (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2266
>   #12 0x0000000000e73dea in target_read_alloc (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2315
>   #13 0x000000000058248c in target_read_auxv_raw (ops=0x408a3a0) at ../../src/gdb/auxv.c:379
>   #14 0x000000000058243d in target_read_auxv () at ../../src/gdb/auxv.c:368
>   #15 0x000000000058255c in target_auxv_search (match=0x0, valp=0x7ffdee17c598) at ../../src/gdb/auxv.c:415
>   #16 0x0000000000a464bb in linux_is_uclinux () at ../../src/gdb/linux-tdep.c:433
>   #17 0x0000000000a464f6 in linux_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/linux-tdep.c:440
>   #18 0x0000000000510eae in gdbarch_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/gdbarch.c:4889
>   #19 0x0000000000bc7558 in program_space::~program_space (this=0x4544aa0, __in_chrg=<optimized out>) at ../../src/gdb/progspace.c:124
>   #20 0x00000000009b245d in delete_inferior (inf=0x47b3de0) at ../../src/gdb/inferior.c:290
>   #21 0x00000000009b2c10 in prune_inferiors () at ../../src/gdb/inferior.c:480
>   #22 0x00000000009c5e3e in fetch_inferior_event () at ../../src/gdb/infrun.c:4558
>   #23 0x000000000099b4dc in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:42
>   #24 0x0000000000cbc64f in remote_async_serial_handler (scb=0x4090a30, context=0x408a6b0) at ../../src/gdb/remote.c:14859
>   #25 0x0000000000d83d3a in run_async_handler_and_reschedule (scb=0x4090a30) at ../../src/gdb/ser-base.c:138
>   #26 0x0000000000d83e1f in fd_event (error=0, context=0x4090a30) at ../../src/gdb/ser-base.c:189
>
> So this is problem #1, if we throw an exception while deleting a
> program_space then this is not caught, and is going to crash GDB.
>
> Problem #2 becomes evident when we ask why GDB is throwing an error in
> this case; the error is thrown because the remote target, operating in
> non-async mode, can't read the auxv data while an inferior is running
> and GDB is waiting for a stop reply.  The problem here then, is why
> does GDB get into a position where it tries to interact with the
> remote target in this way, at this time?  The problem is caused by the
> prune_inferiors call which can be seen in the above backtrace.
>
> In prune_inferiors we check if the inferior is deletable, and if it
> is, we delete it.  The problem is, I think, we should also check if
> the target is currently in a state that would allow us to delete the
> inferior.  We don't currently have such a check available, we'd need
> to add one, but for the remote target, this would return false if the
> remote is in async mode and the remote is currently waiting for a stop
> reply.  With this change in place GDB would defer deleting the
> inferior until the remote target has stopped, at which point GDB would
> be able to refill the auxv cache successfully.
>
> And then, problem #3 becomes evident when we ask why GDB is needing to
> refill the auxv cache now when it didn't need to for GDB 13.  This is
> where the second commit mentioned above (a2827364e2bf) comes in.
> Prior to this commit, the auxv cache was cleared by the
> executable_changed observer, while after that commit the auxv cache
> was cleared by the new_objfile observer -- but only when the
> new_objfile observer is used in the special mode that actually means
> that all objfiles have been unloaded (I know, the overloading of the
> new_objfile observer is horrible, and unnecessary, but it's not really
> important for this bug).
>
> The difference arises because the new_objfile observer is triggered
> from clear_symtab_users, which in turn is called from
> program_space::~program_space.  The new_objfile observer for auxv does
> this:
>
>   static void
>   auxv_new_objfile_observer (struct objfile *objfile)
>   {
>     if (objfile == nullptr)
>       invalidate_auxv_cache_inf (current_inferior ());
>   }
>
> That is, when all the objfiles are unloaded, we clear the auxv cache
> for the current inferior.
>
> The problem is, then when we look at the prune_inferiors ->
> delete_inferior -> ~program_space path, we see that the current
> inferior is not going to be an inferior that exists within the
> program_space being deleted; delete_inferior removes the deleted
> inferior from the global inferior list, and then only deletes the
> program_space if program_space::empty() returns true, which is only
> the case if the current inferior isn't within the program_space to
> delete, and no other inferior exists within that program_space
> either.
>
> What this means is that when the new_objfile observer is called we
> can't rely on the current inferior having any relationship with the
> program space in which the objfiles were removed.  This was an error
> in the commit a2827364e2bf, the only thing we can rely on is the
> current program space.  As a result of this mistake, after commit
> a2827364e2bf, GDB was sometimes clearing the auxv cache for a random
> inferior.  In the native target case this was harmless as we can
> always refill the cache when needed, but in the remote target case, if
> we need to refill the cache when the remote target is executing, then
> we get the crash we observed.
>
> And additionally, if we think about this a little more, we see that
> commit a2827364e2bf made another mistake.  When all the objfiles are
> removed, they are removed from a program_space, a program_space might
> contain multiple inferiors, so surely, we should clear the auxv cache
> for all of the matching inferiors?
>
> Given these two insights, that the current_inferior is not relevant,
> only the current_program_space, and that we should be clearing the
> cache for all inferiors in the current_program_space, we can update
> auxv_new_objfile_observer to:
>
>   if (objfile == nullptr)
>     {
>       for (inferior *inf : all_inferiors ())
> 	{
> 	  if (inf->pspace == current_program_space)
> 	    invalidate_auxv_cache_inf (inf);
> 	}
>     }
>
> With this change we now correctly clear the auxv cache for the correct
> inferiors, and GDB no longer needs to refill the cache at an
> inconvenient time, this avoids the crash we were seeing.
>
> And finally, we reach problem #4.  Inspired by the observation that
> using the current_inferior from within the ~program_space function was
> not correct, I added some debug to see if current_inferior() was
> called anywhere else (below ~program_space), and the answer is yes,
> it's called a often.  Mostly the culprit is GDB doing:
>
>   current_inferior ()->top_target ()-> ....
>
> But I think all of these calls are most likely doing the wrong thing,
> and only work because the top target in all these cases is shared
> between all inferiors, e.g. it's the native target, or the remote
> target for all inferiors.  But if we had a truly multi-connection
> setup, then we might start to see odd behaviour.
>
> Problem #1 I'm just ignoring for now, I guess at some point we might
> run into this again, and then we'd need to solve this.  But in this
> case I wasn't sure what a "good" solution would look like.  We need
> the auxv data in order to implement the linux_is_uclinux() function.
> If we can't get the auxv data then what should we do, assume yes, or
> assume no?  The right answer would probably be to propagate the error
> back up the stack, but then we reach ~program_space, and throwing
> exceptions from a destructor is problematic, so we'd need to catch and
> deal at this point.  The linux_is_uclinux() call is made from within
> gdbarch_has_shared_address_space(), which is used like:
>
>   if (!gdbarch_has_shared_address_space (target_gdbarch ()))
>     delete this->aspace;
>
> So, we would have to choose; delete the address space or not.  If we
> delete it on error, then we might delete an address space that is
> shared within another program space.  If we don't delete the address
> space, then we might leak it.  Neither choice is great.
>
> A better solution might be to have the address spaces be reference
> counted, then we could remove the gdbarch_has_shared_address_space
> call completely, and just rely on the reference count to auto-delete
> the address space when appropriate.
>
> The solution for problem #2 I already hinted at above, we should have
> a new target_can_delete_inferiors() call, which should be called from
> prune_inferiors, this would prevent GDB from trying to delete
> inferiors when a (remote) target is in a state where we know it can't
> delete the inferior.  Deleting an inferior often (always?) requires
> sending packets to the remote, and if the remote is waiting for a stop
> reply then this will never work, so the pruning should be deferred in
> this case.
>
> The solution for problem #3 is included in this commit.
>
> And, for problem #4, I'm not sure what the right solution is.  Maybe
> delete_inferior should ensure the inferior to be deleted is in place
> when ~program_space is called?  But that seems a little weird, as the
> current inferior would, in theory, still be using the current
> program_space...
>
> Anyway, after this commit, the gdb.python/py-progspace-events.exp test
> now passes when run with the native-extended-remote board.
>
> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30935

This commit was merged (1b28c0f488bb).

Thanks,
Andrew


> ---
>  gdb/auxv.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/gdb/auxv.c b/gdb/auxv.c
> index 9f599b04a4f..cfb44ffa0bb 100644
> --- a/gdb/auxv.c
> +++ b/gdb/auxv.c
> @@ -351,7 +351,16 @@ static void
>  auxv_new_objfile_observer (struct objfile *objfile)
>  {
>    if (objfile == nullptr)
> -    invalidate_auxv_cache_inf (current_inferior ());
> +    {
> +      /* When OBJFILE is nullptr, this indicates that all symbol files have
> +	 been unloaded from the current program space.  Discard cached auxv
> +	 information from any inferior within the effected program space.  */
> +      for (inferior *inf : all_inferiors ())
> +	{
> +	  if (inf->pspace == current_program_space)
> +	    invalidate_auxv_cache_inf (inf);
> +	}
> +    }
>  }
>  
>  /* See auxv.h.  */
>
> base-commit: e030ce2c79feee6a35665a6580a23dcf937ea46f
> -- 
> 2.25.4


      parent reply	other threads:[~2023-10-06 16:43 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-03 16:20 Andrew Burgess
2023-10-03 17:50 ` Simon Marchi
2023-10-05 17:20   ` Simon Marchi
2023-11-06  8:21   ` Tom de Vries
2023-10-06 16:43 ` Andrew Burgess [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87y1gfsnve.fsf@redhat.com \
    --to=aburgess@redhat.com \
    --cc=gdb-patches@sourceware.org \
    --cc=simark@simark.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).