[PATCH] gdb: fix auxv cache clearing from new

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

* [PATCH] gdb: fix auxv cache clearing from new_objfile observer
@ 2023-10-03 16:20 Andrew Burgess
  2023-10-03 17:50 ` Simon Marchi
  2023-10-06 16:43 ` Andrew Burgess
  0 siblings, 2 replies; 5+ messages in thread
From: Andrew Burgess @ 2023-10-03 16:20 UTC (permalink / raw)
  To: gdb-patches; +Cc: Simon Marchi, Andrew Burgess

It was pointed out on the mailing list that a recently added
test (gdb.python/py-progspace-events.exp) was failing when run with
the native-extended-gdbserver board.  This test was added with this
commit:

  commit 59912fb2d22f8a4bb0862487f12a5cc65b6a013f
  Date:   Tue Sep 19 11:45:36 2023 +0100

      gdb: add Python events for program space addition and removal

It turns out though that the test is failing due to a existing bug
in GDB, the new test just exposes the problem.  Additionally, the
failure really doesn't even rely on the new functionality added in the
above commit.  I reduced the test to a simple set of steps that
reproduced the failure and tested against GDB 13, and the test passes;
so the bug was introduced since then.  In fact, the bug was introduced
with this commit:

  commit a2827364e2bf19910fa5a54364a594a5ba3033b8
  Date:   Fri Sep 8 15:48:16 2023 +0100

      gdb: remove final user of the executable_changed observer

This commit changed how the per-inferior auxv data cache is managed,
specifically, when the cache is cleared, and it is this that leads to
the failure.

This bug is interesting because it exposes a number of issues with
GDB, I'll explain all of the problems I see, though ultimately, I only
propose fixing one problem in this commit, which is enough to resolve
the crash we are currently seeing.

The crash that we are seeing manifests like this:

  ...
  [Inferior 2 (process 3970384) exited normally]
  +inferior 1
  [Switching to inferior 1 [process 3970383] (/tmp/build/gdb/testsuite/outputs/gdb.python/py-progspace-events/py-progspace-events)]
  [Switching to thread 1.1 (Thread 3970383.3970383)]
  #0  breakpt () at /tmp/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.python/py-progspace-events.c:28
  28	{ /* Nothing.  */ }
  (gdb) step
  +step
  terminate called after throwing an instance of 'gdb_exception_error'

  Fatal signal: Aborted
  ... etc ...

What's happening is that GDB attempts to refile the auxv cache as a
result of the gdbarch_has_shared_address_space call in
program_space::~program_space, the backtrace looks like this:

  #0  0x00007fb4f419a9a5 in raise () from /lib64/libpthread.so.0
  #1  0x00000000008b635d in handle_fatal_signal (sig=6) at ../../src/gdb/event-top.c:912
  #2  <signal handler called>
  #3  0x00007fb4f38e3625 in raise () from /lib64/libc.so.6
  #4  0x00007fb4f38cc8d9 in abort () from /lib64/libc.so.6
  #5  0x00007fb4f3c70756 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
  #6  0x00007fb4f3c7c6dc in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
  #7  0x00007fb4f3c7b6e9 in __cxa_call_terminate () from /lib64/libstdc++.so.6
  #8  0x00007fb4f3c7c094 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
  #9  0x00007fb4f3a80c63 in _Unwind_RaiseException_Phase2 () from /lib64/libgcc_s.so.1
  #10 0x00007fb4f3a8154e in _Unwind_Resume () from /lib64/libgcc_s.so.1
  #11 0x0000000000e8832d in target_read_alloc_1<unsigned char> (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2266
  #12 0x0000000000e73dea in target_read_alloc (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2315
  #13 0x000000000058248c in target_read_auxv_raw (ops=0x408a3a0) at ../../src/gdb/auxv.c:379
  #14 0x000000000058243d in target_read_auxv () at ../../src/gdb/auxv.c:368
  #15 0x000000000058255c in target_auxv_search (match=0x0, valp=0x7ffdee17c598) at ../../src/gdb/auxv.c:415
  #16 0x0000000000a464bb in linux_is_uclinux () at ../../src/gdb/linux-tdep.c:433
  #17 0x0000000000a464f6 in linux_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/linux-tdep.c:440
  #18 0x0000000000510eae in gdbarch_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/gdbarch.c:4889
  #19 0x0000000000bc7558 in program_space::~program_space (this=0x4544aa0, __in_chrg=<optimized out>) at ../../src/gdb/progspace.c:124
  #20 0x00000000009b245d in delete_inferior (inf=0x47b3de0) at ../../src/gdb/inferior.c:290
  #21 0x00000000009b2c10 in prune_inferiors () at ../../src/gdb/inferior.c:480
  #22 0x00000000009c5e3e in fetch_inferior_event () at ../../src/gdb/infrun.c:4558
  #23 0x000000000099b4dc in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:42
  #24 0x0000000000cbc64f in remote_async_serial_handler (scb=0x4090a30, context=0x408a6b0) at ../../src/gdb/remote.c:14859
  #25 0x0000000000d83d3a in run_async_handler_and_reschedule (scb=0x4090a30) at ../../src/gdb/ser-base.c:138
  #26 0x0000000000d83e1f in fd_event (error=0, context=0x4090a30) at ../../src/gdb/ser-base.c:189

So this is problem #1, if we throw an exception while deleting a
program_space then this is not caught, and is going to crash GDB.

Problem #2 becomes evident when we ask why GDB is throwing an error in
this case; the error is thrown because the remote target, operating in
non-async mode, can't read the auxv data while an inferior is running
and GDB is waiting for a stop reply.  The problem here then, is why
does GDB get into a position where it tries to interact with the
remote target in this way, at this time?  The problem is caused by the
prune_inferiors call which can be seen in the above backtrace.

In prune_inferiors we check if the inferior is deletable, and if it
is, we delete it.  The problem is, I think, we should also check if
the target is currently in a state that would allow us to delete the
inferior.  We don't currently have such a check available, we'd need
to add one, but for the remote target, this would return false if the
remote is in async mode and the remote is currently waiting for a stop
reply.  With this change in place GDB would defer deleting the
inferior until the remote target has stopped, at which point GDB would
be able to refill the auxv cache successfully.

And then, problem #3 becomes evident when we ask why GDB is needing to
refill the auxv cache now when it didn't need to for GDB 13.  This is
where the second commit mentioned above (a2827364e2bf) comes in.
Prior to this commit, the auxv cache was cleared by the
executable_changed observer, while after that commit the auxv cache
was cleared by the new_objfile observer -- but only when the
new_objfile observer is used in the special mode that actually means
that all objfiles have been unloaded (I know, the overloading of the
new_objfile observer is horrible, and unnecessary, but it's not really
important for this bug).

The difference arises because the new_objfile observer is triggered
from clear_symtab_users, which in turn is called from
program_space::~program_space.  The new_objfile observer for auxv does
this:

  static void
  auxv_new_objfile_observer (struct objfile *objfile)
  {
    if (objfile == nullptr)
      invalidate_auxv_cache_inf (current_inferior ());
  }

That is, when all the objfiles are unloaded, we clear the auxv cache
for the current inferior.

The problem is, then when we look at the prune_inferiors ->
delete_inferior -> ~program_space path, we see that the current
inferior is not going to be an inferior that exists within the
program_space being deleted; delete_inferior removes the deleted
inferior from the global inferior list, and then only deletes the
program_space if program_space::empty() returns true, which is only
the case if the current inferior isn't within the program_space to
delete, and no other inferior exists within that program_space
either.

What this means is that when the new_objfile observer is called we
can't rely on the current inferior having any relationship with the
program space in which the objfiles were removed.  This was an error
in the commit a2827364e2bf, the only thing we can rely on is the
current program space.  As a result of this mistake, after commit
a2827364e2bf, GDB was sometimes clearing the auxv cache for a random
inferior.  In the native target case this was harmless as we can
always refill the cache when needed, but in the remote target case, if
we need to refill the cache when the remote target is executing, then
we get the crash we observed.

And additionally, if we think about this a little more, we see that
commit a2827364e2bf made another mistake.  When all the objfiles are
removed, they are removed from a program_space, a program_space might
contain multiple inferiors, so surely, we should clear the auxv cache
for all of the matching inferiors?

Given these two insights, that the current_inferior is not relevant,
only the current_program_space, and that we should be clearing the
cache for all inferiors in the current_program_space, we can update
auxv_new_objfile_observer to:

  if (objfile == nullptr)
    {
      for (inferior *inf : all_inferiors ())
	{
	  if (inf->pspace == current_program_space)
	    invalidate_auxv_cache_inf (inf);
	}
    }

With this change we now correctly clear the auxv cache for the correct
inferiors, and GDB no longer needs to refill the cache at an
inconvenient time, this avoids the crash we were seeing.

And finally, we reach problem #4.  Inspired by the observation that
using the current_inferior from within the ~program_space function was
not correct, I added some debug to see if current_inferior() was
called anywhere else (below ~program_space), and the answer is yes,
it's called a often.  Mostly the culprit is GDB doing:

  current_inferior ()->top_target ()-> ....

But I think all of these calls are most likely doing the wrong thing,
and only work because the top target in all these cases is shared
between all inferiors, e.g. it's the native target, or the remote
target for all inferiors.  But if we had a truly multi-connection
setup, then we might start to see odd behaviour.

Problem #1 I'm just ignoring for now, I guess at some point we might
run into this again, and then we'd need to solve this.  But in this
case I wasn't sure what a "good" solution would look like.  We need
the auxv data in order to implement the linux_is_uclinux() function.
If we can't get the auxv data then what should we do, assume yes, or
assume no?  The right answer would probably be to propagate the error
back up the stack, but then we reach ~program_space, and throwing
exceptions from a destructor is problematic, so we'd need to catch and
deal at this point.  The linux_is_uclinux() call is made from within
gdbarch_has_shared_address_space(), which is used like:

  if (!gdbarch_has_shared_address_space (target_gdbarch ()))
    delete this->aspace;

So, we would have to choose; delete the address space or not.  If we
delete it on error, then we might delete an address space that is
shared within another program space.  If we don't delete the address
space, then we might leak it.  Neither choice is great.

A better solution might be to have the address spaces be reference
counted, then we could remove the gdbarch_has_shared_address_space
call completely, and just rely on the reference count to auto-delete
the address space when appropriate.

The solution for problem #2 I already hinted at above, we should have
a new target_can_delete_inferiors() call, which should be called from
prune_inferiors, this would prevent GDB from trying to delete
inferiors when a (remote) target is in a state where we know it can't
delete the inferior.  Deleting an inferior often (always?) requires
sending packets to the remote, and if the remote is waiting for a stop
reply then this will never work, so the pruning should be deferred in
this case.

The solution for problem #3 is included in this commit.

And, for problem #4, I'm not sure what the right solution is.  Maybe
delete_inferior should ensure the inferior to be deleted is in place
when ~program_space is called?  But that seems a little weird, as the
current inferior would, in theory, still be using the current
program_space...

Anyway, after this commit, the gdb.python/py-progspace-events.exp test
now passes when run with the native-extended-remote board.

Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30935
---
 gdb/auxv.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/gdb/auxv.c b/gdb/auxv.c
index 9f599b04a4f..cfb44ffa0bb 100644
--- a/gdb/auxv.c
+++ b/gdb/auxv.c
@@ -351,7 +351,16 @@ static void
 auxv_new_objfile_observer (struct objfile *objfile)
 {
   if (objfile == nullptr)
-    invalidate_auxv_cache_inf (current_inferior ());
+    {
+      /* When OBJFILE is nullptr, this indicates that all symbol files have
+	 been unloaded from the current program space.  Discard cached auxv
+	 information from any inferior within the effected program space.  */
+      for (inferior *inf : all_inferiors ())
+	{
+	  if (inf->pspace == current_program_space)
+	    invalidate_auxv_cache_inf (inf);
+	}
+    }
 }

 /* See auxv.h.  */

base-commit: e030ce2c79feee6a35665a6580a23dcf937ea46f
-- 
2.25.4

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] gdb: fix auxv cache clearing from new_objfile observer
  2023-10-03 16:20 [PATCH] gdb: fix auxv cache clearing from new_objfile observer Andrew Burgess
@ 2023-10-03 17:50 ` Simon Marchi
  2023-10-05 17:20   ` Simon Marchi
  2023-11-06  8:21   ` Tom de Vries
  2023-10-06 16:43 ` Andrew Burgess
  1 sibling, 2 replies; 5+ messages in thread
From: Simon Marchi @ 2023-10-03 17:50 UTC (permalink / raw)
  To: Andrew Burgess, gdb-patches

On 10/3/23 12:20, Andrew Burgess wrote:
> It was pointed out on the mailing list that a recently added
> test (gdb.python/py-progspace-events.exp) was failing when run with
> the native-extended-gdbserver board.  This test was added with this
> commit:
> 
>   commit 59912fb2d22f8a4bb0862487f12a5cc65b6a013f
>   Date:   Tue Sep 19 11:45:36 2023 +0100
> 
>       gdb: add Python events for program space addition and removal
> 
> It turns out though that the test is failing due to a existing bug
> in GDB, the new test just exposes the problem.  Additionally, the
> failure really doesn't even rely on the new functionality added in the
> above commit.  I reduced the test to a simple set of steps that
> reproduced the failure and tested against GDB 13, and the test passes;
> so the bug was introduced since then.  In fact, the bug was introduced
> with this commit:
> 
>   commit a2827364e2bf19910fa5a54364a594a5ba3033b8
>   Date:   Fri Sep 8 15:48:16 2023 +0100
> 
>       gdb: remove final user of the executable_changed observer
> 
> This commit changed how the per-inferior auxv data cache is managed,
> specifically, when the cache is cleared, and it is this that leads to
> the failure.
> 
> This bug is interesting because it exposes a number of issues with
> GDB, I'll explain all of the problems I see, though ultimately, I only
> propose fixing one problem in this commit, which is enough to resolve
> the crash we are currently seeing.
> 
> The crash that we are seeing manifests like this:
> 
>   ...
>   [Inferior 2 (process 3970384) exited normally]
>   +inferior 1
>   [Switching to inferior 1 [process 3970383] (/tmp/build/gdb/testsuite/outputs/gdb.python/py-progspace-events/py-progspace-events)]
>   [Switching to thread 1.1 (Thread 3970383.3970383)]
>   #0  breakpt () at /tmp/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.python/py-progspace-events.c:28
>   28	{ /* Nothing.  */ }
>   (gdb) step
>   +step
>   terminate called after throwing an instance of 'gdb_exception_error'
> 
>   Fatal signal: Aborted
>   ... etc ...
> 
> What's happening is that GDB attempts to refile the auxv cache as a

refile -> refill?

> result of the gdbarch_has_shared_address_space call in
> program_space::~program_space, the backtrace looks like this:
> 
>   #0  0x00007fb4f419a9a5 in raise () from /lib64/libpthread.so.0
>   #1  0x00000000008b635d in handle_fatal_signal (sig=6) at ../../src/gdb/event-top.c:912
>   #2  <signal handler called>
>   #3  0x00007fb4f38e3625 in raise () from /lib64/libc.so.6
>   #4  0x00007fb4f38cc8d9 in abort () from /lib64/libc.so.6
>   #5  0x00007fb4f3c70756 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
>   #6  0x00007fb4f3c7c6dc in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
>   #7  0x00007fb4f3c7b6e9 in __cxa_call_terminate () from /lib64/libstdc++.so.6
>   #8  0x00007fb4f3c7c094 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
>   #9  0x00007fb4f3a80c63 in _Unwind_RaiseException_Phase2 () from /lib64/libgcc_s.so.1
>   #10 0x00007fb4f3a8154e in _Unwind_Resume () from /lib64/libgcc_s.so.1
>   #11 0x0000000000e8832d in target_read_alloc_1<unsigned char> (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2266
>   #12 0x0000000000e73dea in target_read_alloc (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2315
>   #13 0x000000000058248c in target_read_auxv_raw (ops=0x408a3a0) at ../../src/gdb/auxv.c:379
>   #14 0x000000000058243d in target_read_auxv () at ../../src/gdb/auxv.c:368
>   #15 0x000000000058255c in target_auxv_search (match=0x0, valp=0x7ffdee17c598) at ../../src/gdb/auxv.c:415
>   #16 0x0000000000a464bb in linux_is_uclinux () at ../../src/gdb/linux-tdep.c:433
>   #17 0x0000000000a464f6 in linux_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/linux-tdep.c:440
>   #18 0x0000000000510eae in gdbarch_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/gdbarch.c:4889
>   #19 0x0000000000bc7558 in program_space::~program_space (this=0x4544aa0, __in_chrg=<optimized out>) at ../../src/gdb/progspace.c:124
>   #20 0x00000000009b245d in delete_inferior (inf=0x47b3de0) at ../../src/gdb/inferior.c:290
>   #21 0x00000000009b2c10 in prune_inferiors () at ../../src/gdb/inferior.c:480
>   #22 0x00000000009c5e3e in fetch_inferior_event () at ../../src/gdb/infrun.c:4558
>   #23 0x000000000099b4dc in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:42
>   #24 0x0000000000cbc64f in remote_async_serial_handler (scb=0x4090a30, context=0x408a6b0) at ../../src/gdb/remote.c:14859
>   #25 0x0000000000d83d3a in run_async_handler_and_reschedule (scb=0x4090a30) at ../../src/gdb/ser-base.c:138
>   #26 0x0000000000d83e1f in fd_event (error=0, context=0x4090a30) at ../../src/gdb/ser-base.c:189
> 
> So this is problem #1, if we throw an exception while deleting a
> program_space then this is not caught, and is going to crash GDB.

I had forgotten that trying to throw exceptions from destructors causes
std::terminate to be called.

> Problem #2 becomes evident when we ask why GDB is throwing an error in
> this case; the error is thrown because the remote target, operating in
> non-async mode, can't read the auxv data while an inferior is running
> and GDB is waiting for a stop reply.  The problem here then, is why
> does GDB get into a position where it tries to interact with the
> remote target in this way, at this time?  The problem is caused by the
> prune_inferiors call which can be seen in the above backtrace.
> 
> In prune_inferiors we check if the inferior is deletable, and if it
> is, we delete it.  The problem is, I think, we should also check if
> the target is currently in a state that would allow us to delete the
> inferior.  We don't currently have such a check available, we'd need
> to add one, but for the remote target, this would return false if the
> remote is in async mode and the remote is currently waiting for a stop
> reply.  With this change in place GDB would defer deleting the
> inferior until the remote target has stopped, at which point GDB would
> be able to refill the auxv cache successfully.
> 
> And then, problem #3 becomes evident when we ask why GDB is needing to
> refill the auxv cache now when it didn't need to for GDB 13.  This is
> where the second commit mentioned above (a2827364e2bf) comes in.
> Prior to this commit, the auxv cache was cleared by the
> executable_changed observer, while after that commit the auxv cache
> was cleared by the new_objfile observer -- but only when the
> new_objfile observer is used in the special mode that actually means
> that all objfiles have been unloaded (I know, the overloading of the
> new_objfile observer is horrible, and unnecessary, but it's not really
> important for this bug).

Agreed, I started to write a patch to add a new "all_objfiles_removed"
observable and remove that special mode of new_objfile.

> 
> The difference arises because the new_objfile observer is triggered
> from clear_symtab_users, which in turn is called from
> program_space::~program_space.  The new_objfile observer for auxv does
> this:
> 
>   static void
>   auxv_new_objfile_observer (struct objfile *objfile)
>   {
>     if (objfile == nullptr)
>       invalidate_auxv_cache_inf (current_inferior ());
>   }
> 
> That is, when all the objfiles are unloaded, we clear the auxv cache
> for the current inferior.
> 
> The problem is, then when we look at the prune_inferiors ->
> delete_inferior -> ~program_space path, we see that the current
> inferior is not going to be an inferior that exists within the
> program_space being deleted; delete_inferior removes the deleted
> inferior from the global inferior list, and then only deletes the
> program_space if program_space::empty() returns true, which is only
> the case if the current inferior isn't within the program_space to
> delete, and no other inferior exists within that program_space
> either.
> 
> What this means is that when the new_objfile observer is called we
> can't rely on the current inferior having any relationship with the
> program space in which the objfiles were removed.  This was an error
> in the commit a2827364e2bf, the only thing we can rely on is the
> current program space.  As a result of this mistake, after commit
> a2827364e2bf, GDB was sometimes clearing the auxv cache for a random
> inferior.  In the native target case this was harmless as we can
> always refill the cache when needed, but in the remote target case, if
> we need to refill the cache when the remote target is executing, then
> we get the crash we observed.
> 
> And additionally, if we think about this a little more, we see that
> commit a2827364e2bf made another mistake.  When all the objfiles are
> removed, they are removed from a program_space, a program_space might
> contain multiple inferiors, so surely, we should clear the auxv cache
> for all of the matching inferiors?
> 
> Given these two insights, that the current_inferior is not relevant,
> only the current_program_space, and that we should be clearing the
> cache for all inferiors in the current_program_space, we can update
> auxv_new_objfile_observer to:
> 
>   if (objfile == nullptr)
>     {
>       for (inferior *inf : all_inferiors ())
> 	{
> 	  if (inf->pspace == current_program_space)
> 	    invalidate_auxv_cache_inf (inf);
> 	}
>     }

So, in my patch adding the all_objfiles_removed observer, I have:

    static void
    auxv_all_objfiles_removed (program_space *pspace)
    {
      for (inferior *inf : all_inferiors ())
        if (inf->pspace == pspace)
          invalidate_auxv_cache_inf (inf);
    }

... which is identical to what you came up with.

> 
> With this change we now correctly clear the auxv cache for the correct
> inferiors, and GDB no longer needs to refill the cache at an
> inconvenient time, this avoids the crash we were seeing.
> 
> And finally, we reach problem #4.  Inspired by the observation that
> using the current_inferior from within the ~program_space function was
> not correct, I added some debug to see if current_inferior() was
> called anywhere else (below ~program_space), and the answer is yes,
> it's called a often.  Mostly the culprit is GDB doing:
> 
>   current_inferior ()->top_target ()-> ....
> 
> But I think all of these calls are most likely doing the wrong thing,
> and only work because the top target in all these cases is shared
> between all inferiors, e.g. it's the native target, or the remote
> target for all inferiors.  But if we had a truly multi-connection
> setup, then we might start to see odd behaviour.
> 
> Problem #1 I'm just ignoring for now, I guess at some point we might
> run into this again, and then we'd need to solve this.  But in this
> case I wasn't sure what a "good" solution would look like.  We need
> the auxv data in order to implement the linux_is_uclinux() function.
> If we can't get the auxv data then what should we do, assume yes, or
> assume no?  The right answer would probably be to propagate the error
> back up the stack, but then we reach ~program_space, and throwing
> exceptions from a destructor is problematic, so we'd need to catch and
> deal at this point.  The linux_is_uclinux() call is made from within
> gdbarch_has_shared_address_space(), which is used like:
> 
>   if (!gdbarch_has_shared_address_space (target_gdbarch ()))
>     delete this->aspace;
> 
> So, we would have to choose; delete the address space or not.  If we
> delete it on error, then we might delete an address space that is
> shared within another program space.  If we don't delete the address
> space, then we might leak it.  Neither choice is great.
> 
> A better solution might be to have the address spaces be reference
> counted, then we could remove the gdbarch_has_shared_address_space
> call completely, and just rely on the reference count to auto-delete
> the address space when appropriate.

I advocate for the shared_ptr / refcount solution.  ~program_space is
really not a good place to do target calls, as you noted.

> The solution for problem #2 I already hinted at above, we should have
> a new target_can_delete_inferiors() call, which should be called from
> prune_inferiors, this would prevent GDB from trying to delete
> inferiors when a (remote) target is in a state where we know it can't
> delete the inferior.  Deleting an inferior often (always?) requires
> sending packets to the remote, and if the remote is waiting for a stop
> reply then this will never work, so the pruning should be deferred in
> this case.

What other packets do we need to send when deleting an inferior?  If an
inferior is prunable (or removed using the remove-inferiors command), I
expect that it will be dead and the work will all be GDB-side.

> The solution for problem #3 is included in this commit.

This change LGTM.

> And, for problem #4, I'm not sure what the right solution is.  Maybe
> delete_inferior should ensure the inferior to be deleted is in place
> when ~program_space is called?  But that seems a little weird, as the
> current inferior would, in theory, still be using the current
> program_space...

I would have to see which target calls are made exactly, until then I
don't have an opinion.

> Anyway, after this commit, the gdb.python/py-progspace-events.exp test
> now passes when run with the native-extended-remote board.
> 
> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30935
> ---
>  gdb/auxv.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/gdb/auxv.c b/gdb/auxv.c
> index 9f599b04a4f..cfb44ffa0bb 100644
> --- a/gdb/auxv.c
> +++ b/gdb/auxv.c
> @@ -351,7 +351,16 @@ static void
>  auxv_new_objfile_observer (struct objfile *objfile)
>  {
>    if (objfile == nullptr)
> -    invalidate_auxv_cache_inf (current_inferior ());
> +    {
> +      /* When OBJFILE is nullptr, this indicates that all symbol files have
> +	 been unloaded from the current program space.  Discard cached auxv
> +	 information from any inferior within the effected program space.  */
> +      for (inferior *inf : all_inferiors ())
> +	{
> +	  if (inf->pspace == current_program_space)
> +	    invalidate_auxv_cache_inf (inf);
> +	}
> +    }

This LGTM:

Approved-By: Simon Marchi <simon.marchi@efficios.com>

Simon

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] gdb: fix auxv cache clearing from new_objfile observer
  2023-10-03 17:50 ` Simon Marchi
@ 2023-10-05 17:20   ` Simon Marchi
  2023-11-06  8:21   ` Tom de Vries
  1 sibling, 0 replies; 5+ messages in thread
From: Simon Marchi @ 2023-10-05 17:20 UTC (permalink / raw)
  To: Andrew Burgess, gdb-patches

On 10/3/23 13:50, Simon Marchi via Gdb-patches wrote:
> Approved-By: Simon Marchi <simon.marchi@efficios.com>

I'll go ahead and push your patch, so that I can push my series here:

https://inbox.sourceware.org/gdb-patches/87sf6pvxfo.fsf@tromey.com/T/#m59e8d78f695a1ee9d24b4013a094884356b2c3ed

I have the equivalent change in my series, but I'd like for your commit
message to be in the git history.

Simon

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] gdb: fix auxv cache clearing from new_objfile observer
  2023-10-03 16:20 [PATCH] gdb: fix auxv cache clearing from new_objfile observer Andrew Burgess
  2023-10-03 17:50 ` Simon Marchi
@ 2023-10-06 16:43 ` Andrew Burgess
  1 sibling, 0 replies; 5+ messages in thread
From: Andrew Burgess @ 2023-10-06 16:43 UTC (permalink / raw)
  To: gdb-patches; +Cc: Simon Marchi

Andrew Burgess <aburgess@redhat.com> writes:

> It was pointed out on the mailing list that a recently added
> test (gdb.python/py-progspace-events.exp) was failing when run with
> the native-extended-gdbserver board.  This test was added with this
> commit:
>
>   commit 59912fb2d22f8a4bb0862487f12a5cc65b6a013f
>   Date:   Tue Sep 19 11:45:36 2023 +0100
>
>       gdb: add Python events for program space addition and removal
>
> It turns out though that the test is failing due to a existing bug
> in GDB, the new test just exposes the problem.  Additionally, the
> failure really doesn't even rely on the new functionality added in the
> above commit.  I reduced the test to a simple set of steps that
> reproduced the failure and tested against GDB 13, and the test passes;
> so the bug was introduced since then.  In fact, the bug was introduced
> with this commit:
>
>   commit a2827364e2bf19910fa5a54364a594a5ba3033b8
>   Date:   Fri Sep 8 15:48:16 2023 +0100
>
>       gdb: remove final user of the executable_changed observer
>
> This commit changed how the per-inferior auxv data cache is managed,
> specifically, when the cache is cleared, and it is this that leads to
> the failure.
>
> This bug is interesting because it exposes a number of issues with
> GDB, I'll explain all of the problems I see, though ultimately, I only
> propose fixing one problem in this commit, which is enough to resolve
> the crash we are currently seeing.
>
> The crash that we are seeing manifests like this:
>
>   ...
>   [Inferior 2 (process 3970384) exited normally]
>   +inferior 1
>   [Switching to inferior 1 [process 3970383] (/tmp/build/gdb/testsuite/outputs/gdb.python/py-progspace-events/py-progspace-events)]
>   [Switching to thread 1.1 (Thread 3970383.3970383)]
>   #0  breakpt () at /tmp/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.python/py-progspace-events.c:28
>   28	{ /* Nothing.  */ }
>   (gdb) step
>   +step
>   terminate called after throwing an instance of 'gdb_exception_error'
>
>   Fatal signal: Aborted
>   ... etc ...
>
> What's happening is that GDB attempts to refile the auxv cache as a
> result of the gdbarch_has_shared_address_space call in
> program_space::~program_space, the backtrace looks like this:
>
>   #0  0x00007fb4f419a9a5 in raise () from /lib64/libpthread.so.0
>   #1  0x00000000008b635d in handle_fatal_signal (sig=6) at ../../src/gdb/event-top.c:912
>   #2  <signal handler called>
>   #3  0x00007fb4f38e3625 in raise () from /lib64/libc.so.6
>   #4  0x00007fb4f38cc8d9 in abort () from /lib64/libc.so.6
>   #5  0x00007fb4f3c70756 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
>   #6  0x00007fb4f3c7c6dc in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
>   #7  0x00007fb4f3c7b6e9 in __cxa_call_terminate () from /lib64/libstdc++.so.6
>   #8  0x00007fb4f3c7c094 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
>   #9  0x00007fb4f3a80c63 in _Unwind_RaiseException_Phase2 () from /lib64/libgcc_s.so.1
>   #10 0x00007fb4f3a8154e in _Unwind_Resume () from /lib64/libgcc_s.so.1
>   #11 0x0000000000e8832d in target_read_alloc_1<unsigned char> (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2266
>   #12 0x0000000000e73dea in target_read_alloc (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2315
>   #13 0x000000000058248c in target_read_auxv_raw (ops=0x408a3a0) at ../../src/gdb/auxv.c:379
>   #14 0x000000000058243d in target_read_auxv () at ../../src/gdb/auxv.c:368
>   #15 0x000000000058255c in target_auxv_search (match=0x0, valp=0x7ffdee17c598) at ../../src/gdb/auxv.c:415
>   #16 0x0000000000a464bb in linux_is_uclinux () at ../../src/gdb/linux-tdep.c:433
>   #17 0x0000000000a464f6 in linux_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/linux-tdep.c:440
>   #18 0x0000000000510eae in gdbarch_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/gdbarch.c:4889
>   #19 0x0000000000bc7558 in program_space::~program_space (this=0x4544aa0, __in_chrg=<optimized out>) at ../../src/gdb/progspace.c:124
>   #20 0x00000000009b245d in delete_inferior (inf=0x47b3de0) at ../../src/gdb/inferior.c:290
>   #21 0x00000000009b2c10 in prune_inferiors () at ../../src/gdb/inferior.c:480
>   #22 0x00000000009c5e3e in fetch_inferior_event () at ../../src/gdb/infrun.c:4558
>   #23 0x000000000099b4dc in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:42
>   #24 0x0000000000cbc64f in remote_async_serial_handler (scb=0x4090a30, context=0x408a6b0) at ../../src/gdb/remote.c:14859
>   #25 0x0000000000d83d3a in run_async_handler_and_reschedule (scb=0x4090a30) at ../../src/gdb/ser-base.c:138
>   #26 0x0000000000d83e1f in fd_event (error=0, context=0x4090a30) at ../../src/gdb/ser-base.c:189
>
> So this is problem #1, if we throw an exception while deleting a
> program_space then this is not caught, and is going to crash GDB.
>
> Problem #2 becomes evident when we ask why GDB is throwing an error in
> this case; the error is thrown because the remote target, operating in
> non-async mode, can't read the auxv data while an inferior is running
> and GDB is waiting for a stop reply.  The problem here then, is why
> does GDB get into a position where it tries to interact with the
> remote target in this way, at this time?  The problem is caused by the
> prune_inferiors call which can be seen in the above backtrace.
>
> In prune_inferiors we check if the inferior is deletable, and if it
> is, we delete it.  The problem is, I think, we should also check if
> the target is currently in a state that would allow us to delete the
> inferior.  We don't currently have such a check available, we'd need
> to add one, but for the remote target, this would return false if the
> remote is in async mode and the remote is currently waiting for a stop
> reply.  With this change in place GDB would defer deleting the
> inferior until the remote target has stopped, at which point GDB would
> be able to refill the auxv cache successfully.
>
> And then, problem #3 becomes evident when we ask why GDB is needing to
> refill the auxv cache now when it didn't need to for GDB 13.  This is
> where the second commit mentioned above (a2827364e2bf) comes in.
> Prior to this commit, the auxv cache was cleared by the
> executable_changed observer, while after that commit the auxv cache
> was cleared by the new_objfile observer -- but only when the
> new_objfile observer is used in the special mode that actually means
> that all objfiles have been unloaded (I know, the overloading of the
> new_objfile observer is horrible, and unnecessary, but it's not really
> important for this bug).
>
> The difference arises because the new_objfile observer is triggered
> from clear_symtab_users, which in turn is called from
> program_space::~program_space.  The new_objfile observer for auxv does
> this:
>
>   static void
>   auxv_new_objfile_observer (struct objfile *objfile)
>   {
>     if (objfile == nullptr)
>       invalidate_auxv_cache_inf (current_inferior ());
>   }
>
> That is, when all the objfiles are unloaded, we clear the auxv cache
> for the current inferior.
>
> The problem is, then when we look at the prune_inferiors ->
> delete_inferior -> ~program_space path, we see that the current
> inferior is not going to be an inferior that exists within the
> program_space being deleted; delete_inferior removes the deleted
> inferior from the global inferior list, and then only deletes the
> program_space if program_space::empty() returns true, which is only
> the case if the current inferior isn't within the program_space to
> delete, and no other inferior exists within that program_space
> either.
>
> What this means is that when the new_objfile observer is called we
> can't rely on the current inferior having any relationship with the
> program space in which the objfiles were removed.  This was an error
> in the commit a2827364e2bf, the only thing we can rely on is the
> current program space.  As a result of this mistake, after commit
> a2827364e2bf, GDB was sometimes clearing the auxv cache for a random
> inferior.  In the native target case this was harmless as we can
> always refill the cache when needed, but in the remote target case, if
> we need to refill the cache when the remote target is executing, then
> we get the crash we observed.
>
> And additionally, if we think about this a little more, we see that
> commit a2827364e2bf made another mistake.  When all the objfiles are
> removed, they are removed from a program_space, a program_space might
> contain multiple inferiors, so surely, we should clear the auxv cache
> for all of the matching inferiors?
>
> Given these two insights, that the current_inferior is not relevant,
> only the current_program_space, and that we should be clearing the
> cache for all inferiors in the current_program_space, we can update
> auxv_new_objfile_observer to:
>
>   if (objfile == nullptr)
>     {
>       for (inferior *inf : all_inferiors ())
> 	{
> 	  if (inf->pspace == current_program_space)
> 	    invalidate_auxv_cache_inf (inf);
> 	}
>     }
>
> With this change we now correctly clear the auxv cache for the correct
> inferiors, and GDB no longer needs to refill the cache at an
> inconvenient time, this avoids the crash we were seeing.
>
> And finally, we reach problem #4.  Inspired by the observation that
> using the current_inferior from within the ~program_space function was
> not correct, I added some debug to see if current_inferior() was
> called anywhere else (below ~program_space), and the answer is yes,
> it's called a often.  Mostly the culprit is GDB doing:
>
>   current_inferior ()->top_target ()-> ....
>
> But I think all of these calls are most likely doing the wrong thing,
> and only work because the top target in all these cases is shared
> between all inferiors, e.g. it's the native target, or the remote
> target for all inferiors.  But if we had a truly multi-connection
> setup, then we might start to see odd behaviour.
>
> Problem #1 I'm just ignoring for now, I guess at some point we might
> run into this again, and then we'd need to solve this.  But in this
> case I wasn't sure what a "good" solution would look like.  We need
> the auxv data in order to implement the linux_is_uclinux() function.
> If we can't get the auxv data then what should we do, assume yes, or
> assume no?  The right answer would probably be to propagate the error
> back up the stack, but then we reach ~program_space, and throwing
> exceptions from a destructor is problematic, so we'd need to catch and
> deal at this point.  The linux_is_uclinux() call is made from within
> gdbarch_has_shared_address_space(), which is used like:
>
>   if (!gdbarch_has_shared_address_space (target_gdbarch ()))
>     delete this->aspace;
>
> So, we would have to choose; delete the address space or not.  If we
> delete it on error, then we might delete an address space that is
> shared within another program space.  If we don't delete the address
> space, then we might leak it.  Neither choice is great.
>
> A better solution might be to have the address spaces be reference
> counted, then we could remove the gdbarch_has_shared_address_space
> call completely, and just rely on the reference count to auto-delete
> the address space when appropriate.
>
> The solution for problem #2 I already hinted at above, we should have
> a new target_can_delete_inferiors() call, which should be called from
> prune_inferiors, this would prevent GDB from trying to delete
> inferiors when a (remote) target is in a state where we know it can't
> delete the inferior.  Deleting an inferior often (always?) requires
> sending packets to the remote, and if the remote is waiting for a stop
> reply then this will never work, so the pruning should be deferred in
> this case.
>
> The solution for problem #3 is included in this commit.
>
> And, for problem #4, I'm not sure what the right solution is.  Maybe
> delete_inferior should ensure the inferior to be deleted is in place
> when ~program_space is called?  But that seems a little weird, as the
> current inferior would, in theory, still be using the current
> program_space...
>
> Anyway, after this commit, the gdb.python/py-progspace-events.exp test
> now passes when run with the native-extended-remote board.
>
> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30935

This commit was merged (1b28c0f488bb).

Thanks,
Andrew


> ---
>  gdb/auxv.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/gdb/auxv.c b/gdb/auxv.c
> index 9f599b04a4f..cfb44ffa0bb 100644
> --- a/gdb/auxv.c
> +++ b/gdb/auxv.c
> @@ -351,7 +351,16 @@ static void
>  auxv_new_objfile_observer (struct objfile *objfile)
>  {
>    if (objfile == nullptr)
> -    invalidate_auxv_cache_inf (current_inferior ());
> +    {
> +      /* When OBJFILE is nullptr, this indicates that all symbol files have
> +	 been unloaded from the current program space.  Discard cached auxv
> +	 information from any inferior within the effected program space.  */
> +      for (inferior *inf : all_inferiors ())
> +	{
> +	  if (inf->pspace == current_program_space)
> +	    invalidate_auxv_cache_inf (inf);
> +	}
> +    }
>  }
>  
>  /* See auxv.h.  */
>
> base-commit: e030ce2c79feee6a35665a6580a23dcf937ea46f
> -- 
> 2.25.4


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] gdb: fix auxv cache clearing from new_objfile observer
  2023-10-03 17:50 ` Simon Marchi
  2023-10-05 17:20   ` Simon Marchi
@ 2023-11-06  8:21   ` Tom de Vries
  1 sibling, 0 replies; 5+ messages in thread
From: Tom de Vries @ 2023-11-06  8:21 UTC (permalink / raw)
  To: Simon Marchi, Andrew Burgess, gdb-patches

On 10/3/23 19:50, Simon Marchi via Gdb-patches wrote:
>> A better solution might be to have the address spaces be reference
>> counted, then we could remove the gdbarch_has_shared_address_space
>> call completely, and just rely on the reference count to auto-delete
>> the address space when appropriate.
> I advocate for the shared_ptr / refcount solution.  ~program_space is
> really not a good place to do target calls, as you noted.

I've implemented a refcount approach here ( 
https://sourceware.org/pipermail/gdb-patches/2023-November/203764.html ).

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-11-06  8:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-03 16:20 [PATCH] gdb: fix auxv cache clearing from new_objfile observer Andrew Burgess
2023-10-03 17:50 ` Simon Marchi
2023-10-05 17:20   ` Simon Marchi
2023-11-06  8:21   ` Tom de Vries
2023-10-06 16:43 ` Andrew Burgess

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).