public inbox for gdb-prs@sourceware.org
help / color / mirror / Atom feed
* [Bug shlibs/30765] New: Recursive library loading problem when using glibc probes
@ 2023-08-15 13:43 aburgess at redhat dot com
  2023-08-15 14:31 ` [Bug shlibs/30765] " aburgess at redhat dot com
  0 siblings, 1 reply; 2+ messages in thread
From: aburgess at redhat dot com @ 2023-08-15 13:43 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=30765

            Bug ID: 30765
           Summary: Recursive library loading problem when using glibc
                    probes
           Product: gdb
           Version: HEAD
            Status: NEW
          Severity: normal
          Priority: P2
         Component: shlibs
          Assignee: unassigned at sourceware dot org
          Reporter: aburgess at redhat dot com
  Target Milestone: ---

Created attachment 15060
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15060&action=edit
GDB test case that exposes the issue described in this bug.

This bug describes an issues that exists with the mechanism GDB uses to detect
shared library loading, specifically, with glibc's probe interface.  I think
the real problem is with glibc, though it maybe possible that we can work
around this issue in GDB, but I'm not sure how yet.

The attached patch applies to current(ish) HEAD of GDB (86dfe011797) and adds a
test which shows the problem, when run I see results like this:

                === gdb Summary ===

# of expected passes            4
# of known failures             3

Below is the description of the bug taken from the commit message included in
the patch:

    gdb/testsuite: expose issue with recursive dlopen

    This commit exposes an issue with GDB's handling of recursive dlopen.
    The bug is actually an issue in glibc, but I'm creating this patch so
    that I can file a GDB bug, which I'll then reference from a glibc bug.

    The bug is actually in glibc's reloc_complete probe, which the glibc
    documentation describes like this:

      reloc_complete:
        The linker has relocated all objects in the specified namespace.
        The namespace's r_debug structure is consistent and may be
        inspected, and all objects in the namespace's link-map are
        guaranteed to have been relocated.

    In this test we create a situation where a recursive dlopen occurs.
    This is done by overriding malloc.

    Inside the overridden malloc we dlopen a library (libbar) and call a
    function from within that library, we then dlclose the library.  Care
    is taken so that we don't trigger this behaviour recursively, if the
    dlopen, call, dlclose sequence used within malloc triggers another
    malloc, then, in that case, we just forward the request straight
    through to malloc.

    Now, in the main() function we dlopen a different library (libfoo),
    call a function within it, and then dlclose the library.  There is no
    recursion protection here.  And so, the basic sequence of events is:

      In main, dlopen libfoo
        dlopen calls malloc
          In malloc, dlopen libbar
            dlopen calls malloc
              In malloc, allocate memory and return
            dlopen for libbar completes
          In malloc, call function from libbar
          In malloc, dlclose libbar
          In malloc, allocate memory and return
        dlopen for libfoo completes
      In main, call function from libfoo
      In main, dlclose libfoo

    It's not quite that simple, it turns out that dlopen calls malloc a
    number of times, and so we actually see repeated calls into malloc
    that each result in libbar being loaded, called, and closed.

    Within glibc, as each library is loaded, we pass through a number of
    probes:

      - map_start
      - map_complete
      - reloc_start
      - reloc_complete

    GDB only cares about the 'reloc_complete' probe, which is hit when all
    the libraries have been mapped and relocated.

    At some point after map_start the new library is added to the shared
    library list, but is not yet relocated.  Only when reloc_complete is
    hit are we guaranteed that all libraries have been relocated...

    The problem is, glibc calls malloc at some point between map_start and
    reloc_complete.  This call to malloc triggers the recursive dlopen.
    This recursive dlopen passes through all these probes, which means
    that GDB will be triggered by the reloc_complete probe.

    When the reloc_complete probe is hit the following things happen:

    First, GDB tries to only load information about the most recently
    added libraries.  To do this GDB tracks the known library list.  When
    reloc_complete is hit glibc passes GDB a pointer to the new library,
    which is part of a doubly linked list.

    GDB follows the back pointer for the new library and expects the
    previous library to be the last library that GDB knows was loaded.
    However, in our problem case this is not true.  The first
    library (libfoo) has already been added to the library list, but has
    not yet been announced (with reloc_complete) to GDB yet.  GDB is
    seeing the reloc_complete probe for libbar.  However, within glibc's
    data structure, the previous library is libfoo, and this is why we see
    the following warning from GDB:

      warning: Corrupted shared library list: 0x7ffff7ffd988 != 0x405ee0

    Now, when GDB emits that warning it falls back to performing a
    complete reload of all the shared libraries.  This is done by walking
    glibc's data structure to find all the libraries.  This will include
    libfoo, which has not yet been relocated.

    Unfortunately, there is nothing in glibc's data structure (that is
    visible to GDB) that can tell us that libfoo is not yet relocated, as
    a result, GDB will believe that libfoo has been fully relocated, and
    will announce the library to the user.

    This test shows that the library is not fully relocated by stopping on
    the solib event, watching for GDB to tell us that libfoo has been
    loaded, and then prints a global variable from within the library.

    The global variable happens to be initialised with a pointer value,
    and so will not be correct unless relocation has been performed.  As
    we see, GDB can observe the global in an uninitialised state.

    I don't know if there are wider implications from GDB seeing the
    library load earlier than it should, we can, for sure, load the debug
    information at this point -- could we get anything wrong as a result
    of relocation having not been completed yet?  We could potentially
    trigger the loading of Python extensions from the library, this for
    sure could run into problems if the Python code reads any globals that
    it expects to be initialised.

    In terms of fixing this, the only options I see would require GDB to
    be _more_ trusting of glibc, and even then, I don't think the solution
    would be perfect.  We could track the reloc_start/reloc_completed
    pairs to try and track recursion, and thus ignore libraries that have
    not been relocated yet, but this would mean we could not fall back (as
    we currently do) to just "reload everything", when we see some
    unexpected state -- as "everything" can include libraries that are not
    relocated yet.

    Also, if we attach to a process we're stuck, the only option is to
    walk the library list and "reload everything", but at that point we
    might end up finding a library that is not relocated yet.

    Ultimately, the right solution is for glibc to ensure that we really
    do only add the library to the library list just prior to hitting the
    reloc_complete probe.

    Well, to maintain the existing API, I think glibc would need to add
    the library to the list just prior to map_complete, then remove the
    library again just after reloc_start, before adding the libraries
    again at reloc_complete -- which really sucks.  Or maybe glibc needs
    to be smarter and "preallocate" its required memory ahead of time
    before mapping and relocating the library...

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug shlibs/30765] Recursive library loading problem when using glibc probes
  2023-08-15 13:43 [Bug shlibs/30765] New: Recursive library loading problem when using glibc probes aburgess at redhat dot com
@ 2023-08-15 14:31 ` aburgess at redhat dot com
  0 siblings, 0 replies; 2+ messages in thread
From: aburgess at redhat dot com @ 2023-08-15 14:31 UTC (permalink / raw)
  To: gdb-prs

https://sourceware.org/bugzilla/show_bug.cgi?id=30765

Andrew Burgess <aburgess at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |30766

--- Comment #1 from Andrew Burgess <aburgess at redhat dot com> ---
I created glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30766 for
the glibc side of this issue.


Referenced Bugs:

https://sourceware.org/bugzilla/show_bug.cgi?id=30766
[Bug 30766] The reloc_complete probe can be hit when not all libraries have
been relocated
-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-08-15 14:31 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-15 13:43 [Bug shlibs/30765] New: Recursive library loading problem when using glibc probes aburgess at redhat dot com
2023-08-15 14:31 ` [Bug shlibs/30765] " aburgess at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).