From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 45690385703C; Tue, 15 Aug 2023 13:43:57 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 45690385703C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1692107037; bh=M3gfFWCz3TdZ6Qc8AcF3Mby/h06ksuf1hLmRG1qYess=; h=From:To:Subject:Date:From; b=ccjtQNnuVZAXsiftQ022QPCqGnVwy/M34Rslbua6ZWlJhwG7iHUHFK2WYSrJz3mO4 hU1awVTu2CDn48xU49H0fCnUghZGXUT4rO6qKbMFKxZet3AK+onlO0Jhwiy3SKyyvK UHxexJf2QJbQW+bgtvDCuXn1uC9F4wtQhhamEfKc= From: "aburgess at redhat dot com" To: gdb-prs@sourceware.org Subject: [Bug shlibs/30765] New: Recursive library loading problem when using glibc probes Date: Tue, 15 Aug 2023 13:43:55 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gdb X-Bugzilla-Component: shlibs X-Bugzilla-Version: HEAD X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: aburgess at redhat dot com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://sourceware.org/bugzilla/show_bug.cgi?id=3D30765 Bug ID: 30765 Summary: Recursive library loading problem when using glibc probes Product: gdb Version: HEAD Status: NEW Severity: normal Priority: P2 Component: shlibs Assignee: unassigned at sourceware dot org Reporter: aburgess at redhat dot com Target Milestone: --- Created attachment 15060 --> https://sourceware.org/bugzilla/attachment.cgi?id=3D15060&action=3Ded= it GDB test case that exposes the issue described in this bug. This bug describes an issues that exists with the mechanism GDB uses to det= ect shared library loading, specifically, with glibc's probe interface. I think the real problem is with glibc, though it maybe possible that we can work around this issue in GDB, but I'm not sure how yet. The attached patch applies to current(ish) HEAD of GDB (86dfe011797) and ad= ds a test which shows the problem, when run I see results like this: =3D=3D=3D gdb Summary =3D=3D=3D # of expected passes 4 # of known failures 3 Below is the description of the bug taken from the commit message included = in the patch: gdb/testsuite: expose issue with recursive dlopen This commit exposes an issue with GDB's handling of recursive dlopen. The bug is actually an issue in glibc, but I'm creating this patch so that I can file a GDB bug, which I'll then reference from a glibc bug. The bug is actually in glibc's reloc_complete probe, which the glibc documentation describes like this: reloc_complete: The linker has relocated all objects in the specified namespace. The namespace's r_debug structure is consistent and may be inspected, and all objects in the namespace's link-map are guaranteed to have been relocated. In this test we create a situation where a recursive dlopen occurs. This is done by overriding malloc. Inside the overridden malloc we dlopen a library (libbar) and call a function from within that library, we then dlclose the library. Care is taken so that we don't trigger this behaviour recursively, if the dlopen, call, dlclose sequence used within malloc triggers another malloc, then, in that case, we just forward the request straight through to malloc. Now, in the main() function we dlopen a different library (libfoo), call a function within it, and then dlclose the library. There is no recursion protection here. And so, the basic sequence of events is: In main, dlopen libfoo dlopen calls malloc In malloc, dlopen libbar dlopen calls malloc In malloc, allocate memory and return dlopen for libbar completes In malloc, call function from libbar In malloc, dlclose libbar In malloc, allocate memory and return dlopen for libfoo completes In main, call function from libfoo In main, dlclose libfoo It's not quite that simple, it turns out that dlopen calls malloc a number of times, and so we actually see repeated calls into malloc that each result in libbar being loaded, called, and closed. Within glibc, as each library is loaded, we pass through a number of probes: - map_start - map_complete - reloc_start - reloc_complete GDB only cares about the 'reloc_complete' probe, which is hit when all the libraries have been mapped and relocated. At some point after map_start the new library is added to the shared library list, but is not yet relocated. Only when reloc_complete is hit are we guaranteed that all libraries have been relocated... The problem is, glibc calls malloc at some point between map_start and reloc_complete. This call to malloc triggers the recursive dlopen. This recursive dlopen passes through all these probes, which means that GDB will be triggered by the reloc_complete probe. When the reloc_complete probe is hit the following things happen: First, GDB tries to only load information about the most recently added libraries. To do this GDB tracks the known library list. When reloc_complete is hit glibc passes GDB a pointer to the new library, which is part of a doubly linked list. GDB follows the back pointer for the new library and expects the previous library to be the last library that GDB knows was loaded. However, in our problem case this is not true. The first library (libfoo) has already been added to the library list, but has not yet been announced (with reloc_complete) to GDB yet. GDB is seeing the reloc_complete probe for libbar. However, within glibc's data structure, the previous library is libfoo, and this is why we see the following warning from GDB: warning: Corrupted shared library list: 0x7ffff7ffd988 !=3D 0x405ee0 Now, when GDB emits that warning it falls back to performing a complete reload of all the shared libraries. This is done by walking glibc's data structure to find all the libraries. This will include libfoo, which has not yet been relocated. Unfortunately, there is nothing in glibc's data structure (that is visible to GDB) that can tell us that libfoo is not yet relocated, as a result, GDB will believe that libfoo has been fully relocated, and will announce the library to the user. This test shows that the library is not fully relocated by stopping on the solib event, watching for GDB to tell us that libfoo has been loaded, and then prints a global variable from within the library. The global variable happens to be initialised with a pointer value, and so will not be correct unless relocation has been performed. As we see, GDB can observe the global in an uninitialised state. I don't know if there are wider implications from GDB seeing the library load earlier than it should, we can, for sure, load the debug information at this point -- could we get anything wrong as a result of relocation having not been completed yet? We could potentially trigger the loading of Python extensions from the library, this for sure could run into problems if the Python code reads any globals that it expects to be initialised. In terms of fixing this, the only options I see would require GDB to be _more_ trusting of glibc, and even then, I don't think the solution would be perfect. We could track the reloc_start/reloc_completed pairs to try and track recursion, and thus ignore libraries that have not been relocated yet, but this would mean we could not fall back (as we currently do) to just "reload everything", when we see some unexpected state -- as "everything" can include libraries that are not relocated yet. Also, if we attach to a process we're stuck, the only option is to walk the library list and "reload everything", but at that point we might end up finding a library that is not relocated yet. Ultimately, the right solution is for glibc to ensure that we really do only add the library to the library list just prior to hitting the reloc_complete probe. Well, to maintain the existing API, I think glibc would need to add the library to the list just prior to map_complete, then remove the library again just after reloc_start, before adding the libraries again at reloc_complete -- which really sucks. Or maybe glibc needs to be smarter and "preallocate" its required memory ahead of time before mapping and relocating the library... --=20 You are receiving this mail because: You are on the CC list for the bug.=