From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 57D693858C50; Sat, 11 Mar 2023 11:15:40 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 57D693858C50 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1678533340; bh=aUNEBwrASE0ZeJp8BHtj+G+K/c6JBoVJu53gcjaemFE=; h=From:To:Subject:Date:In-Reply-To:References:From; b=F023e460eo3Z6PdcG78tqHOMHA5Nzn5GAlXdoxX1pssK2owwNAFvUkvxbPdsYnuGq e9pO/SqR6BCBF1h67dq5Iptxr1CcKTGDFu44IPWFbx+sny2nU+qIZJ56amNwi9Vu8G LiNP4KIbmh9G1BKBUAdS6DTy79PkxVWwJAnhNpJE= From: "parky at outlook dot com" To: glibc-bugs@sourceware.org Subject: [Bug dynamic-link/30186] RTLD_DEEPBIND interacts badly with LD_PRELOAD Date: Sat, 11 Mar 2023 11:15:38 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: dynamic-link X-Bugzilla-Version: unspecified X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: parky at outlook dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://sourceware.org/bugzilla/show_bug.cgi?id=3D30186 --- Comment #4 from Matthew Parkinson --- (In reply to Adhemerval Zanella from comment #3) Thank you for spending your time on this issue.=20=20 > This essentially adds another PLT indirection on *all* memory allocations, > even for default binaries that do no dlopen or do not use RTLD_DEEPBIND. = I > don't think this is a good tradeoff. >=20 > > The obvious question is how much performance would be lost in adding the > > indirection. It would not affect existing allocator overrides as they = could > > override both, but for programs using the libc allocator, there would b= e a > > cost. >=20 > I don't think the performance overhead is acceptable, and I am almost sure > other maintainers will frown upon it as well. Thanks for the feedback. It is really useful to know. I have modified the approach to reduce the cost, so I would really appreciate your opinion on t= his.=20 I have put up a simple example here: https://github.com/mjp41/deepbindexample/tree/main/solutionopt The basic idea is to have a library local global variable that specifies if= a direct call should be performed. With this approach the common case would = have a single double PLTed call, and then all subsequent calls would be direct (= and potentially inlinable by the compiler. Here is the main part of the code t= hat would be like a libc malloc implementation: ``` #include "stdio.h" #include "stdbool.h" __attribute__((visibility("hidden"))) bool direct_call =3D false; __attribute__((visibility("hidden"))) void message_impl() { puts("lib.c: message_impl"); } extern void internal_message() { direct_call =3D true; puts("lib.c: internal_message -> message_impl"); message_impl(); } extern void message() { if (!direct_call) { puts("lib.c: message -> internal_message"); internal_message(); return; } puts("lib.c: message -> message_impl"); message_impl(); } ``` The first call to `message` will call `internal_message` and then the implementation `message_impl`. The call to `internal_message` will set the library local variable `direct_call` and then any subsequent call to `messa= ge` can directly call `message_impl`. I think this would have similar performance to the `malloc_hook` code, which tested if the hook had been overwritten, and if it had called it. So this seems closer to acceptable performance, but still introduces a load, comparison and jump. One x64 it adds: ``` 1188: 80 3d a2 2e 00 00 00 cmpb $0x0,0x2ea2(%rip) # 4= 031 118f: 74 1f je 11b0 ``` I am not familiar with all the changes to dlmalloc that are in ptmalloc, bu= t in dlmalloc there is some initialisation checking code that could probably be combined with this test to improve performance. > >=20 > > Many allocators already provide symbols for __libc_malloc, e.g.: > > https://github.com/microsoft/mimalloc/blob/ > > dd7348066fe40e8bf372fa4e9538910a5e24a75f/src/alloc-override.c#L273-L285 > >=20 > > https://github.com/jemalloc/jemalloc/blob/ > > 09e4b38fb1f9a9b505e35ac13b8f99282990bc2c/src/jemalloc.c#L3143-L3175 > >=20 > > So these would just start working with RTLD_DEEPBIND. >=20 > __libc_malloc is a strong alias to malloc, so there is no need to override > for shared libraries. Jemalloc states it is required to enable static > linking, which I am not sure it is true any longer (as long the malloc > implementation implements all requires functions). Good to know, thank you. > I really think a tunable or an extra option on LD_PRELOAD would be better= to > instruct that RTLD_DEEPBIND should always consider LD_PRELOADS libraries = on > namespace resolution.=20=20 I think long term this would address a wider range of problems, and am completely in favour of this. I am concerned it might be a lot of work, so = take a while to address? Hence, why I am considering allocator specific solutio= ns. > It has the advantage of being backportable (where a > new symbol indirection would require another set of symbols). Just to clarify. Are you saying patching old versions cannot introduce new symbols? Would that be true if `__libc_malloc` and `malloc` were unaliased= ?=20 I.e. would repurposing the symbol be a valid backport? --=20 You are receiving this mail because: You are on the CC list for the bug.=