* Fix slow tls access after dlopen
@ 2023-01-06 10:06 Paul Zimmermann
2023-01-06 16:36 ` Szabolcs Nagy
0 siblings, 1 reply; 2+ messages in thread
From: Paul Zimmermann @ 2023-01-06 10:06 UTC (permalink / raw)
To: nsz; +Cc: libc-alpha
Hi Szabolcs,
please could you repost [1] as a distinct patch as requested by Carlos?
This issue affects the SageMath software tool, with a major slowdown:
https://trac.sagemath.org/ticket/34850
(see comment 97 which provides a short C program to reproduce the issue).
Best regards,
Paul
[1] https://patchwork.ozlabs.org/project/glibc/patch/b116855de71098ef7dd2875dd3237f8f3ecc12c2.1618301209.git.szabolcs.nagy@arm.com/
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Fix slow tls access after dlopen
2023-01-06 10:06 Fix slow tls access after dlopen Paul Zimmermann
@ 2023-01-06 16:36 ` Szabolcs Nagy
0 siblings, 0 replies; 2+ messages in thread
From: Szabolcs Nagy @ 2023-01-06 16:36 UTC (permalink / raw)
To: Paul Zimmermann; +Cc: nsz, libc-alpha
The 01/06/2023 11:06, Paul Zimmermann via Libc-alpha wrote:
> Hi Szabolcs,
>
> please could you repost [1] as a distinct patch as requested by Carlos?
>
> This issue affects the SageMath software tool, with a major slowdown:
>
> https://trac.sagemath.org/ticket/34850
>
> (see comment 97 which provides a short C program to reproduce the issue).
i see, thanks.
>
> Best regards,
> Paul
>
> [1] https://patchwork.ozlabs.org/project/glibc/patch/b116855de71098ef7dd2875dd3237f8f3ecc12c2.1618301209.git.szabolcs.nagy@arm.com/
yeah i resent that patch (only updated the commit message):
https://patchwork.ozlabs.org/project/glibc/patch/20221019111439.63501-1-szabolcs.nagy@arm.com/
but it requires more work (to document the new sync mechanism).
and requires actual review. and i think there is a better way but
there are non-trivial trade-offs so reviewing the design would help:
in tls access slow path we currently update the thread dtv up to the
generation of the accessed module, but always do the slow path if the
dtv is older than the global generation. the plan is to update the dtv
to the global generation (latest dlopen/dlclose), this means we don't
keep hitting the slow path after a dlopen, but we do hit it once after
every dlopen/dlclose (even though an independent library is loaded,
not related to the current tls access, this is ok if dlopen/dlclose of
libs with tls is rare).
dtv update does two things: 1. resize the dtv array (of the current
thread) upto max modid + some, 2. go through the global dtv slotinfo
list and if the generation of entry i is newer than the dtv generation
then free/reset dtv[i]. (note that dtv[i] is not allocated here, it's
either reset or left alone.)
the original patch in bugzilla and my patch tries to do the dtv update
with lock-free atomics which means the global generation count has to
be accessed with acquire-load and release-store and a lot of racy
loads need to use relaxed mo atomics to avoid data races. this works
except out-of-thin-air values may break the code. OOTA value is only a
theoretical issue with relaxed mo access i think, here it can cause ub
down the line and that ub can cause arbitrary value to be stored
concurrently justifying the OOTA value in the first place, this is a
bit ugly and the use of atomics makes the code harder to read, but
otherwise i don't see a problem with this approach.
another approach is to take GL(dl_load_tls_lock) in tls access slow
path and then the sync is much simpler (no nasty relaxed atomics).
the drawback is that malloc/free is used under the lock which may
be interposed such that they use GD tls which makes lock ordering
deadlock more likely (note such issue already exists in dlopen and
dlclose and tls access already uses the same lock and malloc/free
in tls_get_addr_tail, however now the malloc/free would be under
the lock so deadlock can happen between two threads only doing tls
access, no dlopen/dlclose is needed). another problem is lock
contention when there are many tls accesses concurrently with a
dlopen/dlclose that bumps the generation count so all those accesses
have to wait for each other and dlopen/dlclose for the same lock.
so using a lock is not ideal but if we do bigger changes it may work
or we may get away with simpler code:
- ideally an independent dlopen should not slow tls access down.
i think this can be done by updating dtv[i] in all threads at
dlclose time and leaving dtv resize for the tls access slow path.
(moving dtv resize to dlopen/dlclose too might work but the sync
with tls access is non-trivial then), then generation count is not
needed at all. dtv[i] is valid if dtv array is large enough. first
access still needs to allocate. dlclose will be slower and has to
sync with thread creation and exit and with dtv resize at tls
access. this is a reasonable trade-off i think. requires dtv resize
change to avoid the malloc lock ordering problem (not use realloc
or use a per-thead-lock to sync with dlclose) and target specific
asm change (at least for tlsdesc) and removal of all the slotinfo
generation count code.
- tls_get_addr_tail should not use the tls lock: dlopen can just
set up map->l_tls_offset and have it const during the lifetime of
the module. (i'm not sure why tls offset is set up lazily i dont
think it buys us anything but i might missed something around dl
namespaces).
- _dl_update_slotinfo is called outside of tls access unnecessarily
(in dlopen and reloc processing) i think these should be removed.
what might make sense instead is to update the dtv once in rtld when
the process is still single threaded and all modules are relocated
but ctors are not yet run (this can ensure that libs loaded at start
time to have reliable and as-safe GD tls access: they will use
static tls without any locks or malloc on first access).
- completely avoiding locks and malloc in GD tls access would require
upfront allocation of tls for (and sync with) all threads, this is
likely too much memory use (and dlopen sync overhead), but even
with lazy tls allocation e.g. we can guarantee as-safe, no-malloc,
no-failure GD tls access when tls was already accessed in that
thread (so independent dlopen cannot force a dtv update only the
first access may require lock/malloc). not sure how useful this is
but then we can make dlsym report allocation failure for tls syms
so users can do an initial tls access in relevant threads to ensure
fast and reliable behaviour later (even in signal handlers).
if we want a fix for this release i can repost my previous patch (with
lock-free atomics that is likely backportable) but at some point a
larger rewrite would be better i think.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2023-01-06 16:37 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-06 10:06 Fix slow tls access after dlopen Paul Zimmermann
2023-01-06 16:36 ` Szabolcs Nagy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).