public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* Fix slow tls access after dlopen
@ 2023-01-06 10:06 Paul Zimmermann
  2023-01-06 16:36 ` Szabolcs Nagy
  0 siblings, 1 reply; 2+ messages in thread
From: Paul Zimmermann @ 2023-01-06 10:06 UTC (permalink / raw)
  To: nsz; +Cc: libc-alpha

       Hi Szabolcs,

please could you repost [1] as a distinct patch as requested by Carlos?

This issue affects the SageMath software tool, with a major slowdown:

https://trac.sagemath.org/ticket/34850

(see comment 97 which provides a short C program to reproduce the issue).

Best regards,
Paul

[1] https://patchwork.ozlabs.org/project/glibc/patch/b116855de71098ef7dd2875dd3237f8f3ecc12c2.1618301209.git.szabolcs.nagy@arm.com/

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Fix slow tls access after dlopen
  2023-01-06 10:06 Fix slow tls access after dlopen Paul Zimmermann
@ 2023-01-06 16:36 ` Szabolcs Nagy
  0 siblings, 0 replies; 2+ messages in thread
From: Szabolcs Nagy @ 2023-01-06 16:36 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: nsz, libc-alpha

The 01/06/2023 11:06, Paul Zimmermann via Libc-alpha wrote:
>        Hi Szabolcs,
> 
> please could you repost [1] as a distinct patch as requested by Carlos?
> 
> This issue affects the SageMath software tool, with a major slowdown:
> 
> https://trac.sagemath.org/ticket/34850
> 
> (see comment 97 which provides a short C program to reproduce the issue).

i see, thanks.

> 
> Best regards,
> Paul
> 
> [1] https://patchwork.ozlabs.org/project/glibc/patch/b116855de71098ef7dd2875dd3237f8f3ecc12c2.1618301209.git.szabolcs.nagy@arm.com/

yeah i resent that patch (only updated the commit message):
https://patchwork.ozlabs.org/project/glibc/patch/20221019111439.63501-1-szabolcs.nagy@arm.com/

but it requires more work (to document the new sync mechanism).
and requires actual review. and i think there is a better way but
there are non-trivial trade-offs so reviewing the design would help:

in tls access slow path we currently update the thread dtv up to the
generation of the accessed module, but always do the slow path if the
dtv is older than the global generation. the plan is to update the dtv
to the global generation (latest dlopen/dlclose), this means we don't
keep hitting the slow path after a dlopen, but we do hit it once after
every dlopen/dlclose (even though an independent library is loaded,
not related to the current tls access, this is ok if dlopen/dlclose of
libs with tls is rare).

dtv update does two things: 1. resize the dtv array (of the current
thread) upto max modid + some, 2. go through the global dtv slotinfo
list and if the generation of entry i is newer than the dtv generation
then free/reset dtv[i]. (note that dtv[i] is not allocated here, it's
either reset or left alone.)

the original patch in bugzilla and my patch tries to do the dtv update
with lock-free atomics which means the global generation count has to
be accessed with acquire-load and release-store and a lot of racy
loads need to use relaxed mo atomics to avoid data races. this works
except out-of-thin-air values may break the code. OOTA value is only a
theoretical issue with relaxed mo access i think, here it can cause ub
down the line and that ub can cause arbitrary value to be stored
concurrently justifying the OOTA value in the first place, this is a
bit ugly and the use of atomics makes the code harder to read, but
otherwise i don't see a problem with this approach.

another approach is to take GL(dl_load_tls_lock) in tls access slow
path and then the sync is much simpler (no nasty relaxed atomics).
the drawback is that malloc/free is used under the lock which may
be interposed such that they use GD tls which makes lock ordering
deadlock more likely (note such issue already exists in dlopen and
dlclose and tls access already uses the same lock and malloc/free
in tls_get_addr_tail, however now the malloc/free would be under
the lock so deadlock can happen between two threads only doing tls
access, no dlopen/dlclose is needed). another problem is lock
contention when there are many tls accesses concurrently with a
dlopen/dlclose that bumps the generation count so all those accesses
have to wait for each other and dlopen/dlclose for the same lock.

so using a lock is not ideal but if we do bigger changes it may work
or we may get away with simpler code:

- ideally an independent dlopen should not slow tls access down.
  i think this can be done by updating dtv[i] in all threads at
  dlclose time and leaving dtv resize for the tls access slow path.
  (moving dtv resize to dlopen/dlclose too might work but the sync
  with tls access is non-trivial then), then generation count is not
  needed at all. dtv[i] is valid if dtv array is large enough. first
  access still needs to allocate. dlclose will be slower and has to
  sync with thread creation and exit and with dtv resize at tls
  access. this is a reasonable trade-off i think. requires dtv resize
  change to avoid the malloc lock ordering problem (not use realloc
  or use a per-thead-lock to sync with dlclose) and target specific
  asm change (at least for tlsdesc) and removal of all the slotinfo
  generation count code.

- tls_get_addr_tail should not use the tls lock: dlopen can just
  set up map->l_tls_offset and have it const during the lifetime of
  the module. (i'm not sure why tls offset is set up lazily i dont
  think it buys us anything but i might missed something around dl
  namespaces).

- _dl_update_slotinfo is called outside of tls access unnecessarily
  (in dlopen and reloc processing) i think these should be removed.
  what might make sense instead is to update the dtv once in rtld when
  the process is still single threaded and all modules are relocated
  but ctors are not yet run (this can ensure that libs loaded at start
  time to have reliable and as-safe GD tls access: they will use
  static tls without any locks or malloc on first access).

- completely avoiding locks and malloc in GD tls access would require
  upfront allocation of tls for (and sync with) all threads, this is
  likely too much memory use (and dlopen sync overhead), but even
  with lazy tls allocation e.g. we can guarantee as-safe, no-malloc,
  no-failure GD tls access when tls was already accessed in that
  thread (so independent dlopen cannot force a dtv update only the
  first access may require lock/malloc). not sure how useful this is
  but then we can make dlsym report allocation failure for tls syms
  so users can do an initial tls access in relevant threads to ensure
  fast and reliable behaviour later (even in signal handlers).

if we want a fix for this release i can repost my previous patch (with
lock-free atomics that is likely backportable) but at some point a
larger rewrite would be better i think.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-01-06 16:37 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-06 10:06 Fix slow tls access after dlopen Paul Zimmermann
2023-01-06 16:36 ` Szabolcs Nagy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).