fixing dynamic tls races (BZ 19329 and friends)

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* fixing dynamic tls races (BZ 19329 and friends)
@ 2020-12-24 16:55 Szabolcs Nagy
  2020-12-24 20:07 ` Florian Weimer
  2020-12-30  9:43 ` Szabolcs Nagy
  0 siblings, 2 replies; 5+ messages in thread
From: Szabolcs Nagy @ 2020-12-24 16:55 UTC (permalink / raw)
  To: libc-alpha

i'm trying to respin my patches for bug 19329 but
there may be disagreement about the approach and i may
not have time to finish this i wrote down a summary:

patches:
https://sourceware.org/legacy-ml/libc-alpha/2016-11/msg01093.html
https://sourceware.org/legacy-ml/libc-alpha/2016-11/msg01092.html
test case:
https://sourceware.org/legacy-ml/libc-alpha/2016-12/msg00456.html
(nptl/tst-stack4 sometimes shows the issue too)
discussions:
https://sourceware.org/pipermail/libc-alpha/2016-December/076990.html
https://sourceware.org/pipermail/libc-alpha/2017-January/077988.html

in short, the crux of the problem is that dtv updates
at thread creation and tls access time use dynlinker
internal state that is normally protected by locks.
this seems to be by design (the state is versioned by
a generation counter that can allow consistent view
without locks), but the synchronization is missing.

the deeper i dig into this the more issues pop up so i
mapped out some of them that can be relevant to a fix:

1) data races: tls related global dynamic linker state
that is modified under locks in dlopen and dlclose are
accessed without locks or atomics in pthread_create
and at tls access. (early startup does not take locks
either but that is fine: it is single-threaded). one
particular issue is that link_map structs that are
accessed without locks may be concurrently freed by
dlclose, this cannot be fixed just by using atomics.
globals:
  GL(dl_tls_generation) - size_t gen counter
  GL(dl_tls_max_dtv_idx) - size_t max modid
  GL(dl_tls_dtv_slotinfo_list) - (gen,map) list
    link_maps are accessed via this list.
locks:
  GL(dl_load_lock)
  GL(dl_load_write_lock)
slotinfo changes under lock:
  _dl_add_to_slotinfo - dlopen
  remove_slotinfo - dlclose
slotinfo and link_map access without locks:
  _dl_update_slotinfo - tls access
  tls_get_addr_tail - tls access
  _dl_allocate_tls_init - pthread_create
(similar story for generation counter and modid)
bugs:
  BZ 19329 - pthread_create vs dlopen
  BZ 27111 - pthread_create vs dlclose

2) fail safety: oom failures can abort pthread_create,
dlopen and tls access. tls access cannot poropagate
failures so it must not call failing apis, but lazy
tls and dtv allocation is a core part of the current
design. there is another abort issue: the generation
counter has overflow checks, which can realistically
trigger on 32bit targets. (64bit counter can fix this)
oom abort:
  _dl_resize_dtv - tls access, pthread_create, dlopen
  allocate_and_init - tls access
  _dl_add_to_slotinfo - dlopen
gen count abort:
  update_tls_slotinfo - dlopen
  dl_close_worker - dlclose
bugs:
  BZ 16134 - oom abort
  BZ 27112 - gen overflow

3) as-safety: tls access should be as-safe, but now it
may call malloc/realloc/free or take the dlopen lock.
bug:
  BZ 16133

i tried to fix BZ 19329 4 years ago without changing
the design, just introducing atomics where necessary.
the outcome is a bit hairy and does not solve the
dlclose problem (BZ 27111). (one issue with this is
that all accesses to the shared state has to be atomic
even the ones that are under locks, but sometimes
those can be relaxed atomic since there is no further
ordering requirement other than what is enforced by
the locks. unfortunately these accesses are scattered
around many places. but i still think this is a viable
approach.)

another approach is to take the dlopen lock at thread
creation and tls access time. this increases the
contention on that lock and may cause new deadlocks
because user code (interposed malloc and ctors) may
run while that lock is held. however this solution is
easier to reason about.

with some design changes the solution may get simpler:

is it necessary to free a link_map with static tls at
dlclose time? accessing such link_maps are the main
trouble at thread creation time (because static tls
has to be initialized early (?)), if dlclosing such
libs are rare then simply not freeing them avoids the
issues at a small cost. (however in glibc there is
dlmopen and static tls optimizations which make it
less obvious how bad this is.)

tlsdesc allows not only the allocation of tls/dtv, but
the symbol lookup and tls descriptor setup to be done
lazily too. this means allocating and accessing
map->l_mach.tlsdesc_table under locks, slotinfo list
walking and various other shared memory accesses at
tls access time. on arm and aarch64 this got removed
because it does not allow fast tls access with the
weak memory model, but on x86 lazy tlsdesc reloc is
still the default. (removing this does not fully fix
any of the bugs: the dtv update at tls access time
still has all the issues, but changing this would
reduce the complexity).

what is the verdict on as-safety of (dynamic) tls
access? currently this is not supported but it is not
clear if i can introduce new locks there or not.

can we avoid the generation count overflow checks if
it is 64bit? (it only counts up on dlopen/dlclose
time, i think it cannot realistically overflow.)

if i get time to dust down my old patches i will post
them, meanwhile feedback is welcome.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: fixing dynamic tls races (BZ 19329 and friends)
  2020-12-24 16:55 fixing dynamic tls races (BZ 19329 and friends) Szabolcs Nagy
@ 2020-12-24 20:07 ` Florian Weimer
  2020-12-30  9:43 ` Szabolcs Nagy
  1 sibling, 0 replies; 5+ messages in thread
From: Florian Weimer @ 2020-12-24 20:07 UTC (permalink / raw)
  To: Szabolcs Nagy via Libc-alpha

* Szabolcs Nagy via Libc-alpha:

> 2) fail safety: oom failures can abort pthread_create,
> dlopen and tls access. tls access cannot poropagate
> failures so it must not call failing apis, but lazy
> tls and dtv allocation is a core part of the current
> design. there is another abort issue: the generation
> counter has overflow checks, which can realistically
> trigger on 32bit targets. (64bit counter can fix this)
> oom abort:
>   _dl_resize_dtv - tls access, pthread_create, dlopen
>   allocate_and_init - tls access
>   _dl_add_to_slotinfo - dlopen
> gen count abort:
>   update_tls_slotinfo - dlopen
>   dl_close_worker - dlclose
> bugs:
>   BZ 16134 - oom abort
>   BZ 27112 - gen overflow

I think abort on TLS access is the only controversial matter.  The
only solution to that issue is upfront allocation, and some developers
seem to use lazy TLS as a way to produce allocations for only those
threads that need it.

The other things are just bugs that should be fixed, but the lazy
allocation issue is a genuine disagreement about how interfaces should
behave.

Maybe if we provide an easy-to-use way to deallocate TLS resources,
programmers could allocate more per-thread resources on demand, with
proper error checking.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: fixing dynamic tls races (BZ 19329 and friends)
  2020-12-24 16:55 fixing dynamic tls races (BZ 19329 and friends) Szabolcs Nagy
  2020-12-24 20:07 ` Florian Weimer
@ 2020-12-30  9:43 ` Szabolcs Nagy
  2021-02-18 14:58   ` Florian Weimer
  1 sibling, 1 reply; 5+ messages in thread
From: Szabolcs Nagy @ 2020-12-30  9:43 UTC (permalink / raw)
  To: libc-alpha

The 12/24/2020 16:55, Szabolcs Nagy via Libc-alpha wrote:
> i tried to fix BZ 19329 4 years ago without changing
> the design, just introducing atomics where necessary.
> the outcome is a bit hairy and does not solve the
> dlclose problem (BZ 27111). (one issue with this is
> that all accesses to the shared state has to be atomic
> even the ones that are under locks, but sometimes
> those can be relaxed atomic since there is no further

this likely needs to wait after the release, but to
avoid this fix to blow up in size i'd like to propose
a change to the glibc concurrency rules:

https://sourceware.org/glibc/wiki/Concurrency

currently it says:

- "All accesses to an object should use atomics if
  there is at least one atomic access."

- "Document at least [..] Why a certain memory order
  (MO) is sufficient. This especially applies to
  relaxed MO: why isn't a stronger MO required?"

i think we should allow non-racing memory accesses
to be non-atomic even if the same object may be
accessed atomically at some point. otherwise the
relaxed mo atomic annotations would make the dynamic
linker code harder to read and reason about.

=== c/c++ vs glibc memory model:

in c/c++ atomics is decided based on the object type.
(this allows different representation and alignment
for atomic and non-atomic types e.g. in case atomics
is implemented via locks.) as a consequence atomic
and non-atomic accesses cannot be mixed to an object.

glibc does not use atomic types, atomic is per access
(since there are abi types like pthread_spinlock_t
that are non-atomic historically but must be accessed
atomically). glibc only requires lock free atomic
access to naturally aligned int/unsigned/pointer.
this allows mixing atomic and non-atomic accesses.

=== problematic case:

a common pattern in the dynamic tls handling code is
to use shared variables as follows

(LR) many read accesses behind locks
(LW) few key write accesses behind locks
(NR) few key read accesses without locks

in such design (LW) and (NR) can be in a data-race
and need atomics for synchronization (acquire/release
mo pairing is usually enough to order the surrounding
memory accesses, but this requires careful analysis
on a case-by-case basis). however (LR) cannot be in a
data-race, c/c++ would allow non-atomic access, other
than the type based atomic requirement. in conforming
c/c++ code (LR) would typically be relaxed mo. i find
this problematic and something we can avoid in glibc:
relaxed mo is usually a big red flag due to unclear
and complicated semantics. however non-racing access
such as (LR) is an easy to check case that is common
and can be non-atomic. if we were to turn (LR) to
relaxed mo then glibc will be littered with red flags
and it will be hard to see which are the real tricky
relaxed mo accesses.

reading the list of identified valid uses of relaxed
mo atomics

http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2020/p2055r0.pdf

it seems relaxed atomics can even introduce problems
compared to non-atomic access when it is used for a
non-racing access (see 2.1 there).

=== data point:

in the dynamic tls code at least these objects need
to be atomic:

  GL(dl_tls_generation)
  GL(dl_tls_max_dtv_idx)
  dtv_slotinfo_list->next
  dtv_slotinfo_list->slotinfo[i].gen
  dtv_slotinfo_list->slotinfo[i].map

there are about 100 memory accesses to these objects
most of which are (LR) type: so fixing BZ 19329 would
need to add close to 100 relaxed mo atomic loads with
explanation.

=== proposal:

my proposal is to allow non-racing accesses to be
non-atomic in glibc even if the same object needs
atomics elsewhere (this seems to be standard practice
at least in the linux kernel). i think this rule also
covers early, single-threaded initialization, but not
concurrent initialization (like lazy binding).

=== alternatives:

use non_atomic_load (&obj) annotations instead of
relaxed. (but even this is very intrusive.)

redesign the dynamic tls handling code. (this is
not obvious.)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: fixing dynamic tls races (BZ 19329 and friends)
  2020-12-30  9:43 ` Szabolcs Nagy
@ 2021-02-18 14:58   ` Florian Weimer
  2021-02-19  2:46     ` Carlos O'Donell
  0 siblings, 1 reply; 5+ messages in thread
From: Florian Weimer @ 2021-02-18 14:58 UTC (permalink / raw)
  To: Szabolcs Nagy via Libc-alpha

* Szabolcs Nagy via Libc-alpha:

> The 12/24/2020 16:55, Szabolcs Nagy via Libc-alpha wrote:
>> i tried to fix BZ 19329 4 years ago without changing
>> the design, just introducing atomics where necessary.
>> the outcome is a bit hairy and does not solve the
>> dlclose problem (BZ 27111). (one issue with this is
>> that all accesses to the shared state has to be atomic
>> even the ones that are under locks, but sometimes
>> those can be relaxed atomic since there is no further
>
> this likely needs to wait after the release, but to
> avoid this fix to blow up in size i'd like to propose
> a change to the glibc concurrency rules:
>
> https://sourceware.org/glibc/wiki/Concurrency
>
> currently it says:
>
> - "All accesses to an object should use atomics if
>   there is at least one atomic access."
>
> - "Document at least [..] Why a certain memory order
>   (MO) is sufficient. This especially applies to
>   relaxed MO: why isn't a stronger MO required?"
>
> i think we should allow non-racing memory accesses
> to be non-atomic even if the same object may be
> accessed atomically at some point. otherwise the
> relaxed mo atomic annotations would make the dynamic
> linker code harder to read and reason about.

I agree.  We really need absence of out-of-thin-air values for relaxed
MO loads, which is required for all kinds of things like a working
concurrent fork, or the torn read inside the kernel futex code for our
RW lock implementation.  The C/C++ memory model does not guarantee the
absence of out-of-thin-air values (see the paper you referenced).  So
I'm not sure to what extent it is helpful for glibc development.

So I think using direct accesses for relaxed loads is fine, as long as
there are compiler barriers (explicit or implied) that ensure that the
loads actually happen (and the stores do not happen unexpectedly).

I think the TLS ABI even needs load-load dependency tracking (or what it
is called, I mean the thing that the Alpha is for not having).  We had
to remove a barrier required by the C/C++ memory model and rely on the
load dependency instead, for performance reasons.  I expect TLS access
to be similar in that regard, in that we don't want an acquire barrier
there, either.

> === problematic case:
>
> a common pattern in the dynamic tls handling code is
> to use shared variables as follows
>
> (LR) many read accesses behind locks
> (LW) few key write accesses behind locks
> (NR) few key read accesses without locks
>
> in such design (LW) and (NR) can be in a data-race
> and need atomics for synchronization (acquire/release
> mo pairing is usually enough to order the surrounding
> memory accesses, but this requires careful analysis
> on a case-by-case basis). however (LR) cannot be in a
> data-race, c/c++ would allow non-atomic access, other
> than the type based atomic requirement. in conforming
> c/c++ code (LR) would typically be relaxed mo. i find
> this problematic and something we can avoid in glibc:
> relaxed mo is usually a big red flag due to unclear
> and complicated semantics. however non-racing access
> such as (LR) is an easy to check case that is common
> and can be non-atomic. if we were to turn (LR) to
> relaxed mo then glibc will be littered with red flags
> and it will be hard to see which are the real tricky
> relaxed mo accesses.

Seems reasonable.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: fixing dynamic tls races (BZ 19329 and friends)
  2021-02-18 14:58   ` Florian Weimer
@ 2021-02-19  2:46     ` Carlos O'Donell
  0 siblings, 0 replies; 5+ messages in thread
From: Carlos O'Donell @ 2021-02-19  2:46 UTC (permalink / raw)
  To: Florian Weimer, Szabolcs Nagy via Libc-alpha

On 2/18/21 9:58 AM, Florian Weimer via Libc-alpha wrote:
> * Szabolcs Nagy via Libc-alpha:
> 
>> The 12/24/2020 16:55, Szabolcs Nagy via Libc-alpha wrote:
>>> i tried to fix BZ 19329 4 years ago without changing
>>> the design, just introducing atomics where necessary.
>>> the outcome is a bit hairy and does not solve the
>>> dlclose problem (BZ 27111). (one issue with this is
>>> that all accesses to the shared state has to be atomic
>>> even the ones that are under locks, but sometimes
>>> those can be relaxed atomic since there is no further
>>
>> this likely needs to wait after the release, but to
>> avoid this fix to blow up in size i'd like to propose
>> a change to the glibc concurrency rules:
>>
>> https://sourceware.org/glibc/wiki/Concurrency
>>
>> currently it says:
>>
>> - "All accesses to an object should use atomics if
>>   there is at least one atomic access."
>>
>> - "Document at least [..] Why a certain memory order
>>   (MO) is sufficient. This especially applies to
>>   relaxed MO: why isn't a stronger MO required?"

The intent of writing the "Concurrency" section was to start
a more formal description of how glibc would handle concurrency.

If we are going to do something slightly different for entirely
logical reasons (because all modern hardware ensures it works)
then that makes perfect sense, but we should update the wiki
to reflect any changes in our expectations.

>> i think we should allow non-racing memory accesses
>> to be non-atomic even if the same object may be
>> accessed atomically at some point. otherwise the
>> relaxed mo atomic annotations would make the dynamic
>> linker code harder to read and reason about.
> 
> I agree.  We really need absence of out-of-thin-air values for relaxed
> MO loads, which is required for all kinds of things like a working
> concurrent fork, or the torn read inside the kernel futex code for our
> RW lock implementation.  The C/C++ memory model does not guarantee the
> absence of out-of-thin-air values (see the paper you referenced).  So
> I'm not sure to what extent it is helpful for glibc development.
> 
> So I think using direct accesses for relaxed loads is fine, as long as
> there are compiler barriers (explicit or implied) that ensure that the
> loads actually happen (and the stores do not happen unexpectedly).
> 
> I think the TLS ABI even needs load-load dependency tracking (or what it
> is called, I mean the thing that the Alpha is for not having).  We had
> to remove a barrier required by the C/C++ memory model and rely on the
> load dependency instead, for performance reasons.  I expect TLS access
> to be similar in that regard, in that we don't want an acquire barrier
> there, either.
> 
>> === problematic case:
>>
>> a common pattern in the dynamic tls handling code is
>> to use shared variables as follows
>>
>> (LR) many read accesses behind locks
>> (LW) few key write accesses behind locks
>> (NR) few key read accesses without locks
>>
>> in such design (LW) and (NR) can be in a data-race
>> and need atomics for synchronization (acquire/release
>> mo pairing is usually enough to order the surrounding
>> memory accesses, but this requires careful analysis
>> on a case-by-case basis). however (LR) cannot be in a
>> data-race, c/c++ would allow non-atomic access, other
>> than the type based atomic requirement. in conforming
>> c/c++ code (LR) would typically be relaxed mo. i find
>> this problematic and something we can avoid in glibc:
>> relaxed mo is usually a big red flag due to unclear
>> and complicated semantics. however non-racing access
>> such as (LR) is an easy to check case that is common
>> and can be non-atomic. if we were to turn (LR) to
>> relaxed mo then glibc will be littered with red flags
>> and it will be hard to see which are the real tricky
>> relaxed mo accesses.
> 
> Seems reasonable.
 
We should update Concurrency in the wiki to reflect
this direction, either specifically for the TLS code,
or more generally for the project.

I'm happy to review the changes.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-02-19  2:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-24 16:55 fixing dynamic tls races (BZ 19329 and friends) Szabolcs Nagy
2020-12-24 20:07 ` Florian Weimer
2020-12-30  9:43 ` Szabolcs Nagy
2021-02-18 14:58   ` Florian Weimer
2021-02-19  2:46     ` Carlos O'Donell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).