* [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
@ 2015-04-22 12:27 Szabolcs Nagy
2015-04-22 15:07 ` Adhemerval Zanella
` (3 more replies)
0 siblings, 4 replies; 27+ messages in thread
From: Szabolcs Nagy @ 2015-04-22 12:27 UTC (permalink / raw)
To: libc-alpha; +Cc: Marcus Shawcroft, Ramana Radhakrishnan
[-- Attachment #1: Type: text/plain, Size: 5072 bytes --]
Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
accesses.
The original ARM TLSDESC design did not discuss this race that
arises on systems with weak memory order guarantees
http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt
"Storing the final value of the TLS descriptor also needs care: the
resolver field must be set to its final value after the argument gets
its final value, such that any if thread attempts to use the
descriptor before it gets its final value, it still goes to the hold
function."
The current glibc code (i386, x86_64, arm, aarch64) is
td->arg = ...;
td->entry = ...;
the arm memory model allows store-store reordering, so a barrier is
needed between these two stores (the code is broken on x86 as well in
principle: x86 memory model does not help on the c code level, the
compiler can reorder non-aliasing stores).
What is missing from the TLSDESC design spec is a description of the
ordering requirements on the read side: the TLS access sequence
(generated by the compiler) loads the descriptor function pointer
(td->entry) and then the argument is loaded (td->arg) in that function.
These two loads must observe the changes made on the write side in a
sequentially consistent way. The code in asm:
ldr x1, [x0] // load td->entry
blr [x0] // call it
entryfunc:
ldr x0, [x0,#8] // load td->arg
The branch (blr) does not provide a load-load ordering barrier (so the
second load may observe an old arg even though the first load observed
the new entry, this happens with existing aarch64 hardware causing
invalid memory access due to the incorrect TLS offset).
Various alternatives were considered to force the load ordering in the
descriptor function:
(1) barrier
entryfunc:
dmb ishld
ldr x0, [x0,#8]
(2) address dependency (if the address of the second load depends on the
result of the first one the ordering is guaranteed):
entryfunc:
ldr x1,[x0]
and x1,x1,#8
orr x1,x1,#8
ldr x0,[x0,x1]
(3) load-acquire (ARMv8 instruction that is ordered before subsequent
loads and stores)
entryfunc:
ldar xzr,[x0]
ldr x0,[x0,#8]
Option (1) is the simplest but slowest (note: this runs at every TLS
access), options (2) and (3) do one extra load from [x0] (same address
loads are ordered so it happens-after the load on the call site),
option (2) clobbers x1, so I think (3) is the best solution (a load
into the zero register has the same effect as with a normal register,
but the result is discarded so nothing is clobbered. Note that the
TLSDESC abi allows the descriptor function to clobber x1, but current
gcc does not implement this correctly, gcc will be fixed independently,
the dynamic and undefweak descriptors currently save/restore x1 so only
static TLS would be affected by this clobbering issue).
On the write side I used a full barrier (__sync_synchronize) although
dmb ishst
would be enough, but write side is not performance critical.
I introduced a new _dl_tlsdesc_return_lazy descriptor function for
lazily relocated static TLS, so non-lazy use-case is not affected.
The dynamic and undefweak descriptors do enough work so the additional
ldar should not have a performance impact.
Other thoughts:
- Lazy binding for static TLS may be unnecessary complexity: it seems
that gcc generates at most two static TLSDESC relocation entries for a
translation unit (zero and non-zero initialized TLS), so there has to be
a lot of object files with static TLS linked into a shared object to
make the load time relocation overhead significant. Unless there is some
evidence that lazy static TLS relocation makes sense I would suggest
removing that logic (from all archs). (The dynamic case is probably a
similar micro-optimization, but there the lazy vs non-lazy semantics are
different in case of undefined symbols and some applications may depend
on that).
- 32bit arm with -mtls-dialect=gnu2 is affected too, it will have to go
with option (1) or (2) with an additional twist: some existing ARMv7 cpus
don't implement the same address load ordering reliably, see:
http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf
- I don't understand the role of the hold function in the general
TLSDESC design, why is it not enough to let other threads wait on the
global lock in the initial resolver function? Is the global dl lock
implemented as a spin lock? Is it for some liveness/fairness property?
- There are some incorrect uses of "cfi_adjust_cfa_offset" in
dl-tlsdesc.S which is a separate patch.
Changelog:
2015-04-22 Szabolcs Nagy <szabolcs.nagy@arm.com>
[BZ #18034]
* sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare.
* sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define.
(_dl_tlsdesc_undefweak): Guarantee load-load ordering.
(_dl_tlsdesc_dynamic): Likewise.
* sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add
barrier between stores.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: tlsdesc.diff --]
[-- Type: text/x-patch; name=tlsdesc.diff, Size: 3289 bytes --]
diff --git a/sysdeps/aarch64/dl-tlsdesc.S b/sysdeps/aarch64/dl-tlsdesc.S
index be9b9b3..ff74b52 100644
--- a/sysdeps/aarch64/dl-tlsdesc.S
+++ b/sysdeps/aarch64/dl-tlsdesc.S
@@ -79,6 +79,25 @@ _dl_tlsdesc_return:
cfi_endproc
.size _dl_tlsdesc_return, .-_dl_tlsdesc_return
+ /* Same as _dl_tlsdesc_return but with synchronization for
+ lazy relocation.
+ Prototype:
+ _dl_tlsdesc_return_lazy (tlsdesc *) ;
+ */
+ .hidden _dl_tlsdesc_return_lazy
+ .global _dl_tlsdesc_return_lazy
+ .type _dl_tlsdesc_return_lazy,%function
+ cfi_startproc
+ .align 2
+_dl_tlsdesc_return_lazy:
+ /* The ldar guarantees ordering between the load from [x0] at the
+ call site and the load from [x0,#8] here for lazy relocation. */
+ ldar xzr, [x0]
+ ldr x0, [x0, #8]
+ RET
+ cfi_endproc
+ .size _dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy
+
/* Handler for undefined weak TLS symbols.
Prototype:
_dl_tlsdesc_undefweak (tlsdesc *);
@@ -96,6 +115,9 @@ _dl_tlsdesc_return:
_dl_tlsdesc_undefweak:
str x1, [sp, #-16]!
cfi_adjust_cfa_offset(16)
+ /* The ldar guarantees ordering between the load from [x0] at the
+ call site and the load from [x0,#8] here for lazy relocation. */
+ ldar xzr, [x0]
ldr x0, [x0, #8]
mrs x1, tpidr_el0
sub x0, x0, x1
@@ -152,6 +174,9 @@ _dl_tlsdesc_dynamic:
stp x3, x4, [sp, #32+16*1]
mrs x4, tpidr_el0
+ /* The ldar guarantees ordering between the load from [x0] at the
+ call site and the load from [x0,#8] here for lazy relocation. */
+ ldar xzr, [x0]
ldr x1, [x0,#8]
ldr x0, [x4]
ldr x3, [x1,#16]
diff --git a/sysdeps/aarch64/dl-tlsdesc.h b/sysdeps/aarch64/dl-tlsdesc.h
index 7a1285e..e6c0078 100644
--- a/sysdeps/aarch64/dl-tlsdesc.h
+++ b/sysdeps/aarch64/dl-tlsdesc.h
@@ -46,6 +46,9 @@ extern ptrdiff_t attribute_hidden
_dl_tlsdesc_return (struct tlsdesc *);
extern ptrdiff_t attribute_hidden
+_dl_tlsdesc_return_lazy (struct tlsdesc *);
+
+extern ptrdiff_t attribute_hidden
_dl_tlsdesc_undefweak (struct tlsdesc *);
extern ptrdiff_t attribute_hidden
diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
index 4821f8c..f738cc6 100644
--- a/sysdeps/aarch64/tlsdesc.c
+++ b/sysdeps/aarch64/tlsdesc.c
@@ -87,6 +87,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
if (!sym)
{
td->arg = (void*) reloc->r_addend;
+ /* Barrier so readers see the write above before the one below. */
+ __sync_synchronize ();
td->entry = _dl_tlsdesc_undefweak;
}
else
@@ -98,6 +100,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
{
td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value
+ reloc->r_addend);
+ /* Barrier so readers see the write above before the one below. */
+ __sync_synchronize ();
td->entry = _dl_tlsdesc_dynamic;
}
else
@@ -105,7 +109,9 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
{
td->arg = (void*) (sym->st_value + result->l_tls_offset
+ reloc->r_addend);
- td->entry = _dl_tlsdesc_return;
+ /* Barrier so readers see the write above before the one below. */
+ __sync_synchronize ();
+ td->entry = _dl_tlsdesc_return_lazy;
}
}
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-22 12:27 [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix Szabolcs Nagy
@ 2015-04-22 15:07 ` Adhemerval Zanella
2015-04-22 16:42 ` Szabolcs Nagy
2015-04-22 16:08 ` Torvald Riegel
` (2 subsequent siblings)
3 siblings, 1 reply; 27+ messages in thread
From: Adhemerval Zanella @ 2015-04-22 15:07 UTC (permalink / raw)
To: libc-alpha
Hi
On 22-04-2015 09:27, Szabolcs Nagy wrote:
> Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
> accesses.
>
> The original ARM TLSDESC design did not discuss this race that
> arises on systems with weak memory order guarantees
>
> http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt
>
> "Storing the final value of the TLS descriptor also needs care: the
> resolver field must be set to its final value after the argument gets
> its final value, such that any if thread attempts to use the
> descriptor before it gets its final value, it still goes to the hold
> function."
>
> The current glibc code (i386, x86_64, arm, aarch64) is
>
> td->arg = ...;
> td->entry = ...;
>
> the arm memory model allows store-store reordering, so a barrier is
> needed between these two stores (the code is broken on x86 as well in
> principle: x86 memory model does not help on the c code level, the
> compiler can reorder non-aliasing stores).
>
> What is missing from the TLSDESC design spec is a description of the
> ordering requirements on the read side: the TLS access sequence
> (generated by the compiler) loads the descriptor function pointer
> (td->entry) and then the argument is loaded (td->arg) in that function.
> These two loads must observe the changes made on the write side in a
> sequentially consistent way. The code in asm:
>
> ldr x1, [x0] // load td->entry
> blr [x0] // call it
>
> entryfunc:
> ldr x0, [x0,#8] // load td->arg
>
> The branch (blr) does not provide a load-load ordering barrier (so the
> second load may observe an old arg even though the first load observed
> the new entry, this happens with existing aarch64 hardware causing
> invalid memory access due to the incorrect TLS offset).
>
> Various alternatives were considered to force the load ordering in the
> descriptor function:
>
> (1) barrier
>
> entryfunc:
> dmb ishld
> ldr x0, [x0,#8]
>
> (2) address dependency (if the address of the second load depends on the
> result of the first one the ordering is guaranteed):
>
> entryfunc:
> ldr x1,[x0]
> and x1,x1,#8
> orr x1,x1,#8
> ldr x0,[x0,x1]
>
> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
> loads and stores)
>
> entryfunc:
> ldar xzr,[x0]
> ldr x0,[x0,#8]
>
> Option (1) is the simplest but slowest (note: this runs at every TLS
> access), options (2) and (3) do one extra load from [x0] (same address
> loads are ordered so it happens-after the load on the call site),
> option (2) clobbers x1, so I think (3) is the best solution (a load
> into the zero register has the same effect as with a normal register,
> but the result is discarded so nothing is clobbered. Note that the
> TLSDESC abi allows the descriptor function to clobber x1, but current
> gcc does not implement this correctly, gcc will be fixed independently,
> the dynamic and undefweak descriptors currently save/restore x1 so only
> static TLS would be affected by this clobbering issue).
I see options (3) as the preferred one as well, since it fits better on
the C11 memory semantics.
>
> On the write side I used a full barrier (__sync_synchronize) although
>
> dmb ishst
>
> would be enough, but write side is not performance critical.
My understanding is you do not need to push a seq-consistent, but rather
a store release on the 'td->arg', since the code only requires that
'td->entry' should not be reordered with 'td->arg'. So, to adjust to
C11 semantics I would change:
diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
index 4821f8c..f738cc6 100644
--- a/sysdeps/aarch64/tlsdesc.c
+++ b/sysdeps/aarch64/tlsdesc.c
@@ -87,6 +87,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
if (!sym)
{
td->arg = (void*) reloc->r_addend;
+ /* Barrier so readers see the write above before the one below. */
+ __sync_synchronize ();
To 'atomic_store_relase (td->arg, (void*) reloc->r_addend))'
>
> I introduced a new _dl_tlsdesc_return_lazy descriptor function for
> lazily relocated static TLS, so non-lazy use-case is not affected.
> The dynamic and undefweak descriptors do enough work so the additional
> ldar should not have a performance impact.
>
> Other thoughts:
>
> - Lazy binding for static TLS may be unnecessary complexity: it seems
> that gcc generates at most two static TLSDESC relocation entries for a
> translation unit (zero and non-zero initialized TLS), so there has to be
> a lot of object files with static TLS linked into a shared object to
> make the load time relocation overhead significant. Unless there is some
> evidence that lazy static TLS relocation makes sense I would suggest
> removing that logic (from all archs). (The dynamic case is probably a
> similar micro-optimization, but there the lazy vs non-lazy semantics are
> different in case of undefined symbols and some applications may depend
> on that).
Recently I am seeing that all lazy relocation yields less gains than
before and add a lot of complexity. I would like to see an usercase
where lazy TLS or even normal relocation shows a real gain.
>
> - 32bit arm with -mtls-dialect=gnu2 is affected too, it will have to go
> with option (1) or (2) with an additional twist: some existing ARMv7 cpus
> don't implement the same address load ordering reliably, see:
> http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf
>
> - I don't understand the role of the hold function in the general
> TLSDESC design, why is it not enough to let other threads wait on the
> global lock in the initial resolver function? Is the global dl lock
> implemented as a spin lock? Is it for some liveness/fairness property?
>
> - There are some incorrect uses of "cfi_adjust_cfa_offset" in
> dl-tlsdesc.S which is a separate patch.
>
> Changelog:
>
> 2015-04-22 Szabolcs Nagy <szabolcs.nagy@arm.com>
>
> [BZ #18034]
> * sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare.
> * sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define.
> (_dl_tlsdesc_undefweak): Guarantee load-load ordering.
> (_dl_tlsdesc_dynamic): Likewise.
> * sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add
> barrier between stores.
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-22 12:27 [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix Szabolcs Nagy
2015-04-22 15:07 ` Adhemerval Zanella
@ 2015-04-22 16:08 ` Torvald Riegel
2015-04-22 17:14 ` Szabolcs Nagy
2015-04-27 15:19 ` Szabolcs Nagy
2015-06-09 6:30 ` Alexandre Oliva
2015-07-11 10:16 ` Ondřej Bílka
3 siblings, 2 replies; 27+ messages in thread
From: Torvald Riegel @ 2015-04-22 16:08 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
> Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
> accesses.
>
> The original ARM TLSDESC design did not discuss this race that
> arises on systems with weak memory order guarantees
>
> http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt
>
> "Storing the final value of the TLS descriptor also needs care: the
> resolver field must be set to its final value after the argument gets
> its final value, such that any if thread attempts to use the
> descriptor before it gets its final value, it still goes to the hold
> function."
>
> The current glibc code (i386, x86_64, arm, aarch64) is
>
> td->arg = ...;
> td->entry = ...;
>
> the arm memory model allows store-store reordering, so a barrier is
> needed between these two stores (the code is broken on x86 as well in
> principle: x86 memory model does not help on the c code level, the
> compiler can reorder non-aliasing stores).
Yes, this is not guaranteed to work. This should thus be fixed for all
the architectures.
> What is missing from the TLSDESC design spec is a description of the
> ordering requirements on the read side: the TLS access sequence
> (generated by the compiler) loads the descriptor function pointer
> (td->entry) and then the argument is loaded (td->arg) in that function.
> These two loads must observe the changes made on the write side in a
> sequentially consistent way.
Not strictly in a sequentially consistent way, I believe. What you are
probably looking for (I'm not familiar with TLS in detail) is some
invariant like if you read update X, you will also read update Y or a
more recent update to Y.
> The code in asm:
>
> ldr x1, [x0] // load td->entry
> blr [x0] // call it
>
> entryfunc:
> ldr x0, [x0,#8] // load td->arg
>
> The branch (blr) does not provide a load-load ordering barrier (so the
> second load may observe an old arg even though the first load observed
> the new entry, this happens with existing aarch64 hardware causing
> invalid memory access due to the incorrect TLS offset).
>
> Various alternatives were considered to force the load ordering in the
> descriptor function:
>
> (1) barrier
>
> entryfunc:
> dmb ishld
> ldr x0, [x0,#8]
>
> (2) address dependency (if the address of the second load depends on the
> result of the first one the ordering is guaranteed):
>
> entryfunc:
> ldr x1,[x0]
> and x1,x1,#8
> orr x1,x1,#8
> ldr x0,[x0,x1]
The address dependencies could be useful in terms of performance, but
the disadvantage that I see are that there's currently no useful way to
use them in C code (C11's memory_order_consume is broken).
> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
> loads and stores)
>
> entryfunc:
> ldar xzr,[x0]
> ldr x0,[x0,#8]
>
> Option (1) is the simplest but slowest (note: this runs at every TLS
> access), options (2) and (3) do one extra load from [x0] (same address
> loads are ordered so it happens-after the load on the call site),
> option (2) clobbers x1, so I think (3) is the best solution (a load
> into the zero register has the same effect as with a normal register,
> but the result is discarded so nothing is clobbered. Note that the
> TLSDESC abi allows the descriptor function to clobber x1, but current
> gcc does not implement this correctly, gcc will be fixed independently,
> the dynamic and undefweak descriptors currently save/restore x1 so only
> static TLS would be affected by this clobbering issue).
>
> On the write side I used a full barrier (__sync_synchronize) although
>
> dmb ishst
>
> would be enough, but write side is not performance critical.
Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
Fixes to existing code should use the C11 memory model for concurrent
code, especially if the fix is about a concurrency issue.
I agree with Adhemerval that a release MO store seems to be sufficient
in this case.
I haven't scanned through the TLS code to assess how much work it would
be to make it data-race-free (as defined by C11), but it would certainly
be helpful in showing similar issues (e.g., when documenting it) and
preventing compiler optimizations that are legal for non-concurrent
accesses but not what we need for TLS (e.g., speculative loads
introduced by the compiler on x86 in the case you mention above).
Given that you have looked at the code, could you give a rough estimate
of how much churn it would be to make the TLS code data-race-free?
It also looks as if the x86_64 variant of tlsdesc.c is, before your
changes and ignoring some additional comments, very similar to the
aarch64 variant. Can we get one proper tlsdesc.c (data-race-free and
using the C11 model) that can be used on several archs? This would
certainly decrease maintenance overhead.
I'm aware this might look like I'm requesting extra work that is not at
the core of what you are trying to fix. However, moving to portable
concurrent code that is shared across several archs is something we've
been doing elsewhere too, and in those cases ARM was on the benefiting
side (e.g., pthread_once, ongoing work in nscd, ...). I think overall,
some extra work and clean-up will be a good thing for ARM archs too.
Thanks.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-22 15:07 ` Adhemerval Zanella
@ 2015-04-22 16:42 ` Szabolcs Nagy
0 siblings, 0 replies; 27+ messages in thread
From: Szabolcs Nagy @ 2015-04-22 16:42 UTC (permalink / raw)
To: Adhemerval Zanella, libc-alpha
On 22/04/15 16:07, Adhemerval Zanella wrote:
> On 22-04-2015 09:27, Szabolcs Nagy wrote:
>> On the write side I used a full barrier (__sync_synchronize) although
>>
>> dmb ishst
>>
>> would be enough, but write side is not performance critical.
>
> My understanding is you do not need to push a seq-consistent, but rather
> a store release on the 'td->arg', since the code only requires that
> 'td->entry' should not be reordered with 'td->arg'. So, to adjust to
> C11 semantics I would change:
>
> diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
> index 4821f8c..f738cc6 100644
> --- a/sysdeps/aarch64/tlsdesc.c
> +++ b/sysdeps/aarch64/tlsdesc.c
> @@ -87,6 +87,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
> if (!sym)
> {
> td->arg = (void*) reloc->r_addend;
> + /* Barrier so readers see the write above before the one below. */
> + __sync_synchronize ();
>
> To 'atomic_store_relase (td->arg, (void*) reloc->r_addend))'
>
>
i think
atomic_store_release(&td->entry, (void*) reloc->r_addend))
is the correct replacement, because moving stores
before a store_release is valid
>> - Lazy binding for static TLS may be unnecessary complexity: it seems
>> that gcc generates at most two static TLSDESC relocation entries for a
>> translation unit (zero and non-zero initialized TLS), so there has to be
>> a lot of object files with static TLS linked into a shared object to
>> make the load time relocation overhead significant. Unless there is some
>> evidence that lazy static TLS relocation makes sense I would suggest
>> removing that logic (from all archs). (The dynamic case is probably a
>> similar micro-optimization, but there the lazy vs non-lazy semantics are
>> different in case of undefined symbols and some applications may depend
>> on that).
>
> Recently I am seeing that all lazy relocation yields less gains than
> before and add a lot of complexity. I would like to see an usercase
> where lazy TLS or even normal relocation shows a real gain.
>
i'm not sure if glibc considers lazy relocation as part of the abi
contract, but i think static tls is always safe to do non-lazily
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-22 16:08 ` Torvald Riegel
@ 2015-04-22 17:14 ` Szabolcs Nagy
2015-04-23 12:08 ` Szabolcs Nagy
2015-04-23 21:47 ` Torvald Riegel
2015-04-27 15:19 ` Szabolcs Nagy
1 sibling, 2 replies; 27+ messages in thread
From: Szabolcs Nagy @ 2015-04-22 17:14 UTC (permalink / raw)
To: Torvald Riegel; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On 22/04/15 17:08, Torvald Riegel wrote:
> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
>> (2) address dependency (if the address of the second load depends on the
>> result of the first one the ordering is guaranteed):
>>
>> entryfunc:
>> ldr x1,[x0]
>> and x1,x1,#8
>> orr x1,x1,#8
>> ldr x0,[x0,x1]
>
> The address dependencies could be useful in terms of performance, but
> the disadvantage that I see are that there's currently no useful way to
> use them in C code (C11's memory_order_consume is broken).
>
this must be in asm so i don't think it matters if consume
is broken in c11.
(tlsdesc resolver cannot clobber registers).
>> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
>> loads and stores)
>>
>> entryfunc:
>> ldar xzr,[x0]
>> ldr x0,[x0,#8]
>>
>> Option (1) is the simplest but slowest (note: this runs at every TLS
>> access), options (2) and (3) do one extra load from [x0] (same address
>> loads are ordered so it happens-after the load on the call site),
>> option (2) clobbers x1, so I think (3) is the best solution (a load
>> into the zero register has the same effect as with a normal register,
>> but the result is discarded so nothing is clobbered. Note that the
>> TLSDESC abi allows the descriptor function to clobber x1, but current
>> gcc does not implement this correctly, gcc will be fixed independently,
>> the dynamic and undefweak descriptors currently save/restore x1 so only
>> static TLS would be affected by this clobbering issue).
>>
>> On the write side I used a full barrier (__sync_synchronize) although
>>
>> dmb ishst
>>
>> would be enough, but write side is not performance critical.
>
> Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
> Fixes to existing code should use the C11 memory model for concurrent
> code, especially if the fix is about a concurrency issue.
>
> I agree with Adhemerval that a release MO store seems to be sufficient
> in this case.
>
yes, the write side could use c11 atomics and i agree that
store-release is enough, but i'm not sure how much one can
rely on c11 memory model when interacting with asm code.
(so i thought a full barrier is the easiest to reason about)
i can update the code to use glibc atomics, but i will
have to look into how those work (are there type issues
or are they generic?)
> I haven't scanned through the TLS code to assess how much work it would
> be to make it data-race-free (as defined by C11), but it would certainly
> be helpful in showing similar issues (e.g., when documenting it) and
> preventing compiler optimizations that are legal for non-concurrent
> accesses but not what we need for TLS (e.g., speculative loads
> introduced by the compiler on x86 in the case you mention above).
>
> Given that you have looked at the code, could you give a rough estimate
> of how much churn it would be to make the TLS code data-race-free?
i think td->entry should only be accessed with atomics once
the loader finished loading the dso with it.
(i assume during dso loading concurrent access is not possible)
(the particular case i'm trying to fix is hard because
the access to td->entry is generated by the compiler, so
it has to be worked around inside the entry function.
i think x86 does not have this issue)
i think these function might need some attention:
elf/tlsdeschtab.h:
_dl_tlsdesc_resolve_early_return_p
sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c:
_dl_tlsdesc_resolve_rela_fixup
_dl_tlsdesc_resolve_hold_fixup
but i haven't done an detailed analysis
> It also looks as if the x86_64 variant of tlsdesc.c is, before your
> changes and ignoring some additional comments, very similar to the
> aarch64 variant. Can we get one proper tlsdesc.c (data-race-free and
> using the C11 model) that can be used on several archs? This would
> certainly decrease maintenance overhead.
>
should be possible i think: these functions are called
from asm to do the lazy resolution
but i have to check the arch specific details.
> I'm aware this might look like I'm requesting extra work that is not at
> the core of what you are trying to fix. However, moving to portable
> concurrent code that is shared across several archs is something we've
> been doing elsewhere too, and in those cases ARM was on the benefiting
> side (e.g., pthread_once, ongoing work in nscd, ...). I think overall,
> some extra work and clean-up will be a good thing for ARM archs too.
>
> Thanks.
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-22 17:14 ` Szabolcs Nagy
@ 2015-04-23 12:08 ` Szabolcs Nagy
2015-04-24 17:37 ` Torvald Riegel
2015-04-23 21:47 ` Torvald Riegel
1 sibling, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2015-04-23 12:08 UTC (permalink / raw)
To: Torvald Riegel; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On 22/04/15 18:14, Szabolcs Nagy wrote:
> On 22/04/15 17:08, Torvald Riegel wrote:
>> Given that you have looked at the code, could you give a rough estimate
>> of how much churn it would be to make the TLS code data-race-free?
>
> elf/tlsdeschtab.h:
> _dl_tlsdesc_resolve_early_return_p
> sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c:
> _dl_tlsdesc_resolve_rela_fixup
> _dl_tlsdesc_resolve_hold_fixup
>
i think these all need to use atomics for accessing
td->entry for c11 correctness, but it's hard to tell
what's right since c11 only talks about synchronizing
with c code, not asm.
the td->entry can be in 3 states during lazy resolution:
* init: retries the call of entry until caller==entry
(using a double-checked locking mechanism, then it
- grabs GL(dl_load_lock)
- changes the entry to the hold state
- does the resolver work
- changes arg and entry to the final value
- releases the lock to wake the waiters
- calls the new entry)
* hold: waits on the GL(dl_load_lock)
(then retries calling the entry when it's woken up)
* final state: (static, dynamic or undefweak callback)
calculates the tls offset based on arg
(currently without any locks which is bad on weak
memory models)
the code for tls access is generated by the compiler so
there is no atomics there: the loaded entry can be in
init, hold or final state, the guarantee about the state
of arg must come from the target arch memory model.
and to answer my earlier question about the hold state:
it is needed to avoid the spinning in the double-checked
locking in init.
>> It also looks as if the x86_64 variant of tlsdesc.c is, before your
>> changes and ignoring some additional comments, very similar to the
>> aarch64 variant. Can we get one proper tlsdesc.c (data-race-free and
>> using the C11 model) that can be used on several archs? This would
>> certainly decrease maintenance overhead.
>>
>
> should be possible i think: these functions are called
> from asm to do the lazy resolution
>
> but i have to check the arch specific details.
looking at this, but it seems to be significant work:
* i386 and arm determine the caller differently
* i386 uses special calling convention from asm
* arm handles local symbols specially
i think the way to move forward is
* fix the correctness bug now on aarch64
* decide if lazy static tls resolver is worth it
* do the refactoring so tlsdesc is common code
* use c11 atomics
the fix for arm is a lot harder (because atomics are
different on different versions of arm, an ifunc based
dispatch could in principle solve the dmb vs no-dmb
issue for the lazy resolvers).
is this ok?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-22 17:14 ` Szabolcs Nagy
2015-04-23 12:08 ` Szabolcs Nagy
@ 2015-04-23 21:47 ` Torvald Riegel
2015-04-24 11:05 ` Szabolcs Nagy
1 sibling, 1 reply; 27+ messages in thread
From: Torvald Riegel @ 2015-04-23 21:47 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
> On 22/04/15 17:08, Torvald Riegel wrote:
> > On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
> >> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
> >> loads and stores)
> >>
> >> entryfunc:
> >> ldar xzr,[x0]
> >> ldr x0,[x0,#8]
> >>
> >> Option (1) is the simplest but slowest (note: this runs at every TLS
> >> access), options (2) and (3) do one extra load from [x0] (same address
> >> loads are ordered so it happens-after the load on the call site),
> >> option (2) clobbers x1, so I think (3) is the best solution (a load
> >> into the zero register has the same effect as with a normal register,
> >> but the result is discarded so nothing is clobbered. Note that the
> >> TLSDESC abi allows the descriptor function to clobber x1, but current
> >> gcc does not implement this correctly, gcc will be fixed independently,
> >> the dynamic and undefweak descriptors currently save/restore x1 so only
> >> static TLS would be affected by this clobbering issue).
> >>
> >> On the write side I used a full barrier (__sync_synchronize) although
> >>
> >> dmb ishst
> >>
> >> would be enough, but write side is not performance critical.
> >
> > Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
> > Fixes to existing code should use the C11 memory model for concurrent
> > code, especially if the fix is about a concurrency issue.
> >
> > I agree with Adhemerval that a release MO store seems to be sufficient
> > in this case.
> >
>
> yes, the write side could use c11 atomics and i agree that
> store-release is enough, but i'm not sure how much one can
> rely on c11 memory model when interacting with asm code.
> (so i thought a full barrier is the easiest to reason about)
IIUC, the asm that you need to interact with is generated by the
compiler. The C11 atomics implementation (that is, the asm / HW
features used in it) are effectively part of the ABI; C11 translation
units need to be able to synchronize with each other even if compiled by
different compiler versions.
Therefore, making sure that the asm follows what the compiler would
generate for the intended C11 atomics equivalent is a solution. That
also means that we can say clearly, in C11 memory model terms, what the
asm implementation on an arch should do.
> i can update the code to use glibc atomics, but i will
> have to look into how those work (are there type issues
> or are they generic?)
There are basically generic, and internally use either the __atomic
builtins for newer compilers and archs that provide them, or (fall back
to) the previous implementation based on __sync-builtins or custom asm.
One constraint is that we don't support sizes of atomic access that
aren't actually provided on the particular arch (either through HW
instructions, or with kernel helpers). So, for example, atomic access
to a naturally aligned pointer-sized type can be expected to work.
> > I haven't scanned through the TLS code to assess how much work it would
> > be to make it data-race-free (as defined by C11), but it would certainly
> > be helpful in showing similar issues (e.g., when documenting it) and
> > preventing compiler optimizations that are legal for non-concurrent
> > accesses but not what we need for TLS (e.g., speculative loads
> > introduced by the compiler on x86 in the case you mention above).
> >
> > Given that you have looked at the code, could you give a rough estimate
> > of how much churn it would be to make the TLS code data-race-free?
>
> i think td->entry should only be accessed with atomics once
> the loader finished loading the dso with it.
> (i assume during dso loading concurrent access is not possible)
If that's some kind of initialization you have in mind, using nonatomics
accesses may be fine. We don't have strict rules around that currently.
But if it's just a few lines of code, then simply using relaxed-MO
atomic accesses could be fine as well.
> (the particular case i'm trying to fix is hard because
> the access to td->entry is generated by the compiler, so
> it has to be worked around inside the entry function.
> i think x86 does not have this issue)
Do you mean that you want to affect the compiler-generated code after it
has been generated?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-23 21:47 ` Torvald Riegel
@ 2015-04-24 11:05 ` Szabolcs Nagy
2015-04-24 17:40 ` Torvald Riegel
0 siblings, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2015-04-24 11:05 UTC (permalink / raw)
To: Torvald Riegel; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On 23/04/15 22:46, Torvald Riegel wrote:
> On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
>> On 22/04/15 17:08, Torvald Riegel wrote:
>>> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
>>>> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
>>>> loads and stores)
>>>>
>>>> entryfunc:
>>>> ldar xzr,[x0]
>>>> ldr x0,[x0,#8]
>>>>
>>>> Option (1) is the simplest but slowest (note: this runs at every TLS
>>>> access), options (2) and (3) do one extra load from [x0] (same address
>>>> loads are ordered so it happens-after the load on the call site),
>>>> option (2) clobbers x1, so I think (3) is the best solution (a load
>>>> into the zero register has the same effect as with a normal register,
>>>> but the result is discarded so nothing is clobbered. Note that the
>>>> TLSDESC abi allows the descriptor function to clobber x1, but current
>>>> gcc does not implement this correctly, gcc will be fixed independently,
>>>> the dynamic and undefweak descriptors currently save/restore x1 so only
>>>> static TLS would be affected by this clobbering issue).
>>>>
>>>> On the write side I used a full barrier (__sync_synchronize) although
>>>>
>>>> dmb ishst
>>>>
>>>> would be enough, but write side is not performance critical.
>>>
>>> Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
>>> Fixes to existing code should use the C11 memory model for concurrent
>>> code, especially if the fix is about a concurrency issue.
>>>
>>> I agree with Adhemerval that a release MO store seems to be sufficient
>>> in this case.
>>>
>>
>> yes, the write side could use c11 atomics and i agree that
>> store-release is enough, but i'm not sure how much one can
>> rely on c11 memory model when interacting with asm code.
>> (so i thought a full barrier is the easiest to reason about)
>
> IIUC, the asm that you need to interact with is generated by the
> compiler. The C11 atomics implementation (that is, the asm / HW
to clarify:
// this bit is generated by the compiler for every tls access
// (the exact sequence is different and depends on tiny/small/large
// memory model, it is part of the abi so linkers can do relaxations)
// assume x0 is a pointer to the tls descriptor (td) in the GOT.
// (2 consecutive GOT entries are used: td->entry and td->arg)
// the entry function returns an offset (so tls data is at tp+off)
// note: no barriers are used here
mrs x2, tpidr_el0 // load the thread pointer
ldr x1, [x0] // load td->entry
blr x1 // call it, returns off in x0
ldr x0, [x2, x0] // access tls at tp+off
// this is a function implemented by the libc where td->entry points
// currently the options are
// _dl_tlsdesc_return,
// _dl_tlsdesc_undefweak,
// _dl_tlsdesc_dynamic
// originally td->entry points to a trampoline that calls into
// _dl_tlsdesc_resovle_rela which does the lazy relocation and sets
// td->entry to _dl_tlsdesc_hold temporarily.
entryfunc:
ldr x0, [x0,#8] // load td->arg
>> i can update the code to use glibc atomics, but i will
>> have to look into how those work (are there type issues
>> or are they generic?)
>
> There are basically generic, and internally use either the __atomic
> builtins for newer compilers and archs that provide them, or (fall back
> to) the previous implementation based on __sync-builtins or custom asm.
> One constraint is that we don't support sizes of atomic access that
> aren't actually provided on the particular arch (either through HW
> instructions, or with kernel helpers). So, for example, atomic access
> to a naturally aligned pointer-sized type can be expected to work.
ok
>> i think td->entry should only be accessed with atomics once
>> the loader finished loading the dso with it.
>> (i assume during dso loading concurrent access is not possible)
>
> If that's some kind of initialization you have in mind, using nonatomics
> accesses may be fine. We don't have strict rules around that currently.
> But if it's just a few lines of code, then simply using relaxed-MO
> atomic accesses could be fine as well.
>
at load time the td may be set up and then it is not
written later, but with lazy binding it will be written
latter
care should be taken because the compiler generated
tls access code does not use atomics and can run
in parallel to lazy initialization.
so i guess any td->entry access in the lazy init code
path should use atomics for c11 correctness
>> (the particular case i'm trying to fix is hard because
>> the access to td->entry is generated by the compiler, so
>> it has to be worked around inside the entry function.
>> i think x86 does not have this issue)
>
> Do you mean that you want to affect the compiler-generated code after it
> has been generated?
>
i'm fixing the tlsdesc entry functions in libc
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-23 12:08 ` Szabolcs Nagy
@ 2015-04-24 17:37 ` Torvald Riegel
2015-04-24 21:12 ` Szabolcs Nagy
0 siblings, 1 reply; 27+ messages in thread
From: Torvald Riegel @ 2015-04-24 17:37 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Thu, 2015-04-23 at 13:08 +0100, Szabolcs Nagy wrote:
> On 22/04/15 18:14, Szabolcs Nagy wrote:
> > On 22/04/15 17:08, Torvald Riegel wrote:
> >> Given that you have looked at the code, could you give a rough estimate
> >> of how much churn it would be to make the TLS code data-race-free?
> >
> > elf/tlsdeschtab.h:
> > _dl_tlsdesc_resolve_early_return_p
> > sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c:
> > _dl_tlsdesc_resolve_rela_fixup
> > _dl_tlsdesc_resolve_hold_fixup
> >
>
> i think these all need to use atomics for accessing
> td->entry for c11 correctness, but it's hard to tell
> what's right since c11 only talks about synchronizing
> with c code, not asm.
That's true, but if we can require the asm code to be conceptually
equivalent to some C11 code, then we can specify and document the
concurrent algorithm or synchronization scheme based on C11 as
foundation, and either use C11 atomics to implement it or use code the
compiler would generate for C11 atomics if implementing in asm.
That doesn't cover cases where in the asm, we would want to use
synchronization that can't be represented through C11 atomics, or that
is incompatible with C11 atomics. But so far it doesn't look like this,
or have you observed such cases?
(consume MO might be a case, but I guess for that we could still wave
our hands and pretend compilers would simply generate the "intuitively
generated" code).
> the td->entry can be in 3 states during lazy resolution:
>
> * init: retries the call of entry until caller==entry
> (using a double-checked locking mechanism, then it
> - grabs GL(dl_load_lock)
> - changes the entry to the hold state
> - does the resolver work
> - changes arg and entry to the final value
> - releases the lock to wake the waiters
> - calls the new entry)
> * hold: waits on the GL(dl_load_lock)
> (then retries calling the entry when it's woken up)
> * final state: (static, dynamic or undefweak callback)
> calculates the tls offset based on arg
> (currently without any locks which is bad on weak
> memory models)
Could you point me to (or confirm that) a new load of td->entry happens
if caller != td->entry? In the asm bits you posted so far, it seems
that if the call to *td->entry returns, actual TLS data is loaded. For
example, you posted this:
ldr x1, [x0] // load td->entry
blr [x0] // call it
entryfunc:
ldr x0, [x0,#8] // load td->arg
I'm asking because in tlsdesctab.h we have:
static int
_dl_tlsdesc_resolve_early_return_p (struct tlsdesc volatile *td, void *caller)
{
if (caller != td->entry)
return 1;
__rtld_lock_lock_recursive (GL(dl_load_lock));
if (caller != td->entry)
{
__rtld_lock_unlock_recursive (GL(dl_load_lock));
return 1;
....
If _dl_tlsdesc_resolve_early_return_p returns 1,
_dl_tlsdesc_resolve_rela_fixup just returns. If the next thing we do is
load td->arg, we need another acquire-MO load for td->entry in
_dl_tlsdesc_resolve_early_return_p. A relaxed-MO load (or the current
non-atomic access, ingoring DRF) would only be sufficient if all that
caller != atomic_load_relaxed (&td->entry) leads to is further busy
waiting (eg, calling *td->entry again). But if we really expect
caller != td->entry to have meaning and tell us something about other
state, we need to have another barrier there to not have a typical
double-checked-locking bug.
(This shows another benefit of using atomics for all concurrent
accesses: It forces us to make a conscious decision about the memory
orders, and document why. If a relaxed-MO access is sufficient, it
should better have a clear explanation...)
>
> the code for tls access is generated by the compiler so
> there is no atomics there: the loaded entry can be in
> init, hold or final state, the guarantee about the state
> of arg must come from the target arch memory model.
I understood there are no C11 atomics used in the asm implementation
bits. I'm thinking about abstract the synchronization scheme we want to
use in the implementation, represented through C11 atomics
synchronization; the asm would just do something equivalent (see my
comments about the atomics implementation being effectively part of the
ABI).
> and to answer my earlier question about the hold state:
> it is needed to avoid the spinning in the double-checked
> locking in init.
>
> >> It also looks as if the x86_64 variant of tlsdesc.c is, before your
> >> changes and ignoring some additional comments, very similar to the
> >> aarch64 variant. Can we get one proper tlsdesc.c (data-race-free and
> >> using the C11 model) that can be used on several archs? This would
> >> certainly decrease maintenance overhead.
> >>
> >
> > should be possible i think: these functions are called
> > from asm to do the lazy resolution
> >
> > but i have to check the arch specific details.
>
> looking at this, but it seems to be significant work:
>
> * i386 and arm determine the caller differently
> * i386 uses special calling convention from asm
> * arm handles local symbols specially
Okay. But do they still use the same abstract synchronization scheme,
and just differ in a few aspects? Or are the synchronization schemes
significantly different?
> i think the way to move forward is
I think step 1 should be to document the synchronization scheme on an
abstract, algorithmic level. Using C11 atomics in this pseudo-code. We
want to document the abstract level because it's typically much easier
to reason about the abstraction than the concrete implementation in case
of concurrent code, and because confirming that an implementation
matches the abstraction is typically also much easier than inferring the
abstraction from a concrete implementation.
So, for example, if we use a kind of double-checked locking here,
document that, show the pseudo code with C11 atomics, and then refer
back to this when documenting why certain memory orders on atomic
operations (or asm instructions) are sufficient and equivalent to the
abstract algorithm. In our case, this would mean saying that, for
example, we need load Y to use acquire MO so that it synchronizes with
release-MO store X, making sure that a load B that is sequenced after X
reads from store A (which is a core part of double-checked locking, so
could also be done at the abstract algorithm docs and then just referred
back to from the concrete implementation code).
> * fix the correctness bug now on aarch64
OK. When doing that, check and document equivalence to the abstract
synchronization scheme. When you do that, please use atomic_* for all
synchronization that you add or change.
It might be easiest to also change all atomic accesses that you checked
and documented to using atomic_*; this isn't much work compared to the
checking, and it becomes obvious what has been checked.
> * decide if lazy static tls resolver is worth it
> * do the refactoring so tlsdesc is common code
> * use c11 atomics
I'd prefer C11 atomics to be used right away when fixing things;
however, if you mean transforming and checking all other TLS code
unrelated to the bug you're looking at, then I agree that this can be
done incrementally and after you fixed this bug.
> the fix for arm is a lot harder (because atomics are
> different on different versions of arm, an ifunc based
> dispatch could in principle solve the dmb vs no-dmb
> issue for the lazy resolvers).
Is the synchronization used in the asm bits on the different versions of
arm different from how C11 atomics would look like on those versions of
arm? Or are we just dealing with different ways to implement acquire
loads, for example?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-24 11:05 ` Szabolcs Nagy
@ 2015-04-24 17:40 ` Torvald Riegel
2015-04-24 21:27 ` Szabolcs Nagy
0 siblings, 1 reply; 27+ messages in thread
From: Torvald Riegel @ 2015-04-24 17:40 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Fri, 2015-04-24 at 12:05 +0100, Szabolcs Nagy wrote:
>
> On 23/04/15 22:46, Torvald Riegel wrote:
> > On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
> >> (the particular case i'm trying to fix is hard because
> >> the access to td->entry is generated by the compiler, so
> >> it has to be worked around inside the entry function.
> >> i think x86 does not have this issue)
> >
> > Do you mean that you want to affect the compiler-generated code after it
> > has been generated?
> >
>
> i'm fixing the tlsdesc entry functions in libc
>
Sorry, I'm still not sure I can follow. IIUC, the compiler-generated
asm code for TLS accesses may lack acquire barriers in some cases. Do
you want to work around that by trying to adapt the glibc
implementation, or do you just want to fix all glibc-specific bugs, and
handle the compiler-side issues separately?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-24 17:37 ` Torvald Riegel
@ 2015-04-24 21:12 ` Szabolcs Nagy
0 siblings, 0 replies; 27+ messages in thread
From: Szabolcs Nagy @ 2015-04-24 21:12 UTC (permalink / raw)
To: Torvald Riegel; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On 24/04/15 18:37, Torvald Riegel wrote:
> On Thu, 2015-04-23 at 13:08 +0100, Szabolcs Nagy wrote:
>> On 22/04/15 18:14, Szabolcs Nagy wrote:
>
>> the td->entry can be in 3 states during lazy resolution:
>>
>> * init: retries the call of entry until caller==entry
>> (using a double-checked locking mechanism, then it
>> - grabs GL(dl_load_lock)
>> - changes the entry to the hold state
>> - does the resolver work
>> - changes arg and entry to the final value
>> - releases the lock to wake the waiters
>> - calls the new entry)
>> * hold: waits on the GL(dl_load_lock)
>> (then retries calling the entry when it's woken up)
>> * final state: (static, dynamic or undefweak callback)
>> calculates the tls offset based on arg
>> (currently without any locks which is bad on weak
>> memory models)
>
> Could you point me to (or confirm that) a new load of td->entry happens
> if caller != td->entry? In the asm bits you posted so far, it seems
> that if the call to *td->entry returns, actual TLS data is loaded. For
> example, you posted this:
if td->entry is in init state it will do the retry
(when it ends up in _dl_tlsdesc_resolve_early_return_p)
>
> ldr x1, [x0] // load td->entry
> blr [x0] // call it
>
this is supposed to be 'blr x1', but you got the meaning
> entryfunc:
> ldr x0, [x0,#8] // load td->arg
>
>
> I'm asking because in tlsdesctab.h we have:
>
> static int
> _dl_tlsdesc_resolve_early_return_p (struct tlsdesc volatile *td, void *caller)
> {
> if (caller != td->entry)
> return 1;
>
> __rtld_lock_lock_recursive (GL(dl_load_lock));
> if (caller != td->entry)
> {
> __rtld_lock_unlock_recursive (GL(dl_load_lock));
> return 1;
> ....
>
> If _dl_tlsdesc_resolve_early_return_p returns 1,
> _dl_tlsdesc_resolve_rela_fixup just returns. If the next thing we do is
> load td->arg, we need another acquire-MO load for td->entry in
no, if _dl_tlsdesc_resolve_rela_fixup returns it ends up
in _dl_tlsdesc_resolve_rela (asm code) that loads td->entry again
and calls the entry again (this is the retry loop)
(so when the resolver updated the entry then the new entry is
actually called, and in case of early return there is a retry)
> _dl_tlsdesc_resolve_early_return_p. A relaxed-MO load (or the current
> non-atomic access, ingoring DRF) would only be sufficient if all that
> caller != atomic_load_relaxed (&td->entry) leads to is further busy
> waiting (eg, calling *td->entry again). But if we really expect
> caller != td->entry to have meaning and tell us something about other
> state, we need to have another barrier there to not have a typical
> double-checked-locking bug.
>
this is why i said you need atomics for accessing td->entry
but this is generic code and not related to the aarch64 bug
so i'd prefer if any such cleanup were done independently
> (This shows another benefit of using atomics for all concurrent
> accesses: It forces us to make a conscious decision about the memory
> orders, and document why. If a relaxed-MO access is sufficient, it
> should better have a clear explanation...)
>
>>
>> the code for tls access is generated by the compiler so
>> there is no atomics there: the loaded entry can be in
>> init, hold or final state, the guarantee about the state
>> of arg must come from the target arch memory model.
>
> I understood there are no C11 atomics used in the asm implementation
> bits. I'm thinking about abstract the synchronization scheme we want to
> use in the implementation, represented through C11 atomics
> synchronization; the asm would just do something equivalent (see my
> comments about the atomics implementation being effectively part of the
> ABI).
>
ok
>> and to answer my earlier question about the hold state:
>> it is needed to avoid the spinning in the double-checked
>> locking in init.
>>
>>>> It also looks as if the x86_64 variant of tlsdesc.c is, before your
>>>> changes and ignoring some additional comments, very similar to the
>>>> aarch64 variant. Can we get one proper tlsdesc.c (data-race-free and
>>>> using the C11 model) that can be used on several archs? This would
>>>> certainly decrease maintenance overhead.
>>>>
>>>
>>> should be possible i think: these functions are called
>>> from asm to do the lazy resolution
>>>
>>> but i have to check the arch specific details.
>>
>> looking at this, but it seems to be significant work:
>>
>> * i386 and arm determine the caller differently
>> * i386 uses special calling convention from asm
>> * arm handles local symbols specially
>
> Okay. But do they still use the same abstract synchronization scheme,
> and just differ in a few aspects? Or are the synchronization schemes
> significantly different?
>
the synchronization scheme is the same
but if the calling convention is non-C (regparm) then you
cannot use the same generic C code (without significant
changes somewhere).
>> i think the way to move forward is
>
> I think step 1 should be to document the synchronization scheme on an
> abstract, algorithmic level. Using C11 atomics in this pseudo-code. We
> want to document the abstract level because it's typically much easier
> to reason about the abstraction than the concrete implementation in case
> of concurrent code, and because confirming that an implementation
> matches the abstraction is typically also much easier than inferring the
> abstraction from a concrete implementation.
>
i think the documentation is the tlsdesc abi spec
(synchronization requirements can be derived from that)
i think we need to fix the bug instead of writing more documents
(you can build the theories later, it will inevitably cause a lot
of changes because the lazy dl code is far from perfect)
> So, for example, if we use a kind of double-checked locking here,
> document that, show the pseudo code with C11 atomics, and then refer
> back to this when documenting why certain memory orders on atomic
> operations (or asm instructions) are sufficient and equivalent to the
> abstract algorithm. In our case, this would mean saying that, for
> example, we need load Y to use acquire MO so that it synchronizes with
> release-MO store X, making sure that a load B that is sequenced after X
> reads from store A (which is a core part of double-checked locking, so
> could also be done at the abstract algorithm docs and then just referred
> back to from the concrete implementation code).
>
>> * fix the correctness bug now on aarch64
>
> OK. When doing that, check and document equivalence to the abstract
> synchronization scheme. When you do that, please use atomic_* for all
> synchronization that you add or change.
> It might be easiest to also change all atomic accesses that you checked
> and documented to using atomic_*; this isn't much work compared to the
> checking, and it becomes obvious what has been checked.
>
>> * decide if lazy static tls resolver is worth it
>> * do the refactoring so tlsdesc is common code
>> * use c11 atomics
>
> I'd prefer C11 atomics to be used right away when fixing things;
> however, if you mean transforming and checking all other TLS code
> unrelated to the bug you're looking at, then I agree that this can be
> done incrementally and after you fixed this bug.
>
ok i'll do the atomic_store on td->entry instead of __sync_synchronize
>> the fix for arm is a lot harder (because atomics are
>> different on different versions of arm, an ifunc based
>> dispatch could in principle solve the dmb vs no-dmb
>> issue for the lazy resolvers).
>
> Is the synchronization used in the asm bits on the different versions of
> arm different from how C11 atomics would look like on those versions of
> arm? Or are we just dealing with different ways to implement acquire
> loads, for example?
>
i dont know the details of c11 atomics on arm
but i assume the emitted code depends on which cpu
you compile for (so lots of ifdefs in the asm, or
you can detect the current cpu at runtime with ifunc
which is also ugly)
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-24 17:40 ` Torvald Riegel
@ 2015-04-24 21:27 ` Szabolcs Nagy
2015-04-25 4:27 ` Rich Felker
0 siblings, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2015-04-24 21:27 UTC (permalink / raw)
To: Torvald Riegel; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On 24/04/15 18:40, Torvald Riegel wrote:
> On Fri, 2015-04-24 at 12:05 +0100, Szabolcs Nagy wrote:
>>
>> On 23/04/15 22:46, Torvald Riegel wrote:
>>> On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
>>>> (the particular case i'm trying to fix is hard because
>>>> the access to td->entry is generated by the compiler, so
>>>> it has to be worked around inside the entry function.
>>>> i think x86 does not have this issue)
>>>
>>> Do you mean that you want to affect the compiler-generated code after it
>>> has been generated?
>>>
>>
>> i'm fixing the tlsdesc entry functions in libc
>>
>
> Sorry, I'm still not sure I can follow. IIUC, the compiler-generated
> asm code for TLS accesses may lack acquire barriers in some cases. Do
> you want to work around that by trying to adapt the glibc
> implementation, or do you just want to fix all glibc-specific bugs, and
> handle the compiler-side issues separately?
>
the compiler generated code dont need any change
(that would break the arm and aarch64 abi)
i'm fixing the issue in the libc tlsdesc entry functions
(because glibc is broken currently, very easy to make it
crash)
the fix relies on the arm memory model
the real bug is the complexity of the dynamic linker code in glibc
and the design document that specified lazy binding for tlsdesc
without considering weak memory model issues, for these there is
an easy fix but you won't like it: don't do lazy binding.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-24 21:27 ` Szabolcs Nagy
@ 2015-04-25 4:27 ` Rich Felker
0 siblings, 0 replies; 27+ messages in thread
From: Rich Felker @ 2015-04-25 4:27 UTC (permalink / raw)
To: Szabolcs Nagy
Cc: Torvald Riegel, libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Fri, Apr 24, 2015 at 10:27:36PM +0100, Szabolcs Nagy wrote:
>
> On 24/04/15 18:40, Torvald Riegel wrote:
> > On Fri, 2015-04-24 at 12:05 +0100, Szabolcs Nagy wrote:
> >>
> >> On 23/04/15 22:46, Torvald Riegel wrote:
> >>> On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
> >>>> (the particular case i'm trying to fix is hard because
> >>>> the access to td->entry is generated by the compiler, so
> >>>> it has to be worked around inside the entry function.
> >>>> i think x86 does not have this issue)
> >>>
> >>> Do you mean that you want to affect the compiler-generated code after it
> >>> has been generated?
> >>
> >> i'm fixing the tlsdesc entry functions in libc
> >
> > Sorry, I'm still not sure I can follow. IIUC, the compiler-generated
> > asm code for TLS accesses may lack acquire barriers in some cases. Do
> > you want to work around that by trying to adapt the glibc
> > implementation, or do you just want to fix all glibc-specific bugs, and
> > handle the compiler-side issues separately?
> >
>
> the compiler generated code dont need any change
> (that would break the arm and aarch64 abi)
>
> i'm fixing the issue in the libc tlsdesc entry functions
> (because glibc is broken currently, very easy to make it
> crash)
>
> the fix relies on the arm memory model
>
> the real bug is the complexity of the dynamic linker code in glibc
> and the design document that specified lazy binding for tlsdesc
> without considering weak memory model issues, for these there is
> an easy fix but you won't like it: don't do lazy binding.
Being that lazy tlsdesc binding seems to break AS-safety too, and
breaks fail-safety (there's an allocation that can fail at resolving
time) I think it should just be dropped. Then this whole issue goes
away and doesn't need complex arch-specific solutions.
Rich
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-22 16:08 ` Torvald Riegel
2015-04-22 17:14 ` Szabolcs Nagy
@ 2015-04-27 15:19 ` Szabolcs Nagy
2015-05-26 23:26 ` Torvald Riegel
1 sibling, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2015-04-27 15:19 UTC (permalink / raw)
To: Torvald Riegel; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
[-- Attachment #1: Type: text/plain, Size: 1558 bytes --]
On 22/04/15 17:08, Torvald Riegel wrote:
> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
>> Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
>> accesses.
>
> Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
> Fixes to existing code should use the C11 memory model for concurrent
> code, especially if the fix is about a concurrency issue.
>
> I agree with Adhemerval that a release MO store seems to be sufficient
> in this case.
>
Updated the patch.
I used a fence instead of the suggested atomic store, because I
think this part of the concurrency wiki is problematic if glibc
ever tries to move to C11:
"We (currently) do not use special types for the variables accessed
by atomic operations, but the underlying non-atomic types with
suitable alignment on each architecture."
The release fence generates the same code (dmb ish) as the previous
__sync_synchronize (release fence is slightly stronger than what arm
actually needs here: only store-store is needed, release fence gives
load-store barrier too).
I haven't changed other parts of the patch.
Changelog:
2015-04-27 Szabolcs Nagy <szabolcs.nagy@arm.com>
[BZ #18034]
* sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare.
* sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define.
(_dl_tlsdesc_undefweak): Guarantee load-load ordering.
(_dl_tlsdesc_dynamic): Likewise.
* sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add
barrier between stores.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: tlsdesc-v2.diff --]
[-- Type: text/x-patch; name=tlsdesc-v2.diff, Size: 3573 bytes --]
diff --git a/sysdeps/aarch64/dl-tlsdesc.S b/sysdeps/aarch64/dl-tlsdesc.S
index be9b9b3..ff74b52 100644
--- a/sysdeps/aarch64/dl-tlsdesc.S
+++ b/sysdeps/aarch64/dl-tlsdesc.S
@@ -79,6 +79,25 @@ _dl_tlsdesc_return:
cfi_endproc
.size _dl_tlsdesc_return, .-_dl_tlsdesc_return
+ /* Same as _dl_tlsdesc_return but with synchronization for
+ lazy relocation.
+ Prototype:
+ _dl_tlsdesc_return_lazy (tlsdesc *) ;
+ */
+ .hidden _dl_tlsdesc_return_lazy
+ .global _dl_tlsdesc_return_lazy
+ .type _dl_tlsdesc_return_lazy,%function
+ cfi_startproc
+ .align 2
+_dl_tlsdesc_return_lazy:
+ /* The ldar guarantees ordering between the load from [x0] at the
+ call site and the load from [x0,#8] here for lazy relocation. */
+ ldar xzr, [x0]
+ ldr x0, [x0, #8]
+ RET
+ cfi_endproc
+ .size _dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy
+
/* Handler for undefined weak TLS symbols.
Prototype:
_dl_tlsdesc_undefweak (tlsdesc *);
@@ -96,6 +115,9 @@ _dl_tlsdesc_return:
_dl_tlsdesc_undefweak:
str x1, [sp, #-16]!
cfi_adjust_cfa_offset(16)
+ /* The ldar guarantees ordering between the load from [x0] at the
+ call site and the load from [x0,#8] here for lazy relocation. */
+ ldar xzr, [x0]
ldr x0, [x0, #8]
mrs x1, tpidr_el0
sub x0, x0, x1
@@ -152,6 +174,9 @@ _dl_tlsdesc_dynamic:
stp x3, x4, [sp, #32+16*1]
mrs x4, tpidr_el0
+ /* The ldar guarantees ordering between the load from [x0] at the
+ call site and the load from [x0,#8] here for lazy relocation. */
+ ldar xzr, [x0]
ldr x1, [x0,#8]
ldr x0, [x4]
ldr x3, [x1,#16]
diff --git a/sysdeps/aarch64/dl-tlsdesc.h b/sysdeps/aarch64/dl-tlsdesc.h
index 7a1285e..e6c0078 100644
--- a/sysdeps/aarch64/dl-tlsdesc.h
+++ b/sysdeps/aarch64/dl-tlsdesc.h
@@ -46,6 +46,9 @@ extern ptrdiff_t attribute_hidden
_dl_tlsdesc_return (struct tlsdesc *);
extern ptrdiff_t attribute_hidden
+_dl_tlsdesc_return_lazy (struct tlsdesc *);
+
+extern ptrdiff_t attribute_hidden
_dl_tlsdesc_undefweak (struct tlsdesc *);
extern ptrdiff_t attribute_hidden
diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
index 4821f8c..2e55c07 100644
--- a/sysdeps/aarch64/tlsdesc.c
+++ b/sysdeps/aarch64/tlsdesc.c
@@ -25,6 +25,7 @@
#include <dl-tlsdesc.h>
#include <dl-unmap-segments.h>
#include <tlsdeschtab.h>
+#include <atomic.h>
/* The following functions take an entry_check_offset argument. It's
computed by the caller as an offset between its entry point and the
@@ -87,6 +88,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
if (!sym)
{
td->arg = (void*) reloc->r_addend;
+ /* Store-store barrier so the two writes are not reordered. */
+ atomic_thread_fence_release ();
td->entry = _dl_tlsdesc_undefweak;
}
else
@@ -98,6 +101,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
{
td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value
+ reloc->r_addend);
+ /* Store-store barrier so the two writes are not reordered. */
+ atomic_thread_fence_release ();
td->entry = _dl_tlsdesc_dynamic;
}
else
@@ -105,7 +110,9 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
{
td->arg = (void*) (sym->st_value + result->l_tls_offset
+ reloc->r_addend);
- td->entry = _dl_tlsdesc_return;
+ /* Store-store barrier so the two writes are not reordered. */
+ atomic_thread_fence_release ();
+ td->entry = _dl_tlsdesc_return_lazy;
}
}
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-27 15:19 ` Szabolcs Nagy
@ 2015-05-26 23:26 ` Torvald Riegel
2015-05-27 13:02 ` Szabolcs Nagy
2015-06-09 3:50 ` [PATCH] [BZ 18034] [AArch64] " Alexandre Oliva
0 siblings, 2 replies; 27+ messages in thread
From: Torvald Riegel @ 2015-05-26 23:26 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Mon, 2015-04-27 at 16:19 +0100, Szabolcs Nagy wrote:
>
> On 22/04/15 17:08, Torvald Riegel wrote:
> > On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
> >> Lazy TLSDESC initialization needs to be synchronized with
> concurrent TLS
> >> accesses.
> >
> > Please get familiar with
> https://sourceware.org/glibc/wiki/Concurrency
> > Fixes to existing code should use the C11 memory model for
> concurrent
> > code, especially if the fix is about a concurrency issue.
> >
> > I agree with Adhemerval that a release MO store seems to be
> sufficient
> > in this case.
> >
>
> Updated the patch.
>
> I used a fence instead of the suggested atomic store, because I
> think this part of the concurrency wiki is problematic if glibc
> ever tries to move to C11:
>
> "We (currently) do not use special types for the variables accessed
> by atomic operations, but the underlying non-atomic types with
> suitable alignment on each architecture."
No, I don't see how it would be problematic. Why do you think this is
the case?
There's no collision because glibc's atomic functions are named
differently. Second, we expect the underlying data type for both the
arch's atomic types and what we use now to be identical, so even
switching all of that over would be possible.
This should have relaxed atomic accesses for all things concurrently
accessed. As a result, you can drop the volatile qualification I
believe (I haven't checked, but it seems this was a pre-memory-model way
to avoid compiler reordering -- it's not actually observable behavior
that we'd need it for).
The release store would be clear, but you can use the release fence as
well.
>
> diff --git a/sysdeps/aarch64/dl-tlsdesc.S
> b/sysdeps/aarch64/dl-tlsdesc.S
> index be9b9b3..ff74b52 100644
> --- a/sysdeps/aarch64/dl-tlsdesc.S
> +++ b/sysdeps/aarch64/dl-tlsdesc.S
> @@ -79,6 +79,25 @@ _dl_tlsdesc_return:
> cfi_endproc
> .size _dl_tlsdesc_return, .-_dl_tlsdesc_return
>
> + /* Same as _dl_tlsdesc_return but with synchronization for
> + lazy relocation.
> + Prototype:
> + _dl_tlsdesc_return_lazy (tlsdesc *) ;
> + */
> + .hidden _dl_tlsdesc_return_lazy
> + .global _dl_tlsdesc_return_lazy
> + .type _dl_tlsdesc_return_lazy,%function
> + cfi_startproc
> + .align 2
> +_dl_tlsdesc_return_lazy:
> + /* The ldar guarantees ordering between the load from [x0] at
> the
> + call site and the load from [x0,#8] here for lazy
> relocation. */
You should point out where the matching release MO operation is. For
example, you could say somethign along the lines of:
"The ldar ensures that we synchronize-with with the release MO store in
_dl_tlsdesc_resolve_rela_fixup; as a result, the load from [x0,#8] will
happen after the initialization of td->arg in
_dl_tlsdesc_resolve_rela_fixup."
> + ldar xzr, [x0]
> + ldr x0, [x0, #8]
> + RET
> + cfi_endproc
> + .size _dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy
> +
> /* Handler for undefined weak TLS symbols.
> Prototype:
> _dl_tlsdesc_undefweak (tlsdesc *);
> @@ -96,6 +115,9 @@ _dl_tlsdesc_return:
> _dl_tlsdesc_undefweak:
> str x1, [sp, #-16]!
> cfi_adjust_cfa_offset(16)
> + /* The ldar guarantees ordering between the load from [x0] at
> the
> + call site and the load from [x0,#8] here for lazy
> relocation. */
> + ldar xzr, [x0]
Likewise.
> ldr x0, [x0, #8]
> mrs x1, tpidr_el0
> sub x0, x0, x1
> @@ -152,6 +174,9 @@ _dl_tlsdesc_dynamic:
> stp x3, x4, [sp, #32+16*1]
>
> mrs x4, tpidr_el0
> + /* The ldar guarantees ordering between the load from [x0] at
> the
> + call site and the load from [x0,#8] here for lazy
> relocation. */
> + ldar xzr, [x0]
Likewise.
> ldr x1, [x0,#8]
> ldr x0, [x4]
> ldr x3, [x1,#16]
> diff --git a/sysdeps/aarch64/dl-tlsdesc.h
> b/sysdeps/aarch64/dl-tlsdesc.h
> index 7a1285e..e6c0078 100644
> --- a/sysdeps/aarch64/dl-tlsdesc.h
> +++ b/sysdeps/aarch64/dl-tlsdesc.h
> @@ -46,6 +46,9 @@ extern ptrdiff_t attribute_hidden
> _dl_tlsdesc_return (struct tlsdesc *);
>
> extern ptrdiff_t attribute_hidden
> +_dl_tlsdesc_return_lazy (struct tlsdesc *);
> +
> +extern ptrdiff_t attribute_hidden
> _dl_tlsdesc_undefweak (struct tlsdesc *);
>
> extern ptrdiff_t attribute_hidden
> diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
> index 4821f8c..2e55c07 100644
> --- a/sysdeps/aarch64/tlsdesc.c
> +++ b/sysdeps/aarch64/tlsdesc.c
> @@ -25,6 +25,7 @@
> #include <dl-tlsdesc.h>
> #include <dl-unmap-segments.h>
> #include <tlsdeschtab.h>
> +#include <atomic.h>
>
> /* The following functions take an entry_check_offset argument. It's
> computed by the caller as an offset between its entry point and
> the
> @@ -87,6 +88,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc
> volatile *td,
This volatile can be dropped when we use atomics properly, I believe.
> if (!sym)
> {
> td->arg = (void*) reloc->r_addend;
That needs to be an atomic access.
> + /* Store-store barrier so the two writes are not reordered. */
Say that this release MO store is meant to synchronize with the acquire
MO operation in _dl_tlsdesc_undefweak.
> + atomic_thread_fence_release ();
> td->entry = _dl_tlsdesc_undefweak;
Relaxed MO atomic store, or make it a release MO store and drop the
barrier.
> }
> else
> @@ -98,6 +101,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc
> volatile *td,
> {
> td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value
> + reloc->r_addend);
> + /* Store-store barrier so the two writes are not reordered.
> */
> + atomic_thread_fence_release ();
> td->entry = _dl_tlsdesc_dynamic;
Likewise.
> }
> else
> @@ -105,7 +110,9 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc
> volatile *td,
> {
> td->arg = (void*) (sym->st_value + result->l_tls_offset
> + reloc->r_addend);
> - td->entry = _dl_tlsdesc_return;
> + /* Store-store barrier so the two writes are not reordered.
> */
> + atomic_thread_fence_release ();
> + td->entry = _dl_tlsdesc_return_lazy;
Likewise.
> }
> }
>
>
You should also document why the relaxed MO load in
_dl_tlsdesc_resolve_early_return_p is sufficient (see the other part of
the email thread -- I now see what you meant but this isn't obvious at
all). The acquire MO operations that make it work are added by you in
this patch.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-05-26 23:26 ` Torvald Riegel
@ 2015-05-27 13:02 ` Szabolcs Nagy
2015-05-27 14:38 ` Torvald Riegel
2015-06-09 3:50 ` [PATCH] [BZ 18034] [AArch64] " Alexandre Oliva
1 sibling, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2015-05-27 13:02 UTC (permalink / raw)
To: Torvald Riegel; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On 26/05/15 21:37, Torvald Riegel wrote:
> On Mon, 2015-04-27 at 16:19 +0100, Szabolcs Nagy wrote:
>> On 22/04/15 17:08, Torvald Riegel wrote:
>>> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
>>>> Lazy TLSDESC initialization needs to be synchronized with
>> concurrent TLS
>>>> accesses.
>>>
>>> Please get familiar with
>> https://sourceware.org/glibc/wiki/Concurrency
>>> Fixes to existing code should use the C11 memory model for
>> concurrent
>>> code, especially if the fix is about a concurrency issue.
>>>
>>> I agree with Adhemerval that a release MO store seems to be
>> sufficient
>>> in this case.
>>
>> I used a fence instead of the suggested atomic store, because I
>> think this part of the concurrency wiki is problematic if glibc
>> ever tries to move to C11:
>>
>> "We (currently) do not use special types for the variables accessed
>> by atomic operations, but the underlying non-atomic types with
>> suitable alignment on each architecture."
>
> No, I don't see how it would be problematic. Why do you think this is
> the case?
ok, i was thinking about type system issues but that's
probably not relevant.
(_Atomic int* is incompatible with int* and changing
the type everywhere to _Atomic might not work if the
type is externally visible like pthread types).
> This should have relaxed atomic accesses for all things concurrently
> accessed. As a result, you can drop the volatile qualification I
> believe (I haven't checked, but it seems this was a pre-memory-model way
> to avoid compiler reordering -- it's not actually observable behavior
> that we'd need it for).
i guess volatile prevents reordering wrt other volatiles
(but not wrt other non-volatiles).
the other use of volatile is to prevent spurious loads, eg.
x = *p;
y = x;
z = x;
can be "optimized" into
y = *p;
z = *p;
or even
y = z = *p & *p;
if every access to *p is changed to use an atomic function
then dropping volatile is ok, but in this case td is passed
around and used in _dl_tlsdesc_resolve_early_return_p too,
so i'm not sure if it's ok to remove volatile yet.
is it ok if that's fixed as a separate commit?
> The release store would be clear, but you can use the release fence as
> well.
>
ok, i'll use release store then.
>> +_dl_tlsdesc_return_lazy:
>> + /* The ldar guarantees ordering between the load from [x0] at
>> the
>> + call site and the load from [x0,#8] here for lazy
>> relocation. */
>
> You should point out where the matching release MO operation is. For
> example, you could say somethign along the lines of:
>
> "The ldar ensures that we synchronize-with with the release MO store in
> _dl_tlsdesc_resolve_rela_fixup; as a result, the load from [x0,#8] will
> happen after the initialization of td->arg in
> _dl_tlsdesc_resolve_rela_fixup."
>
ok
>> + /* Store-store barrier so the two writes are not reordered. */
>
> Say that this release MO store is meant to synchronize with the acquire
> MO operation in _dl_tlsdesc_undefweak.
ok
> You should also document why the relaxed MO load in
> _dl_tlsdesc_resolve_early_return_p is sufficient (see the other part of
> the email thread -- I now see what you meant but this isn't obvious at
> all). The acquire MO operations that make it work are added by you in
> this patch.
ok,
but i think that should be fixed separately
together with other archs.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-05-27 13:02 ` Szabolcs Nagy
@ 2015-05-27 14:38 ` Torvald Riegel
2015-06-01 10:25 ` Szabolcs Nagy
0 siblings, 1 reply; 27+ messages in thread
From: Torvald Riegel @ 2015-05-27 14:38 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Wed, 2015-05-27 at 12:27 +0100, Szabolcs Nagy wrote:
> On 26/05/15 21:37, Torvald Riegel wrote:
> > This should have relaxed atomic accesses for all things concurrently
> > accessed. As a result, you can drop the volatile qualification I
> > believe (I haven't checked, but it seems this was a pre-memory-model way
> > to avoid compiler reordering -- it's not actually observable behavior
> > that we'd need it for).
>
> i guess volatile prevents reordering wrt other volatiles
> (but not wrt other non-volatiles).
The atomic ops will take care of this (or, allow one to take care of
this).
> the other use of volatile is to prevent spurious loads, eg.
>
> x = *p;
> y = x;
> z = x;
>
> can be "optimized" into
>
> y = *p;
> z = *p;
>
> or even
>
> y = z = *p & *p;
Yes, that's allowed for non-atomic non-volatile (either due to the
data-race-freedom requirement for non-atomic, or due to no volatile
constraints regarding equality to abstract machine). If the load from
*p were even a memory_order_relaxed atomic load, the compiler would not
be allowed to re-load.
> if every access to *p is changed to use an atomic function
> then dropping volatile is ok, but in this case td is passed
> around and used in _dl_tlsdesc_resolve_early_return_p too,
> so i'm not sure if it's ok to remove volatile yet.
Yes, the clean thing to do would be to change it everywhere at once.
> is it ok if that's fixed as a separate commit?
Yes, if you promise to submit such a patch :)
Please still use mo_relaxed atomic accesses though, so that we can take
care of the DRF requirement for the code you changed.
> > You should also document why the relaxed MO load in
> > _dl_tlsdesc_resolve_early_return_p is sufficient (see the other part of
> > the email thread -- I now see what you meant but this isn't obvious at
> > all). The acquire MO operations that make it work are added by you in
> > this patch.
>
> ok,
>
> but i think that should be fixed separately
> together with other archs.
>
If you want to merge that into, for example, using a common tlsdesc.c,
then that's fine with me. I just don't want this information to get
lost; I think it's non-obvious and thus should be documented.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-05-27 14:38 ` Torvald Riegel
@ 2015-06-01 10:25 ` Szabolcs Nagy
2015-06-03 10:49 ` Torvald Riegel
0 siblings, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2015-06-01 10:25 UTC (permalink / raw)
To: Torvald Riegel; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
[-- Attachment #1: Type: text/plain, Size: 2274 bytes --]
On 27/05/15 14:02, Torvald Riegel wrote:
> On Wed, 2015-05-27 at 12:27 +0100, Szabolcs Nagy wrote:
>> On 26/05/15 21:37, Torvald Riegel wrote:
>>> This should have relaxed atomic accesses for all things concurrently
>>> accessed. As a result, you can drop the volatile qualification I
removed volatile from aarch64 tlsdesc.c, but kept it in
elf/tlsdeschtab.h for now.
>>> You should also document why the relaxed MO load in
>>> _dl_tlsdesc_resolve_early_return_p is sufficient (see the other part of
i added a comment to the _dl_tlsdesc_resolve_early_return_p
call in aarch64 tlsdesc.c about the retry loop.
attached the updated patch:
- The c code now follows the glibc concurrency guidelines
(using relaxed atomics instead of volatile and release store
instead of barrier on the write side).
- Fixed the atomics in elf/tlsdeschtab.h too.
- Updated the comments.
the write is side now compiled to: (stlr has store-release semantics)
158: 91002261 add x1, x19, #0x8
15c: f9000020 str x0, [x1]
160: 90000000 adrp x0, 0 <_dl_tlsdesc_return_lazy>
160: R_AARCH64_ADR_PREL_PG_HI21 _dl_tlsdesc_return_lazy
164: 91000000 add x0, x0, #0x0
164: R_AARCH64_ADD_ABS_LO12_NC _dl_tlsdesc_return_lazy
168: c89ffe60 stlr x0, [x19]
the read side is: (ldar has load-acquire semantics)
0000000000000008 <_dl_tlsdesc_return_lazy>:
8: c8dffc1f ldar xzr, [x0]
c: f9400400 ldr x0, [x0,#8]
10: d65f03c0 ret
Changelog:
2015-06-01 Szabolcs Nagy <szabolcs.nagy@arm.com>
[BZ #18034]
* sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare.
* sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define.
(_dl_tlsdesc_undefweak): Guarantee TLSDESC entry and argument load-load
ordering using ldar.
(_dl_tlsdesc_dynamic): Likewise.
(_dl_tlsdesc_return_lazy): Likewise.
* sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Use
relaxed atomics instead of volatile and synchronize with release store.
(_dl_tlsdesc_resolve_hold_fixup): Use relaxed atomics instead of
volatile.
* elf/tlsdeschtab.h (_dl_tlsdesc_resolve_early_return_p): Likewise.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: tlsdesc-v3.diff --]
[-- Type: text/x-patch; name=tlsdesc-v3.diff, Size: 6863 bytes --]
diff --git a/elf/tlsdeschtab.h b/elf/tlsdeschtab.h
index d13b4e5..fb0eb88 100644
--- a/elf/tlsdeschtab.h
+++ b/elf/tlsdeschtab.h
@@ -20,6 +20,8 @@
#ifndef TLSDESCHTAB_H
# define TLSDESCHTAB_H 1
+#include <atomic.h>
+
# ifdef SHARED
# include <inline-hashtab.h>
@@ -138,17 +140,17 @@ _dl_make_tlsdesc_dynamic (struct link_map *map, size_t ti_offset)
static int
_dl_tlsdesc_resolve_early_return_p (struct tlsdesc volatile *td, void *caller)
{
- if (caller != td->entry)
+ if (caller != atomic_load_relaxed (&td->entry))
return 1;
__rtld_lock_lock_recursive (GL(dl_load_lock));
- if (caller != td->entry)
+ if (caller != atomic_load_relaxed (&td->entry))
{
__rtld_lock_unlock_recursive (GL(dl_load_lock));
return 1;
}
- td->entry = _dl_tlsdesc_resolve_hold;
+ atomic_store_relaxed (&td->entry, _dl_tlsdesc_resolve_hold);
return 0;
}
diff --git a/sysdeps/aarch64/dl-tlsdesc.S b/sysdeps/aarch64/dl-tlsdesc.S
index be9b9b3..c7adf79 100644
--- a/sysdeps/aarch64/dl-tlsdesc.S
+++ b/sysdeps/aarch64/dl-tlsdesc.S
@@ -79,6 +79,29 @@ _dl_tlsdesc_return:
cfi_endproc
.size _dl_tlsdesc_return, .-_dl_tlsdesc_return
+ /* Same as _dl_tlsdesc_return but with synchronization for
+ lazy relocation.
+ Prototype:
+ _dl_tlsdesc_return_lazy (tlsdesc *) ;
+ */
+ .hidden _dl_tlsdesc_return_lazy
+ .global _dl_tlsdesc_return_lazy
+ .type _dl_tlsdesc_return_lazy,%function
+ cfi_startproc
+ .align 2
+_dl_tlsdesc_return_lazy:
+ /* The ldar here happens after the load from [x0] at the call site
+ (that is generated by the compiler as part of the TLS access ABI),
+ so it reads the same value (this function is the final value of
+ td->entry) and thus it synchronizes with the release store to
+ td->entry in _dl_tlsdesc_resolve_rela_fixup ensuring that the load
+ from [x0,#8] here happens after the initialization of td->arg. */
+ ldar xzr, [x0]
+ ldr x0, [x0, #8]
+ RET
+ cfi_endproc
+ .size _dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy
+
/* Handler for undefined weak TLS symbols.
Prototype:
_dl_tlsdesc_undefweak (tlsdesc *);
@@ -96,6 +119,13 @@ _dl_tlsdesc_return:
_dl_tlsdesc_undefweak:
str x1, [sp, #-16]!
cfi_adjust_cfa_offset(16)
+ /* The ldar here happens after the load from [x0] at the call site
+ (that is generated by the compiler as part of the TLS access ABI),
+ so it reads the same value (this function is the final value of
+ td->entry) and thus it synchronizes with the release store to
+ td->entry in _dl_tlsdesc_resolve_rela_fixup ensuring that the load
+ from [x0,#8] here happens after the initialization of td->arg. */
+ ldar xzr, [x0]
ldr x0, [x0, #8]
mrs x1, tpidr_el0
sub x0, x0, x1
@@ -152,6 +182,13 @@ _dl_tlsdesc_dynamic:
stp x3, x4, [sp, #32+16*1]
mrs x4, tpidr_el0
+ /* The ldar here happens after the load from [x0] at the call site
+ (that is generated by the compiler as part of the TLS access ABI),
+ so it reads the same value (this function is the final value of
+ td->entry) and thus it synchronizes with the release store to
+ td->entry in _dl_tlsdesc_resolve_rela_fixup ensuring that the load
+ from [x0,#8] here happens after the initialization of td->arg. */
+ ldar xzr, [x0]
ldr x1, [x0,#8]
ldr x0, [x4]
ldr x3, [x1,#16]
diff --git a/sysdeps/aarch64/dl-tlsdesc.h b/sysdeps/aarch64/dl-tlsdesc.h
index 7a1285e..e6c0078 100644
--- a/sysdeps/aarch64/dl-tlsdesc.h
+++ b/sysdeps/aarch64/dl-tlsdesc.h
@@ -46,6 +46,9 @@ extern ptrdiff_t attribute_hidden
_dl_tlsdesc_return (struct tlsdesc *);
extern ptrdiff_t attribute_hidden
+_dl_tlsdesc_return_lazy (struct tlsdesc *);
+
+extern ptrdiff_t attribute_hidden
_dl_tlsdesc_undefweak (struct tlsdesc *);
extern ptrdiff_t attribute_hidden
diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
index 4821f8c..d2f1cd5 100644
--- a/sysdeps/aarch64/tlsdesc.c
+++ b/sysdeps/aarch64/tlsdesc.c
@@ -25,6 +25,7 @@
#include <dl-tlsdesc.h>
#include <dl-unmap-segments.h>
#include <tlsdeschtab.h>
+#include <atomic.h>
/* The following functions take an entry_check_offset argument. It's
computed by the caller as an offset between its entry point and the
@@ -39,11 +40,12 @@
void
attribute_hidden
-_dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
- struct link_map *l)
+_dl_tlsdesc_resolve_rela_fixup (struct tlsdesc *td, struct link_map *l)
{
- const ElfW(Rela) *reloc = td->arg;
+ const ElfW(Rela) *reloc = atomic_load_relaxed (&td->arg);
+ /* Return and thus retry calling td->entry if it changes before
+ GL(dl_load_lock) is grabbed. */
if (_dl_tlsdesc_resolve_early_return_p
(td, (void*)(D_PTR (l, l_info[ADDRIDX (DT_TLSDESC_PLT)]) + l->l_addr)))
return;
@@ -86,8 +88,10 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
if (!sym)
{
- td->arg = (void*) reloc->r_addend;
- td->entry = _dl_tlsdesc_undefweak;
+ atomic_store_relaxed (&td->arg, (void *) reloc->r_addend);
+ /* This release store synchronizes with the ldar acquire load
+ instruction in _dl_tlsdesc_undefweak. */
+ atomic_store_release (&td->entry, _dl_tlsdesc_undefweak);
}
else
{
@@ -96,16 +100,22 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
# else
if (!TRY_STATIC_TLS (l, result))
{
- td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value
+ void *p = _dl_make_tlsdesc_dynamic (result, sym->st_value
+ reloc->r_addend);
- td->entry = _dl_tlsdesc_dynamic;
+ atomic_store_relaxed (&td->arg, p);
+ /* This release store synchronizes with the ldar acquire load
+ instruction in _dl_tlsdesc_dynamic. */
+ atomic_store_release (&td->entry, _dl_tlsdesc_dynamic);
}
else
# endif
{
- td->arg = (void*) (sym->st_value + result->l_tls_offset
+ void *p = (void*) (sym->st_value + result->l_tls_offset
+ reloc->r_addend);
- td->entry = _dl_tlsdesc_return;
+ atomic_store_relaxed (&td->arg, p);
+ /* This release store synchronizes with the ldar acquire load
+ instruction in _dl_tlsdesc_return_lazy. */
+ atomic_store_release (&td->entry, _dl_tlsdesc_return_lazy);
}
}
@@ -120,11 +130,10 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
void
attribute_hidden
-_dl_tlsdesc_resolve_hold_fixup (struct tlsdesc volatile *td,
- void *caller)
+_dl_tlsdesc_resolve_hold_fixup (struct tlsdesc *td, void *caller)
{
/* Maybe we're lucky and can return early. */
- if (caller != td->entry)
+ if (caller != atomic_load_relaxed (&td->entry))
return;
/* Locking here will stop execution until the running resolver runs
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-06-01 10:25 ` Szabolcs Nagy
@ 2015-06-03 10:49 ` Torvald Riegel
2015-06-15 14:31 ` [PATCH v4][BZ 18034][AArch64] " Szabolcs Nagy
0 siblings, 1 reply; 27+ messages in thread
From: Torvald Riegel @ 2015-06-03 10:49 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Mon, 2015-06-01 at 11:25 +0100, Szabolcs Nagy wrote:
> i added a comment to the _dl_tlsdesc_resolve_early_return_p
> call in aarch64 tlsdesc.c about the retry loop.
That's good, but it would have been better if you could have briefly
pointed out that this relates to mo_relaxed loads in
_dl_tlsdesc_resolve_early_return_p. And/or added a comment there saying
that the mo_relaxed loads are fine because of this retry loop in the
caller(s).
> diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
> index 4821f8c..d2f1cd5 100644
> --- a/sysdeps/aarch64/tlsdesc.c
> +++ b/sysdeps/aarch64/tlsdesc.c
> @@ -39,11 +40,12 @@
>
> void
> attribute_hidden
> -_dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
> - struct link_map *l)
> +_dl_tlsdesc_resolve_rela_fixup (struct tlsdesc *td, struct link_map
> *l)
> {
> - const ElfW(Rela) *reloc = td->arg;
> + const ElfW(Rela) *reloc = atomic_load_relaxed (&td->arg);
Good change. Can you add a brief comment saying why the mo_relaxed load
is sufficient? IIRC, this is because of the acquire loads done by the
caller.
OK with those changes.
Thanks!
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-05-26 23:26 ` Torvald Riegel
2015-05-27 13:02 ` Szabolcs Nagy
@ 2015-06-09 3:50 ` Alexandre Oliva
2015-06-09 9:38 ` Torvald Riegel
1 sibling, 1 reply; 27+ messages in thread
From: Alexandre Oliva @ 2015-06-09 3:50 UTC (permalink / raw)
To: Torvald Riegel
Cc: Szabolcs Nagy, libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On May 26, 2015, Torvald Riegel <triegel@redhat.com> wrote:
> As a result, you can drop the volatile qualification I
> believe (I haven't checked, but it seems this was a pre-memory-model way
> to avoid compiler reordering
That's correct. It was the only portable way available back then to
force stores and loads to take place in specific orders.
--
Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/ FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-22 12:27 [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix Szabolcs Nagy
2015-04-22 15:07 ` Adhemerval Zanella
2015-04-22 16:08 ` Torvald Riegel
@ 2015-06-09 6:30 ` Alexandre Oliva
2015-07-11 10:16 ` Ondřej Bílka
3 siblings, 0 replies; 27+ messages in thread
From: Alexandre Oliva @ 2015-06-09 6:30 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Apr 22, 2015, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> - I don't understand the role of the hold function in the general
> TLSDESC design
On machines that can't overwrite the entire descriptor atomically, it is
intended to stop a thread from using a partially-relocated descriptor.
Users of the descriptor must load the function word before the operand,
and the relocator must first set the function word to hold, then modify
the operand, and then set the function to its final value.
This is supposed to ensure that users of the descriptor can tell the
relocation of the descriptor is already underway, thus the operand may
already be trashed. I agree it's not extremely relevant for current
implementations, but the x86 TLSDESC RFC (the first that couldn't update
the descriptor atomically) describes other envisioned implementations
that could take advantage of it.
--
Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/ FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-06-09 3:50 ` [PATCH] [BZ 18034] [AArch64] " Alexandre Oliva
@ 2015-06-09 9:38 ` Torvald Riegel
0 siblings, 0 replies; 27+ messages in thread
From: Torvald Riegel @ 2015-06-09 9:38 UTC (permalink / raw)
To: Alexandre Oliva
Cc: Szabolcs Nagy, libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Tue, 2015-06-09 at 00:10 -0300, Alexandre Oliva wrote:
> On May 26, 2015, Torvald Riegel <triegel@redhat.com> wrote:
>
> > As a result, you can drop the volatile qualification I
> > believe (I haven't checked, but it seems this was a pre-memory-model way
> > to avoid compiler reordering
I meant that I haven't checked what the intent behind it was, and why it
was chosen instead of other alternatives.
> That's correct. It was the only portable way available back then to
> force stores and loads to take place in specific orders.
An alternative would have been to do what Boehm's libatomic-ops did:
Have functions/macros for all atomic accesses, and implement them with
asm volatile.
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v4][BZ 18034][AArch64] Lazy TLSDESC relocation data race fix
2015-06-03 10:49 ` Torvald Riegel
@ 2015-06-15 14:31 ` Szabolcs Nagy
2015-06-15 18:52 ` Torvald Riegel
2015-06-17 17:29 ` Joseph Myers
0 siblings, 2 replies; 27+ messages in thread
From: Szabolcs Nagy @ 2015-06-15 14:31 UTC (permalink / raw)
To: Torvald Riegel; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
[-- Attachment #1: Type: text/plain, Size: 3899 bytes --]
On 03/06/15 11:40, Torvald Riegel wrote:
> On Mon, 2015-06-01 at 11:25 +0100, Szabolcs Nagy wrote:
>> i added a comment to the _dl_tlsdesc_resolve_early_return_p
>> call in aarch64 tlsdesc.c about the retry loop.
>
> That's good, but it would have been better if you could have briefly
> pointed out that this relates to mo_relaxed loads in
> _dl_tlsdesc_resolve_early_return_p. And/or added a comment there saying
> that the mo_relaxed loads are fine because of this retry loop in the
> caller(s).
>
>> - const ElfW(Rela) *reloc = td->arg;
>> + const ElfW(Rela) *reloc = atomic_load_relaxed (&td->arg);
>
> Good change. Can you add a brief comment saying why the mo_relaxed load
> is sufficient? IIRC, this is because of the acquire loads done by the
> caller.
>
> OK with those changes.
>
updated the comments in the code and the description:
Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
accesses. The TLS descriptor contains a function pointer (entry) and an
argument that is accessed from the entry function. With lazy initialization
the first call to the entry function updates the entry and the argument to
their final value. A final entry function must make sure that it accesses an
initialized argument, this needs synchronization on systems with weak memory
ordering otherwise the writes of the first call can be observed out of order.
There are at least two issues with the current code:
tlsdesc.c (i386, x86_64, arm, aarch64) uses volatile memory accesses on the
write side (in the initial entry function) instead of C11 atomics.
And on systems with weak memory ordering (arm, aarch64) the read side
synchronization is missing from the final entry functions (dl-tlsdesc.S).
This patch only deals with aarch64.
* Write side:
Volatile accesses were replaced with C11 relaxed atomics, and a release
store was used for the initialization of entry so the read side can
synchronize with it.
* Read side:
TLS access generated by the compiler and an entry function code is roughly
ldr x1, [x0] // load the entry
blr x1 // call it
entryfunc:
ldr x0, [x0,#8] // load the arg
ret
Various alternatives were considered to force the ordering in the entry
function between the two loads:
(1) barrier
entryfunc:
dmb ishld
ldr x0, [x0,#8]
(2) address dependency (if the address of the second load depends on the
result of the first one the ordering is guaranteed):
entryfunc:
ldr x1,[x0]
and x1,x1,#8
orr x1,x1,#8
ldr x0,[x0,x1]
(3) load-acquire (ARMv8 instruction that is ordered before subsequent
loads and stores)
entryfunc:
ldar xzr,[x0]
ldr x0,[x0,#8]
Option (1) is the simplest but slowest (note: this runs at every TLS
access), options (2) and (3) do one extra load from [x0] (same address
loads are ordered so it happens-after the load on the call site),
option (2) clobbers x1 which is problematic because existing gcc does
not expect that, so approach (3) was chosen.
A new _dl_tlsdesc_return_lazy entry function was introduced for lazily
relocated static TLS, so non-lazy static TLS can avoid the synchronization
cost.
Changelog:
2015-06-15 Szabolcs Nagy <szabolcs.nagy@arm.com>
[BZ #18034]
* sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare.
* sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define.
(_dl_tlsdesc_undefweak): Guarantee TLSDESC entry and argument load-load
ordering using ldar.
(_dl_tlsdesc_dynamic): Likewise.
(_dl_tlsdesc_return_lazy): Likewise.
* sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Use
relaxed atomics instead of volatile and synchronize with release store.
(_dl_tlsdesc_resolve_hold_fixup): Use relaxed atomics instead of
volatile.
* elf/tlsdeschtab.h (_dl_tlsdesc_resolve_early_return_p): Likewise.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: tlsdesc-v4.diff --]
[-- Type: text/x-patch; name=tlsdesc-v4.diff, Size: 7092 bytes --]
diff --git a/elf/tlsdeschtab.h b/elf/tlsdeschtab.h
index d13b4e5..fb0eb88 100644
--- a/elf/tlsdeschtab.h
+++ b/elf/tlsdeschtab.h
@@ -20,6 +20,8 @@
#ifndef TLSDESCHTAB_H
# define TLSDESCHTAB_H 1
+#include <atomic.h>
+
# ifdef SHARED
# include <inline-hashtab.h>
@@ -138,17 +140,17 @@ _dl_make_tlsdesc_dynamic (struct link_map *map, size_t ti_offset)
static int
_dl_tlsdesc_resolve_early_return_p (struct tlsdesc volatile *td, void *caller)
{
- if (caller != td->entry)
+ if (caller != atomic_load_relaxed (&td->entry))
return 1;
__rtld_lock_lock_recursive (GL(dl_load_lock));
- if (caller != td->entry)
+ if (caller != atomic_load_relaxed (&td->entry))
{
__rtld_lock_unlock_recursive (GL(dl_load_lock));
return 1;
}
- td->entry = _dl_tlsdesc_resolve_hold;
+ atomic_store_relaxed (&td->entry, _dl_tlsdesc_resolve_hold);
return 0;
}
diff --git a/sysdeps/aarch64/dl-tlsdesc.S b/sysdeps/aarch64/dl-tlsdesc.S
index be9b9b3..c7adf79 100644
--- a/sysdeps/aarch64/dl-tlsdesc.S
+++ b/sysdeps/aarch64/dl-tlsdesc.S
@@ -79,6 +79,29 @@ _dl_tlsdesc_return:
cfi_endproc
.size _dl_tlsdesc_return, .-_dl_tlsdesc_return
+ /* Same as _dl_tlsdesc_return but with synchronization for
+ lazy relocation.
+ Prototype:
+ _dl_tlsdesc_return_lazy (tlsdesc *) ;
+ */
+ .hidden _dl_tlsdesc_return_lazy
+ .global _dl_tlsdesc_return_lazy
+ .type _dl_tlsdesc_return_lazy,%function
+ cfi_startproc
+ .align 2
+_dl_tlsdesc_return_lazy:
+ /* The ldar here happens after the load from [x0] at the call site
+ (that is generated by the compiler as part of the TLS access ABI),
+ so it reads the same value (this function is the final value of
+ td->entry) and thus it synchronizes with the release store to
+ td->entry in _dl_tlsdesc_resolve_rela_fixup ensuring that the load
+ from [x0,#8] here happens after the initialization of td->arg. */
+ ldar xzr, [x0]
+ ldr x0, [x0, #8]
+ RET
+ cfi_endproc
+ .size _dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy
+
/* Handler for undefined weak TLS symbols.
Prototype:
_dl_tlsdesc_undefweak (tlsdesc *);
@@ -96,6 +119,13 @@ _dl_tlsdesc_return:
_dl_tlsdesc_undefweak:
str x1, [sp, #-16]!
cfi_adjust_cfa_offset(16)
+ /* The ldar here happens after the load from [x0] at the call site
+ (that is generated by the compiler as part of the TLS access ABI),
+ so it reads the same value (this function is the final value of
+ td->entry) and thus it synchronizes with the release store to
+ td->entry in _dl_tlsdesc_resolve_rela_fixup ensuring that the load
+ from [x0,#8] here happens after the initialization of td->arg. */
+ ldar xzr, [x0]
ldr x0, [x0, #8]
mrs x1, tpidr_el0
sub x0, x0, x1
@@ -152,6 +182,13 @@ _dl_tlsdesc_dynamic:
stp x3, x4, [sp, #32+16*1]
mrs x4, tpidr_el0
+ /* The ldar here happens after the load from [x0] at the call site
+ (that is generated by the compiler as part of the TLS access ABI),
+ so it reads the same value (this function is the final value of
+ td->entry) and thus it synchronizes with the release store to
+ td->entry in _dl_tlsdesc_resolve_rela_fixup ensuring that the load
+ from [x0,#8] here happens after the initialization of td->arg. */
+ ldar xzr, [x0]
ldr x1, [x0,#8]
ldr x0, [x4]
ldr x3, [x1,#16]
diff --git a/sysdeps/aarch64/dl-tlsdesc.h b/sysdeps/aarch64/dl-tlsdesc.h
index 7a1285e..e6c0078 100644
--- a/sysdeps/aarch64/dl-tlsdesc.h
+++ b/sysdeps/aarch64/dl-tlsdesc.h
@@ -46,6 +46,9 @@ extern ptrdiff_t attribute_hidden
_dl_tlsdesc_return (struct tlsdesc *);
extern ptrdiff_t attribute_hidden
+_dl_tlsdesc_return_lazy (struct tlsdesc *);
+
+extern ptrdiff_t attribute_hidden
_dl_tlsdesc_undefweak (struct tlsdesc *);
extern ptrdiff_t attribute_hidden
diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
index 4821f8c..9f3ff9b 100644
--- a/sysdeps/aarch64/tlsdesc.c
+++ b/sysdeps/aarch64/tlsdesc.c
@@ -25,6 +25,7 @@
#include <dl-tlsdesc.h>
#include <dl-unmap-segments.h>
#include <tlsdeschtab.h>
+#include <atomic.h>
/* The following functions take an entry_check_offset argument. It's
computed by the caller as an offset between its entry point and the
@@ -39,11 +40,15 @@
void
attribute_hidden
-_dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
- struct link_map *l)
+_dl_tlsdesc_resolve_rela_fixup (struct tlsdesc *td, struct link_map *l)
{
- const ElfW(Rela) *reloc = td->arg;
+ const ElfW(Rela) *reloc = atomic_load_relaxed (&td->arg);
+ /* After GL(dl_load_lock) is grabbed only one caller can see td->entry in
+ initial state in _dl_tlsdesc_resolve_early_return_p, other concurrent
+ callers will return and retry calling td->entry. The updated td->entry
+ synchronizes with the single writer so all read accesses here can use
+ relaxed order. */
if (_dl_tlsdesc_resolve_early_return_p
(td, (void*)(D_PTR (l, l_info[ADDRIDX (DT_TLSDESC_PLT)]) + l->l_addr)))
return;
@@ -86,8 +91,10 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
if (!sym)
{
- td->arg = (void*) reloc->r_addend;
- td->entry = _dl_tlsdesc_undefweak;
+ atomic_store_relaxed (&td->arg, (void *) reloc->r_addend);
+ /* This release store synchronizes with the ldar acquire load
+ instruction in _dl_tlsdesc_undefweak. */
+ atomic_store_release (&td->entry, _dl_tlsdesc_undefweak);
}
else
{
@@ -96,16 +103,22 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
# else
if (!TRY_STATIC_TLS (l, result))
{
- td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value
+ void *p = _dl_make_tlsdesc_dynamic (result, sym->st_value
+ reloc->r_addend);
- td->entry = _dl_tlsdesc_dynamic;
+ atomic_store_relaxed (&td->arg, p);
+ /* This release store synchronizes with the ldar acquire load
+ instruction in _dl_tlsdesc_dynamic. */
+ atomic_store_release (&td->entry, _dl_tlsdesc_dynamic);
}
else
# endif
{
- td->arg = (void*) (sym->st_value + result->l_tls_offset
+ void *p = (void*) (sym->st_value + result->l_tls_offset
+ reloc->r_addend);
- td->entry = _dl_tlsdesc_return;
+ atomic_store_relaxed (&td->arg, p);
+ /* This release store synchronizes with the ldar acquire load
+ instruction in _dl_tlsdesc_return_lazy. */
+ atomic_store_release (&td->entry, _dl_tlsdesc_return_lazy);
}
}
@@ -120,11 +133,10 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
void
attribute_hidden
-_dl_tlsdesc_resolve_hold_fixup (struct tlsdesc volatile *td,
- void *caller)
+_dl_tlsdesc_resolve_hold_fixup (struct tlsdesc *td, void *caller)
{
/* Maybe we're lucky and can return early. */
- if (caller != td->entry)
+ if (caller != atomic_load_relaxed (&td->entry))
return;
/* Locking here will stop execution until the running resolver runs
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4][BZ 18034][AArch64] Lazy TLSDESC relocation data race fix
2015-06-15 14:31 ` [PATCH v4][BZ 18034][AArch64] " Szabolcs Nagy
@ 2015-06-15 18:52 ` Torvald Riegel
2015-06-17 17:29 ` Joseph Myers
1 sibling, 0 replies; 27+ messages in thread
From: Torvald Riegel @ 2015-06-15 18:52 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Mon, 2015-06-15 at 12:30 +0100, Szabolcs Nagy wrote:
> On 03/06/15 11:40, Torvald Riegel wrote:
> > On Mon, 2015-06-01 at 11:25 +0100, Szabolcs Nagy wrote:
> >> i added a comment to the _dl_tlsdesc_resolve_early_return_p
> >> call in aarch64 tlsdesc.c about the retry loop.
> >
> > That's good, but it would have been better if you could have briefly
> > pointed out that this relates to mo_relaxed loads in
> > _dl_tlsdesc_resolve_early_return_p. And/or added a comment there saying
> > that the mo_relaxed loads are fine because of this retry loop in the
> > caller(s).
> >
> >> - const ElfW(Rela) *reloc = td->arg;
> >> + const ElfW(Rela) *reloc = atomic_load_relaxed (&td->arg);
> >
> > Good change. Can you add a brief comment saying why the mo_relaxed load
> > is sufficient? IIRC, this is because of the acquire loads done by the
> > caller.
> >
> > OK with those changes.
> >
>
> updated the comments in the code and the description:
OK. Thanks!
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4][BZ 18034][AArch64] Lazy TLSDESC relocation data race fix
2015-06-15 14:31 ` [PATCH v4][BZ 18034][AArch64] " Szabolcs Nagy
2015-06-15 18:52 ` Torvald Riegel
@ 2015-06-17 17:29 ` Joseph Myers
2015-06-22 8:54 ` Szabolcs Nagy
1 sibling, 1 reply; 27+ messages in thread
From: Joseph Myers @ 2015-06-17 17:29 UTC (permalink / raw)
To: Szabolcs Nagy
Cc: Torvald Riegel, libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
If your commit fully fixes bug 18034, please close it as fixed (as well as
updating the patchwork state for your patches).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4][BZ 18034][AArch64] Lazy TLSDESC relocation data race fix
2015-06-17 17:29 ` Joseph Myers
@ 2015-06-22 8:54 ` Szabolcs Nagy
0 siblings, 0 replies; 27+ messages in thread
From: Szabolcs Nagy @ 2015-06-22 8:54 UTC (permalink / raw)
To: Joseph Myers
Cc: Torvald Riegel, libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On 17/06/15 18:21, Joseph Myers wrote:
> If your commit fully fixes bug 18034, please close it as fixed (as well as
> updating the patchwork state for your patches).
>
done
patchwork feels a bit redundant manual work..
especially when it cant follow the various iterations of the same patch
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
2015-04-22 12:27 [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix Szabolcs Nagy
` (2 preceding siblings ...)
2015-06-09 6:30 ` Alexandre Oliva
@ 2015-07-11 10:16 ` Ondřej Bílka
3 siblings, 0 replies; 27+ messages in thread
From: Ondřej Bílka @ 2015-07-11 10:16 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: libc-alpha, Marcus Shawcroft, Ramana Radhakrishnan
On Wed, Apr 22, 2015 at 01:27:15PM +0100, Szabolcs Nagy wrote:
> Other thoughts:
>
> - Lazy binding for static TLS may be unnecessary complexity: it seems
> that gcc generates at most two static TLSDESC relocation entries for a
> translation unit (zero and non-zero initialized TLS), so there has to be
> a lot of object files with static TLS linked into a shared object to
> make the load time relocation overhead significant. Unless there is some
> evidence that lazy static TLS relocation makes sense I would suggest
> removing that logic (from all archs). (The dynamic case is probably a
> similar micro-optimization, but there the lazy vs non-lazy semantics are
> different in case of undefined symbols and some applications may depend
> on that).
>
I agree with this. As undefined symbols there is bug that we segfault
with weak tls variable so I doubt that anyone depends on that.
You didn't mention one essential argument in your analysis as overhead
is caused by unused tls variables. When variable is used then you need
to do relocation anyway and lazy one could be mangitude more expensive
than eager.
There is project on my backlog to improve tls access
in libraries which is still slow even with gnu2 so it would need
introduce -mtls-dialect=gnu3.
Eager binding of TLS is prerequisite. Now tls models are flawed design
as they try to save space by being lazy without any evidence that there
is application that uses it.
Main idea that total tls usage is for most application less than 65536
bytes. So unless application uses more we could preallocate that when
loading dso. That makes a code for dynamic library require only one
extra read versus tls access in main binary when you use following.
static int foo_offset;
#define foo *(foo_offset > 0 ? tcb + foo_offset: get_tls_ptr (- foo_offset))
Here you could do lazy allocation in get_tls_ptr without taking lock.
There is bug that when you do allocation in signal handler it calls
malloc which could result in deadlock, there was patch to use mmap to
fix it which was reverted.
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2015-07-11 10:16 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-22 12:27 [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix Szabolcs Nagy
2015-04-22 15:07 ` Adhemerval Zanella
2015-04-22 16:42 ` Szabolcs Nagy
2015-04-22 16:08 ` Torvald Riegel
2015-04-22 17:14 ` Szabolcs Nagy
2015-04-23 12:08 ` Szabolcs Nagy
2015-04-24 17:37 ` Torvald Riegel
2015-04-24 21:12 ` Szabolcs Nagy
2015-04-23 21:47 ` Torvald Riegel
2015-04-24 11:05 ` Szabolcs Nagy
2015-04-24 17:40 ` Torvald Riegel
2015-04-24 21:27 ` Szabolcs Nagy
2015-04-25 4:27 ` Rich Felker
2015-04-27 15:19 ` Szabolcs Nagy
2015-05-26 23:26 ` Torvald Riegel
2015-05-27 13:02 ` Szabolcs Nagy
2015-05-27 14:38 ` Torvald Riegel
2015-06-01 10:25 ` Szabolcs Nagy
2015-06-03 10:49 ` Torvald Riegel
2015-06-15 14:31 ` [PATCH v4][BZ 18034][AArch64] " Szabolcs Nagy
2015-06-15 18:52 ` Torvald Riegel
2015-06-17 17:29 ` Joseph Myers
2015-06-22 8:54 ` Szabolcs Nagy
2015-06-09 3:50 ` [PATCH] [BZ 18034] [AArch64] " Alexandre Oliva
2015-06-09 9:38 ` Torvald Riegel
2015-06-09 6:30 ` Alexandre Oliva
2015-07-11 10:16 ` Ondřej Bílka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).