public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
@ 2018-09-19 21:48 Tulio Magno Quites Machado Filho
  2018-09-20  0:04 ` Joseph Myers
  2018-09-20  0:16 ` Carlos O'Donell
  0 siblings, 2 replies; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-09-19 21:48 UTC (permalink / raw)
  To: libc-alpha

The field reloc_result->addr is used to indicate if the rest of the
fields of reloc_result have already been written, creating a
data-dependency order.
Reading reloc_result->addr to the variable value requires to complete
before reading the rest of the fields of reloc_result.
Likewise, the writes to the other fields of the reloc_result must
complete before reloc_result-addr is updated.

2018-09-19  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>

	[BZ #23690]
	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
	modification order when accessing reloc_result->addr.

Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
---
 elf/dl-runtime.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
index 63bbc89776..6518e66fd6 100644
--- a/elf/dl-runtime.c
+++ b/elf/dl-runtime.c
@@ -183,9 +183,16 @@ _dl_profile_fixup (
   /* This is the address in the array where we store the result of previous
      relocations.  */
   struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
-  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
 
-  DL_FIXUP_VALUE_TYPE value = *resultp;
+  /* CONCURRENCY NOTES:
+
+     The following code uses reloc_result->addr to indicate if it is the first
+     time this object is being relocated.
+     Reading/Writing from/to reloc_result->addr must not happen before previous
+     writes to reloc_result complete as they could end-up with an incomplete
+     struct.  */
+  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
+  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
   if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
     {
       /* This is the first time we have to relocate this object.  */
@@ -346,7 +353,10 @@ _dl_profile_fixup (
 
       /* Store the result for later runs.  */
       if (__glibc_likely (! GLRO(dl_bind_not)))
-	*resultp = value;
+	/* Guarantee all previous writes complete before
+	   resultp (reloc_result->addr) is updated.  See CONCURRENCY NOTES
+	   earlier  */
+	atomic_store_release(resultp, value);
     }
 
   /* By default we do not call the pltexit function.  */
-- 
2.14.4

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-19 21:48 [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690] Tulio Magno Quites Machado Filho
@ 2018-09-20  0:04 ` Joseph Myers
  2018-09-20 13:03   ` Tulio Magno Quites Machado Filho
  2018-09-20  0:16 ` Carlos O'Donell
  1 sibling, 1 reply; 40+ messages in thread
From: Joseph Myers @ 2018-09-20  0:04 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho; +Cc: libc-alpha

On Wed, 19 Sep 2018, Tulio Magno Quites Machado Filho wrote:

> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);

> +	atomic_store_release(resultp, value);

Missing spaces before '('.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-19 21:48 [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690] Tulio Magno Quites Machado Filho
  2018-09-20  0:04 ` Joseph Myers
@ 2018-09-20  0:16 ` Carlos O'Donell
  2018-09-20  1:59   ` John David Anglin
                     ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: Carlos O'Donell @ 2018-09-20  0:16 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Adhemerval Zanella

On 09/19/2018 05:48 PM, Tulio Magno Quites Machado Filho wrote:
> The field reloc_result->addr is used to indicate if the rest of the
> fields of reloc_result have already been written, creating a
> data-dependency order.
> Reading reloc_result->addr to the variable value requires to complete
> before reading the rest of the fields of reloc_result.
> Likewise, the writes to the other fields of the reloc_result must
> complete before reloc_result-addr is updated.

Good catch. This needs just a little more work and it's done.

When you're ready please post a v2 and TO me for review.

> 2018-09-19  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>
> 
> 	[BZ #23690]
> 	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
> 	modification order when accessing reloc_result->addr.
> 
> Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
> ---
>  elf/dl-runtime.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
> index 63bbc89776..6518e66fd6 100644
> --- a/elf/dl-runtime.c
> +++ b/elf/dl-runtime.c
> @@ -183,9 +183,16 @@ _dl_profile_fixup (
>    /* This is the address in the array where we store the result of previous
>       relocations.  */
>    struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
> -  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>  
> -  DL_FIXUP_VALUE_TYPE value = *resultp;
> +  /* CONCURRENCY NOTES:
> +
> +     The following code uses reloc_result->addr to indicate if it is the first
> +     time this object is being relocated.
> +     Reading/Writing from/to reloc_result->addr must not happen before previous
> +     writes to reloc_result complete as they could end-up with an incomplete
> +     struct.  */

This is not quite accurate. The following code uses DL_FIXUP_VALUE_CODE_ADDR to
access a potential member of addr to indicate it is the first time the object is
being relocated. The "guard" variable in this access is either the address itself
or the ip of a function descriptor (the only two implementations of "code addr").

Also we don't explain what other data is being accessed? What non-function-locale
data is being updated as part of the guard variable check?

In the case of the function descriptor it's easy to know that the fdesc's gp is 
going to be written to, and without a acquire we won't see that write, or the new 
ip value, so we might have two threads update the same thing, in theory a benign 
data race which we should fix.

In theory elf_ifunc_invoke() is called and that could write to arbitrary memory
with the IFUNC resolver, whose values should be seen by any future thread via
the new load acquire you are putting in place. You might be able to run the IFUNC
resolver twice which could be bad?


> +  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);

You are potentially requiring an atomic load of a structure whose size can be
an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
and 128-bit for 64-bit ia64). These architectures might not have such wide
atomic operations, and ...

>    if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)

... they only actually look at the value returned by DL_FIXUP_VALUE_CODE_ADDR(value).

The "guard" in this case is the IP value in the address or function descriptor
(only two implementations we have in glibc).

You need an atomic load acquire of the guard (ip), but it's hidden behind macros.

>      {
>        /* This is the first time we have to relocate this object.  */
> @@ -346,7 +353,10 @@ _dl_profile_fixup (
>  
>        /* Store the result for later runs.  */
>        if (__glibc_likely (! GLRO(dl_bind_not)))
> -	*resultp = value;
> +	/* Guarantee all previous writes complete before
> +	   resultp (reloc_result->addr) is updated.  See CONCURRENCY NOTES
> +	   earlier  */
> +	atomic_store_release(resultp, value);

Likewise you need an atomic store release of the ip value in the function
descriptor to avoid needing arbitrarily large atomic load/stores.

>      }
>  
>    /* By default we do not call the pltexit function.  */
> 

What I'd like to see:

- Refactor macros to get access to the "code addr" (ip of the fdesc or regular
  addr), and do the atomic loads and stores against that. Or feel free to
  delegate this to more macros at the machine level e.g.
  DL_FIXUP_ATOMIC_LOAD_ACQUIRE_REALLY_LONG_NAME_VALUE_CODE_ADDR (value),
  but I feel this looses out on the fact that we have generic atomic macros
  and so could use something like DL_FIXUP_VALUE_ADDR_CODE which returns
  the address of the code addr to load, and then you operate atomically on
  that to do the load.

- Fix ./sysdeps/hppa/dl-machine.h (elf_machine_fixup_plt) to do an atomic
  store release to update ip. Likewise for ia64. Both should reference the
  concurrency notes in elf/dl-runtime.c.

- Can you write a test for this? A test which starts several threads, barriers
  them all, then starts them, and lets them try to resolve (test is compiled
  lazy binding forced) the plt entries, then repeat for a while, each iteration
  with a different function that hasn't been bound yet (maybe synthetically
  generate a few thousand functions). If we ever see this fail we'll know we
  did something wrong.

And also feel free to tell me I'm wrong :-)

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-20  0:16 ` Carlos O'Donell
@ 2018-09-20  1:59   ` John David Anglin
  2018-09-20  2:01     ` Carlos O'Donell
  2018-09-20 16:42   ` Tulio Magno Quites Machado Filho
  2018-10-08 19:28   ` [PATCHv2] " Tulio Magno Quites Machado Filho
  2 siblings, 1 reply; 40+ messages in thread
From: John David Anglin @ 2018-09-20  1:59 UTC (permalink / raw)
  To: Carlos O'Donell, Tulio Magno Quites Machado Filho,
	libc-alpha, Adhemerval Zanella

On 2018-09-19 8:16 PM, Carlos O'Donell wrote:
>>    DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
> You are potentially requiring an atomic load of a structure whose size can be
> an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
> and 128-bit for 64-bit ia64). These architectures might not have such wide
> atomic operations, and ...
We have implemented 64-bit atomic loads and stores for 32-bit hppa. They 
are not well tested but
they might work.  They use floating point loads and stores, and kernel 
helper.  The code is pretty horrific :-(

Dave

-- 
John David Anglin  dave.anglin@bell.net

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-20  1:59   ` John David Anglin
@ 2018-09-20  2:01     ` Carlos O'Donell
  2018-09-20 13:34       ` John David Anglin
  0 siblings, 1 reply; 40+ messages in thread
From: Carlos O'Donell @ 2018-09-20  2:01 UTC (permalink / raw)
  To: John David Anglin, Tulio Magno Quites Machado Filho, libc-alpha,
	Adhemerval Zanella

On 09/19/2018 09:59 PM, John David Anglin wrote:
> On 2018-09-19 8:16 PM, Carlos O'Donell wrote:
>>>    DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
>> You are potentially requiring an atomic load of a structure whose size can be
>> an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
>> and 128-bit for 64-bit ia64). These architectures might not have such wide
>> atomic operations, and ...
> We have implemented 64-bit atomic loads and stores for 32-bit hppa. They are not well tested but
> they might work.  They use floating point loads and stores, and kernel helper.  The code is pretty horrific :-(

We only need to use the fdesc->ip as the guard, so we don't really need the 64-bit
atomic, but other algorithms like the new pthread condvars can use them effectively
to accelerate and avoid 2 lws kernel helper calls and instead use 1 lws kernel helper
64-bit atomic.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-20  0:04 ` Joseph Myers
@ 2018-09-20 13:03   ` Tulio Magno Quites Machado Filho
  0 siblings, 0 replies; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-09-20 13:03 UTC (permalink / raw)
  To: Joseph Myers; +Cc: libc-alpha

Joseph Myers <joseph@codesourcery.com> writes:

> On Wed, 19 Sep 2018, Tulio Magno Quites Machado Filho wrote:
>
>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
>
>> +	atomic_store_release(resultp, value);
>
> Missing spaces before '('.

Argh!  Fixed both lines.

Thanks!

-- 
Tulio Magno

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-20  2:01     ` Carlos O'Donell
@ 2018-09-20 13:34       ` John David Anglin
  2018-10-11 21:26         ` Carlos O'Donell
  0 siblings, 1 reply; 40+ messages in thread
From: John David Anglin @ 2018-09-20 13:34 UTC (permalink / raw)
  To: Carlos O'Donell, Tulio Magno Quites Machado Filho,
	libc-alpha, Adhemerval Zanella

On 2018-09-19 10:00 PM, Carlos O'Donell wrote:
> On 09/19/2018 09:59 PM, John David Anglin wrote:
>> On 2018-09-19 8:16 PM, Carlos O'Donell wrote:
>>>>     DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>>>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
>>> You are potentially requiring an atomic load of a structure whose size can be
>>> an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
>>> and 128-bit for 64-bit ia64). These architectures might not have such wide
>>> atomic operations, and ...
>> We have implemented 64-bit atomic loads and stores for 32-bit hppa. They are not well tested but
>> they might work.  They use floating point loads and stores, and kernel helper.  The code is pretty horrific :-(
> We only need to use the fdesc->ip as the guard, so we don't really need the 64-bit
> atomic, but other algorithms like the new pthread condvars can use them effectively
> to accelerate and avoid 2 lws kernel helper calls and instead use 1 lws kernel helper
> 64-bit atomic.
Regarding using fdesc->ip as the guard, The gp is loaded both before and 
after the ip on hppa.
For example, $$dyncall loads gp before the branch.  This could be 
changed at the cost of one
instruction.  Stubs load gp after ip.  I don't think this is easy to change.

Dave

-- 
John David Anglin  dave.anglin@bell.net

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-20  0:16 ` Carlos O'Donell
  2018-09-20  1:59   ` John David Anglin
@ 2018-09-20 16:42   ` Tulio Magno Quites Machado Filho
  2018-09-20 17:04     ` Florian Weimer
  2018-10-08 19:28   ` [PATCHv2] " Tulio Magno Quites Machado Filho
  2 siblings, 1 reply; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-09-20 16:42 UTC (permalink / raw)
  To: Carlos O'Donell, libc-alpha, John David Anglin, Adhemerval Zanella

Carlos O'Donell <carlos@redhat.com> writes:

> On 09/19/2018 05:48 PM, Tulio Magno Quites Machado Filho wrote:
>> The field reloc_result->addr is used to indicate if the rest of the
>> fields of reloc_result have already been written, creating a
>> data-dependency order.
>> Reading reloc_result->addr to the variable value requires to complete
>> before reading the rest of the fields of reloc_result.
>> Likewise, the writes to the other fields of the reloc_result must
>> complete before reloc_result-addr is updated.
>
> Good catch. This needs just a little more work and it's done.
>
> When you're ready please post a v2 and TO me for review.

Ack.

>> diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
>> index 63bbc89776..6518e66fd6 100644
>> --- a/elf/dl-runtime.c
>> +++ b/elf/dl-runtime.c
>> @@ -183,9 +183,16 @@ _dl_profile_fixup (
>>    /* This is the address in the array where we store the result of previous
>>       relocations.  */
>>    struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
>> -  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>>  
>> -  DL_FIXUP_VALUE_TYPE value = *resultp;
>> +  /* CONCURRENCY NOTES:
>> +
>> +     The following code uses reloc_result->addr to indicate if it is the first
>> +     time this object is being relocated.
>> +     Reading/Writing from/to reloc_result->addr must not happen before previous
>> +     writes to reloc_result complete as they could end-up with an incomplete
>> +     struct.  */
>
> This is not quite accurate. The following code uses DL_FIXUP_VALUE_CODE_ADDR to
> access a potential member of addr to indicate it is the first time the object is
> being relocated. The "guard" variable in this access is either the address itself
> or the ip of a function descriptor (the only two implementations of "code addr").

Ack.  I'll elaborate the comments in the next version of the patch.

> Also we don't explain what other data is being accessed? What non-function-locale
> data is being updated as part of the guard variable check?

I don't follow you.
Do you mean other data being updated but not in reloc_result?
What do you mean by non-function-locale?

> In the case of the function descriptor it's easy to know that the fdesc's gp is 
> going to be written to, and without a acquire we won't see that write, or the new 
> ip value, so we might have two threads update the same thing, in theory a benign 
> data race which we should fix.

Why should we fix it?

> In theory elf_ifunc_invoke() is called and that could write to arbitrary memory
> with the IFUNC resolver, whose values should be seen by any future thread via
> the new load acquire you are putting in place. You might be able to run the IFUNC
> resolver twice which could be bad?

Good point.  I haven't tested IFUNCs yet.

>> +  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
>
> You are potentially requiring an atomic load of a structure whose size can be
> an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
> and 128-bit for 64-bit ia64). These architectures might not have such wide
> atomic operations, and ...

Would it help hppa and ia64 if I replaced it with just memory fences? e.g.:

  DL_FIXUP_VALUE_TYPE value = *resultp;
  atomic_thread_fence_acquire ();

Likewise for the atomic_store_release () later in the code.

>>    /* By default we do not call the pltexit function.  */
>> 
>
> What I'd like to see:
>
> - Refactor macros to get access to the "code addr" (ip of the fdesc or regular
>   addr), and do the atomic loads and stores against that. Or feel free to
>   delegate this to more macros at the machine level e.g.
>   DL_FIXUP_ATOMIC_LOAD_ACQUIRE_REALLY_LONG_NAME_VALUE_CODE_ADDR (value),
>   but I feel this looses out on the fact that we have generic atomic macros
>   and so could use something like DL_FIXUP_VALUE_ADDR_CODE which returns
>   the address of the code addr to load, and then you operate atomically on
>   that to do the load.
>
> - Fix ./sysdeps/hppa/dl-machine.h (elf_machine_fixup_plt) to do an atomic
>   store release to update ip. Likewise for ia64. Both should reference the
>   concurrency notes in elf/dl-runtime.c.

If we use memory fences, I believe this won't be necessary.
Do you agree?

> - Can you write a test for this? A test which starts several threads, barriers
>   them all, then starts them, and lets them try to resolve (test is compiled
>   lazy binding forced) the plt entries, then repeat for a while, each iteration
>   with a different function that hasn't been bound yet (maybe synthetically
>   generate a few thousand functions). If we ever see this fail we'll know we
>   did something wrong.

Yes.  There is already something attached to the bug report, but it does
require dozens of threads to reproduce here.
If you're fine with that, I can adapt it.

-- 
Tulio Magno

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-20 16:42   ` Tulio Magno Quites Machado Filho
@ 2018-09-20 17:04     ` Florian Weimer
  2018-09-21 17:21       ` Tulio Magno Quites Machado Filho
  0 siblings, 1 reply; 40+ messages in thread
From: Florian Weimer @ 2018-09-20 17:04 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho
  Cc: Carlos O'Donell, libc-alpha, John David Anglin, Adhemerval Zanella

* Tulio Magno Quites Machado Filho:

> Yes.  There is already something attached to the bug report, but it does
> require dozens of threads to reproduce here.
> If you're fine with that, I can adapt it.

That's okay (depending on the exact value of “dozens” 8-).

I assume it's burning CPU all the time?  Then you should put it into
nptl/ because those tests run serialized.  If you think it should run
longer than five second, we perhaps should make it an xtest.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-20 17:04     ` Florian Weimer
@ 2018-09-21 17:21       ` Tulio Magno Quites Machado Filho
  2018-09-21 17:24         ` Florian Weimer
  2018-09-21 17:36         ` Tulio Magno Quites Machado Filho
  0 siblings, 2 replies; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-09-21 17:21 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Carlos O'Donell, libc-alpha, John David Anglin, Adhemerval Zanella

Florian Weimer <fweimer@redhat.com> writes:

> * Tulio Magno Quites Machado Filho:
>
>> Yes.  There is already something attached to the bug report, but it does
>> require dozens of threads to reproduce here.
>> If you're fine with that, I can adapt it.
>
> That's okay (depending on the exact value of “dozens” 8-).

I'm using 70, but I believe we should limit this according to the machine.
I don't think we need to run this test on processors with sequential con

> I assume it's burning CPU all the time?  Then you should put it into
> nptl/ because those tests run serialized.  If you think it should run
> longer than five second, we perhaps should make it an xtest.

Yes, it does burn CPU for ~30s on a POWER8.

-- 
Tulio Magno

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-21 17:21       ` Tulio Magno Quites Machado Filho
@ 2018-09-21 17:24         ` Florian Weimer
  2018-09-21 17:37           ` Tulio Magno Quites Machado Filho
  2018-09-21 17:36         ` Tulio Magno Quites Machado Filho
  1 sibling, 1 reply; 40+ messages in thread
From: Florian Weimer @ 2018-09-21 17:24 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho
  Cc: Carlos O'Donell, libc-alpha, John David Anglin, Adhemerval Zanella

* Tulio Magno Quites Machado Filho:

>> I assume it's burning CPU all the time?  Then you should put it into
>> nptl/ because those tests run serialized.  If you think it should run
>> longer than five second, we perhaps should make it an xtest.
>
> Yes, it does burn CPU for ~30s on a POWER8.

30 second wall clock time?  That's a bit borderline, but not excessive
considering the overall nptl testing time.  But maybe it should still be
an xtest.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-21 17:21       ` Tulio Magno Quites Machado Filho
  2018-09-21 17:24         ` Florian Weimer
@ 2018-09-21 17:36         ` Tulio Magno Quites Machado Filho
  1 sibling, 0 replies; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-09-21 17:36 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Carlos O'Donell, libc-alpha, John David Anglin, Adhemerval Zanella

Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com> writes:

> Florian Weimer <fweimer@redhat.com> writes:
>
>> * Tulio Magno Quites Machado Filho:
>>
>>> Yes.  There is already something attached to the bug report, but it does
>>> require dozens of threads to reproduce here.
>>> If you're fine with that, I can adapt it.
>>
>> That's okay (depending on the exact value of “dozens” 8-).
>
> I'm using 70, but I believe we should limit this according to the machine.
> I don't think we need to run this test on processors with sequential con

Oops.  I hit send too soon. I meant to say: I don't think we need to run this
test on processors with stronger memory ordering.  Except for the extra time
when testing, I wouldn't hurt, though.

-- 
Tulio Magno

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-21 17:24         ` Florian Weimer
@ 2018-09-21 17:37           ` Tulio Magno Quites Machado Filho
  0 siblings, 0 replies; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-09-21 17:37 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Carlos O'Donell, libc-alpha, John David Anglin, Adhemerval Zanella

Florian Weimer <fweimer@redhat.com> writes:

> * Tulio Magno Quites Machado Filho:
>
>>> I assume it's burning CPU all the time?  Then you should put it into
>>> nptl/ because those tests run serialized.  If you think it should run
>>> longer than five second, we perhaps should make it an xtest.
>>
>> Yes, it does burn CPU for ~30s on a POWER8.
>
> 30 second wall clock time?  That's a bit borderline, but not excessive
> considering the overall nptl testing time.  But maybe it should still be
> an xtest.

Yes.

-- 
Tulio Magno

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCHv2] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-20  0:16 ` Carlos O'Donell
  2018-09-20  1:59   ` John David Anglin
  2018-09-20 16:42   ` Tulio Magno Quites Machado Filho
@ 2018-10-08 19:28   ` Tulio Magno Quites Machado Filho
  2018-10-08 19:45     ` Florian Weimer
  2 siblings, 1 reply; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-10-08 19:28 UTC (permalink / raw)
  To: Carlos O'Donell, libc-alpha, John David Anglin,
	Adhemerval Zanella, Joseph Myers, Florian Weimer

I suspect this patch doesn't address all the comments from v1.
However, I believe some of the open questions/comments may not be
necessary anymore after the latest changes.

I've decided to not add the new test to xtests, because it executes in
less than 3s in most of my tests.  There is just a single case that
takes up to 30s.

Changes since v1:

 - Fixed the coding style issues.
 - Replaced atomic loads/store with memory fences.
 - Added a test.

---- 8< ----

The field reloc_result->addr is used to indicate if the rest of the
fields of reloc_result have already been written, creating a
data-dependency order.
Reading reloc_result->addr to the variable value requires to complete
before reading the rest of the fields of reloc_result.
Likewise, the writes to the other fields of the reloc_result must
complete before reloc_result-addr is updated.

2018-10-08  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>

	[BZ #23690]
	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
	modification order when accessing reloc_result->addr.
	* nptl/Makefile (tests): Add tst-audit-threads.
	(modules-names): Add tst-audit-threads-mod1 and
	tst-audit-threads-mod2.
	Add rules to build tst-audit-threads.
	* nptl/tst-audit-threads-mod1.c: New file.
	* nptl/tst-audit-threads-mod2.c: Likewise.
	* nptl/tst-audit-threads.c: Likewise.
	* nptl/tst-audit-threads.h: Likewise.

Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
---
 elf/dl-runtime.c              | 19 ++++++++-
 nptl/Makefile                 | 14 ++++++-
 nptl/tst-audit-threads-mod1.c | 38 ++++++++++++++++++
 nptl/tst-audit-threads-mod2.c | 22 +++++++++++
 nptl/tst-audit-threads.c      | 92 +++++++++++++++++++++++++++++++++++++++++++
 nptl/tst-audit-threads.h      | 86 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 267 insertions(+), 4 deletions(-)
 create mode 100644 nptl/tst-audit-threads-mod1.c
 create mode 100644 nptl/tst-audit-threads-mod2.c
 create mode 100644 nptl/tst-audit-threads.c
 create mode 100644 nptl/tst-audit-threads.h

diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
index 63bbc89776..c1ba372bd7 100644
--- a/elf/dl-runtime.c
+++ b/elf/dl-runtime.c
@@ -183,9 +183,18 @@ _dl_profile_fixup (
   /* This is the address in the array where we store the result of previous
      relocations.  */
   struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
-  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
 
+  /* CONCURRENCY NOTES:
+
+     The following code uses DL_FIXUP_VALUE_CODE_ADDR to access a potential
+     member of reloc_result->addr to indicate if it is the first time this
+     object is being relocated.
+     Reading/Writing from/to reloc_result->addr must not happen before previous
+     writes to reloc_result complete as they could end-up with an incomplete
+     struct.  */
+  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
   DL_FIXUP_VALUE_TYPE value = *resultp;
+  atomic_thread_fence_acquire ();
   if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
     {
       /* This is the first time we have to relocate this object.  */
@@ -346,7 +355,13 @@ _dl_profile_fixup (
 
       /* Store the result for later runs.  */
       if (__glibc_likely (! GLRO(dl_bind_not)))
-	*resultp = value;
+	{
+	  /* Guarantee all previous writes complete before
+	     resultp (aka. reloc_result->addr) is updated.  See CONCURRENCY
+	     NOTES earlier  */
+	  atomic_thread_fence_release ();
+	  *resultp = value;
+	}
     }
 
   /* By default we do not call the pltexit function.  */
diff --git a/nptl/Makefile b/nptl/Makefile
index be8066524c..5aed247a9d 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
 	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
 	 tst-oncex3 tst-oncex4
 ifeq ($(build-shared),yes)
-tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
+tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
+	 tst-audit-threads
 tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
 tests-nolibpthread += tst-fini1
 ifeq ($(have-z-execstack),yes)
@@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
 		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
 		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
 		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
-		tst-join7mod tst-compat-forwarder-mod
+		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
+		tst-audit-threads-mod2
 extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
 		   tst-cleanup4aux.o tst-cleanupx4aux.o
 test-extras += tst-cleanup4aux tst-cleanupx4aux
@@ -709,6 +711,14 @@ endif
 
 $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
 
+ifeq ($(run-built-tests),yes)
+ifeq (yes,$(build-shared))
+$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
+$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
+tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so
+endif
+endif
+
 # The tests here better do not run in parallel
 ifneq ($(filter %tests,$(MAKECMDGOALS)),)
 .NOTPARALLEL:
diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
new file mode 100644
index 0000000000..e2d3f78bae
--- /dev/null
+++ b/nptl/tst-audit-threads-mod1.c
@@ -0,0 +1,38 @@
+/* Dummy audit library for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <elf.h>
+#include <link.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+
+volatile int count = 0;
+
+unsigned int
+la_version(unsigned int ver)
+{
+  return 1;
+}
+
+unsigned int
+la_objopen(struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
+{
+  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
+}
diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
new file mode 100644
index 0000000000..6ceedb0196
--- /dev/null
+++ b/nptl/tst-audit-threads-mod2.c
@@ -0,0 +1,22 @@
+/* Shared object with a huge number of functions for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Define all the retNumN functions.  */
+#define definenum
+#include "tst-audit-threads.h"
diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
new file mode 100644
index 0000000000..93ddebaecb
--- /dev/null
+++ b/nptl/tst-audit-threads.c
@@ -0,0 +1,92 @@
+/* Test multi-threading using LD_AUDIT.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
+   library with a huge number of functions in order to validate lazy symbol
+   binding with an audit library.  */
+
+#include <pthread.h>
+#include <strings.h>
+#include <stdlib.h>
+#include <sys/sysinfo.h>
+
+static int do_test (void);
+
+/* This test usually takes less than 3s to run.  However, there are cases that
+   take up to 30s.  */
+#define TIMEOUT 60
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
+
+#define externnum
+#include "tst-audit-threads.h"
+#undef externnum
+
+int num_threads;
+pthread_barrier_t barrier;
+
+void
+sync_all (int num)
+{
+  pthread_barrier_wait (&barrier);
+}
+
+void
+call_all_ret_nums (void)
+{
+#define callnum
+#include "tst-audit-threads.h"
+#undef callnum
+}
+
+void *
+thread_main (void *unused)
+{
+  call_all_ret_nums ();
+  return NULL;
+}
+
+#define STR2(X) #X
+#define STR(X) STR2(X)
+
+static int
+do_test (void)
+{
+  int i;
+  pthread_t *threads;
+
+  num_threads = get_nprocs ();
+  if (num_threads <= 1)
+    num_threads = 2;
+
+  /* Used to synchronize all the threads after calling each retNumN.  */
+  pthread_barrier_init (&barrier, NULL, num_threads);
+
+  threads = (pthread_t *) malloc (num_threads * sizeof(pthread_t));
+  bzero (threads, num_threads * sizeof(pthread_t));
+  for (i = 0; i < num_threads; i++)
+    pthread_create(threads + i, NULL, thread_main, NULL);
+
+  for (i = 0; i < num_threads; i++)
+    pthread_join(threads[i], NULL);
+
+  free (threads);
+
+  return 0;
+}
diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
new file mode 100644
index 0000000000..c2b4d1d589
--- /dev/null
+++ b/nptl/tst-audit-threads.h
@@ -0,0 +1,86 @@
+/* Helper header for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#define CONCAT(a, b) a ## b
+#define NUM(x, y) CONCAT (x, y)
+
+#define FUNC10(x)	\
+  FUNC (NUM (x, 0));	\
+  FUNC (NUM (x, 1));	\
+  FUNC (NUM (x, 2));	\
+  FUNC (NUM (x, 3));	\
+  FUNC (NUM (x, 4));	\
+  FUNC (NUM (x, 5));	\
+  FUNC (NUM (x, 6));	\
+  FUNC (NUM (x, 7));	\
+  FUNC (NUM (x, 8));	\
+  FUNC (NUM (x, 9))
+
+#define FUNC100(x)	\
+  FUNC10 (NUM (x, 0));	\
+  FUNC10 (NUM (x, 1));	\
+  FUNC10 (NUM (x, 2));	\
+  FUNC10 (NUM (x, 3));	\
+  FUNC10 (NUM (x, 4));	\
+  FUNC10 (NUM (x, 5));	\
+  FUNC10 (NUM (x, 6));	\
+  FUNC10 (NUM (x, 7));	\
+  FUNC10 (NUM (x, 8));	\
+  FUNC10 (NUM (x, 9))
+
+#define FUNC1000(x)		\
+  FUNC100 (NUM (x, 0));		\
+  FUNC100 (NUM (x, 1));		\
+  FUNC100 (NUM (x, 2));		\
+  FUNC100 (NUM (x, 3));		\
+  FUNC100 (NUM (x, 4));		\
+  FUNC100 (NUM (x, 5));		\
+  FUNC100 (NUM (x, 6));		\
+  FUNC100 (NUM (x, 7));		\
+  FUNC100 (NUM (x, 8));		\
+  FUNC100 (NUM (x, 9))
+
+#define FUNC10000()	\
+  FUNC1000 (1);		\
+  FUNC1000 (2);		\
+  FUNC1000 (3);		\
+  FUNC1000 (4);		\
+  FUNC1000 (5);		\
+  FUNC1000 (6);		\
+  FUNC1000 (7);		\
+  FUNC1000 (8);		\
+  FUNC1000 (9)
+
+#ifdef FUNC
+# undef FUNC
+#endif
+
+#ifdef externnum
+# define FUNC(x) extern int CONCAT (retNum, x) (void)
+#endif
+
+#ifdef definenum
+# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
+#endif
+
+#ifdef callnum
+# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
+#endif
+
+FUNC10000 ();
-- 
2.14.4

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-08 19:28   ` [PATCHv2] " Tulio Magno Quites Machado Filho
@ 2018-10-08 19:45     ` Florian Weimer
  2018-10-11  6:15       ` [PATCHv3] " Tulio Magno Quites Machado Filho
  0 siblings, 1 reply; 40+ messages in thread
From: Florian Weimer @ 2018-10-08 19:45 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho
  Cc: Carlos O'Donell, libc-alpha, John David Anglin,
	Adhemerval Zanella, Joseph Myers, Florian Weimer

* Tulio Magno Quites Machado Filho:

> I suspect this patch doesn't address all the comments from v1.
> However, I believe some of the open questions/comments may not be
> necessary anymore after the latest changes.
>
> I've decided to not add the new test to xtests, because it executes in
> less than 3s in most of my tests.  There is just a single case that
> takes up to 30s.
>
> Changes since v1:
>
>  - Fixed the coding style issues.
>  - Replaced atomic loads/store with memory fences.
>  - Added a test.

I don't think the fences are correct, they still need to be combined
with relaxed MO loads and stores.

Does the issue that Carlos mentioned really show up in cross-builds?

> diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
> new file mode 100644
> index 0000000000..e2d3f78bae
> --- /dev/null
> +++ b/nptl/tst-audit-threads-mod1.c

> +la_version(unsigned int ver)

> +la_objopen(struct link_map *map, Lmid_t lmid, uintptr_t *cookie)

Style: missing space before (.

> diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
> new file mode 100644
> index 0000000000..93ddebaecb
> --- /dev/null
> +++ b/nptl/tst-audit-threads.c

> +  /* Used to synchronize all the threads after calling each retNumN.  */
> +  pthread_barrier_init (&barrier, NULL, num_threads);

xpthread_barrier_init.

> +  threads = (pthread_t *) malloc (num_threads * sizeof(pthread_t));
> +  bzero (threads, num_threads * sizeof(pthread_t));

xcalloc or xmalloc.  But bzero does not appear to be required.

> +  for (i = 0; i < num_threads; i++)
> +    pthread_create(threads + i, NULL, thread_main, NULL);

xpthread_create.

> +  for (i = 0; i < num_threads; i++)
> +    pthread_join(threads[i], NULL);

xpthread_join.

Rest looks okay.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-08 19:45     ` Florian Weimer
@ 2018-10-11  6:15       ` Tulio Magno Quites Machado Filho
  2018-10-12  1:08         ` Carlos O'Donell
  0 siblings, 1 reply; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-10-11  6:15 UTC (permalink / raw)
  To: Florian Weimer, Carlos O'Donell, libc-alpha,
	John David Anglin, Adhemerval Zanella, Joseph Myers,
	Florian Weimer

Florian Weimer <fw@deneb.enyo.de> writes:

> * Tulio Magno Quites Machado Filho:
>
>> I suspect this patch doesn't address all the comments from v1.
>> However, I believe some of the open questions/comments may not be
>> necessary anymore after the latest changes.
>>
>> I've decided to not add the new test to xtests, because it executes in
>> less than 3s in most of my tests.  There is just a single case that
>> takes up to 30s.
>>
>> Changes since v1:
>>
>>  - Fixed the coding style issues.
>>  - Replaced atomic loads/store with memory fences.
>>  - Added a test.
>
> I don't think the fences are correct, they still need to be combined
> with relaxed MO loads and stores.
>
> Does the issue that Carlos mentioned really show up in cross-builds?

Yes, it does fail on hppa and ia64.
But v3 (using thread fences) pass on build-many-glibcs.

Changes since v2:

 - Fixed coding style in nptl/tst-audit-threads-mod1.c.
 - Replaced pthreads.h functions with respective support/xthread.h ones.
 - Replaced malloc() with xcalloc() in nptl/tst-audit-threads.c.
 - Removed bzero().
 - Reduced the amount of functions to 7k in order to fit the relocation
   limit  of some architectures, e.g. m68k, mips.
 - Fixed issues in nptl/Makefile.

Changes since v1:

 - Fixed the coding style issues.
 - Replaced atomic loads/store with memory fences.
 - Added a test.

---- 8< ----

The field reloc_result->addr is used to indicate if the rest of the
fields of reloc_result have already been written, creating a
data-dependency order.
Reading reloc_result->addr to the variable value requires to complete
before reading the rest of the fields of reloc_result.
Likewise, the writes to the other fields of the reloc_result must
complete before reloc_result-addr is updated.

Tested with build-many-glibcs.

2018-10-10  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>

	[BZ #23690]
	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
	modification order when accessing reloc_result->addr.
	* nptl/Makefile (tests): Add tst-audit-threads.
	(modules-names): Add tst-audit-threads-mod1 and
	tst-audit-threads-mod2.
	Add rules to build tst-audit-threads.
	* nptl/tst-audit-threads-mod1.c: New file.
	* nptl/tst-audit-threads-mod2.c: Likewise.
	* nptl/tst-audit-threads.c: Likewise.
	* nptl/tst-audit-threads.h: Likewise.

Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
---
 elf/dl-runtime.c              | 19 ++++++++-
 nptl/Makefile                 | 10 ++++-
 nptl/tst-audit-threads-mod1.c | 38 ++++++++++++++++++
 nptl/tst-audit-threads-mod2.c | 22 +++++++++++
 nptl/tst-audit-threads.c      | 91 +++++++++++++++++++++++++++++++++++++++++++
 nptl/tst-audit-threads.h      | 84 +++++++++++++++++++++++++++++++++++++++
 6 files changed, 260 insertions(+), 4 deletions(-)
 create mode 100644 nptl/tst-audit-threads-mod1.c
 create mode 100644 nptl/tst-audit-threads-mod2.c
 create mode 100644 nptl/tst-audit-threads.c
 create mode 100644 nptl/tst-audit-threads.h

diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
index 63bbc89776..c1ba372bd7 100644
--- a/elf/dl-runtime.c
+++ b/elf/dl-runtime.c
@@ -183,9 +183,18 @@ _dl_profile_fixup (
   /* This is the address in the array where we store the result of previous
      relocations.  */
   struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
-  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
 
+  /* CONCURRENCY NOTES:
+
+     The following code uses DL_FIXUP_VALUE_CODE_ADDR to access a potential
+     member of reloc_result->addr to indicate if it is the first time this
+     object is being relocated.
+     Reading/Writing from/to reloc_result->addr must not happen before previous
+     writes to reloc_result complete as they could end-up with an incomplete
+     struct.  */
+  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
   DL_FIXUP_VALUE_TYPE value = *resultp;
+  atomic_thread_fence_acquire ();
   if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
     {
       /* This is the first time we have to relocate this object.  */
@@ -346,7 +355,13 @@ _dl_profile_fixup (
 
       /* Store the result for later runs.  */
       if (__glibc_likely (! GLRO(dl_bind_not)))
-	*resultp = value;
+	{
+	  /* Guarantee all previous writes complete before
+	     resultp (aka. reloc_result->addr) is updated.  See CONCURRENCY
+	     NOTES earlier  */
+	  atomic_thread_fence_release ();
+	  *resultp = value;
+	}
     }
 
   /* By default we do not call the pltexit function.  */
diff --git a/nptl/Makefile b/nptl/Makefile
index be8066524c..48aba579c0 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
 	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
 	 tst-oncex3 tst-oncex4
 ifeq ($(build-shared),yes)
-tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
+tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
+	 tst-audit-threads
 tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
 tests-nolibpthread += tst-fini1
 ifeq ($(have-z-execstack),yes)
@@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
 		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
 		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
 		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
-		tst-join7mod tst-compat-forwarder-mod
+		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
+		tst-audit-threads-mod2
 extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
 		   tst-cleanup4aux.o tst-cleanupx4aux.o
 test-extras += tst-cleanup4aux tst-cleanupx4aux
@@ -709,6 +711,10 @@ endif
 
 $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
 
+$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
+$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
+tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so
+
 # The tests here better do not run in parallel
 ifneq ($(filter %tests,$(MAKECMDGOALS)),)
 .NOTPARALLEL:
diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
new file mode 100644
index 0000000000..194c65a6bb
--- /dev/null
+++ b/nptl/tst-audit-threads-mod1.c
@@ -0,0 +1,38 @@
+/* Dummy audit library for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <elf.h>
+#include <link.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+
+volatile int count = 0;
+
+unsigned int
+la_version (unsigned int ver)
+{
+  return 1;
+}
+
+unsigned int
+la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
+{
+  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
+}
diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
new file mode 100644
index 0000000000..6ceedb0196
--- /dev/null
+++ b/nptl/tst-audit-threads-mod2.c
@@ -0,0 +1,22 @@
+/* Shared object with a huge number of functions for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Define all the retNumN functions.  */
+#define definenum
+#include "tst-audit-threads.h"
diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
new file mode 100644
index 0000000000..0c81edc762
--- /dev/null
+++ b/nptl/tst-audit-threads.c
@@ -0,0 +1,91 @@
+/* Test multi-threading using LD_AUDIT.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
+   library with a huge number of functions in order to validate lazy symbol
+   binding with an audit library.  */
+
+#include <support/xthread.h>
+#include <strings.h>
+#include <stdlib.h>
+#include <sys/sysinfo.h>
+
+static int do_test (void);
+
+/* This test usually takes less than 3s to run.  However, there are cases that
+   take up to 30s.  */
+#define TIMEOUT 60
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
+
+#define externnum
+#include "tst-audit-threads.h"
+#undef externnum
+
+int num_threads;
+pthread_barrier_t barrier;
+
+void
+sync_all (int num)
+{
+  pthread_barrier_wait (&barrier);
+}
+
+void
+call_all_ret_nums (void)
+{
+#define callnum
+#include "tst-audit-threads.h"
+#undef callnum
+}
+
+void *
+thread_main (void *unused)
+{
+  call_all_ret_nums ();
+  return NULL;
+}
+
+#define STR2(X) #X
+#define STR(X) STR2(X)
+
+static int
+do_test (void)
+{
+  int i;
+  pthread_t *threads;
+
+  num_threads = get_nprocs ();
+  if (num_threads <= 1)
+    num_threads = 2;
+
+  /* Used to synchronize all the threads after calling each retNumN.  */
+  xpthread_barrier_init (&barrier, NULL, num_threads);
+
+  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
+  for (i = 0; i < num_threads; i++)
+    threads[i] = xpthread_create(NULL, thread_main, NULL);
+
+  for (i = 0; i < num_threads; i++)
+    xpthread_join(threads[i]);
+
+  free (threads);
+
+  return 0;
+}
diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
new file mode 100644
index 0000000000..cb17645f4b
--- /dev/null
+++ b/nptl/tst-audit-threads.h
@@ -0,0 +1,84 @@
+/* Helper header for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#define CONCAT(a, b) a ## b
+#define NUM(x, y) CONCAT (x, y)
+
+#define FUNC10(x)	\
+  FUNC (NUM (x, 0));	\
+  FUNC (NUM (x, 1));	\
+  FUNC (NUM (x, 2));	\
+  FUNC (NUM (x, 3));	\
+  FUNC (NUM (x, 4));	\
+  FUNC (NUM (x, 5));	\
+  FUNC (NUM (x, 6));	\
+  FUNC (NUM (x, 7));	\
+  FUNC (NUM (x, 8));	\
+  FUNC (NUM (x, 9))
+
+#define FUNC100(x)	\
+  FUNC10 (NUM (x, 0));	\
+  FUNC10 (NUM (x, 1));	\
+  FUNC10 (NUM (x, 2));	\
+  FUNC10 (NUM (x, 3));	\
+  FUNC10 (NUM (x, 4));	\
+  FUNC10 (NUM (x, 5));	\
+  FUNC10 (NUM (x, 6));	\
+  FUNC10 (NUM (x, 7));	\
+  FUNC10 (NUM (x, 8));	\
+  FUNC10 (NUM (x, 9))
+
+#define FUNC1000(x)		\
+  FUNC100 (NUM (x, 0));		\
+  FUNC100 (NUM (x, 1));		\
+  FUNC100 (NUM (x, 2));		\
+  FUNC100 (NUM (x, 3));		\
+  FUNC100 (NUM (x, 4));		\
+  FUNC100 (NUM (x, 5));		\
+  FUNC100 (NUM (x, 6));		\
+  FUNC100 (NUM (x, 7));		\
+  FUNC100 (NUM (x, 8));		\
+  FUNC100 (NUM (x, 9))
+
+#define FUNC7000()	\
+  FUNC1000 (1);		\
+  FUNC1000 (2);		\
+  FUNC1000 (3);		\
+  FUNC1000 (4);		\
+  FUNC1000 (5);		\
+  FUNC1000 (6);		\
+  FUNC1000 (7);
+
+#ifdef FUNC
+# undef FUNC
+#endif
+
+#ifdef externnum
+# define FUNC(x) extern int CONCAT (retNum, x) (void)
+#endif
+
+#ifdef definenum
+# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
+#endif
+
+#ifdef callnum
+# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
+#endif
+
+FUNC7000 ();
-- 
2.14.4

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-09-20 13:34       ` John David Anglin
@ 2018-10-11 21:26         ` Carlos O'Donell
  0 siblings, 0 replies; 40+ messages in thread
From: Carlos O'Donell @ 2018-10-11 21:26 UTC (permalink / raw)
  To: John David Anglin, Tulio Magno Quites Machado Filho, libc-alpha,
	Adhemerval Zanella

On 9/20/18 9:34 AM, John David Anglin wrote:
> On 2018-09-19 10:00 PM, Carlos O'Donell wrote:
>> On 09/19/2018 09:59 PM, John David Anglin wrote:
>>> On 2018-09-19 8:16 PM, Carlos O'Donell wrote:
>>>>>     DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>>>>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
>>>> You are potentially requiring an atomic load of a structure whose size can be
>>>> an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
>>>> and 128-bit for 64-bit ia64). These architectures might not have such wide
>>>> atomic operations, and ...
>>> We have implemented 64-bit atomic loads and stores for 32-bit hppa. They are not well tested but
>>> they might work.  They use floating point loads and stores, and kernel helper.  The code is pretty horrific :-(
>> We only need to use the fdesc->ip as the guard, so we don't really need the 64-bit
>> atomic, but other algorithms like the new pthread condvars can use them effectively
>> to accelerate and avoid 2 lws kernel helper calls and instead use 1 lws kernel helper
>> 64-bit atomic.
> Regarding using fdesc->ip as the guard, The gp is loaded both before and after the ip on hppa.
> For example, $$dyncall loads gp before the branch.  This could be changed at the cost of one
> instruction.  Stubs load gp after ip.  I don't think this is easy to change.

This is fine. The point is that ip is the guard, and so all other elements of the fdesc (only
gp in this case) need to be written to before the guard is updated to indicate that the
relocation of the fdesc is complete.

This is now fixed with Tulios changes which just use an atomic_thread_acquire and
atomic_thread_release in the _dl_profile_fixup.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-11  6:15       ` [PATCHv3] " Tulio Magno Quites Machado Filho
@ 2018-10-12  1:08         ` Carlos O'Donell
  2018-10-15 13:01           ` Florian Weimer
  2018-10-18  2:02           ` [PATCHv4] " Tulio Magno Quites Machado Filho
  0 siblings, 2 replies; 40+ messages in thread
From: Carlos O'Donell @ 2018-10-12  1:08 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho, Florian Weimer, libc-alpha,
	John David Anglin, Adhemerval Zanella, Joseph Myers,
	Florian Weimer

On 10/10/18 10:57 PM, Tulio Magno Quites Machado Filho wrote:
> Florian Weimer <fw@deneb.enyo.de> writes:
> 
>> * Tulio Magno Quites Machado Filho:
>>
>>> I suspect this patch doesn't address all the comments from v1.
>>> However, I believe some of the open questions/comments may not be
>>> necessary anymore after the latest changes.
>>>
>>> I've decided to not add the new test to xtests, because it executes in
>>> less than 3s in most of my tests.  There is just a single case that
>>> takes up to 30s.
>>>
>>> Changes since v1:
>>>
>>>  - Fixed the coding style issues.
>>>  - Replaced atomic loads/store with memory fences.
>>>  - Added a test.
>>
>> I don't think the fences are correct, they still need to be combined
>> with relaxed MO loads and stores.
>>
>> Does the issue that Carlos mentioned really show up in cross-builds?
> 
> Yes, it does fail on hppa and ia64.
> But v3 (using thread fences) pass on build-many-glibcs.

We will need a v4. Please review (1), (2) and (3) carefully, feel free to
ignore (4).

(1) I added a bunch of comments.

Comments added inline.

(2) -Wl,-z,now worries.

Added some things for you to check.

(3) Fence-to-fence sync.

For fence-to-fence synchronization to work we need an acquire and release
fence, and we have that.

We are missing the atomic read and write of the guard. Please review below.
Florian mentioned this in his review. He is correct.

And all the problems are back again because you can't do atomic loads of
the large guards because they are actually the function descriptor structures.
However, this is just laziness, we used the addr because it was convenient.
It is no longer convenient. Just add a 'init' field to reloc_result and use
that as the guard to synchronize the threads against for initialization of
the results. This should solve the reloc_result problem (ignorning the issues
hppa and ia64 have with the fdesc updates across multiple threads in _dl_fixup).

(4) Review of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE.	

I reviewed the uses of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE to
see if there was any other case of this problem, particularly where there
might be a case where a write happens on one thread that might not be
seen in another.

I also looked at _dl_relocate_object and the initialization of all 
l_reloc_result via calloc, and that is also covered because the
atomic_thread_fence_acquire ensures any secondary thread sees the
initialization.

So just _dl_fixup for hppa and ia64 (the case not related to this issue)
still have potential ordering issues if the compiler writes ip before gp.

Nothing for you to worry about.

> Changes since v2:
> 
>  - Fixed coding style in nptl/tst-audit-threads-mod1.c.
>  - Replaced pthreads.h functions with respective support/xthread.h ones.
>  - Replaced malloc() with xcalloc() in nptl/tst-audit-threads.c.
>  - Removed bzero().
>  - Reduced the amount of functions to 7k in order to fit the relocation
>    limit  of some architectures, e.g. m68k, mips.
>  - Fixed issues in nptl/Makefile.
> 
> Changes since v1:
> 
>  - Fixed the coding style issues.
>  - Replaced atomic loads/store with memory fences.
>  - Added a test.
> 
> ---- 8< ----
> 
> The field reloc_result->addr is used to indicate if the rest of the
> fields of reloc_result have already been written, creating a
> data-dependency order.
> Reading reloc_result->addr to the variable value requires to complete
> before reading the rest of the fields of reloc_result.
> Likewise, the writes to the other fields of the reloc_result must
> complete before reloc_result-addr is updated.
> 
> Tested with build-many-glibcs.
> 
> 2018-10-10  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>
> 
> 	[BZ #23690]
> 	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
> 	modification order when accessing reloc_result->addr.
> 	* nptl/Makefile (tests): Add tst-audit-threads.
> 	(modules-names): Add tst-audit-threads-mod1 and
> 	tst-audit-threads-mod2.
> 	Add rules to build tst-audit-threads.
> 	* nptl/tst-audit-threads-mod1.c: New file.
> 	* nptl/tst-audit-threads-mod2.c: Likewise.
> 	* nptl/tst-audit-threads.c: Likewise.
> 	* nptl/tst-audit-threads.h: Likewise.
> 
> Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>

Please send v4.

> ---
>  elf/dl-runtime.c              | 19 ++++++++-
>  nptl/Makefile                 | 10 ++++-
>  nptl/tst-audit-threads-mod1.c | 38 ++++++++++++++++++
>  nptl/tst-audit-threads-mod2.c | 22 +++++++++++
>  nptl/tst-audit-threads.c      | 91 +++++++++++++++++++++++++++++++++++++++++++
>  nptl/tst-audit-threads.h      | 84 +++++++++++++++++++++++++++++++++++++++
>  6 files changed, 260 insertions(+), 4 deletions(-)
>  create mode 100644 nptl/tst-audit-threads-mod1.c
>  create mode 100644 nptl/tst-audit-threads-mod2.c
>  create mode 100644 nptl/tst-audit-threads.c
>  create mode 100644 nptl/tst-audit-threads.h
> 
> diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
> index 63bbc89776..c1ba372bd7 100644
> --- a/elf/dl-runtime.c
> +++ b/elf/dl-runtime.c
> @@ -183,9 +183,18 @@ _dl_profile_fixup (
>    /* This is the address in the array where we store the result of previous
>       relocations.  */
>    struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
> -  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>  
> +  /* CONCURRENCY NOTES:
> +

Suggest adding:

Multiple threads may be calling the same PLT sequence and with LD_AUDIT enabled
they will be calling into _dl_profile_fixup to update the reloc_result with the
result of the lazy resolution. The reloc_result guard variable is addr, and we
use relaxed MO loads and store to it along with an atomic_thread_acquire and
atomic_thread_release fence to ensure that the results of the structure are
consistent with the loaded value of the guard.

> +     The following code uses DL_FIXUP_VALUE_CODE_ADDR to access a potential
> +     member of reloc_result->addr to indicate if it is the first time this
> +     object is being relocated.
> +     Reading/Writing from/to reloc_result->addr must not happen before previous
> +     writes to reloc_result complete as they could end-up with an incomplete
> +     struct.  */

OK.

> +  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;

OK.

>    DL_FIXUP_VALUE_TYPE value = *resultp;

Not OK. This is a guard. You read it here, and write to it below.
That's a data race. Both need to be atomic accesses with any MO you want.
On hppa this will require a new enough compile to get a 64-bit atomic load.
On ia64 I don't know if there is a usable 128-bit atomic.

The key problem here is that addr is being overloaded as a guard here because
it was convenient. It's non-zero when the symbol is initialized, otherwhise it's
zero when it's not. However, for arches with function descriptors you've found
out that using it is causing problems because it's too big for traditional atomic
operations.

What you really need is a new "init" field in reloc_result, make it a word,
and then use word-sized atomics on that with relaxed MO, and keep the fences.

> +  atomic_thread_fence_acquire ();

OK, this acquire ensures all previous writes on threads are visible.

>    if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)

OK, either this is zero, and we redo the initialization, or it's not
and we see all the results of the previous writes because of the
atomic_thread_fence_acquire.

>      {
>        /* This is the first time we have to relocate this object.  */
> @@ -346,7 +355,13 @@ _dl_profile_fixup (
>  
>        /* Store the result for later runs.  */
>        if (__glibc_likely (! GLRO(dl_bind_not)))
> -	*resultp = value;

OK.

> +	{
> +	  /* Guarantee all previous writes complete before
> +	     resultp (aka. reloc_result->addr) is updated.  See CONCURRENCY
> +	     NOTES earlier  */
> +	  atomic_thread_fence_release ();

OK, this ensures that any write done by the auditors, if any, are seen by
subsequent threads attempting a resolution of the same function, and this
sequences-before all the writes with the earlier acquire.

> +	  *resultp = value;

Not OK, see above, this needs to be an atomic relaxed-MO store to 'init'
or something smaller than value.

You need a guard small enough that arches will have an atomic load/store
to the size.

> +	}
>      }
>  
>    /* By default we do not call the pltexit function.  */
> diff --git a/nptl/Makefile b/nptl/Makefile
> index be8066524c..48aba579c0 100644
> --- a/nptl/Makefile
> +++ b/nptl/Makefile
> @@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
>  	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
>  	 tst-oncex3 tst-oncex4
>  ifeq ($(build-shared),yes)
> -tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
> +tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
> +	 tst-audit-threads

OK.

>  tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
>  tests-nolibpthread += tst-fini1
>  ifeq ($(have-z-execstack),yes)
> @@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
>  		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
>  		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
>  		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
> -		tst-join7mod tst-compat-forwarder-mod
> +		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
> +		tst-audit-threads-mod2

OK.

>  extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
>  		   tst-cleanup4aux.o tst-cleanupx4aux.o
>  test-extras += tst-cleanup4aux tst-cleanupx4aux
> @@ -709,6 +711,10 @@ endif
>  
>  $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
>  
> +$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
> +$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
> +tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so

Do we need to add -Wl,-z,lazy?

Users might have -Wl,-z,now as the default for their build?

With BIND_NOW the test doesn't test what we want.

> +
>  # The tests here better do not run in parallel
>  ifneq ($(filter %tests,$(MAKECMDGOALS)),)
>  .NOTPARALLEL:
> diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
> new file mode 100644
> index 0000000000..194c65a6bb
> --- /dev/null
> +++ b/nptl/tst-audit-threads-mod1.c
> @@ -0,0 +1,38 @@
> +/* Dummy audit library for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <elf.h>
> +#include <link.h>
> +#include <stdio.h>
> +#include <assert.h>
> +#include <string.h>
> +

Suggest:

/* We must use a dummy LD_AUDIT module to force the dynamic loader to
   *not* update the real PLT, and instead use a cached value for the
   lazy resolution result. It is the update of that cached value that
   we are testing for correctness by doing this.  */

> +volatile int count = 0;
> +
> +unsigned int
> +la_version (unsigned int ver)
> +{
> +  return 1;
> +}
> +
> +unsigned int
> +la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
> +{
> +  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
> +}

I'm worried binutils will optimize away the PLT entries and this test will
pass without failing but the lazy resolution will not be tested.

Can we just *count* the number of PLT resolutions and see if they match?

Counting the PLT resolutions and using -Wl,-z,lazy (above) will mean we have
done our best to test what we intended to test.

> diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
> new file mode 100644
> index 0000000000..6ceedb0196
> --- /dev/null
> +++ b/nptl/tst-audit-threads-mod2.c
> @@ -0,0 +1,22 @@
> +/* Shared object with a huge number of functions for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +

Suggest:

/* Define all the retNumN functions in a library.  */

Just to be clear that this must be distinct from the executable.

> +/* Define all the retNumN functions.  */
> +#define definenum
> +#include "tst-audit-threads.h"
> diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
> new file mode 100644
> index 0000000000..0c81edc762
> --- /dev/null
> +++ b/nptl/tst-audit-threads.c
> @@ -0,0 +1,91 @@
> +/* Test multi-threading using LD_AUDIT.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +

Suggest:

/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
   library with a huge number of functions in order to validate lazy symbol
   binding with an audit library.  We use one thread per CPU to test that
   concurrent lazy resolution does not have any defects which would cause
   the process to fail.  We use an LD_AUDIT library to force the testing of
   the relocation resolution caching code in the dynamic loader i.e. 
   _dl_runtime_profile and _dl_profile_fixup.  */

> +/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
> +   library with a huge number of functions in order to validate lazy symbol
> +   binding with an audit library.  */
> +
> +#include <support/xthread.h>
> +#include <strings.h>
> +#include <stdlib.h>
> +#include <sys/sysinfo.h>
> +
> +static int do_test (void);
> +
> +/* This test usually takes less than 3s to run.  However, there are cases that
> +   take up to 30s.  */
> +#define TIMEOUT 60
> +#define TEST_FUNCTION do_test ()
> +#include "../test-skeleton.c"
> +

Suggest:

/* Declare the functions we are going to call.  */

> +#define externnum
> +#include "tst-audit-threads.h"
> +#undef externnum
> +
> +int num_threads;
> +pthread_barrier_t barrier;
> +
> +void
> +sync_all (int num)
> +{
> +  pthread_barrier_wait (&barrier);
> +}
> +
> +void
> +call_all_ret_nums (void)
> +{

Suggest:

/* Call each function one at a time from all threads.  */

> +#define callnum
> +#include "tst-audit-threads.h"
> +#undef callnum
> +}
> +
> +void *
> +thread_main (void *unused)
> +{
> +  call_all_ret_nums ();
> +  return NULL;
> +}
> +
> +#define STR2(X) #X
> +#define STR(X) STR2(X)
> +
> +static int
> +do_test (void)
> +{
> +  int i;
> +  pthread_t *threads;
> +
> +  num_threads = get_nprocs ();
> +  if (num_threads <= 1)
> +    num_threads = 2;

OK.

> +
> +  /* Used to synchronize all the threads after calling each retNumN.  */
> +  xpthread_barrier_init (&barrier, NULL, num_threads);

OK.

> +
> +  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
> +  for (i = 0; i < num_threads; i++)
> +    threads[i] = xpthread_create(NULL, thread_main, NULL);
> +
> +  for (i = 0; i < num_threads; i++)
> +    xpthread_join(threads[i]);
> +
> +  free (threads);
> +
> +  return 0;

OK.

> +}
> diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
> new file mode 100644
> index 0000000000..cb17645f4b
> --- /dev/null
> +++ b/nptl/tst-audit-threads.h
> @@ -0,0 +1,84 @@
> +/* Helper header for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +

Suggest adding:

/* We use this helper to create a large number of functions, all of
   which will be resolved lazily and thus have their PLT updated.
   This is done to provide enough functions that we can statistically
   observe a thread vs. PLT resolution failure if one exists.  */

> +#define CONCAT(a, b) a ## b
> +#define NUM(x, y) CONCAT (x, y)
> +
> +#define FUNC10(x)	\
> +  FUNC (NUM (x, 0));	\
> +  FUNC (NUM (x, 1));	\
> +  FUNC (NUM (x, 2));	\
> +  FUNC (NUM (x, 3));	\
> +  FUNC (NUM (x, 4));	\
> +  FUNC (NUM (x, 5));	\
> +  FUNC (NUM (x, 6));	\
> +  FUNC (NUM (x, 7));	\
> +  FUNC (NUM (x, 8));	\
> +  FUNC (NUM (x, 9))
> +
> +#define FUNC100(x)	\
> +  FUNC10 (NUM (x, 0));	\
> +  FUNC10 (NUM (x, 1));	\
> +  FUNC10 (NUM (x, 2));	\
> +  FUNC10 (NUM (x, 3));	\
> +  FUNC10 (NUM (x, 4));	\
> +  FUNC10 (NUM (x, 5));	\
> +  FUNC10 (NUM (x, 6));	\
> +  FUNC10 (NUM (x, 7));	\
> +  FUNC10 (NUM (x, 8));	\
> +  FUNC10 (NUM (x, 9))
> +
> +#define FUNC1000(x)		\
> +  FUNC100 (NUM (x, 0));		\
> +  FUNC100 (NUM (x, 1));		\
> +  FUNC100 (NUM (x, 2));		\
> +  FUNC100 (NUM (x, 3));		\
> +  FUNC100 (NUM (x, 4));		\
> +  FUNC100 (NUM (x, 5));		\
> +  FUNC100 (NUM (x, 6));		\
> +  FUNC100 (NUM (x, 7));		\
> +  FUNC100 (NUM (x, 8));		\
> +  FUNC100 (NUM (x, 9))
> +
> +#define FUNC7000()	\
> +  FUNC1000 (1);		\
> +  FUNC1000 (2);		\
> +  FUNC1000 (3);		\
> +  FUNC1000 (4);		\
> +  FUNC1000 (5);		\
> +  FUNC1000 (6);		\
> +  FUNC1000 (7);
> +
> +#ifdef FUNC
> +# undef FUNC
> +#endif
> +
> +#ifdef externnum
> +# define FUNC(x) extern int CONCAT (retNum, x) (void)
> +#endif

OK.

> +
> +#ifdef definenum
> +# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
> +#endif

OK.

> +
> +#ifdef callnum
> +# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
> +#endif

OK.

> +
> +FUNC7000 ();
> 

OK, 7000 functions to test, all of which need resolution.


-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-12  1:08         ` Carlos O'Donell
@ 2018-10-15 13:01           ` Florian Weimer
  2018-10-15 15:10             ` Carlos O'Donell
  2018-10-18  2:02           ` [PATCHv4] " Tulio Magno Quites Machado Filho
  1 sibling, 1 reply; 40+ messages in thread
From: Florian Weimer @ 2018-10-15 13:01 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Adhemerval Zanella, Joseph Myers

* Carlos O'Donell:

> (3) Fence-to-fence sync.
>
> For fence-to-fence synchronization to work we need an acquire and release
> fence, and we have that.
>
> We are missing the atomic read and write of the guard. Please review below.
> Florian mentioned this in his review. He is correct.
>
> And all the problems are back again because you can't do atomic loads of
> the large guards because they are actually the function descriptor structures.
> However, this is just laziness, we used the addr because it was convenient.
> It is no longer convenient. Just add a 'init' field to reloc_result and use
> that as the guard to synchronize the threads against for initialization of
> the results. This should solve the reloc_result problem (ignorning the issues
> hppa and ia64 have with the fdesc updates across multiple threads in _dl_fixup).

I think due to various external factors, we should go with the
fence-based solution for now, and change it later to something which
uses an acquire/release on the code address later, using proper atomics.

I don't want to see this bug fix blocked by ia64 and hppa.  The proper
fix needs some reshuffling of the macros here, or maybe use an unused
bit in the flags field as an indicator for initialization.

> (4) Review of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE.	
> 
> I reviewed the uses of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE to
> see if there was any other case of this problem, particularly where there
> might be a case where a write happens on one thread that might not be
> seen in another.
> 
> I also looked at _dl_relocate_object and the initialization of all 
> l_reloc_result via calloc, and that is also covered because the
> atomic_thread_fence_acquire ensures any secondary thread sees the
> initialization.

I don't think the analysis is correct.  It's up to the application to
ensure that the dlopen (or at least the call to an ELF constructor in
the new DSO) happens before a call to any function in the DSO, and this
is why there is no need to synchronize the calloc with the profiling
code.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-15 13:01           ` Florian Weimer
@ 2018-10-15 15:10             ` Carlos O'Donell
  2018-10-17 21:25               ` Florian Weimer
  0 siblings, 1 reply; 40+ messages in thread
From: Carlos O'Donell @ 2018-10-15 15:10 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Adhemerval Zanella, Joseph Myers

On 10/15/18 8:57 AM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> (3) Fence-to-fence sync.
>>
>> For fence-to-fence synchronization to work we need an acquire and release
>> fence, and we have that.
>>
>> We are missing the atomic read and write of the guard. Please review below.
>> Florian mentioned this in his review. He is correct.
>>
>> And all the problems are back again because you can't do atomic loads of
>> the large guards because they are actually the function descriptor structures.
>> However, this is just laziness, we used the addr because it was convenient.
>> It is no longer convenient. Just add a 'init' field to reloc_result and use
>> that as the guard to synchronize the threads against for initialization of
>> the results. This should solve the reloc_result problem (ignorning the issues
>> hppa and ia64 have with the fdesc updates across multiple threads in _dl_fixup).
> 
> I think due to various external factors, we should go with the
> fence-based solution for now, and change it later to something which
> uses an acquire/release on the code address later, using proper atomics.

Let me clarify.

The fence fix as proposed in v3 is wrong for all architectures.

We are emulating C/C++ 11 atomics within glibc, and a fence-to-fence sync
*requires* an atomic load / store of the guard, you can't use a non-atomic
access. The point of the atomic load/store is to ensure you don't have a
data race.

> I don't want to see this bug fix blocked by ia64 and hppa.  The proper
> fix needs some reshuffling of the macros here, or maybe use an unused
> bit in the flags field as an indicator for initialization.

The fix for this is straight forward.

Add a new initializer field to the reloc_result, it's an internal data
structure. It can be as big as we want and we can optimize it later.

You don't need to do any big cleanups, but we *do* have to get the
synchronization correct.

>> (4) Review of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE.	
>>
>> I reviewed the uses of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE to
>> see if there was any other case of this problem, particularly where there
>> might be a case where a write happens on one thread that might not be
>> seen in another.
>>
>> I also looked at _dl_relocate_object and the initialization of all 
>> l_reloc_result via calloc, and that is also covered because the
>> atomic_thread_fence_acquire ensures any secondary thread sees the
>> initialization.
> 
> I don't think the analysis is correct.  It's up to the application to
> ensure that the dlopen (or at least the call to an ELF constructor in
> the new DSO) happens before a call to any function in the DSO, and this
> is why there is no need to synchronize the calloc with the profiling
> code.

I agree, you would need some inter-thread synchronization to ensure all
other threads new the dlopen was complete, and that would ensure that
the writes would be seen.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-15 15:10             ` Carlos O'Donell
@ 2018-10-17 21:25               ` Florian Weimer
  2018-10-18  2:14                 ` Carlos O'Donell
  0 siblings, 1 reply; 40+ messages in thread
From: Florian Weimer @ 2018-10-17 21:25 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Adhemerval Zanella, Joseph Myers

* Carlos O'Donell:

> On 10/15/18 8:57 AM, Florian Weimer wrote:
>> * Carlos O'Donell:
>> 
>>> (3) Fence-to-fence sync.
>>>
>>> For fence-to-fence synchronization to work we need an acquire and release
>>> fence, and we have that.
>>>
>>> We are missing the atomic read and write of the guard. Please review below.
>>> Florian mentioned this in his review. He is correct.
>>>
>>> And all the problems are back again because you can't do atomic loads of
>>> the large guards because they are actually the function descriptor structures.
>>> However, this is just laziness, we used the addr because it was convenient.
>>> It is no longer convenient. Just add a 'init' field to reloc_result and use
>>> that as the guard to synchronize the threads against for initialization of
>>> the results. This should solve the reloc_result problem (ignorning the issues
>>> hppa and ia64 have with the fdesc updates across multiple threads in _dl_fixup).
>> 
>> I think due to various external factors, we should go with the
>> fence-based solution for now, and change it later to something which
>> uses an acquire/release on the code address later, using proper atomics.
>
> Let me clarify.
>
> The fence fix as proposed in v3 is wrong for all architectures.
>
> We are emulating C/C++ 11 atomics within glibc, and a fence-to-fence sync
> *requires* an atomic load / store of the guard, you can't use a non-atomic
> access. The point of the atomic load/store is to ensure you don't have a
> data race.

Carlos, I'm sorry, but I think your position is logically inconsistent.

Formally, you cannot follow the memory model here without a substantial
rewrite of the code, breaking up the struct fdesc abstraction.  The
reason is that without blocking synchronization, you still end up with
two non-atomic writes to the same object, which is a data race, and
undefined, even if both threads write the same value.

As far as I can see, POWER is !USE_ATOMIC_COMPILER_BUILTINS, so our
relaxed MO store is just a regular store, without a compiler barrier.
That means after all that rewriting, we basically end up with the same
code and the same formal data race that we would have when we just used
fences.

This is different for USE_ATOMIC_COMPILER_BUILTINS architectures, where
we do use actual atomic stores.  But for !USE_ATOMIC_COMPILER_BUILTINS,
the fence-based approach is as good as we can get, with or without
breaking the abstractions.

So as I said, given the constraints we are working under, we should go
with the solution based on fences, and have that tested on Aarch64 as
well.

>> I don't want to see this bug fix blocked by ia64 and hppa.  The proper
>> fix needs some reshuffling of the macros here, or maybe use an unused
>> bit in the flags field as an indicator for initialization.
>
> The fix for this is straight forward.
>
> Add a new initializer field to the reloc_result, it's an internal data
> structure. It can be as big as we want and we can optimize it later.
>
> You don't need to do any big cleanups, but we *do* have to get the
> synchronization correct.

See above; I don't think we can get the synchronization formally
correct, even with any level of cleanups.  In the data race case, we
would have

  atomic acquire MO load of initializer field
  non-atomic writes to various struct fields
  atomic release MO store to initializer field

in each thread.  That's still undefined behavior due to the blocking
stores in the middle.

Let me reiterate: Just because you say our atomics are C11, it doesn't
make them so.  They are syntactically different, and they are not
presented to the compiler as atomics for !USE_ATOMIC_COMPILER_BUILTINS.
I know that you and Torvald didn't consider this a problem in the past,
but maybe you can reconsider your position?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCHv4] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-12  1:08         ` Carlos O'Donell
  2018-10-15 13:01           ` Florian Weimer
@ 2018-10-18  2:02           ` Tulio Magno Quites Machado Filho
  2018-10-18  2:17             ` Carlos O'Donell
  1 sibling, 1 reply; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-10-18  2:02 UTC (permalink / raw)
  To: Carlos O'Donell, Florian Weimer, libc-alpha,
	John David Anglin, Adhemerval Zanella, Joseph Myers,
	Florian Weimer

Carlos O'Donell <carlos@redhat.com> writes:

> On 10/10/18 10:57 PM, Tulio Magno Quites Machado Filho wrote:
>
> We will need a v4. Please review (1), (2) and (3) carefully, feel free to
> ignore (4).
>
> (1) I added a bunch of comments.
>
> Comments added inline.

Ack.

> (2) -Wl,-z,now worries.
>
> Added some things for you to check.

Fixed.

> (3) Fence-to-fence sync.
>
> For fence-to-fence synchronization to work we need an acquire and release
> fence, and we have that.
>
> We are missing the atomic read and write of the guard. Please review below.
> Florian mentioned this in his review. He is correct.
>
> And all the problems are back again because you can't do atomic loads of
> the large guards because they are actually the function descriptor structures.
> However, this is just laziness, we used the addr because it was convenient.
> It is no longer convenient. Just add a 'init' field to reloc_result and use
> that as the guard to synchronize the threads against for initialization of
> the results. This should solve the reloc_result problem (ignorning the issues
> hppa and ia64 have with the fdesc updates across multiple threads in _dl_fixup).

Ack.

> (4) Review of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE.
>
> I reviewed the uses of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE to
> see if there was any other case of this problem, particularly where there
> might be a case where a write happens on one thread that might not be
> seen in another.
>
> I also looked at _dl_relocate_object and the initialization of all
> l_reloc_result via calloc, and that is also covered because the
> atomic_thread_fence_acquire ensures any secondary thread sees the
> initialization.

Related to this: I'm skeptical about the usage of l->l_relocated in
_dl_relocate_object.  But I was not able to reproduce a failure there yet.

>> diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
>> index 63bbc89776..c1ba372bd7 100644
>> --- a/elf/dl-runtime.c
>> +++ b/elf/dl-runtime.c
>> @@ -183,9 +183,18 @@ _dl_profile_fixup (
>>    /* This is the address in the array where we store the result of previous
>>       relocations.  */
>>    struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
>> -  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>>
>> +  /* CONCURRENCY NOTES:
>> +
>
> Suggest adding:
>
> Multiple threads may be calling the same PLT sequence and with LD_AUDIT enabled
> they will be calling into _dl_profile_fixup to update the reloc_result with the
> result of the lazy resolution. The reloc_result guard variable is addr, and we
> use relaxed MO loads and store to it along with an atomic_thread_acquire and
> atomic_thread_release fence to ensure that the results of the structure are
> consistent with the loaded value of the guard.

I added this comment with small changes based on your suggestion for
reloc_result.init.

>> +     The following code uses DL_FIXUP_VALUE_CODE_ADDR to access a potential
>> +     member of reloc_result->addr to indicate if it is the first time this
>> +     object is being relocated.
>> +     Reading/Writing from/to reloc_result->addr must not happen before previous
>> +     writes to reloc_result complete as they could end-up with an incomplete
>> +     struct.  */
>
> OK.

After adding your previous comment and the changes to use reloc_result.init,
this comment is obsolete.  I removed it.

>>    DL_FIXUP_VALUE_TYPE value = *resultp;
>
> Not OK. This is a guard. You read it here, and write to it below.
> That's a data race. Both need to be atomic accesses with any MO you want.

Agreed.

> On hppa this will require a new enough compile to get a 64-bit atomic load.
> On ia64 I don't know if there is a usable 128-bit atomic.
>
> The key problem here is that addr is being overloaded as a guard here because
> it was convenient. It's non-zero when the symbol is initialized, otherwhise it's
> zero when it's not. However, for arches with function descriptors you've found
> out that using it is causing problems because it's too big for traditional atomic
> operations.
>
> What you really need is a new "init" field in reloc_result, make it a word,
> and then use word-sized atomics on that with relaxed MO, and keep the fences.

Ack.  Fixed.

>> +	  *resultp = value;
>
> Not OK, see above, this needs to be an atomic relaxed-MO store to 'init'
> or something smaller than value.
>
> You need a guard small enough that arches will have an atomic load/store
> to the size.

Ack.  Fixed.

>>  extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
>>  		   tst-cleanup4aux.o tst-cleanupx4aux.o
>>  test-extras += tst-cleanup4aux tst-cleanupx4aux
>> @@ -709,6 +711,10 @@ endif
>>
>>  $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
>>
>> +$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
>> +$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
>> +tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so
>
> Do we need to add -Wl,-z,lazy?
>
> Users might have -Wl,-z,now as the default for their build?
>
> With BIND_NOW the test doesn't test what we want.

Indeed.  Fixed.

>> diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
>> +#include <elf.h>
>> +#include <link.h>
>> +#include <stdio.h>
>> +#include <assert.h>
>> +#include <string.h>
>> +
>
> Suggest:
>
> /* We must use a dummy LD_AUDIT module to force the dynamic loader to
>    *not* update the real PLT, and instead use a cached value for the
>    lazy resolution result. It is the update of that cached value that
>    we are testing for correctness by doing this.  */

Fixed.

>> +volatile int count = 0;
>> +
>> +unsigned int
>> +la_version (unsigned int ver)
>> +{
>> +  return 1;
>> +}
>> +
>> +unsigned int
>> +la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
>> +{
>> +  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
>> +}
>
> I'm worried binutils will optimize away the PLT entries and this test will
> pass without failing but the lazy resolution will not be tested.
>
> Can we just *count* the number of PLT resolutions and see if they match?

With so many threads calling these functions, la_symbind is called many times.
As the intention was to avoid optimizing away the PLT entries, I added another
validation: it tests if it's always increasing and that it's never higher than
the expected value.

>> diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
>
> Suggest:
>
> /* Define all the retNumN functions in a library.  */
>
> Just to be clear that this must be distinct from the executable.

Fixed.

>> diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
>
> Suggest:
>
> /* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
>    library with a huge number of functions in order to validate lazy symbol
>    binding with an audit library.  We use one thread per CPU to test that
>    concurrent lazy resolution does not have any defects which would cause
>    the process to fail.  We use an LD_AUDIT library to force the testing of
>    the relocation resolution caching code in the dynamic loader i.e.
>    _dl_runtime_profile and _dl_profile_fixup.  */

Fixed.

>> +/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
>> +   library with a huge number of functions in order to validate lazy symbol
>> +   binding with an audit library.  */
>> +
>> +#include <support/xthread.h>
>> +#include <strings.h>
>> +#include <stdlib.h>
>> +#include <sys/sysinfo.h>
>> +
>> +static int do_test (void);
>> +
>> +/* This test usually takes less than 3s to run.  However, there are cases that
>> +   take up to 30s.  */
>> +#define TIMEOUT 60
>> +#define TEST_FUNCTION do_test ()
>> +#include "../test-skeleton.c"
>> +
>
> Suggest:
>
> /* Declare the functions we are going to call.  */

Fixed.

>> +void
>> +call_all_ret_nums (void)
>> +{
>
> Suggest:
>
> /* Call each function one at a time from all threads.  */

Fixed.

>> diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
> Suggest adding:
>
> /* We use this helper to create a large number of functions, all of
>    which will be resolved lazily and thus have their PLT updated.
>    This is done to provide enough functions that we can statistically
>    observe a thread vs. PLT resolution failure if one exists.  */

Fixed.

Changes since v3:

 - Improved comments.
 - Started to use -Wl,-z,now.
 - Added field init to l_reloc_result to be used as a guard.

Changes since v2:

 - Fixed coding style in nptl/tst-audit-threads-mod1.c.
 - Replaced pthreads.h functions with respective support/xthread.h ones.
 - Replaced malloc() with xcalloc() in nptl/tst-audit-threads.c.
 - Removed bzero().
 - Reduced the amount of functions to 7k in order to fit the relocation
   limit  of some architectures, e.g. m68k, mips.
 - Fixed issues in nptl/Makefile.

Changes since v1:

 - Fixed the coding style issues.
 - Replaced atomic loads/store with memory fences.
 - Added a test.

---- 8< ----

The field reloc_result->addr is used to indicate if the rest of the
fields of reloc_result have already been written, creating a
data-dependency order.
Reading reloc_result->addr to the variable value requires to complete
before reading the rest of the fields of reloc_result.
Likewise, the writes to the other fields of the reloc_result must
complete before reloc_result-addr is updated.

Tested with build-many-glibcs.

2018-10-17  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>

	[BZ #23690]
	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
	modification order when accessing reloc_result->addr.
	* include/link.h (reloc_result): Add field init.
	* nptl/Makefile (tests): Add tst-audit-threads.
	(modules-names): Add tst-audit-threads-mod1 and
	tst-audit-threads-mod2.
	Add rules to build tst-audit-threads.
	* nptl/tst-audit-threads-mod1.c: New file.
	* nptl/tst-audit-threads-mod2.c: Likewise.
	* nptl/tst-audit-threads.c: Likewise.
	* nptl/tst-audit-threads.h: Likewise.

Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
---
 elf/dl-runtime.c              | 30 +++++++++++--
 include/link.h                |  2 +
 nptl/Makefile                 | 14 ++++++-
 nptl/tst-audit-threads-mod1.c | 74 +++++++++++++++++++++++++++++++++
 nptl/tst-audit-threads-mod2.c | 22 ++++++++++
 nptl/tst-audit-threads.c      | 97 +++++++++++++++++++++++++++++++++++++++++++
 nptl/tst-audit-threads.h      | 89 +++++++++++++++++++++++++++++++++++++++
 7 files changed, 322 insertions(+), 6 deletions(-)
 create mode 100644 nptl/tst-audit-threads-mod1.c
 create mode 100644 nptl/tst-audit-threads-mod2.c
 create mode 100644 nptl/tst-audit-threads.c
 create mode 100644 nptl/tst-audit-threads.h

diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
index 63bbc89776..a760a00a62 100644
--- a/elf/dl-runtime.c
+++ b/elf/dl-runtime.c
@@ -183,10 +183,22 @@ _dl_profile_fixup (
   /* This is the address in the array where we store the result of previous
      relocations.  */
   struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
+
+  /* CONCURRENCY NOTES:
+
+     Multiple threads may be calling the same PLT sequence and with LD_AUDIT
+     enabled they will be calling into _dl_profile_fixup to update the
+     reloc_result with the result of the lazy resolution.  The reloc_result
+     guard variable is init, and we use relaxed MO loads and stores to it
+     along with an atomic_thread_acquire and atomic_thread_release fence to
+     ensure that the results of the structure are consistent with the
+     loaded value of the guard.  */
   DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
+  DL_FIXUP_VALUE_TYPE value;
+  unsigned int init = atomic_load_relaxed (&reloc_result->init);
+  atomic_thread_fence_acquire ();
 
-  DL_FIXUP_VALUE_TYPE value = *resultp;
-  if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
+  if (init == 0)
     {
       /* This is the first time we have to relocate this object.  */
       const ElfW(Sym) *const symtab
@@ -346,16 +358,26 @@ _dl_profile_fixup (
 
       /* Store the result for later runs.  */
       if (__glibc_likely (! GLRO(dl_bind_not)))
-	*resultp = value;
+	{
+	  *resultp = value;
+	  atomic_thread_fence_release ();
+	  /* Guarantee all previous writes complete before
+	     init is updated.  See CONCURRENCY NOTES earlier  */
+	  atomic_store_relaxed (&reloc_result->init, 1);
+	}
+      init = 1;
     }
+  else
+    value = *resultp;
 
   /* By default we do not call the pltexit function.  */
   long int framesize = -1;
 
+
 #ifdef SHARED
   /* Auditing checkpoint: report the PLT entering and allow the
      auditors to change the value.  */
-  if (DL_FIXUP_VALUE_CODE_ADDR (value) != 0 && GLRO(dl_naudit) > 0
+  if (init != 0 && GLRO(dl_naudit) > 0
       /* Don't do anything if no auditor wants to intercept this call.  */
       && (reloc_result->enterexit & LA_SYMB_NOPLTENTER) == 0)
     {
diff --git a/include/link.h b/include/link.h
index 5924594548..1d13d02637 100644
--- a/include/link.h
+++ b/include/link.h
@@ -216,6 +216,8 @@ struct link_map
       unsigned int boundndx;
       uint32_t enterexit;
       unsigned int flags;
+      /* Indicates if reloc_result fields have been initialized.  */
+      unsigned int init;
     } *l_reloc_result;
 
     /* Pointer to the version information if available.  */
diff --git a/nptl/Makefile b/nptl/Makefile
index be8066524c..9862ef53fc 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
 	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
 	 tst-oncex3 tst-oncex4
 ifeq ($(build-shared),yes)
-tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
+tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
+	 tst-audit-threads
 tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
 tests-nolibpthread += tst-fini1
 ifeq ($(have-z-execstack),yes)
@@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
 		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
 		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
 		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
-		tst-join7mod tst-compat-forwarder-mod
+		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
+		tst-audit-threads-mod2
 extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
 		   tst-cleanup4aux.o tst-cleanupx4aux.o
 test-extras += tst-cleanup4aux tst-cleanupx4aux
@@ -709,6 +711,14 @@ endif
 
 $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
 
+# Protect against a build using -Wl,-z,now.
+LDFLAGS-tst-audit-threads-mod1.so = -Wl,-z,lazy
+LDFLAGS-tst-audit-threads-mod2.so = -Wl,-z,lazy
+LDFLAGS-tst-audit-threads = -Wl,-z,lazy
+$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
+$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
+tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so
+
 # The tests here better do not run in parallel
 ifneq ($(filter %tests,$(MAKECMDGOALS)),)
 .NOTPARALLEL:
diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
new file mode 100644
index 0000000000..6fa0c0c6c4
--- /dev/null
+++ b/nptl/tst-audit-threads-mod1.c
@@ -0,0 +1,74 @@
+/* Dummy audit library for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <elf.h>
+#include <link.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+
+/* We must use a dummy LD_AUDIT module to force the dynamic loader to
+   *not* update the real PLT, and instead use a cached value for the
+   lazy resolution result.  It is the update of that cached value that
+   we are testing for correctness by doing this.  */
+
+/* Library to be audited.  */
+#define LIB "tst-audit-threads-mod2.so"
+/* CALLNUM is the number of retNum functions.  */
+#define CALLNUM 7999
+
+#define CONCATX(a, b) __CONCAT (a, b)
+
+static int previous = 0;
+
+unsigned int
+la_version (unsigned int ver)
+{
+  return 1;
+}
+
+unsigned int
+la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
+{
+  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
+}
+
+uintptr_t
+CONCATX(la_symbind, __ELF_NATIVE_CLASS) (ElfW(Sym) *sym,
+					unsigned int ndx,
+					uintptr_t *refcook,
+					uintptr_t *defcook,
+					unsigned int *flags,
+					const char *symname)
+{
+  const char * retnum = "retNum";
+  char * num = strstr (symname, retnum);
+  int n;
+  /* Validate if the symbols are getting called in the correct order.
+     This code is here just to guarantee Binutils will not optimize out this
+     function.  */
+  if (num != NULL)
+    {
+      n = atoi (num);
+      assert (n >= previous);
+      assert (n <= CALLNUM);
+      previous = n;
+    }
+  return sym->st_value;
+}
diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
new file mode 100644
index 0000000000..f9817dd3dc
--- /dev/null
+++ b/nptl/tst-audit-threads-mod2.c
@@ -0,0 +1,22 @@
+/* Shared object with a huge number of functions for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Define all the retNumN functions in a library.  */
+#define definenum
+#include "tst-audit-threads.h"
diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
new file mode 100644
index 0000000000..e4bf433bd8
--- /dev/null
+++ b/nptl/tst-audit-threads.c
@@ -0,0 +1,97 @@
+/* Test multi-threading using LD_AUDIT.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
+   library with a huge number of functions in order to validate lazy symbol
+   binding with an audit library.  We use one thread per CPU to test that
+   concurrent lazy resolution does not have any defects which would cause
+   the process to fail.  We use an LD_AUDIT library to force the testing of
+   the relocation resolution caching code in the dynamic loader i.e.
+   _dl_runtime_profile and _dl_profile_fixup.  */
+
+#include <support/xthread.h>
+#include <strings.h>
+#include <stdlib.h>
+#include <sys/sysinfo.h>
+
+static int do_test (void);
+
+/* This test usually takes less than 3s to run.  However, there are cases that
+   take up to 30s.  */
+#define TIMEOUT 60
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
+
+/* Declare the functions we are going to call.  */
+#define externnum
+#include "tst-audit-threads.h"
+#undef externnum
+
+int num_threads;
+pthread_barrier_t barrier;
+
+void
+sync_all (int num)
+{
+  pthread_barrier_wait (&barrier);
+}
+
+void
+call_all_ret_nums (void)
+{
+  /* Call each function one at a time from all threads.  */
+#define callnum
+#include "tst-audit-threads.h"
+#undef callnum
+}
+
+void *
+thread_main (void *unused)
+{
+  call_all_ret_nums ();
+  return NULL;
+}
+
+#define STR2(X) #X
+#define STR(X) STR2(X)
+
+static int
+do_test (void)
+{
+  int i;
+  pthread_t *threads;
+
+  num_threads = get_nprocs ();
+  if (num_threads <= 1)
+    num_threads = 2;
+
+  /* Used to synchronize all the threads after calling each retNumN.  */
+  xpthread_barrier_init (&barrier, NULL, num_threads);
+
+  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
+  for (i = 0; i < num_threads; i++)
+    threads[i] = xpthread_create(NULL, thread_main, NULL);
+
+  for (i = 0; i < num_threads; i++)
+    xpthread_join(threads[i]);
+
+  free (threads);
+
+  return 0;
+}
diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
new file mode 100644
index 0000000000..491d0dcbf0
--- /dev/null
+++ b/nptl/tst-audit-threads.h
@@ -0,0 +1,89 @@
+/* Helper header for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* We use this helper to create a large number of functions, all of
+   which will be resolved lazily and thus have their PLT updated.
+   This is done to provide enough functions that we can statistically
+   observe a thread vs. PLT resolution failure if one exists.  */
+
+#define CONCAT(a, b) a ## b
+#define NUM(x, y) CONCAT (x, y)
+
+#define FUNC10(x)	\
+  FUNC (NUM (x, 0));	\
+  FUNC (NUM (x, 1));	\
+  FUNC (NUM (x, 2));	\
+  FUNC (NUM (x, 3));	\
+  FUNC (NUM (x, 4));	\
+  FUNC (NUM (x, 5));	\
+  FUNC (NUM (x, 6));	\
+  FUNC (NUM (x, 7));	\
+  FUNC (NUM (x, 8));	\
+  FUNC (NUM (x, 9))
+
+#define FUNC100(x)	\
+  FUNC10 (NUM (x, 0));	\
+  FUNC10 (NUM (x, 1));	\
+  FUNC10 (NUM (x, 2));	\
+  FUNC10 (NUM (x, 3));	\
+  FUNC10 (NUM (x, 4));	\
+  FUNC10 (NUM (x, 5));	\
+  FUNC10 (NUM (x, 6));	\
+  FUNC10 (NUM (x, 7));	\
+  FUNC10 (NUM (x, 8));	\
+  FUNC10 (NUM (x, 9))
+
+#define FUNC1000(x)		\
+  FUNC100 (NUM (x, 0));		\
+  FUNC100 (NUM (x, 1));		\
+  FUNC100 (NUM (x, 2));		\
+  FUNC100 (NUM (x, 3));		\
+  FUNC100 (NUM (x, 4));		\
+  FUNC100 (NUM (x, 5));		\
+  FUNC100 (NUM (x, 6));		\
+  FUNC100 (NUM (x, 7));		\
+  FUNC100 (NUM (x, 8));		\
+  FUNC100 (NUM (x, 9))
+
+#define FUNC7000()	\
+  FUNC1000 (1);		\
+  FUNC1000 (2);		\
+  FUNC1000 (3);		\
+  FUNC1000 (4);		\
+  FUNC1000 (5);		\
+  FUNC1000 (6);		\
+  FUNC1000 (7);
+
+#ifdef FUNC
+# undef FUNC
+#endif
+
+#ifdef externnum
+# define FUNC(x) extern int CONCAT (retNum, x) (void)
+#endif
+
+#ifdef definenum
+# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
+#endif
+
+#ifdef callnum
+# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
+#endif
+
+FUNC7000 ();
-- 
2.14.4

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-17 21:25               ` Florian Weimer
@ 2018-10-18  2:14                 ` Carlos O'Donell
  2018-10-18  7:24                   ` Carlos O'Donell
  2018-10-18 10:21                   ` Florian Weimer
  0 siblings, 2 replies; 40+ messages in thread
From: Carlos O'Donell @ 2018-10-18  2:14 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Adhemerval Zanella, Joseph Myers

[-- Attachment #1: Type: text/plain, Size: 7062 bytes --]

On 10/17/18 4:12 PM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> On 10/15/18 8:57 AM, Florian Weimer wrote:
>>> * Carlos O'Donell:
>>>
>>>> (3) Fence-to-fence sync.
>>>>
>>>> For fence-to-fence synchronization to work we need an acquire and release
>>>> fence, and we have that.
>>>>
>>>> We are missing the atomic read and write of the guard. Please review below.
>>>> Florian mentioned this in his review. He is correct.
>>>>
>>>> And all the problems are back again because you can't do atomic loads of
>>>> the large guards because they are actually the function descriptor structures.
>>>> However, this is just laziness, we used the addr because it was convenient.
>>>> It is no longer convenient. Just add a 'init' field to reloc_result and use
>>>> that as the guard to synchronize the threads against for initialization of
>>>> the results. This should solve the reloc_result problem (ignorning the issues
>>>> hppa and ia64 have with the fdesc updates across multiple threads in _dl_fixup).
>>>
>>> I think due to various external factors, we should go with the
>>> fence-based solution for now, and change it later to something which
>>> uses an acquire/release on the code address later, using proper atomics.
>>
>> Let me clarify.
>>
>> The fence fix as proposed in v3 is wrong for all architectures.
>>
>> We are emulating C/C++ 11 atomics within glibc, and a fence-to-fence sync
>> *requires* an atomic load / store of the guard, you can't use a non-atomic
>> access. The point of the atomic load/store is to ensure you don't have a
>> data race.
> 
> Carlos, I'm sorry, but I think your position is logically inconsistent.

Yes, it *is* logically inconsistent. I agree with you.

However, to *be* logically consistent I'd have to fix all data races in
this code in one go, and I can't, it's too much work.

All I want is for any *changes* to follow C11 semantics, and I think we
can do that without major surgery.

Consider it an ideological flaw that I want everyone to practice following
a consistent memory model and think about these problems in terms of that
memory model, and evaluate patches using that model.

> Formally, you cannot follow the memory model here without a substantial
> rewrite of the code, breaking up the struct fdesc abstraction.  The
> reason is that without blocking synchronization, you still end up with
> two non-atomic writes to the same object, which is a data race, and
> undefined, even if both threads write the same value.

There are two distinct problems here, and each can be handled distinctly.

The first is the problem at hand, that there is a data-dependency issue
with the update of the struct reloc_result structure. We have multiple
threads writing to the reloc_result structure, in general those threads
write the same value (locks are taken _dl_lookup_symbol_x), and while
this is a data race, I don't care about it and we aren't going to fix 
it. The only thing we should do is that a thread that determines the 
reloc_result is initialized should see all the correct value in the 
structure and not a sheared result. That is all that we are fixing
here, call that the "change."

We can follow the memory model far enough to avoid a sheared result
being read out of the struct reloc_result.

We have not fixed the data races that occur when two threads read a
zero addr value and both use non-atomic writes to update reloc_result,
and I don't intend the patch to fix that. I don't require that.

> As far as I can see, POWER is !USE_ATOMIC_COMPILER_BUILTINS, so our
> relaxed MO store is just a regular store, without a compiler barrier.
> That means after all that rewriting, we basically end up with the same
> code and the same formal data race that we would have when we just used
> fences.

That's fine.

The use of the guard+fence-to-fence sync is, from a C11 perspective,
correct. However, I recommend adding a reloc_result->reloc_init and
using that with release/acquire loads.

> This is different for USE_ATOMIC_COMPILER_BUILTINS architectures, where
> we do use actual atomic stores.  But for !USE_ATOMIC_COMPILER_BUILTINS,
> the fence-based approach is as good as we can get, with or without
> breaking the abstractions.

We can do better.

> So as I said, given the constraints we are working under, we should go
> with the solution based on fences, and have that tested on Aarch64 as
> well.

I whole heartedly appreciate a pragmatic approach to these problems, but
I still challenge that we can do better without much more work.

>>> I don't want to see this bug fix blocked by ia64 and hppa.  The proper
>>> fix needs some reshuffling of the macros here, or maybe use an unused
>>> bit in the flags field as an indicator for initialization.
>>
>> The fix for this is straight forward.
>>
>> Add a new initializer field to the reloc_result, it's an internal data
>> structure. It can be as big as we want and we can optimize it later.
>>
>> You don't need to do any big cleanups, but we *do* have to get the
>> synchronization correct.
> 
> See above; I don't think we can get the synchronization formally
> correct, even with any level of cleanups.  In the data race case, we
> would have
> 
>   atomic acquire MO load of initializer field
>   non-atomic writes to various struct fields
>   atomic release MO store to initializer field
> 
> in each thread.  That's still undefined behavior due to the blocking
> stores in the middle.

That's fine. The changes made are correct, even if the whole algorithm
itself is not.

> Let me reiterate: Just because you say our atomics are C11, it doesn't
> make them so.  They are syntactically different, and they are not
> presented to the compiler as atomics for !USE_ATOMIC_COMPILER_BUILTINS.
> I know that you and Torvald didn't consider this a problem in the past,
> but maybe you can reconsider your position?

My position is unchanged.

If I could summarize our positions, I would write:

(1) The "Pragmatic approach"

- Since we don't have C11 atomics, we should just use the fences because
  they fix the data dependency issue, and stop there. We need not go any
  further until we are ready to fix the underlying algorithm of the result
  updates and *then* we can follow C11.

(2) The "Incremental C11 approach"

- Assume we are C11, and take an incremental approach where we fix the
  data dependency issue using correct synchronization primitives, even
  if it doesn't solve all of the data races.

Did I summarize your position accurately?

I prefer (2).

Tulio can have a preference also.

I spoke to Tulio on IRC and he says he has a working v4 in build-many-glibcs.

I figure we'll commit something probably tomorrow for this.

Just for the record I'm attaching my WIP v4 so you can see what I mean about
the solution. Yes, I did make the structure 4 bytes larger, and that will have
a real impact, but it removes the dependency on the size of the function pointer,
and I like that for future maintenance, and we'll likely rewrite the whole struct
when we fix 23790 to get a more optimal layout.

-- 
Cheers,
Carlos.

[-- Attachment #2: swbz23690.patch --]
[-- Type: text/x-patch, Size: 14510 bytes --]

diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
index 63bbc89776..b115f738a4 100644
--- a/elf/dl-runtime.c
+++ b/elf/dl-runtime.c
@@ -158,6 +158,7 @@ _dl_profile_fixup (
 		   struct link_map *l, ElfW(Word) reloc_arg,
 		   ElfW(Addr) retaddr, void *regs, long int *framesizep)
 {
+  unsigned int reloc_init;
   void (*mcount_fct) (ElfW(Addr), ElfW(Addr)) = _dl_mcount;
 
   if (l->l_reloc_result == NULL)
@@ -183,10 +184,36 @@ _dl_profile_fixup (
   /* This is the address in the array where we store the result of previous
      relocations.  */
   struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
-  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
 
-  DL_FIXUP_VALUE_TYPE value = *resultp;
-  if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
+  DL_FIXUP_VALUE_TYPE value;
+  /* CONCURRENCY NOTES:
+
+     Multiple threads may be calling the same PLT sequence and with
+     LD_AUDIT enabled they will be calling into _dl_profile_fixup to
+     update the reloc_result with the result of the lazy resolution.
+     The reloc_result guard variable is reloc_init, and we use
+     acquire/release loads and store to it to ensure that the results of
+     the structure are consistent with the loaded value of the guard.
+     This does not fix all of the data races that occur when two or more
+     threads read reloc_result->reloc_init with a value of zero and read
+     and write to that reloc_result concurrently.  The expectation is
+     generally that while this is a data race it works because the
+     threads write the same values.  Until the data races are fixed
+     there is a potential for problems to arise from these data races.
+     The reloc result updates should happen in parallel but there should
+     be an atomic RMW which does the final update to the real result
+     entry (see bug 23790).
+
+     The following code uses reloc_init set to 0 indicate if it is the
+     first time this object is being relocated, otherwise 1 which
+     indicates the object has already been relocated.
+
+     Reading/Writing from/to reloc_result->reloc_init must not happen
+     before previous writes to reloc_result complete as they could
+     end-up with an incomplete struct.  */
+  reloc_init = atomic_load_acquire (&reloc_result->reloc_init);
+
+  if (reloc_init == 0)
     {
       /* This is the first time we have to relocate this object.  */
       const ElfW(Sym) *const symtab
@@ -346,8 +373,16 @@ _dl_profile_fixup (
 
       /* Store the result for later runs.  */
       if (__glibc_likely (! GLRO(dl_bind_not)))
-	*resultp = value;
+	{
+	  /* Guarantee all previous writes complete before
+	     resultp (aka. reloc_result->addr) is updated.  See CONCURRENCY
+	     NOTES earlier  */
+	  reloc_result->addr = value;
+	  atomic_store_release (&reloc_result->reloc_init, 1);
+	}
     }
+  else
+    value = reloc_result->addr;
 
   /* By default we do not call the pltexit function.  */
   long int framesize = -1;
@@ -355,7 +390,7 @@ _dl_profile_fixup (
 #ifdef SHARED
   /* Auditing checkpoint: report the PLT entering and allow the
      auditors to change the value.  */
-  if (DL_FIXUP_VALUE_CODE_ADDR (value) != 0 && GLRO(dl_naudit) > 0
+  if (reloc_init != 0 && GLRO(dl_naudit) > 0
       /* Don't do anything if no auditor wants to intercept this call.  */
       && (reloc_result->enterexit & LA_SYMB_NOPLTENTER) == 0)
     {
diff --git a/include/link.h b/include/link.h
index 5924594548..a4895ecb2f 100644
--- a/include/link.h
+++ b/include/link.h
@@ -211,10 +211,20 @@ struct link_map
     /* Collected results of relocation while profiling.  */
     struct reloc_result
     {
+      /* CONCURRENCY NOTE: This is used to guard the concurrent initialization
+	 of the relocation result across multiple threads. See the more
+	 detailed notes in elf/dl-runtime.c.  */
+      unsigned int reloc_init;
       DL_FIXUP_VALUE_TYPE addr;
       struct link_map *bound;
       unsigned int boundndx;
+      /* We use this to store the la_symbind flag results.  */
       uint32_t enterexit;
+      /* The enterexit field holds the plt exit/enter results from the symbol
+	 binding, which leaves the two low bits of the following flag unused.
+	 Low 2-bits of each char holds plt exit/enter tracing information for
+	 each dynamic loader namespace (and because of that we don't want to
+	 use this as the guard variable for initialization).  */
       unsigned int flags;
     } *l_reloc_result;
 
diff --git a/nptl/Makefile b/nptl/Makefile
index be8066524c..ba2ce125c3 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
 	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
 	 tst-oncex3 tst-oncex4
 ifeq ($(build-shared),yes)
-tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
+tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
+	 tst-audit-threads
 tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
 tests-nolibpthread += tst-fini1
 ifeq ($(have-z-execstack),yes)
@@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
 		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
 		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
 		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
-		tst-join7mod tst-compat-forwarder-mod
+		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
+		tst-audit-threads-mod2
 extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
 		   tst-cleanup4aux.o tst-cleanupx4aux.o
 test-extras += tst-cleanup4aux tst-cleanupx4aux
@@ -709,6 +711,11 @@ endif
 
 $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
 
+$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
+$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
+tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so
+LDFLAGS-tst-audit-threads = -Wl,-z,lazy
+
 # The tests here better do not run in parallel
 ifneq ($(filter %tests,$(MAKECMDGOALS)),)
 .NOTPARALLEL:
diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
new file mode 100644
index 0000000000..45f89017d3
--- /dev/null
+++ b/nptl/tst-audit-threads-mod1.c
@@ -0,0 +1,43 @@
+/* Dummy audit library for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <elf.h>
+#include <link.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+
+/* We must use a dummy LD_AUDIT module to force the dynamic loader to
+   *not* update the real PLT, and instead use a cached value for the
+   lazy resolution result. It is the update of that cached value that
+   we are testing for correctness by doing this.  */
+
+volatile int count = 0;
+
+unsigned int
+la_version (unsigned int ver)
+{
+  return 1;
+}
+
+unsigned int
+la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
+{
+  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
+}
diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
new file mode 100644
index 0000000000..f9817dd3dc
--- /dev/null
+++ b/nptl/tst-audit-threads-mod2.c
@@ -0,0 +1,22 @@
+/* Shared object with a huge number of functions for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Define all the retNumN functions in a library.  */
+#define definenum
+#include "tst-audit-threads.h"
diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
new file mode 100644
index 0000000000..d303648c96
--- /dev/null
+++ b/nptl/tst-audit-threads.c
@@ -0,0 +1,97 @@
+/* Test multi-threading using LD_AUDIT.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
+   library with a huge number of functions in order to validate lazy symbol
+   binding with an audit library.  We use one thread per CPU to test that
+   concurrent lazy resolution does not have any defects which would cause
+   the process to fail.  We use an LD_AUDIT library to force the testing of
+   the relocation resolution caching code in the dynamic loader i.e.
+   _dl_runtime_profile and _dl_profile_fixup.  */
+
+#include <support/xthread.h>
+#include <strings.h>
+#include <stdlib.h>
+#include <sys/sysinfo.h>
+
+static int do_test (void);
+
+/* This test usually takes less than 3s to run.  However, there are cases that
+   take up to 30s.  */
+#define TIMEOUT 60
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
+
+/* Declare the functions we are going to call.  */
+#define externnum
+#include "tst-audit-threads.h"
+#undef externnum
+
+int num_threads;
+pthread_barrier_t barrier;
+
+void
+sync_all (int num)
+{
+  pthread_barrier_wait (&barrier);
+}
+
+void
+call_all_ret_nums (void)
+{
+/* Call each function one at a time from all threads.  */
+#define callnum
+#include "tst-audit-threads.h"
+#undef callnum
+}
+
+void *
+thread_main (void *unused)
+{
+  call_all_ret_nums ();
+  return NULL;
+}
+
+#define STR2(X) #X
+#define STR(X) STR2(X)
+
+static int
+do_test (void)
+{
+  int i;
+  pthread_t *threads;
+
+  num_threads = get_nprocs ();
+  if (num_threads <= 1)
+    num_threads = 2;
+
+  /* Used to synchronize all the threads after calling each retNumN.  */
+  xpthread_barrier_init (&barrier, NULL, num_threads);
+
+  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
+  for (i = 0; i < num_threads; i++)
+    threads[i] = xpthread_create(NULL, thread_main, NULL);
+
+  for (i = 0; i < num_threads; i++)
+    xpthread_join(threads[i]);
+
+  free (threads);
+
+  return 0;
+}
diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
new file mode 100644
index 0000000000..491d0dcbf0
--- /dev/null
+++ b/nptl/tst-audit-threads.h
@@ -0,0 +1,89 @@
+/* Helper header for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* We use this helper to create a large number of functions, all of
+   which will be resolved lazily and thus have their PLT updated.
+   This is done to provide enough functions that we can statistically
+   observe a thread vs. PLT resolution failure if one exists.  */
+
+#define CONCAT(a, b) a ## b
+#define NUM(x, y) CONCAT (x, y)
+
+#define FUNC10(x)	\
+  FUNC (NUM (x, 0));	\
+  FUNC (NUM (x, 1));	\
+  FUNC (NUM (x, 2));	\
+  FUNC (NUM (x, 3));	\
+  FUNC (NUM (x, 4));	\
+  FUNC (NUM (x, 5));	\
+  FUNC (NUM (x, 6));	\
+  FUNC (NUM (x, 7));	\
+  FUNC (NUM (x, 8));	\
+  FUNC (NUM (x, 9))
+
+#define FUNC100(x)	\
+  FUNC10 (NUM (x, 0));	\
+  FUNC10 (NUM (x, 1));	\
+  FUNC10 (NUM (x, 2));	\
+  FUNC10 (NUM (x, 3));	\
+  FUNC10 (NUM (x, 4));	\
+  FUNC10 (NUM (x, 5));	\
+  FUNC10 (NUM (x, 6));	\
+  FUNC10 (NUM (x, 7));	\
+  FUNC10 (NUM (x, 8));	\
+  FUNC10 (NUM (x, 9))
+
+#define FUNC1000(x)		\
+  FUNC100 (NUM (x, 0));		\
+  FUNC100 (NUM (x, 1));		\
+  FUNC100 (NUM (x, 2));		\
+  FUNC100 (NUM (x, 3));		\
+  FUNC100 (NUM (x, 4));		\
+  FUNC100 (NUM (x, 5));		\
+  FUNC100 (NUM (x, 6));		\
+  FUNC100 (NUM (x, 7));		\
+  FUNC100 (NUM (x, 8));		\
+  FUNC100 (NUM (x, 9))
+
+#define FUNC7000()	\
+  FUNC1000 (1);		\
+  FUNC1000 (2);		\
+  FUNC1000 (3);		\
+  FUNC1000 (4);		\
+  FUNC1000 (5);		\
+  FUNC1000 (6);		\
+  FUNC1000 (7);
+
+#ifdef FUNC
+# undef FUNC
+#endif
+
+#ifdef externnum
+# define FUNC(x) extern int CONCAT (retNum, x) (void)
+#endif
+
+#ifdef definenum
+# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
+#endif
+
+#ifdef callnum
+# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
+#endif
+
+FUNC7000 ();

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv4] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-18  2:02           ` [PATCHv4] " Tulio Magno Quites Machado Filho
@ 2018-10-18  2:17             ` Carlos O'Donell
  2018-10-23 21:07               ` [PATCHv5] " Tulio Magno Quites Machado Filho
  0 siblings, 1 reply; 40+ messages in thread
From: Carlos O'Donell @ 2018-10-18  2:17 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho, Florian Weimer, libc-alpha,
	John David Anglin, Adhemerval Zanella, Joseph Myers,
	Florian Weimer

On 10/17/18 10:00 PM, Tulio Magno Quites Machado Filho wrote:
> Changes since v3:
> 
>  - Improved comments.
>  - Started to use -Wl,-z,now.
>  - Added field init to l_reloc_result to be used as a guard.

Thank you for working through this!

OK, comments below.

Please send v5, and I think we're done.

> Changes since v2:
> 
>  - Fixed coding style in nptl/tst-audit-threads-mod1.c.
>  - Replaced pthreads.h functions with respective support/xthread.h ones.
>  - Replaced malloc() with xcalloc() in nptl/tst-audit-threads.c.
>  - Removed bzero().
>  - Reduced the amount of functions to 7k in order to fit the relocation
>    limit  of some architectures, e.g. m68k, mips.
>  - Fixed issues in nptl/Makefile.
> 
> Changes since v1:
> 
>  - Fixed the coding style issues.
>  - Replaced atomic loads/store with memory fences.
>  - Added a test.
> 
> ---- 8< ----
> 
> The field reloc_result->addr is used to indicate if the rest of the
> fields of reloc_result have already been written, creating a
> data-dependency order> Reading reloc_result->addr to the variable value requires to complete
> before reading the rest of the fields of reloc_result.
> Likewise, the writes to the other fields of the reloc_result must
> complete before reloc_result-addr is updated.

Commit message is incorrect. Could you please update the patch and send
out with a more verbose commit message which covers the key points of
the issue and the solution? Also please mention that data races remain.

> Tested with build-many-glibcs.

Perfect.

> 2018-10-17  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>
> 
> 	[BZ #23690]
> 	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
> 	modification order when accessing reloc_result->addr.
> 	* include/link.h (reloc_result): Add field init.
> 	* nptl/Makefile (tests): Add tst-audit-threads.
> 	(modules-names): Add tst-audit-threads-mod1 and
> 	tst-audit-threads-mod2.
> 	Add rules to build tst-audit-threads.
> 	* nptl/tst-audit-threads-mod1.c: New file.
> 	* nptl/tst-audit-threads-mod2.c: Likewise.
> 	* nptl/tst-audit-threads.c: Likewise.
> 	* nptl/tst-audit-threads.h: Likewise.
> 
> Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
> ---
>  elf/dl-runtime.c              | 30 +++++++++++--
>  include/link.h                |  2 +
>  nptl/Makefile                 | 14 ++++++-
>  nptl/tst-audit-threads-mod1.c | 74 +++++++++++++++++++++++++++++++++
>  nptl/tst-audit-threads-mod2.c | 22 ++++++++++
>  nptl/tst-audit-threads.c      | 97 +++++++++++++++++++++++++++++++++++++++++++
>  nptl/tst-audit-threads.h      | 89 +++++++++++++++++++++++++++++++++++++++
>  7 files changed, 322 insertions(+), 6 deletions(-)
>  create mode 100644 nptl/tst-audit-threads-mod1.c
>  create mode 100644 nptl/tst-audit-threads-mod2.c
>  create mode 100644 nptl/tst-audit-threads.c
>  create mode 100644 nptl/tst-audit-threads.h
> 
> diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
> index 63bbc89776..a760a00a62 100644
> --- a/elf/dl-runtime.c
> +++ b/elf/dl-runtime.c
> @@ -183,10 +183,22 @@ _dl_profile_fixup (
>    /* This is the address in the array where we store the result of previous
>       relocations.  */
>    struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
> +
> +  /* CONCURRENCY NOTES:
> +
> +     Multiple threads may be calling the same PLT sequence and with LD_AUDIT
> +     enabled they will be calling into _dl_profile_fixup to update the
> +     reloc_result with the result of the lazy resolution.  The reloc_result
> +     guard variable is init, and we use relaxed MO loads and stores to it
> +     along with an atomic_thread_acquire and atomic_thread_release fence to
> +     ensure that the results of the structure are consistent with the
> +     loaded value of the guard.  */

Suggest:

  /* CONCURRENCY NOTES:

     Multiple threads may be calling the same PLT sequence and with
     LD_AUDIT enabled they will be calling into _dl_profile_fixup to
     update the reloc_result with the result of the lazy resolution.
     The reloc_result guard variable is reloc_init, and we use
     acquire/release loads and store to it to ensure that the results of
     the structure are consistent with the loaded value of the guard.
     This does not fix all of the data races that occur when two or more
     threads read reloc_result->reloc_init with a value of zero and read
     and write to that reloc_result concurrently.  The expectation is
     generally that while this is a data race it works because the
     threads write the same values.  Until the data races are fixed
     there is a potential for problems to arise from these data races.
     The reloc result updates should happen in parallel but there should
     be an atomic RMW which does the final update to the real result
     entry (see bug 23790).

     The following code uses reloc_init set to 0 indicate if it is the
     first time this object is being relocated, otherwise 1 which
     indicates the object has already been relocated.

     Reading/Writing from/to reloc_result->reloc_init must not happen
     before previous writes to reloc_result complete as they could
     end-up with an incomplete struct.  */

>    DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;

Please see my patch:
https://www.sourceware.org/ml/libc-alpha/2018-10/msg00320.html

You can get rid of *resultp entirely.


> +  DL_FIXUP_VALUE_TYPE value;
> +  unsigned int init = atomic_load_relaxed (&reloc_result->init);

OK.

> +  atomic_thread_fence_acquire ();

Can't we just use atomic_load_acquire and atomic_store_release now?

>  
> -  DL_FIXUP_VALUE_TYPE value = *resultp;
> -  if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
> +  if (init == 0)

OK.

>      {
>        /* This is the first time we have to relocate this object.  */
>        const ElfW(Sym) *const symtab
> @@ -346,16 +358,26 @@ _dl_profile_fixup (
>  
>        /* Store the result for later runs.  */
>        if (__glibc_likely (! GLRO(dl_bind_not)))
> -	*resultp = value;
> +	{
> +	  *resultp = value;
> +	  atomic_thread_fence_release ();
> +	  /* Guarantee all previous writes complete before
> +	     init is updated.  See CONCURRENCY NOTES earlier  */
> +	  atomic_store_relaxed (&reloc_result->init, 1);

This just becomes a atomic_store_release on init.

> +	}
> +      init = 1;

OK (Ooops! I forgot this in my patch! :-))

>      }
> +  else
> +    value = *resultp;

This can just be value = reloc_result->addr;

>  
>    /* By default we do not call the pltexit function.  */
>    long int framesize = -1;
>  
> +
>  #ifdef SHARED
>    /* Auditing checkpoint: report the PLT entering and allow the
>       auditors to change the value.  */
> -  if (DL_FIXUP_VALUE_CODE_ADDR (value) != 0 && GLRO(dl_naudit) > 0
> +  if (init != 0 && GLRO(dl_naudit) > 0

OK.

>        /* Don't do anything if no auditor wants to intercept this call.  */
>        && (reloc_result->enterexit & LA_SYMB_NOPLTENTER) == 0)
>      {
> diff --git a/include/link.h b/include/link.h
> index 5924594548..1d13d02637 100644
> --- a/include/link.h
> +++ b/include/link.h
> @@ -216,6 +216,8 @@ struct link_map
>        unsigned int boundndx;
>        uint32_t enterexit;
>        unsigned int flags;
> +      /* Indicates if reloc_result fields have been initialized.  */
> +      unsigned int init;

Suggest:

/* CONCURRENCY NOTE: This is used to guard the concurrent initialization
   of the relocation result across multiple threads. See the more
   detailed notes in elf/dl-runtime.c.  */

>      } *l_reloc_result;
>  
>      /* Pointer to the version information if available.  */
> diff --git a/nptl/Makefile b/nptl/Makefile
> index be8066524c..9862ef53fc 100644
> --- a/nptl/Makefile
> +++ b/nptl/Makefile
> @@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
>  	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
>  	 tst-oncex3 tst-oncex4
>  ifeq ($(build-shared),yes)
> -tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
> +tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
> +	 tst-audit-threads

OK.

>  tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
>  tests-nolibpthread += tst-fini1
>  ifeq ($(have-z-execstack),yes)
> @@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
>  		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
>  		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
>  		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
> -		tst-join7mod tst-compat-forwarder-mod
> +		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
> +		tst-audit-threads-mod2

OK.

>  extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
>  		   tst-cleanup4aux.o tst-cleanupx4aux.o
>  test-extras += tst-cleanup4aux tst-cleanupx4aux
> @@ -709,6 +711,14 @@ endif
>  
>  $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
>  
> +# Protect against a build using -Wl,-z,now.
> +LDFLAGS-tst-audit-threads-mod1.so = -Wl,-z,lazy
> +LDFLAGS-tst-audit-threads-mod2.so = -Wl,-z,lazy
> +LDFLAGS-tst-audit-threads = -Wl,-z,lazy

OK. Yes! Perfect, good belt-and-suspenders.

> +$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
> +$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
> +tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so
> +

OK.

>  # The tests here better do not run in parallel
>  ifneq ($(filter %tests,$(MAKECMDGOALS)),)
>  .NOTPARALLEL:
> diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
> new file mode 100644
> index 0000000000..6fa0c0c6c4
> --- /dev/null
> +++ b/nptl/tst-audit-threads-mod1.c
> @@ -0,0 +1,74 @@
> +/* Dummy audit library for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <elf.h>
> +#include <link.h>
> +#include <stdio.h>
> +#include <assert.h>
> +#include <string.h>
> +
> +/* We must use a dummy LD_AUDIT module to force the dynamic loader to
> +   *not* update the real PLT, and instead use a cached value for the
> +   lazy resolution result.  It is the update of that cached value that
> +   we are testing for correctness by doing this.  */
> +
> +/* Library to be audited.  */
> +#define LIB "tst-audit-threads-mod2.so"
> +/* CALLNUM is the number of retNum functions.  */
> +#define CALLNUM 7999
> +
> +#define CONCATX(a, b) __CONCAT (a, b)
> +
> +static int previous = 0;
> +
> +unsigned int
> +la_version (unsigned int ver)
> +{
> +  return 1;
> +}
> +
> +unsigned int
> +la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
> +{
> +  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
> +}
> +
> +uintptr_t
> +CONCATX(la_symbind, __ELF_NATIVE_CLASS) (ElfW(Sym) *sym,
> +					unsigned int ndx,
> +					uintptr_t *refcook,
> +					uintptr_t *defcook,
> +					unsigned int *flags,
> +					const char *symname)
> +{
> +  const char * retnum = "retNum";
> +  char * num = strstr (symname, retnum);
> +  int n;
> +  /* Validate if the symbols are getting called in the correct order.
> +     This code is here just to guarantee Binutils will not optimize out this
> +     function.  */

Slight clarification:

"This code is here to verify binutils does not optimize out the PLT
 entries that require the symbol binding."


> +  if (num != NULL)
> +    {
> +      n = atoi (num);
> +      assert (n >= previous);
> +      assert (n <= CALLNUM);
> +      previous = n;
> +    }
> +  return sym->st_value;
> +}
> diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
> new file mode 100644
> index 0000000000..f9817dd3dc
> --- /dev/null
> +++ b/nptl/tst-audit-threads-mod2.c
> @@ -0,0 +1,22 @@
> +/* Shared object with a huge number of functions for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* Define all the retNumN functions in a library.  */
> +#define definenum
> +#include "tst-audit-threads.h"
> diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
> new file mode 100644
> index 0000000000..e4bf433bd8
> --- /dev/null
> +++ b/nptl/tst-audit-threads.c
> @@ -0,0 +1,97 @@
> +/* Test multi-threading using LD_AUDIT.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
> +   library with a huge number of functions in order to validate lazy symbol
> +   binding with an audit library.  We use one thread per CPU to test that
> +   concurrent lazy resolution does not have any defects which would cause
> +   the process to fail.  We use an LD_AUDIT library to force the testing of
> +   the relocation resolution caching code in the dynamic loader i.e.
> +   _dl_runtime_profile and _dl_profile_fixup.  */

OK.

> +
> +#include <support/xthread.h>
> +#include <strings.h>
> +#include <stdlib.h>
> +#include <sys/sysinfo.h>
> +
> +static int do_test (void);
> +
> +/* This test usually takes less than 3s to run.  However, there are cases that
> +   take up to 30s.  */
> +#define TIMEOUT 60
> +#define TEST_FUNCTION do_test ()
> +#include "../test-skeleton.c"
> +
> +/* Declare the functions we are going to call.  */
> +#define externnum
> +#include "tst-audit-threads.h"
> +#undef externnum
> +
> +int num_threads;
> +pthread_barrier_t barrier;
> +
> +void
> +sync_all (int num)
> +{
> +  pthread_barrier_wait (&barrier);
> +}
> +
> +void
> +call_all_ret_nums (void)
> +{
> +  /* Call each function one at a time from all threads.  */
> +#define callnum
> +#include "tst-audit-threads.h"
> +#undef callnum
> +}
> +
> +void *
> +thread_main (void *unused)
> +{
> +  call_all_ret_nums ();
> +  return NULL;
> +}
> +
> +#define STR2(X) #X
> +#define STR(X) STR2(X)
> +
> +static int
> +do_test (void)
> +{
> +  int i;
> +  pthread_t *threads;
> +
> +  num_threads = get_nprocs ();
> +  if (num_threads <= 1)
> +    num_threads = 2;
> +
> +  /* Used to synchronize all the threads after calling each retNumN.  */
> +  xpthread_barrier_init (&barrier, NULL, num_threads);
> +
> +  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
> +  for (i = 0; i < num_threads; i++)
> +    threads[i] = xpthread_create(NULL, thread_main, NULL);
> +
> +  for (i = 0; i < num_threads; i++)
> +    xpthread_join(threads[i]);
> +
> +  free (threads);
> +
> +  return 0;
> +}

OK.

> diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
> new file mode 100644
> index 0000000000..491d0dcbf0
> --- /dev/null
> +++ b/nptl/tst-audit-threads.h
> @@ -0,0 +1,89 @@
> +/* Helper header for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* We use this helper to create a large number of functions, all of
> +   which will be resolved lazily and thus have their PLT updated.
> +   This is done to provide enough functions that we can statistically
> +   observe a thread vs. PLT resolution failure if one exists.  */
> +
> +#define CONCAT(a, b) a ## b
> +#define NUM(x, y) CONCAT (x, y)
> +
> +#define FUNC10(x)	\
> +  FUNC (NUM (x, 0));	\
> +  FUNC (NUM (x, 1));	\
> +  FUNC (NUM (x, 2));	\
> +  FUNC (NUM (x, 3));	\
> +  FUNC (NUM (x, 4));	\
> +  FUNC (NUM (x, 5));	\
> +  FUNC (NUM (x, 6));	\
> +  FUNC (NUM (x, 7));	\
> +  FUNC (NUM (x, 8));	\
> +  FUNC (NUM (x, 9))
> +
> +#define FUNC100(x)	\
> +  FUNC10 (NUM (x, 0));	\
> +  FUNC10 (NUM (x, 1));	\
> +  FUNC10 (NUM (x, 2));	\
> +  FUNC10 (NUM (x, 3));	\
> +  FUNC10 (NUM (x, 4));	\
> +  FUNC10 (NUM (x, 5));	\
> +  FUNC10 (NUM (x, 6));	\
> +  FUNC10 (NUM (x, 7));	\
> +  FUNC10 (NUM (x, 8));	\
> +  FUNC10 (NUM (x, 9))
> +
> +#define FUNC1000(x)		\
> +  FUNC100 (NUM (x, 0));		\
> +  FUNC100 (NUM (x, 1));		\
> +  FUNC100 (NUM (x, 2));		\
> +  FUNC100 (NUM (x, 3));		\
> +  FUNC100 (NUM (x, 4));		\
> +  FUNC100 (NUM (x, 5));		\
> +  FUNC100 (NUM (x, 6));		\
> +  FUNC100 (NUM (x, 7));		\
> +  FUNC100 (NUM (x, 8));		\
> +  FUNC100 (NUM (x, 9))
> +
> +#define FUNC7000()	\
> +  FUNC1000 (1);		\
> +  FUNC1000 (2);		\
> +  FUNC1000 (3);		\
> +  FUNC1000 (4);		\
> +  FUNC1000 (5);		\
> +  FUNC1000 (6);		\
> +  FUNC1000 (7);
> +
> +#ifdef FUNC
> +# undef FUNC
> +#endif
> +
> +#ifdef externnum
> +# define FUNC(x) extern int CONCAT (retNum, x) (void)
> +#endif
> +
> +#ifdef definenum
> +# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
> +#endif
> +
> +#ifdef callnum
> +# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
> +#endif
> +
> +FUNC7000 ();
> -- 2.14.4


-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-18  2:14                 ` Carlos O'Donell
@ 2018-10-18  7:24                   ` Carlos O'Donell
  2018-10-18 10:21                   ` Florian Weimer
  1 sibling, 0 replies; 40+ messages in thread
From: Carlos O'Donell @ 2018-10-18  7:24 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Adhemerval Zanella, Joseph Myers

On 10/17/18 10:02 PM, Carlos O'Donell wrote:
> My position is unchanged.
> 
> If I could summarize our positions, I would write:
> 
> (1) The "Pragmatic approach"
> 
> - Since we don't have C11 atomics, we should just use the fences because
>   they fix the data dependency issue, and stop there. We need not go any
>   further until we are ready to fix the underlying algorithm of the result
>   updates and *then* we can follow C11.
> 
> (2) The "Incremental C11 approach"
> 
> - Assume we are C11, and take an incremental approach where we fix the
>   data dependency issue using correct synchronization primitives, even
>   if it doesn't solve all of the data races.
> 
> Did I summarize your position accurately?
> 
> I prefer (2).
> 
> Tulio can have a preference also.
> 
> I spoke to Tulio on IRC and he says he has a working v4 in build-many-glibcs.
> 
> I figure we'll commit something probably tomorrow for this.
> 
> Just for the record I'm attaching my WIP v4 so you can see what I mean about
> the solution. Yes, I did make the structure 4 bytes larger, and that will have
> a real impact, but it removes the dependency on the size of the function pointer,
> and I like that for future maintenance, and we'll likely rewrite the whole struct
> when we fix 23790 to get a more optimal layout.

I'm happy to see that Tulio and I basically had the same solution for v4.

Actually, I have a bug in mine where I didn't set local reloc_init to 1
which results in the plenter not being called for all threads doing the
initialization, but Tulio's code has this fixed :-)

I've reviewed Tulio's patch and so with v5 I think we're done.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-18  2:14                 ` Carlos O'Donell
  2018-10-18  7:24                   ` Carlos O'Donell
@ 2018-10-18 10:21                   ` Florian Weimer
  2018-10-18 16:56                     ` Carlos O'Donell
  1 sibling, 1 reply; 40+ messages in thread
From: Florian Weimer @ 2018-10-18 10:21 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Adhemerval Zanella, Joseph Myers

* Carlos O'Donell:

> The use of the guard+fence-to-fence sync is, from a C11 perspective,
> correct.

I really don't think this is true:

| Two expression evaluations conflict if one of them modifies a memory
| location and the other one reads or modifies the same memory location.

(C11 5.1.2.4p4)

| The execution of a program contains a data race if it contains two
| conflicting actions in different threads, at least one of which is not
| atomic, and neither happens before the other. Any such data race
| results in undefined behavior.

(C11 51.2.4p25)

We still have unordered conflicting non-atomic writes after Tulio's
patch.  I don't think they matter to us.  But this is *not* correct for
C11.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-18 10:21                   ` Florian Weimer
@ 2018-10-18 16:56                     ` Carlos O'Donell
  2018-10-18 18:22                       ` Adhemerval Zanella
  0 siblings, 1 reply; 40+ messages in thread
From: Carlos O'Donell @ 2018-10-18 16:56 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Adhemerval Zanella, Joseph Myers

On 10/18/18 3:24 AM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> The use of the guard+fence-to-fence sync is, from a C11 perspective,
>> correct.
> 
> I really don't think this is true:
> 
> | Two expression evaluations conflict if one of them modifies a memory
> | location and the other one reads or modifies the same memory location.
> 
> (C11 5.1.2.4p4)
> 
> | The execution of a program contains a data race if it contains two
> | conflicting actions in different threads, at least one of which is not
> | atomic, and neither happens before the other. Any such data race
> | results in undefined behavior.
> 
> (C11 51.2.4p25)
> 
> We still have unordered conflicting non-atomic writes after Tulio's
> patch.  I don't think they matter to us.  But this is *not* correct for
> C11.

I agree completely. My point is that the change, the specific lines Tulio
is touching, and the changes made, are correct, a fence-to-fence sync
requires an atomic guard access. I agree it doesn't fix the actual problem
of multiple threads doing the same updates to the reloc_result.

glibc is *full* of data races, and that doesn't mean we will just give up
on using C11 semantics until we can fix them all. Any changes we do, we
should do them so they are correct.

It really feels like we agree, but we're talking past eachother.  Did my
previous email clarify our positions and which one I choose and why?

See:
https://www.sourceware.org/ml/libc-alpha/2018-10/msg00320.html

If I didn't understand your position correctly, please correct what I 
wrote so I can understand your suggestion.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-18 16:56                     ` Carlos O'Donell
@ 2018-10-18 18:22                       ` Adhemerval Zanella
  2018-10-18 19:25                         ` Carlos O'Donell
  0 siblings, 1 reply; 40+ messages in thread
From: Adhemerval Zanella @ 2018-10-18 18:22 UTC (permalink / raw)
  To: Carlos O'Donell, Florian Weimer
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Joseph Myers



On 18/10/2018 10:39, Carlos O'Donell wrote:
> On 10/18/18 3:24 AM, Florian Weimer wrote:
>> * Carlos O'Donell:
>>
>>> The use of the guard+fence-to-fence sync is, from a C11 perspective,
>>> correct.
>>
>> I really don't think this is true:
>>
>> | Two expression evaluations conflict if one of them modifies a memory
>> | location and the other one reads or modifies the same memory location.
>>
>> (C11 5.1.2.4p4)
>>
>> | The execution of a program contains a data race if it contains two
>> | conflicting actions in different threads, at least one of which is not
>> | atomic, and neither happens before the other. Any such data race
>> | results in undefined behavior.
>>
>> (C11 51.2.4p25)
>>
>> We still have unordered conflicting non-atomic writes after Tulio's
>> patch.  I don't think they matter to us.  But this is *not* correct for
>> C11.
> 
> I agree completely. My point is that the change, the specific lines Tulio
> is touching, and the changes made, are correct, a fence-to-fence sync
> requires an atomic guard access. I agree it doesn't fix the actual problem
> of multiple threads doing the same updates to the reloc_result.
> 
> glibc is *full* of data races, and that doesn't mean we will just give up
> on using C11 semantics until we can fix them all. Any changes we do, we
> should do them so they are correct.
> 
> It really feels like we agree, but we're talking past eachother.  Did my
> previous email clarify our positions and which one I choose and why?
> 
> See:
> https://www.sourceware.org/ml/libc-alpha/2018-10/msg00320.html
> 
> If I didn't understand your position correctly, please correct what I 
> wrote so I can understand your suggestion.
> 

Wouldn't just disable lazy-resolution for LD_AUDIT be a simpler solution?
More and more distributions are set bind-now as default build option and
audition already implies some performance overhead (not considering the
lazy-resolution performance gain might also not represent true in real
world cases).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-18 18:22                       ` Adhemerval Zanella
@ 2018-10-18 19:25                         ` Carlos O'Donell
  2018-10-18 20:01                           ` Adhemerval Zanella
  0 siblings, 1 reply; 40+ messages in thread
From: Carlos O'Donell @ 2018-10-18 19:25 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Joseph Myers

On 10/18/18 2:21 PM, Adhemerval Zanella wrote:
> 
> 
> On 18/10/2018 10:39, Carlos O'Donell wrote:
>> On 10/18/18 3:24 AM, Florian Weimer wrote:
>>> * Carlos O'Donell:
>>>
>>>> The use of the guard+fence-to-fence sync is, from a C11 perspective,
>>>> correct.
>>>
>>> I really don't think this is true:
>>>
>>> | Two expression evaluations conflict if one of them modifies a memory
>>> | location and the other one reads or modifies the same memory location.
>>>
>>> (C11 5.1.2.4p4)
>>>
>>> | The execution of a program contains a data race if it contains two
>>> | conflicting actions in different threads, at least one of which is not
>>> | atomic, and neither happens before the other. Any such data race
>>> | results in undefined behavior.
>>>
>>> (C11 51.2.4p25)
>>>
>>> We still have unordered conflicting non-atomic writes after Tulio's
>>> patch.  I don't think they matter to us.  But this is *not* correct for
>>> C11.
>>
>> I agree completely. My point is that the change, the specific lines Tulio
>> is touching, and the changes made, are correct, a fence-to-fence sync
>> requires an atomic guard access. I agree it doesn't fix the actual problem
>> of multiple threads doing the same updates to the reloc_result.
>>
>> glibc is *full* of data races, and that doesn't mean we will just give up
>> on using C11 semantics until we can fix them all. Any changes we do, we
>> should do them so they are correct.
>>
>> It really feels like we agree, but we're talking past eachother.  Did my
>> previous email clarify our positions and which one I choose and why?
>>
>> See:
>> https://www.sourceware.org/ml/libc-alpha/2018-10/msg00320.html
>>
>> If I didn't understand your position correctly, please correct what I 
>> wrote so I can understand your suggestion.
>>
> 
> Wouldn't just disable lazy-resolution for LD_AUDIT be a simpler solution?

This is not the question I would ask myself in this case.

Consider that auditing is independent of the manner in which the application
is deployed by the user (built with or without lazy binding).

Thus enabling auditing should have as little impact on the underlying
application deployment as possible.

Forcing immediate binding for LD_AUDIT has an impact we cannot measure,
because we aren't the user with the application.

The point of these features is to allow for users to customize their choices
to meet their application needs. It is not a one-siz-fits-all.

> More and more distributions are set bind-now as default build option and
> audition already implies some performance overhead (not considering the
> lazy-resolution performance gain might also not represent true in real
> world cases).
 
Distribution choices are different from user application choices.

Sometimes we make unilateral choices, but only if it's a clear win.

The most recent case was AArch64 TLSDESC, where Arm decided that TLSDESC
would always be resolved non-lazily (Szabolcs will correct me if I'm wrong).
This was a case where the synchronization required to update the TLSDESC
was so costly on a per-function-call basis that it was clearly always a
win to force TLSDESC to always be immediately bound, and drop the required
synchronization (a cost you always had to pay).

Here the situation is less clear, and we have less data with which to make
the choice. Selection of lazy vs. non-lazy is still a choice we give users
and it is independent of auditing.

In summary:

- Selection of lazy vs non-lazy binding is presently an orthogonal user
  choice from auditing.

- Distribution choices are about general solutions that work best for a
  large number of users.

- Lastly, a one-size-fits-all solution doesn't work best for all users.

Unless there is a very strong and compelling reason to force non-lazy-binding
for LD_AUDIT, I would not recommend we do it. It's just a question of user
choice.

I also think that the new reloc_result.init field can now be used to
implement a lockless algorithm to update the relocs without data races,
but it would be "part 2" of fixing P&C for LD_AUDIT.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-18 19:25                         ` Carlos O'Donell
@ 2018-10-18 20:01                           ` Adhemerval Zanella
  2018-10-23  1:33                             ` Carlos O'Donell
  0 siblings, 1 reply; 40+ messages in thread
From: Adhemerval Zanella @ 2018-10-18 20:01 UTC (permalink / raw)
  To: Carlos O'Donell, Florian Weimer
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Joseph Myers



On 18/10/2018 15:43, Carlos O'Donell wrote:
> On 10/18/18 2:21 PM, Adhemerval Zanella wrote:
>>
>>
>> On 18/10/2018 10:39, Carlos O'Donell wrote:
>>> On 10/18/18 3:24 AM, Florian Weimer wrote:
>>>> * Carlos O'Donell:
>>>>
>>>>> The use of the guard+fence-to-fence sync is, from a C11 perspective,
>>>>> correct.
>>>>
>>>> I really don't think this is true:
>>>>
>>>> | Two expression evaluations conflict if one of them modifies a memory
>>>> | location and the other one reads or modifies the same memory location.
>>>>
>>>> (C11 5.1.2.4p4)
>>>>
>>>> | The execution of a program contains a data race if it contains two
>>>> | conflicting actions in different threads, at least one of which is not
>>>> | atomic, and neither happens before the other. Any such data race
>>>> | results in undefined behavior.
>>>>
>>>> (C11 51.2.4p25)
>>>>
>>>> We still have unordered conflicting non-atomic writes after Tulio's
>>>> patch.  I don't think they matter to us.  But this is *not* correct for
>>>> C11.
>>>
>>> I agree completely. My point is that the change, the specific lines Tulio
>>> is touching, and the changes made, are correct, a fence-to-fence sync
>>> requires an atomic guard access. I agree it doesn't fix the actual problem
>>> of multiple threads doing the same updates to the reloc_result.
>>>
>>> glibc is *full* of data races, and that doesn't mean we will just give up
>>> on using C11 semantics until we can fix them all. Any changes we do, we
>>> should do them so they are correct.
>>>
>>> It really feels like we agree, but we're talking past eachother.  Did my
>>> previous email clarify our positions and which one I choose and why?
>>>
>>> See:
>>> https://www.sourceware.org/ml/libc-alpha/2018-10/msg00320.html
>>>
>>> If I didn't understand your position correctly, please correct what I 
>>> wrote so I can understand your suggestion.
>>>
>>
>> Wouldn't just disable lazy-resolution for LD_AUDIT be a simpler solution?
> 
> This is not the question I would ask myself in this case.
> 
> Consider that auditing is independent of the manner in which the application
> is deployed by the user (built with or without lazy binding).

I disagree, each possible user option we support incurs in extra
maintainability and in this case the possible combination of current 
trampoline types and arch-specific code increases even more the burden
of not only provide, but to ensure correctness and testability.

> 
> Thus enabling auditing should have as little impact on the underlying
> application deployment as possible.
> 
> Forcing immediate binding for LD_AUDIT has an impact we cannot measure,
> because we aren't the user with the application.

I agree, but I constantly I hear that lazy-binding might show performance
advantages without much data to actually to back this up. Do we have actual
benchmarks and data that show it still a relevant feature?

> 
> The point of these features is to allow for users to customize their choices
> to meet their application needs. It is not a one-siz-fits-all.
> 
>> More and more distributions are set bind-now as default build option and
>> audition already implies some performance overhead (not considering the
>> lazy-resolution performance gain might also not represent true in real
>> world cases).
>  
> Distribution choices are different from user application choices.
> 
> Sometimes we make unilateral choices, but only if it's a clear win.
> 
> The most recent case was AArch64 TLSDESC, where Arm decided that TLSDESC
> would always be resolved non-lazily (Szabolcs will correct me if I'm wrong).
> This was a case where the synchronization required to update the TLSDESC
> was so costly on a per-function-call basis that it was clearly always a
> win to force TLSDESC to always be immediately bound, and drop the required
> synchronization (a cost you always had to pay).
> 
> Here the situation is less clear, and we have less data with which to make
> the choice. Selection of lazy vs. non-lazy is still a choice we give users
> and it is independent of auditing.
> 
> In summary:
> 
> - Selection of lazy vs non-lazy binding is presently an orthogonal user
>   choice from auditing.
> 
> - Distribution choices are about general solutions that work best for a
>   large number of users.
> 
> - Lastly, a one-size-fits-all solution doesn't work best for all users.
> 
> Unless there is a very strong and compelling reason to force non-lazy-binding
> for LD_AUDIT, I would not recommend we do it. It's just a question of user
> choice.

My point is since we have limited resources, specially for synchronization
issues which required an extra level of carefulness; I see we should prioritize
better and revaluate some taken decisions. Some decisions were made to handle a 
very specific issue in the past which might not be relevant for current usercases,
where the trade-off of performance/usability/maintainability might have changed.

We already had some lazy-bind issues in the past (BZ#19129, BZ#18034, BZ#726),
still have some (BZ#23296, BZ#23240, BZ#21349, BZ#20107), and might still contain
some not accounted for in bugzilla for not so widespread used options (ld audit,
ifunc, tlsdesc, etc.). These are just the one I got from a very basic bugzilla 
search, we might have more.

This lead to ask me if lazy-bind still worth all the required internal complexity
and which real world gains we are trying to obtain besides just the option for
itself. I do agree that giving more user choices are a better thing, but we
need to balance usefulness, usability, and maintenance.

> 
> I also think that the new reloc_result.init field can now be used to
> implement a lockless algorithm to update the relocs without data races,
> but it would be "part 2" of fixing P&C for LD_AUDIT.
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-18 20:01                           ` Adhemerval Zanella
@ 2018-10-23  1:33                             ` Carlos O'Donell
  2018-10-23 14:11                               ` Adhemerval Zanella
  0 siblings, 1 reply; 40+ messages in thread
From: Carlos O'Donell @ 2018-10-23  1:33 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Joseph Myers

On 10/18/18 3:40 PM, Adhemerval Zanella wrote:
> I disagree, each possible user option we support incurs in extra
> maintainability and in this case the possible combination of current 
> trampoline types and arch-specific code increases even more the burden
> of not only provide, but to ensure correctness and testability.

I agree with you on this.

>> Thus enabling auditing should have as little impact on the underlying
>> application deployment as possible.
>>
>> Forcing immediate binding for LD_AUDIT has an impact we cannot measure,
>> because we aren't the user with the application.
> 
> I agree, but I constantly I hear that lazy-binding might show performance
> advantages without much data to actually to back this up. Do we have actual
> benchmarks and data that show it still a relevant feature?

There are two issues at hand.

(1) Lazy-binding provides a hook for developer tooling.

(2) Lazy-binding speeds up application startup.

We have concrete evidence for (1), it's LD_AUDIT, and latrace/ltrace, and
a bunch of other smaller developer tooling.

There is even production systems using it like Spindle:
https://computation.llnl.gov/projects/spindle

Spindle has immediate examples of where all aspects of the dynamic loading
process are slowed down by large scientific workloads.

However, we don't have any good microbenchmarks to show the difference
between lazy and non-lazy. I should write some so we can have a concrete
discussion.

I see rented cloud environments as places where lazy-binding would help
reduce CPU usage costs.

I see distribution usage of BIND_NOW as a security measure that while
important is not always relevant to users running services inside their
own networks. Why pay the performance cost of security relevant features
if you don't need them?

>>
>> The point of these features is to allow for users to customize their choices
>> to meet their application needs. It is not a one-siz-fits-all.
>>
>>> More and more distributions are set bind-now as default build option and
>>> audition already implies some performance overhead (not considering the
>>> lazy-resolution performance gain might also not represent true in real
>>> world cases).
>>  
>> Distribution choices are different from user application choices.
>>
>> Sometimes we make unilateral choices, but only if it's a clear win.
>>
>> The most recent case was AArch64 TLSDESC, where Arm decided that TLSDESC
>> would always be resolved non-lazily (Szabolcs will correct me if I'm wrong).
>> This was a case where the synchronization required to update the TLSDESC
>> was so costly on a per-function-call basis that it was clearly always a
>> win to force TLSDESC to always be immediately bound, and drop the required
>> synchronization (a cost you always had to pay).
>>
>> Here the situation is less clear, and we have less data with which to make
>> the choice. Selection of lazy vs. non-lazy is still a choice we give users
>> and it is independent of auditing.
>>
>> In summary:
>>
>> - Selection of lazy vs non-lazy binding is presently an orthogonal user
>>   choice from auditing.
>>
>> - Distribution choices are about general solutions that work best for a
>>   large number of users.
>>
>> - Lastly, a one-size-fits-all solution doesn't work best for all users.
>>
>> Unless there is a very strong and compelling reason to force non-lazy-binding
>> for LD_AUDIT, I would not recommend we do it. It's just a question of user
>> choice.
> 
> My point is since we have limited resources, specially for synchronization
> issues which required an extra level of carefulness; I see we should prioritize
> better and revaluate some taken decisions. Some decisions were made to handle a 
> very specific issue in the past which might not be relevant for current usercases,
> where the trade-off of performance/usability/maintainability might have changed.

Agreed. I think we need some benchmarks here to have a real discussion.

> We already had some lazy-bind issues in the past (BZ#19129, BZ#18034, BZ#726),
> still have some (BZ#23296, BZ#23240, BZ#21349, BZ#20107), and might still contain
> some not accounted for in bugzilla for not so widespread used options (ld audit,
> ifunc, tlsdesc, etc.). These are just the one I got from a very basic bugzilla 
> search, we might have more.

I agree, it is compilcated by the fact that multiple threads resolve the symbols
at the same time.

> This lead to ask me if lazy-bind still worth all the required internal complexity
> and which real world gains we are trying to obtain besides just the option for
> itself. I do agree that giving more user choices are a better thing, but we
> need to balance usefulness, usability, and maintenance.

I don't disagree, *but* if we are going to get rid of lazy-binding, something
we have supported for a long time, it's going to have to be with good evidence
to show our users that it really doesn't matter anymore.

I hope that makes my position clearer.

In summary:

- If we are going to make a change to remove lazy-binding it has to be in an
  informed manner with results from benchmarking that allow us to give
  evidence to our users.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-23  1:33                             ` Carlos O'Donell
@ 2018-10-23 14:11                               ` Adhemerval Zanella
  2018-10-23 15:56                                 ` Carlos O'Donell
  0 siblings, 1 reply; 40+ messages in thread
From: Adhemerval Zanella @ 2018-10-23 14:11 UTC (permalink / raw)
  To: Carlos O'Donell, Florian Weimer
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Joseph Myers



On 22/10/2018 21:17, Carlos O'Donell wrote:
> On 10/18/18 3:40 PM, Adhemerval Zanella wrote:
>> I disagree, each possible user option we support incurs in extra
>> maintainability and in this case the possible combination of current 
>> trampoline types and arch-specific code increases even more the burden
>> of not only provide, but to ensure correctness and testability.
> 
> I agree with you on this.
> 
>>> Thus enabling auditing should have as little impact on the underlying
>>> application deployment as possible.
>>>
>>> Forcing immediate binding for LD_AUDIT has an impact we cannot measure,
>>> because we aren't the user with the application.
>>
>> I agree, but I constantly I hear that lazy-binding might show performance
>> advantages without much data to actually to back this up. Do we have actual
>> benchmarks and data that show it still a relevant feature?
> 
> There are two issues at hand.
> 
> (1) Lazy-binding provides a hook for developer tooling.
> 
> (2) Lazy-binding speeds up application startup.
> 
> We have concrete evidence for (1), it's LD_AUDIT, and latrace/ltrace, and
> a bunch of other smaller developer tooling.
> 
> There is even production systems using it like Spindle:
> https://computation.llnl.gov/projects/spindle
> 
> Spindle has immediate examples of where all aspects of the dynamic loading
> process are slowed down by large scientific workloads.

Correct me if I am wrong, but from the paper it seems it intercepts the 
the file operations and use a shared caching mechanism to avoid duplicate
the loading time. My understanding is the issue they are trying to solve
is not relocation runtime overhead, but rather parallel file system operations
when multiple processes loads a bulk of shared libraries and python modules
incurring in I/O concurrency overhead.

Also, it says rtld-audit PLT interposition is in fact a performance issue
which they had to actually make the spindle client to handle the GOT
setup. They do seems to use the symbol binding to intercept open* calls
to handle script languages loading.

I understand that they adapted the rtld-audit to their needs, however it
does not really require lazy-binding to intercept the library calls to
intercept the file operations (readdir for instance). Also I think it
would be feasible to call la_symbind* on first symbol resolution for
non-lazy mode.

> 
> However, we don't have any good microbenchmarks to show the difference
> between lazy and non-lazy. I should write some so we can have a concrete
> discussion.
> 
> I see rented cloud environments as places where lazy-binding would help
> reduce CPU usage costs.
> 
> I see distribution usage of BIND_NOW as a security measure that while
> important is not always relevant to users running services inside their
> own networks. Why pay the performance cost of security relevant features
> if you don't need them?

I do agree with you, but my point is 1. maybe the performance gains do not
really outweigh the code complexity and its maintainability costs and 2. the
other factor (security in this case) might be more cost effectively.

> 
>>>
>>> The point of these features is to allow for users to customize their choices
>>> to meet their application needs. It is not a one-siz-fits-all.
>>>
>>>> More and more distributions are set bind-now as default build option and
>>>> audition already implies some performance overhead (not considering the
>>>> lazy-resolution performance gain might also not represent true in real
>>>> world cases).
>>>  
>>> Distribution choices are different from user application choices.
>>>
>>> Sometimes we make unilateral choices, but only if it's a clear win.
>>>
>>> The most recent case was AArch64 TLSDESC, where Arm decided that TLSDESC
>>> would always be resolved non-lazily (Szabolcs will correct me if I'm wrong).
>>> This was a case where the synchronization required to update the TLSDESC
>>> was so costly on a per-function-call basis that it was clearly always a
>>> win to force TLSDESC to always be immediately bound, and drop the required
>>> synchronization (a cost you always had to pay).
>>>
>>> Here the situation is less clear, and we have less data with which to make
>>> the choice. Selection of lazy vs. non-lazy is still a choice we give users
>>> and it is independent of auditing.
>>>
>>> In summary:
>>>
>>> - Selection of lazy vs non-lazy binding is presently an orthogonal user
>>>   choice from auditing.
>>>
>>> - Distribution choices are about general solutions that work best for a
>>>   large number of users.
>>>
>>> - Lastly, a one-size-fits-all solution doesn't work best for all users.
>>>
>>> Unless there is a very strong and compelling reason to force non-lazy-binding
>>> for LD_AUDIT, I would not recommend we do it. It's just a question of user
>>> choice.
>>
>> My point is since we have limited resources, specially for synchronization
>> issues which required an extra level of carefulness; I see we should prioritize
>> better and revaluate some taken decisions. Some decisions were made to handle a 
>> very specific issue in the past which might not be relevant for current usercases,
>> where the trade-off of performance/usability/maintainability might have changed.
> 
> Agreed. I think we need some benchmarks here to have a real discussion.

Agreed.

> 
>> We already had some lazy-bind issues in the past (BZ#19129, BZ#18034, BZ#726),
>> still have some (BZ#23296, BZ#23240, BZ#21349, BZ#20107), and might still contain
>> some not accounted for in bugzilla for not so widespread used options (ld audit,
>> ifunc, tlsdesc, etc.). These are just the one I got from a very basic bugzilla 
>> search, we might have more.
> 
> I agree, it is compilcated by the fact that multiple threads resolve the symbols
> at the same time.
> 
>> This lead to ask me if lazy-bind still worth all the required internal complexity
>> and which real world gains we are trying to obtain besides just the option for
>> itself. I do agree that giving more user choices are a better thing, but we
>> need to balance usefulness, usability, and maintenance.
> 
> I don't disagree, *but* if we are going to get rid of lazy-binding, something
> we have supported for a long time, it's going to have to be with good evidence
> to show our users that it really doesn't matter anymore.
> 
> I hope that makes my position clearer.
> 
> In summary:
> 
> - If we are going to make a change to remove lazy-binding it has to be in an
>   informed manner with results from benchmarking that allow us to give
>   evidence to our users.
> 

Your position is clear and also to make mine clearly I just want to check
if removing lazy-binding as default might be an option. I also don't want
to block this patch, so might move this discussion no another thread.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-23 14:11                               ` Adhemerval Zanella
@ 2018-10-23 15:56                                 ` Carlos O'Donell
  0 siblings, 0 replies; 40+ messages in thread
From: Carlos O'Donell @ 2018-10-23 15:56 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer
  Cc: Tulio Magno Quites Machado Filho, libc-alpha, John David Anglin,
	Joseph Myers

On 10/23/18 10:08 AM, Adhemerval Zanella wrote:
> Your position is clear and also to make mine clearly I just want to check
> if removing lazy-binding as default might be an option. I also don't want
> to block this patch, so might move this discussion no another thread.
 
Yes, removing lazy-binding is an option. We just need to evaluate all the
consequences of this and act to ensure certain features like LD_AUDIT
keep working.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCHv5] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-18  2:17             ` Carlos O'Donell
@ 2018-10-23 21:07               ` Tulio Magno Quites Machado Filho
  2018-11-07 21:54                 ` Carlos O'Donell
  0 siblings, 1 reply; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-10-23 21:07 UTC (permalink / raw)
  To: Carlos O'Donell, Florian Weimer, libc-alpha,
	John David Anglin, Adhemerval Zanella, Joseph Myers,
	Florian Weimer

Changes since v4:
 - Updated commit message.
 - Replace memory fences with atomic load/store acquire/release.
 - Removed resultp from _dl_profile_fixup.
 - Improved source code comments.

Changes since v3:

 - Improved comments.
 - Started to use -Wl,-z,now.
 - Added field init to l_reloc_result to be used as a guard.

Changes since v2:

 - Fixed coding style in nptl/tst-audit-threads-mod1.c.
 - Replaced pthreads.h functions with respective support/xthread.h ones.
 - Replaced malloc() with xcalloc() in nptl/tst-audit-threads.c.
 - Removed bzero().
 - Reduced the amount of functions to 7k in order to fit the relocation
   limit  of some architectures, e.g. m68k, mips.
 - Fixed issues in nptl/Makefile.

Changes since v1:

 - Fixed the coding style issues.
 - Replaced atomic loads/store with memory fences.
 - Added a test.

---- 8< ----

There is a data-dependency order between the fields of struct
l_reloc_result and the field used as its initialization guard.

Reading from the initialization guard requires to complete before
reading the rest of the fields of l_reloc_result.
Likewise, the writes to the other fields of the reloc_result must
complete before the initialization guard is updated.

The previous implementation used DL_FIXUP_VALUE_ADDR (l_reloc_result->addr)
as the initialization guard, making it impossible for some architectures
to load and store it atomically, i.e. hppa and ia64, due to its larger size.

This commit adds an unsigned int to l_reloc_result to be used as the new
initialization guard of the struct, making it possible to load and store
it atomically in all architectures.

Tested with build-many-glibcs and on powerpc, powerpc64 and powerpc64le.

2018-10-23  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>

	[BZ #23690]
	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
	modification order when accessing reloc_result->addr.
	* include/link.h (reloc_result): Add field init.
	* nptl/Makefile (tests): Add tst-audit-threads.
	(modules-names): Add tst-audit-threads-mod1 and
	tst-audit-threads-mod2.
	Add rules to build tst-audit-threads.
	* nptl/tst-audit-threads-mod1.c: New file.
	* nptl/tst-audit-threads-mod2.c: Likewise.
	* nptl/tst-audit-threads.c: Likewise.
	* nptl/tst-audit-threads.h: Likewise.

Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
---
 elf/dl-runtime.c              | 45 +++++++++++++++++---
 include/link.h                |  4 ++
 nptl/Makefile                 | 14 ++++++-
 nptl/tst-audit-threads-mod1.c | 74 +++++++++++++++++++++++++++++++++
 nptl/tst-audit-threads-mod2.c | 22 ++++++++++
 nptl/tst-audit-threads.c      | 97 +++++++++++++++++++++++++++++++++++++++++++
 nptl/tst-audit-threads.h      | 89 +++++++++++++++++++++++++++++++++++++++
 7 files changed, 338 insertions(+), 7 deletions(-)
 create mode 100644 nptl/tst-audit-threads-mod1.c
 create mode 100644 nptl/tst-audit-threads-mod2.c
 create mode 100644 nptl/tst-audit-threads.c
 create mode 100644 nptl/tst-audit-threads.h

diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
index 63bbc89776..757a865092 100644
--- a/elf/dl-runtime.c
+++ b/elf/dl-runtime.c
@@ -183,10 +183,36 @@ _dl_profile_fixup (
   /* This is the address in the array where we store the result of previous
      relocations.  */
   struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
-  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
 
-  DL_FIXUP_VALUE_TYPE value = *resultp;
-  if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
+ /* CONCURRENCY NOTES:
+
+  Multiple threads may be calling the same PLT sequence and with
+  LD_AUDIT enabled they will be calling into _dl_profile_fixup to
+  update the reloc_result with the result of the lazy resolution.
+  The reloc_result guard variable is reloc_init, and we use
+  acquire/release loads and store to it to ensure that the results of
+  the structure are consistent with the loaded value of the guard.
+  This does not fix all of the data races that occur when two or more
+  threads read reloc_result->reloc_init with a value of zero and read
+  and write to that reloc_result concurrently.  The expectation is
+  generally that while this is a data race it works because the
+  threads write the same values.  Until the data races are fixed
+  there is a potential for problems to arise from these data races.
+  The reloc result updates should happen in parallel but there should
+  be an atomic RMW which does the final update to the real result
+  entry (see bug 23790).
+
+  The following code uses reloc_result->init set to 0 to indicate if it is
+  the first time this object is being relocated, otherwise 1 which
+  indicates the object has already been relocated.
+
+  Reading/Writing from/to reloc_result->reloc_init must not happen
+  before previous writes to reloc_result complete as they could
+  end-up with an incomplete struct.  */
+  DL_FIXUP_VALUE_TYPE value;
+  unsigned int init = atomic_load_acquire (&reloc_result->init);
+
+  if (init == 0)
     {
       /* This is the first time we have to relocate this object.  */
       const ElfW(Sym) *const symtab
@@ -346,16 +372,25 @@ _dl_profile_fixup (
 
       /* Store the result for later runs.  */
       if (__glibc_likely (! GLRO(dl_bind_not)))
-	*resultp = value;
+	{
+	  reloc_result->addr = value;
+	  /* Guarantee all previous writes complete before
+	     init is updated.  See CONCURRENCY NOTES earlier  */
+	  atomic_store_release (&reloc_result->init, 1);
+	}
+      init = 1;
     }
+  else
+    value = reloc_result->addr;
 
   /* By default we do not call the pltexit function.  */
   long int framesize = -1;
 
+
 #ifdef SHARED
   /* Auditing checkpoint: report the PLT entering and allow the
      auditors to change the value.  */
-  if (DL_FIXUP_VALUE_CODE_ADDR (value) != 0 && GLRO(dl_naudit) > 0
+  if (init != 0 && GLRO(dl_naudit) > 0
       /* Don't do anything if no auditor wants to intercept this call.  */
       && (reloc_result->enterexit & LA_SYMB_NOPLTENTER) == 0)
     {
diff --git a/include/link.h b/include/link.h
index 5924594548..83b1c34b7b 100644
--- a/include/link.h
+++ b/include/link.h
@@ -216,6 +216,10 @@ struct link_map
       unsigned int boundndx;
       uint32_t enterexit;
       unsigned int flags;
+      /* CONCURRENCY NOTE: This is used to guard the concurrent initialization
+	 of the relocation result across multiple threads.  See the more
+	 detailed notes in elf/dl-runtime.c.  */
+      unsigned int init;
     } *l_reloc_result;
 
     /* Pointer to the version information if available.  */
diff --git a/nptl/Makefile b/nptl/Makefile
index 49b6faa330..ee720960d1 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
 	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
 	 tst-oncex3 tst-oncex4
 ifeq ($(build-shared),yes)
-tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
+tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
+	 tst-audit-threads
 tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
 tests-nolibpthread += tst-fini1
 ifeq ($(have-z-execstack),yes)
@@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
 		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
 		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
 		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
-		tst-join7mod tst-compat-forwarder-mod
+		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
+		tst-audit-threads-mod2
 extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
 		   tst-cleanup4aux.o tst-cleanupx4aux.o
 test-extras += tst-cleanup4aux tst-cleanupx4aux
@@ -711,6 +713,14 @@ $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
 
 tst-mutex10-ENV = GLIBC_TUNABLES=glibc.elision.enable=1
 
+# Protect against a build using -Wl,-z,now.
+LDFLAGS-tst-audit-threads-mod1.so = -Wl,-z,lazy
+LDFLAGS-tst-audit-threads-mod2.so = -Wl,-z,lazy
+LDFLAGS-tst-audit-threads = -Wl,-z,lazy
+$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
+$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
+tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so
+
 # The tests here better do not run in parallel
 ifneq ($(filter %tests,$(MAKECMDGOALS)),)
 .NOTPARALLEL:
diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
new file mode 100644
index 0000000000..615d5ee512
--- /dev/null
+++ b/nptl/tst-audit-threads-mod1.c
@@ -0,0 +1,74 @@
+/* Dummy audit library for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <elf.h>
+#include <link.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+
+/* We must use a dummy LD_AUDIT module to force the dynamic loader to
+   *not* update the real PLT, and instead use a cached value for the
+   lazy resolution result.  It is the update of that cached value that
+   we are testing for correctness by doing this.  */
+
+/* Library to be audited.  */
+#define LIB "tst-audit-threads-mod2.so"
+/* CALLNUM is the number of retNum functions.  */
+#define CALLNUM 7999
+
+#define CONCATX(a, b) __CONCAT (a, b)
+
+static int previous = 0;
+
+unsigned int
+la_version (unsigned int ver)
+{
+  return 1;
+}
+
+unsigned int
+la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
+{
+  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
+}
+
+uintptr_t
+CONCATX(la_symbind, __ELF_NATIVE_CLASS) (ElfW(Sym) *sym,
+					unsigned int ndx,
+					uintptr_t *refcook,
+					uintptr_t *defcook,
+					unsigned int *flags,
+					const char *symname)
+{
+  const char * retnum = "retNum";
+  char * num = strstr (symname, retnum);
+  int n;
+  /* Validate if the symbols are getting called in the correct order.
+     This code is here to verify binutils does not optimize out the PLT
+     entries that require the symbol binding.  */
+  if (num != NULL)
+    {
+      n = atoi (num);
+      assert (n >= previous);
+      assert (n <= CALLNUM);
+      previous = n;
+    }
+  return sym->st_value;
+}
diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
new file mode 100644
index 0000000000..f9817dd3dc
--- /dev/null
+++ b/nptl/tst-audit-threads-mod2.c
@@ -0,0 +1,22 @@
+/* Shared object with a huge number of functions for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Define all the retNumN functions in a library.  */
+#define definenum
+#include "tst-audit-threads.h"
diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
new file mode 100644
index 0000000000..e4bf433bd8
--- /dev/null
+++ b/nptl/tst-audit-threads.c
@@ -0,0 +1,97 @@
+/* Test multi-threading using LD_AUDIT.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
+   library with a huge number of functions in order to validate lazy symbol
+   binding with an audit library.  We use one thread per CPU to test that
+   concurrent lazy resolution does not have any defects which would cause
+   the process to fail.  We use an LD_AUDIT library to force the testing of
+   the relocation resolution caching code in the dynamic loader i.e.
+   _dl_runtime_profile and _dl_profile_fixup.  */
+
+#include <support/xthread.h>
+#include <strings.h>
+#include <stdlib.h>
+#include <sys/sysinfo.h>
+
+static int do_test (void);
+
+/* This test usually takes less than 3s to run.  However, there are cases that
+   take up to 30s.  */
+#define TIMEOUT 60
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
+
+/* Declare the functions we are going to call.  */
+#define externnum
+#include "tst-audit-threads.h"
+#undef externnum
+
+int num_threads;
+pthread_barrier_t barrier;
+
+void
+sync_all (int num)
+{
+  pthread_barrier_wait (&barrier);
+}
+
+void
+call_all_ret_nums (void)
+{
+  /* Call each function one at a time from all threads.  */
+#define callnum
+#include "tst-audit-threads.h"
+#undef callnum
+}
+
+void *
+thread_main (void *unused)
+{
+  call_all_ret_nums ();
+  return NULL;
+}
+
+#define STR2(X) #X
+#define STR(X) STR2(X)
+
+static int
+do_test (void)
+{
+  int i;
+  pthread_t *threads;
+
+  num_threads = get_nprocs ();
+  if (num_threads <= 1)
+    num_threads = 2;
+
+  /* Used to synchronize all the threads after calling each retNumN.  */
+  xpthread_barrier_init (&barrier, NULL, num_threads);
+
+  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
+  for (i = 0; i < num_threads; i++)
+    threads[i] = xpthread_create(NULL, thread_main, NULL);
+
+  for (i = 0; i < num_threads; i++)
+    xpthread_join(threads[i]);
+
+  free (threads);
+
+  return 0;
+}
diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
new file mode 100644
index 0000000000..491d0dcbf0
--- /dev/null
+++ b/nptl/tst-audit-threads.h
@@ -0,0 +1,89 @@
+/* Helper header for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* We use this helper to create a large number of functions, all of
+   which will be resolved lazily and thus have their PLT updated.
+   This is done to provide enough functions that we can statistically
+   observe a thread vs. PLT resolution failure if one exists.  */
+
+#define CONCAT(a, b) a ## b
+#define NUM(x, y) CONCAT (x, y)
+
+#define FUNC10(x)	\
+  FUNC (NUM (x, 0));	\
+  FUNC (NUM (x, 1));	\
+  FUNC (NUM (x, 2));	\
+  FUNC (NUM (x, 3));	\
+  FUNC (NUM (x, 4));	\
+  FUNC (NUM (x, 5));	\
+  FUNC (NUM (x, 6));	\
+  FUNC (NUM (x, 7));	\
+  FUNC (NUM (x, 8));	\
+  FUNC (NUM (x, 9))
+
+#define FUNC100(x)	\
+  FUNC10 (NUM (x, 0));	\
+  FUNC10 (NUM (x, 1));	\
+  FUNC10 (NUM (x, 2));	\
+  FUNC10 (NUM (x, 3));	\
+  FUNC10 (NUM (x, 4));	\
+  FUNC10 (NUM (x, 5));	\
+  FUNC10 (NUM (x, 6));	\
+  FUNC10 (NUM (x, 7));	\
+  FUNC10 (NUM (x, 8));	\
+  FUNC10 (NUM (x, 9))
+
+#define FUNC1000(x)		\
+  FUNC100 (NUM (x, 0));		\
+  FUNC100 (NUM (x, 1));		\
+  FUNC100 (NUM (x, 2));		\
+  FUNC100 (NUM (x, 3));		\
+  FUNC100 (NUM (x, 4));		\
+  FUNC100 (NUM (x, 5));		\
+  FUNC100 (NUM (x, 6));		\
+  FUNC100 (NUM (x, 7));		\
+  FUNC100 (NUM (x, 8));		\
+  FUNC100 (NUM (x, 9))
+
+#define FUNC7000()	\
+  FUNC1000 (1);		\
+  FUNC1000 (2);		\
+  FUNC1000 (3);		\
+  FUNC1000 (4);		\
+  FUNC1000 (5);		\
+  FUNC1000 (6);		\
+  FUNC1000 (7);
+
+#ifdef FUNC
+# undef FUNC
+#endif
+
+#ifdef externnum
+# define FUNC(x) extern int CONCAT (retNum, x) (void)
+#endif
+
+#ifdef definenum
+# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
+#endif
+
+#ifdef callnum
+# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
+#endif
+
+FUNC7000 ();
-- 
2.14.4

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv5] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-10-23 21:07               ` [PATCHv5] " Tulio Magno Quites Machado Filho
@ 2018-11-07 21:54                 ` Carlos O'Donell
  2018-11-22 18:21                   ` Tulio Magno Quites Machado Filho
  0 siblings, 1 reply; 40+ messages in thread
From: Carlos O'Donell @ 2018-11-07 21:54 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho, Florian Weimer, libc-alpha,
	John David Anglin, Adhemerval Zanella, Joseph Myers,
	Florian Weimer

On 10/23/18 4:52 PM, Tulio Magno Quites Machado Filho wrote:
> Changes since v4:
>  - Updated commit message.
>  - Replace memory fences with atomic load/store acquire/release.
>  - Removed resultp from _dl_profile_fixup.
>  - Improved source code comments.

Please post and commit v6 if you:
* Accept my comment in tst-audit-threads.h about arbitrary values.
* Accept my suggested fix for running the auditor (switch back to checking value).

Reviewed-by: Carlos O'Donell

> Changes since v3:
> 
>  - Improved comments.
>  - Started to use -Wl,-z,now.
>  - Added field init to l_reloc_result to be used as a guard.
> 
> Changes since v2:
> 
>  - Fixed coding style in nptl/tst-audit-threads-mod1.c.
>  - Replaced pthreads.h functions with respective support/xthread.h ones.
>  - Replaced malloc() with xcalloc() in nptl/tst-audit-threads.c.
>  - Removed bzero().
>  - Reduced the amount of functions to 7k in order to fit the relocation
>    limit  of some architectures, e.g. m68k, mips.
>  - Fixed issues in nptl/Makefile.
> 
> Changes since v1:
> 
>  - Fixed the coding style issues.
>  - Replaced atomic loads/store with memory fences.
>  - Added a test.
> 
> ---- 8< ----
> 
> There is a data-dependency order between the fields of struct
> l_reloc_result and the field used as its initialization guard.

> Reading from the initialization guard requires to complete before
> reading the rest of the fields of l_reloc_result.
> Likewise, the writes to the other fields of the reloc_result must
> complete before the initialization guard is updated.
 
> The previous implementation used DL_FIXUP_VALUE_ADDR (l_reloc_result->addr)
> as the initialization guard, making it impossible for some architectures
> to load and store it atomically, i.e. hppa and ia64, due to its larger size.
> 
> This commit adds an unsigned int to l_reloc_result to be used as the new
> initialization guard of the struct, making it possible to load and store
> it atomically in all architectures.
> 
> Tested with build-many-glibcs and on powerpc, powerpc64 and powerpc64le.

This is my suggested wording. It doesn't block commit, but some of what 
you write needs clarification.

Suggested commit message:

Fix _dl_profile_fixup data-dependency issue (Bug 23690)

There is a data-dependency between the fields of struct l_reloc_result
and the field used as the initialization guard. Users of the guard
expect writes to the structure to be observable when they also observe
the guard initialized. The solution for this problem is to use an acquire
and release load and store to ensure previous writes to the structure are
observable if the guard is initialized.

The previous implementation used DL_FIXUP_VALUE_ADDR (l_reloc_result->addr)
as the initialization guard, making it impossible for some architectures
to load and store it atomically, i.e. hppa and ia64, due to its larger size.

This commit adds an unsigned int to l_reloc_result to be used as the new
initialization guard of the struct, making it possible to load and store
it atomically in all architectures. The fix ensures that the values
observed in l_reloc_result are consistent and do not lead to crashes.
The algorithm is documented in the code in elf/dl-runtime.c 
(_dl_profile_fixup). Not all data races have been eliminated.

Tested with build-many-glibcs and on powerpc, powerpc64, and powerpc64le.
 
> 2018-10-23  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>
> 
> 	[BZ #23690]
> 	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
> 	modification order when accessing reloc_result->addr.
> 	* include/link.h (reloc_result): Add field init.
> 	* nptl/Makefile (tests): Add tst-audit-threads.
> 	(modules-names): Add tst-audit-threads-mod1 and
> 	tst-audit-threads-mod2.
> 	Add rules to build tst-audit-threads.
> 	* nptl/tst-audit-threads-mod1.c: New file.
> 	* nptl/tst-audit-threads-mod2.c: Likewise.
> 	* nptl/tst-audit-threads.c: Likewise.
> 	* nptl/tst-audit-threads.h: Likewise.
> 
> Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
> ---
>  elf/dl-runtime.c              | 45 +++++++++++++++++---
>  include/link.h                |  4 ++
>  nptl/Makefile                 | 14 ++++++-
>  nptl/tst-audit-threads-mod1.c | 74 +++++++++++++++++++++++++++++++++
>  nptl/tst-audit-threads-mod2.c | 22 ++++++++++
>  nptl/tst-audit-threads.c      | 97 +++++++++++++++++++++++++++++++++++++++++++
>  nptl/tst-audit-threads.h      | 89 +++++++++++++++++++++++++++++++++++++++
>  7 files changed, 338 insertions(+), 7 deletions(-)
>  create mode 100644 nptl/tst-audit-threads-mod1.c
>  create mode 100644 nptl/tst-audit-threads-mod2.c
>  create mode 100644 nptl/tst-audit-threads.c
>  create mode 100644 nptl/tst-audit-threads.h
> 
> diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
> index 63bbc89776..757a865092 100644
> --- a/elf/dl-runtime.c
> +++ b/elf/dl-runtime.c
> @@ -183,10 +183,36 @@ _dl_profile_fixup (
>    /* This is the address in the array where we store the result of previous
>       relocations.  */
>    struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
> -  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>  
> -  DL_FIXUP_VALUE_TYPE value = *resultp;
> -  if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
> + /* CONCURRENCY NOTES:
> +
> +  Multiple threads may be calling the same PLT sequence and with
> +  LD_AUDIT enabled they will be calling into _dl_profile_fixup to
> +  update the reloc_result with the result of the lazy resolution.
> +  The reloc_result guard variable is reloc_init, and we use
> +  acquire/release loads and store to it to ensure that the results of
> +  the structure are consistent with the loaded value of the guard.
> +  This does not fix all of the data races that occur when two or more
> +  threads read reloc_result->reloc_init with a value of zero and read
> +  and write to that reloc_result concurrently.  The expectation is
> +  generally that while this is a data race it works because the
> +  threads write the same values.  Until the data races are fixed
> +  there is a potential for problems to arise from these data races.
> +  The reloc result updates should happen in parallel but there should
> +  be an atomic RMW which does the final update to the real result
> +  entry (see bug 23790).
> +
> +  The following code uses reloc_result->init set to 0 to indicate if it is
> +  the first time this object is being relocated, otherwise 1 which
> +  indicates the object has already been relocated.
> +
> +  Reading/Writing from/to reloc_result->reloc_init must not happen
> +  before previous writes to reloc_result complete as they could
> +  end-up with an incomplete struct.  */

OK.


> +  DL_FIXUP_VALUE_TYPE value;
> +  unsigned int init = atomic_load_acquire (&reloc_result->init);

OK.

> +
> +  if (init == 0)
>      {
>        /* This is the first time we have to relocate this object.  */
>        const ElfW(Sym) *const symtab
> @@ -346,16 +372,25 @@ _dl_profile_fixup (
>  
>        /* Store the result for later runs.  */
>        if (__glibc_likely (! GLRO(dl_bind_not)))
> -	*resultp = value;
> +	{
> +	  reloc_result->addr = value;
> +	  /* Guarantee all previous writes complete before
> +	     init is updated.  See CONCURRENCY NOTES earlier  */
> +	  atomic_store_release (&reloc_result->init, 1);

OK.

> +	}
> +      init = 1;


OK, and now set init to indicate locally that initialization is complete.

>      }
> +  else
> +    value = reloc_result->addr;

OK, already initialized, just load addr.

>  
>    /* By default we do not call the pltexit function.  */
>    long int framesize = -1;
>  
> +
>  #ifdef SHARED
>    /* Auditing checkpoint: report the PLT entering and allow the
>       auditors to change the value.  */
> -  if (DL_FIXUP_VALUE_CODE_ADDR (value) != 0 && GLRO(dl_naudit) > 0
> +  if (init != 0 && GLRO(dl_naudit) > 0

This bit worries me, and took most of my review time to think up and
review surrounding code.

Isn't 'init != 0' always going to be true?

Up above if it's 0 then we do the initialization, and set it to 1.

Otherwise it's non-zero and we load value.

In both cases it's non-zero by the time we reach here.

The previous check had some interesting side-effects in that if value
was relocated to a NULL value, we would skip running the auditor.

The test here is probably not about the initialization guard, but
rather if value is non-NULL then run the auditor.

I think this needs restoring to 'DL_FIXUP_VALUE_CODE_ADDR (value) != 0'

What do you think?

>        /* Don't do anything if no auditor wants to intercept this call.  */
>        && (reloc_result->enterexit & LA_SYMB_NOPLTENTER) == 0)
>      {
> diff --git a/include/link.h b/include/link.h
> index 5924594548..83b1c34b7b 100644
> --- a/include/link.h
> +++ b/include/link.h
> @@ -216,6 +216,10 @@ struct link_map
>        unsigned int boundndx;
>        uint32_t enterexit;
>        unsigned int flags;
> +      /* CONCURRENCY NOTE: This is used to guard the concurrent initialization
> +	 of the relocation result across multiple threads.  See the more
> +	 detailed notes in elf/dl-runtime.c.  */
> +      unsigned int init;

OK.

>      } *l_reloc_result;
>  
>      /* Pointer to the version information if available.  */
> diff --git a/nptl/Makefile b/nptl/Makefile
> index 49b6faa330..ee720960d1 100644
> --- a/nptl/Makefile
> +++ b/nptl/Makefile
> @@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
>  	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
>  	 tst-oncex3 tst-oncex4
>  ifeq ($(build-shared),yes)
> -tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
> +tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
> +	 tst-audit-threads

OK.

>  tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
>  tests-nolibpthread += tst-fini1
>  ifeq ($(have-z-execstack),yes)
> @@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
>  		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
>  		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
>  		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
> -		tst-join7mod tst-compat-forwarder-mod
> +		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
> +		tst-audit-threads-mod2

OK.

>  extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
>  		   tst-cleanup4aux.o tst-cleanupx4aux.o
>  test-extras += tst-cleanup4aux tst-cleanupx4aux
> @@ -711,6 +713,14 @@ $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
>  
>  tst-mutex10-ENV = GLIBC_TUNABLES=glibc.elision.enable=1
>  
> +# Protect against a build using -Wl,-z,now.
> +LDFLAGS-tst-audit-threads-mod1.so = -Wl,-z,lazy
> +LDFLAGS-tst-audit-threads-mod2.so = -Wl,-z,lazy
> +LDFLAGS-tst-audit-threads = -Wl,-z,lazy
> +$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
> +$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
> +tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so
> +

OK.

>  # The tests here better do not run in parallel
>  ifneq ($(filter %tests,$(MAKECMDGOALS)),)
>  .NOTPARALLEL:
> diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
> new file mode 100644
> index 0000000000..615d5ee512
> --- /dev/null
> +++ b/nptl/tst-audit-threads-mod1.c
> @@ -0,0 +1,74 @@
> +/* Dummy audit library for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <elf.h>
> +#include <link.h>
> +#include <stdio.h>
> +#include <assert.h>
> +#include <string.h>
> +
> +/* We must use a dummy LD_AUDIT module to force the dynamic loader to
> +   *not* update the real PLT, and instead use a cached value for the
> +   lazy resolution result.  It is the update of that cached value that
> +   we are testing for correctness by doing this.  */

OK.

> +
> +/* Library to be audited.  */
> +#define LIB "tst-audit-threads-mod2.so"
> +/* CALLNUM is the number of retNum functions.  */
> +#define CALLNUM 7999
> +
> +#define CONCATX(a, b) __CONCAT (a, b)
> +
> +static int previous = 0;
> +
> +unsigned int
> +la_version (unsigned int ver)
> +{
> +  return 1;
> +}
> +
> +unsigned int
> +la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
> +{
> +  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
> +}
> +
> +uintptr_t
> +CONCATX(la_symbind, __ELF_NATIVE_CLASS) (ElfW(Sym) *sym,
> +					unsigned int ndx,
> +					uintptr_t *refcook,
> +					uintptr_t *defcook,
> +					unsigned int *flags,
> +					const char *symname)
> +{
> +  const char * retnum = "retNum";
> +  char * num = strstr (symname, retnum);
> +  int n;
> +  /* Validate if the symbols are getting called in the correct order.
> +     This code is here to verify binutils does not optimize out the PLT
> +     entries that require the symbol binding.  */

OK.

> +  if (num != NULL)
> +    {
> +      n = atoi (num);
> +      assert (n >= previous);
> +      assert (n <= CALLNUM);

OK. Great job, I like the bounded assert here.

> +      previous = n;
> +    }
> +  return sym->st_value;
> +}
> diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
> new file mode 100644
> index 0000000000..f9817dd3dc
> --- /dev/null
> +++ b/nptl/tst-audit-threads-mod2.c
> @@ -0,0 +1,22 @@
> +/* Shared object with a huge number of functions for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* Define all the retNumN functions in a library.  */
> +#define definenum
> +#include "tst-audit-threads.h"

OK.

> diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
> new file mode 100644
> index 0000000000..e4bf433bd8
> --- /dev/null
> +++ b/nptl/tst-audit-threads.c
> @@ -0,0 +1,97 @@
> +/* Test multi-threading using LD_AUDIT.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
> +   library with a huge number of functions in order to validate lazy symbol
> +   binding with an audit library.  We use one thread per CPU to test that
> +   concurrent lazy resolution does not have any defects which would cause
> +   the process to fail.  We use an LD_AUDIT library to force the testing of
> +   the relocation resolution caching code in the dynamic loader i.e.
> +   _dl_runtime_profile and _dl_profile_fixup.  */

OK.

> +
> +#include <support/xthread.h>
> +#include <strings.h>
> +#include <stdlib.h>
> +#include <sys/sysinfo.h>
> +
> +static int do_test (void);
> +
> +/* This test usually takes less than 3s to run.  However, there are cases that
> +   take up to 30s.  */

OK.

> +#define TIMEOUT 60
> +#define TEST_FUNCTION do_test ()
> +#include "../test-skeleton.c"
> +
> +/* Declare the functions we are going to call.  */
> +#define externnum
> +#include "tst-audit-threads.h"
> +#undef externnum
> +
> +int num_threads;
> +pthread_barrier_t barrier;
> +
> +void
> +sync_all (int num)
> +{
> +  pthread_barrier_wait (&barrier);
> +}
> +
> +void
> +call_all_ret_nums (void)
> +{
> +  /* Call each function one at a time from all threads.  */
> +#define callnum
> +#include "tst-audit-threads.h"
> +#undef callnum
> +}
> +
> +void *
> +thread_main (void *unused)
> +{
> +  call_all_ret_nums ();
> +  return NULL;
> +}
> +
> +#define STR2(X) #X
> +#define STR(X) STR2(X)
> +
> +static int
> +do_test (void)
> +{
> +  int i;
> +  pthread_t *threads;
> +
> +  num_threads = get_nprocs ();
> +  if (num_threads <= 1)
> +    num_threads = 2;
> +
> +  /* Used to synchronize all the threads after calling each retNumN.  */
> +  xpthread_barrier_init (&barrier, NULL, num_threads);
> +
> +  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
> +  for (i = 0; i < num_threads; i++)
> +    threads[i] = xpthread_create(NULL, thread_main, NULL);
> +
> +  for (i = 0; i < num_threads; i++)
> +    xpthread_join(threads[i]);
> +
> +  free (threads);
> +
> +  return 0;
> +}
> diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
> new file mode 100644
> index 0000000000..491d0dcbf0
> --- /dev/null
> +++ b/nptl/tst-audit-threads.h
> @@ -0,0 +1,89 @@
> +/* Helper header for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* We use this helper to create a large number of functions, all of
> +   which will be resolved lazily and thus have their PLT updated.
> +   This is done to provide enough functions that we can statistically
> +   observe a thread vs. PLT resolution failure if one exists.  */
> +
> +#define CONCAT(a, b) a ## b
> +#define NUM(x, y) CONCAT (x, y)
> +
> +#define FUNC10(x)	\
> +  FUNC (NUM (x, 0));	\
> +  FUNC (NUM (x, 1));	\
> +  FUNC (NUM (x, 2));	\
> +  FUNC (NUM (x, 3));	\
> +  FUNC (NUM (x, 4));	\
> +  FUNC (NUM (x, 5));	\
> +  FUNC (NUM (x, 6));	\
> +  FUNC (NUM (x, 7));	\
> +  FUNC (NUM (x, 8));	\
> +  FUNC (NUM (x, 9))
> +
> +#define FUNC100(x)	\
> +  FUNC10 (NUM (x, 0));	\
> +  FUNC10 (NUM (x, 1));	\
> +  FUNC10 (NUM (x, 2));	\
> +  FUNC10 (NUM (x, 3));	\
> +  FUNC10 (NUM (x, 4));	\
> +  FUNC10 (NUM (x, 5));	\
> +  FUNC10 (NUM (x, 6));	\
> +  FUNC10 (NUM (x, 7));	\
> +  FUNC10 (NUM (x, 8));	\
> +  FUNC10 (NUM (x, 9))
> +
> +#define FUNC1000(x)		\
> +  FUNC100 (NUM (x, 0));		\
> +  FUNC100 (NUM (x, 1));		\
> +  FUNC100 (NUM (x, 2));		\
> +  FUNC100 (NUM (x, 3));		\
> +  FUNC100 (NUM (x, 4));		\
> +  FUNC100 (NUM (x, 5));		\
> +  FUNC100 (NUM (x, 6));		\
> +  FUNC100 (NUM (x, 7));		\
> +  FUNC100 (NUM (x, 8));		\
> +  FUNC100 (NUM (x, 9))
> +
> +#define FUNC7000()	\
> +  FUNC1000 (1);		\
> +  FUNC1000 (2);		\
> +  FUNC1000 (3);		\
> +  FUNC1000 (4);		\
> +  FUNC1000 (5);		\
> +  FUNC1000 (6);		\
> +  FUNC1000 (7);
> +
> +#ifdef FUNC
> +# undef FUNC
> +#endif
> +
> +#ifdef externnum
> +# define FUNC(x) extern int CONCAT (retNum, x) (void)
> +#endif
> +
> +#ifdef definenum
> +# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
> +#endif
> +
> +#ifdef callnum
> +# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
> +#endif
> +

Please add:

/* A value of 7000 functions is chosen as an arbitrarily large
   number of functions that will allow us enough attempts to
   verify lazy resolution operation.  */

> +FUNC7000 ();

OK.

> 


-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv5] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-11-07 21:54                 ` Carlos O'Donell
@ 2018-11-22 18:21                   ` Tulio Magno Quites Machado Filho
  2018-11-29 14:28                     ` Carlos O'Donell
  0 siblings, 1 reply; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-11-22 18:21 UTC (permalink / raw)
  To: Carlos O'Donell, Florian Weimer, libc-alpha,
	John David Anglin, Adhemerval Zanella, Joseph Myers,
	Florian Weimer

Carlos O'Donell <carlos@redhat.com> writes:

> On 10/23/18 4:52 PM, Tulio Magno Quites Machado Filho wrote:
> This is my suggested wording. It doesn't block commit, but some of what 
> you write needs clarification.
>
> Suggested commit message:
>
> Fix _dl_profile_fixup data-dependency issue (Bug 23690)
>
> There is a data-dependency between the fields of struct l_reloc_result
> and the field used as the initialization guard. Users of the guard
> expect writes to the structure to be observable when they also observe
> the guard initialized. The solution for this problem is to use an acquire
> and release load and store to ensure previous writes to the structure are
> observable if the guard is initialized.
>
> The previous implementation used DL_FIXUP_VALUE_ADDR (l_reloc_result->addr)
> as the initialization guard, making it impossible for some architectures
> to load and store it atomically, i.e. hppa and ia64, due to its larger size.
>
> This commit adds an unsigned int to l_reloc_result to be used as the new
> initialization guard of the struct, making it possible to load and store
> it atomically in all architectures. The fix ensures that the values
> observed in l_reloc_result are consistent and do not lead to crashes.
> The algorithm is documented in the code in elf/dl-runtime.c 
> (_dl_profile_fixup). Not all data races have been eliminated.

Fixed.

>>    /* By default we do not call the pltexit function.  */
>>    long int framesize = -1;
>>  
>> +
>>  #ifdef SHARED
>>    /* Auditing checkpoint: report the PLT entering and allow the
>>       auditors to change the value.  */
>> -  if (DL_FIXUP_VALUE_CODE_ADDR (value) != 0 && GLRO(dl_naudit) > 0
>> +  if (init != 0 && GLRO(dl_naudit) > 0
>
> This bit worries me, and took most of my review time to think up and
> review surrounding code.
>
> Isn't 'init != 0' always going to be true?
>
> Up above if it's 0 then we do the initialization, and set it to 1.
>
> Otherwise it's non-zero and we load value.
>
> In both cases it's non-zero by the time we reach here.

Indeed.

> The previous check had some interesting side-effects in that if value
> was relocated to a NULL value, we would skip running the auditor.

I haven't seen this happening yet.

> The test here is probably not about the initialization guard, but
> rather if value is non-NULL then run the auditor.
>
> I think this needs restoring to 'DL_FIXUP_VALUE_CODE_ADDR (value) != 0'
>
> What do you think?

AFAICS, the same analysis applies to 'DL_FIXUP_VALUE_CODE_ADDR (value)'.
It will never be 0 at this point.

I think we can safely add 'assert (DL_FIXUP_VALUE_CODE_ADDR (value) != 0)' and
remove both tests for init and DL_FIXUP_VALUE_CODE_ADDR (value) from here.

What do you think?

>> +#ifdef callnum
>> +# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
>> +#endif
>> +
>
> Please add:
>
> /* A value of 7000 functions is chosen as an arbitrarily large
>    number of functions that will allow us enough attempts to
>    verify lazy resolution operation.  */

Fixed.

-- 
Tulio Magno

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv5] Protect _dl_profile_fixup data-dependency order [BZ #23690]
  2018-11-22 18:21                   ` Tulio Magno Quites Machado Filho
@ 2018-11-29 14:28                     ` Carlos O'Donell
  2018-11-30 16:23                       ` [PATCHv6] Fix _dl_profile_fixup data-dependency issue (Bug 23690) Tulio Magno Quites Machado Filho
  0 siblings, 1 reply; 40+ messages in thread
From: Carlos O'Donell @ 2018-11-29 14:28 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho, Florian Weimer, libc-alpha,
	John David Anglin, Adhemerval Zanella, Joseph Myers,
	Florian Weimer

On 11/22/18 1:21 PM, Tulio Magno Quites Machado Filho wrote:
>>>    /* By default we do not call the pltexit function.  */
>>>    long int framesize = -1;
>>>  
>>> +
>>>  #ifdef SHARED
>>>    /* Auditing checkpoint: report the PLT entering and allow the
>>>       auditors to change the value.  */
>>> -  if (DL_FIXUP_VALUE_CODE_ADDR (value) != 0 && GLRO(dl_naudit) > 0
>>> +  if (init != 0 && GLRO(dl_naudit) > 0
>>
>> This bit worries me, and took most of my review time to think up and
>> review surrounding code.
>>
>> Isn't 'init != 0' always going to be true?
>>
>> Up above if it's 0 then we do the initialization, and set it to 1.
>>
>> Otherwise it's non-zero and we load value.
>>
>> In both cases it's non-zero by the time we reach here.
> 
> Indeed.
> 
>> The previous check had some interesting side-effects in that if value
>> was relocated to a NULL value, we would skip running the auditor.
> 
> I haven't seen this happening yet.

OK.

>> The test here is probably not about the initialization guard, but
>> rather if value is non-NULL then run the auditor.
>>
>> I think this needs restoring to 'DL_FIXUP_VALUE_CODE_ADDR (value) != 0'
>>
>> What do you think?
> 
> AFAICS, the same analysis applies to 'DL_FIXUP_VALUE_CODE_ADDR (value)'.
> It will never be 0 at this point.

Agreed.

> I think we can safely add 'assert (DL_FIXUP_VALUE_CODE_ADDR (value) != 0)' and
> remove both tests for init and DL_FIXUP_VALUE_CODE_ADDR (value) from here.
> 
> What do you think?

That's fine with me.

>>> +#ifdef callnum
>>> +# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
>>> +#endif
>>> +
>>
>> Please add:
>>
>> /* A value of 7000 functions is chosen as an arbitrarily large
>>    number of functions that will allow us enough attempts to
>>    verify lazy resolution operation.  */
> 
> Fixed.
> 

Please post a v6 and I'll do one final check over it and then we'll commit.

I'm eager to see this fixed for our downstream users that reported this bug :-)

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCHv6] Fix _dl_profile_fixup data-dependency issue (Bug 23690)
  2018-11-29 14:28                     ` Carlos O'Donell
@ 2018-11-30 16:23                       ` Tulio Magno Quites Machado Filho
  2018-11-30 19:56                         ` Carlos O'Donell
  0 siblings, 1 reply; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-11-30 16:23 UTC (permalink / raw)
  To: Carlos O'Donell, libc-alpha
  Cc: Florian Weimer, John David Anglin, Adhemerval Zanella,
	Joseph Myers, Florian Weimer

Changes since v5:
 - Changed commit message.
 - New source code comment explaining why 7k functions.
 - dl-runtime.c: replace a test for init != 0 with an assert() on
   DL_FIXUP_VALUE_CODE_ADDR (value).

Changes since v4:
 - Updated commit message.
 - Replace memory fences with atomic load/store acquire/release.
 - Removed resultp from _dl_profile_fixup.
 - Improved source code comments.

Changes since v3:

 - Improved comments.
 - Started to use -Wl,-z,now.
 - Added field init to l_reloc_result to be used as a guard.

Changes since v2:

 - Fixed coding style in nptl/tst-audit-threads-mod1.c.
 - Replaced pthreads.h functions with respective support/xthread.h ones.
 - Replaced malloc() with xcalloc() in nptl/tst-audit-threads.c.
 - Removed bzero().
 - Reduced the amount of functions to 7k in order to fit the relocation
   limit  of some architectures, e.g. m68k, mips.
 - Fixed issues in nptl/Makefile.

Changes since v1:

 - Fixed the coding style issues.
 - Replaced atomic loads/store with memory fences.
 - Added a test.

---- 8< ----

There is a data-dependency between the fields of struct l_reloc_result
and the field used as the initialization guard. Users of the guard
expect writes to the structure to be observable when they also observe
the guard initialized. The solution for this problem is to use an acquire
and release load and store to ensure previous writes to the structure are
observable if the guard is initialized.

The previous implementation used DL_FIXUP_VALUE_ADDR (l_reloc_result->addr)
as the initialization guard, making it impossible for some architectures
to load and store it atomically, i.e. hppa and ia64, due to its larger size.

This commit adds an unsigned int to l_reloc_result to be used as the new
initialization guard of the struct, making it possible to load and store
it atomically in all architectures. The fix ensures that the values
observed in l_reloc_result are consistent and do not lead to crashes.
The algorithm is documented in the code in elf/dl-runtime.c
(_dl_profile_fixup). Not all data races have been eliminated.

Tested with build-many-glibcs and on powerpc, powerpc64, and powerpc64le.

2018-11-30  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>

	[BZ #23690]
	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
	modification order when accessing reloc_result->addr.
	* include/link.h (reloc_result): Add field init.
	* nptl/Makefile (tests): Add tst-audit-threads.
	(modules-names): Add tst-audit-threads-mod1 and
	tst-audit-threads-mod2.
	Add rules to build tst-audit-threads.
	* nptl/tst-audit-threads-mod1.c: New file.
	* nptl/tst-audit-threads-mod2.c: Likewise.
	* nptl/tst-audit-threads.c: Likewise.
	* nptl/tst-audit-threads.h: Likewise.

Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
---
 elf/dl-runtime.c              | 48 ++++++++++++++++++---
 include/link.h                |  4 ++
 nptl/Makefile                 | 14 ++++++-
 nptl/tst-audit-threads-mod1.c | 74 +++++++++++++++++++++++++++++++++
 nptl/tst-audit-threads-mod2.c | 22 ++++++++++
 nptl/tst-audit-threads.c      | 97 +++++++++++++++++++++++++++++++++++++++++++
 nptl/tst-audit-threads.h      | 92 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 344 insertions(+), 7 deletions(-)
 create mode 100644 nptl/tst-audit-threads-mod1.c
 create mode 100644 nptl/tst-audit-threads-mod2.c
 create mode 100644 nptl/tst-audit-threads.c
 create mode 100644 nptl/tst-audit-threads.h

diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
index 63bbc89776..3d2f4a7a76 100644
--- a/elf/dl-runtime.c
+++ b/elf/dl-runtime.c
@@ -183,10 +183,36 @@ _dl_profile_fixup (
   /* This is the address in the array where we store the result of previous
      relocations.  */
   struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
-  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
 
-  DL_FIXUP_VALUE_TYPE value = *resultp;
-  if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
+ /* CONCURRENCY NOTES:
+
+  Multiple threads may be calling the same PLT sequence and with
+  LD_AUDIT enabled they will be calling into _dl_profile_fixup to
+  update the reloc_result with the result of the lazy resolution.
+  The reloc_result guard variable is reloc_init, and we use
+  acquire/release loads and store to it to ensure that the results of
+  the structure are consistent with the loaded value of the guard.
+  This does not fix all of the data races that occur when two or more
+  threads read reloc_result->reloc_init with a value of zero and read
+  and write to that reloc_result concurrently.  The expectation is
+  generally that while this is a data race it works because the
+  threads write the same values.  Until the data races are fixed
+  there is a potential for problems to arise from these data races.
+  The reloc result updates should happen in parallel but there should
+  be an atomic RMW which does the final update to the real result
+  entry (see bug 23790).
+
+  The following code uses reloc_result->init set to 0 to indicate if it is
+  the first time this object is being relocated, otherwise 1 which
+  indicates the object has already been relocated.
+
+  Reading/Writing from/to reloc_result->reloc_init must not happen
+  before previous writes to reloc_result complete as they could
+  end-up with an incomplete struct.  */
+  DL_FIXUP_VALUE_TYPE value;
+  unsigned int init = atomic_load_acquire (&reloc_result->init);
+
+  if (init == 0)
     {
       /* This is the first time we have to relocate this object.  */
       const ElfW(Sym) *const symtab
@@ -346,19 +372,31 @@ _dl_profile_fixup (
 
       /* Store the result for later runs.  */
       if (__glibc_likely (! GLRO(dl_bind_not)))
-	*resultp = value;
+	{
+	  reloc_result->addr = value;
+	  /* Guarantee all previous writes complete before
+	     init is updated.  See CONCURRENCY NOTES earlier  */
+	  atomic_store_release (&reloc_result->init, 1);
+	}
+      init = 1;
     }
+  else
+    value = reloc_result->addr;
 
   /* By default we do not call the pltexit function.  */
   long int framesize = -1;
 
+
 #ifdef SHARED
   /* Auditing checkpoint: report the PLT entering and allow the
      auditors to change the value.  */
-  if (DL_FIXUP_VALUE_CODE_ADDR (value) != 0 && GLRO(dl_naudit) > 0
+  if (GLRO(dl_naudit) > 0
       /* Don't do anything if no auditor wants to intercept this call.  */
       && (reloc_result->enterexit & LA_SYMB_NOPLTENTER) == 0)
     {
+      /* Sanity check:  DL_FIXUP_VALUE_CODE_ADDR (value) should have been
+	 initialized earlier in this function or in another thread.  */
+      assert (DL_FIXUP_VALUE_CODE_ADDR (value) != 0);
       ElfW(Sym) *defsym = ((ElfW(Sym) *) D_PTR (reloc_result->bound,
 						l_info[DT_SYMTAB])
 			   + reloc_result->boundndx);
diff --git a/include/link.h b/include/link.h
index 5924594548..83b1c34b7b 100644
--- a/include/link.h
+++ b/include/link.h
@@ -216,6 +216,10 @@ struct link_map
       unsigned int boundndx;
       uint32_t enterexit;
       unsigned int flags;
+      /* CONCURRENCY NOTE: This is used to guard the concurrent initialization
+	 of the relocation result across multiple threads.  See the more
+	 detailed notes in elf/dl-runtime.c.  */
+      unsigned int init;
     } *l_reloc_result;
 
     /* Pointer to the version information if available.  */
diff --git a/nptl/Makefile b/nptl/Makefile
index 982e43adfa..98b0aa01c7 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
 	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
 	 tst-oncex3 tst-oncex4
 ifeq ($(build-shared),yes)
-tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
+tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
+	 tst-audit-threads
 tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
 tests-nolibpthread += tst-fini1
 ifeq ($(have-z-execstack),yes)
@@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
 		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
 		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
 		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
-		tst-join7mod tst-compat-forwarder-mod
+		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
+		tst-audit-threads-mod2
 extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
 		   tst-cleanup4aux.o tst-cleanupx4aux.o
 test-extras += tst-cleanup4aux tst-cleanupx4aux
@@ -712,6 +714,14 @@ $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
 
 tst-mutex10-ENV = GLIBC_TUNABLES=glibc.elision.enable=1
 
+# Protect against a build using -Wl,-z,now.
+LDFLAGS-tst-audit-threads-mod1.so = -Wl,-z,lazy
+LDFLAGS-tst-audit-threads-mod2.so = -Wl,-z,lazy
+LDFLAGS-tst-audit-threads = -Wl,-z,lazy
+$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
+$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
+tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so
+
 # The tests here better do not run in parallel
 ifneq ($(filter %tests,$(MAKECMDGOALS)),)
 .NOTPARALLEL:
diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
new file mode 100644
index 0000000000..615d5ee512
--- /dev/null
+++ b/nptl/tst-audit-threads-mod1.c
@@ -0,0 +1,74 @@
+/* Dummy audit library for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <elf.h>
+#include <link.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+
+/* We must use a dummy LD_AUDIT module to force the dynamic loader to
+   *not* update the real PLT, and instead use a cached value for the
+   lazy resolution result.  It is the update of that cached value that
+   we are testing for correctness by doing this.  */
+
+/* Library to be audited.  */
+#define LIB "tst-audit-threads-mod2.so"
+/* CALLNUM is the number of retNum functions.  */
+#define CALLNUM 7999
+
+#define CONCATX(a, b) __CONCAT (a, b)
+
+static int previous = 0;
+
+unsigned int
+la_version (unsigned int ver)
+{
+  return 1;
+}
+
+unsigned int
+la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
+{
+  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
+}
+
+uintptr_t
+CONCATX(la_symbind, __ELF_NATIVE_CLASS) (ElfW(Sym) *sym,
+					unsigned int ndx,
+					uintptr_t *refcook,
+					uintptr_t *defcook,
+					unsigned int *flags,
+					const char *symname)
+{
+  const char * retnum = "retNum";
+  char * num = strstr (symname, retnum);
+  int n;
+  /* Validate if the symbols are getting called in the correct order.
+     This code is here to verify binutils does not optimize out the PLT
+     entries that require the symbol binding.  */
+  if (num != NULL)
+    {
+      n = atoi (num);
+      assert (n >= previous);
+      assert (n <= CALLNUM);
+      previous = n;
+    }
+  return sym->st_value;
+}
diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
new file mode 100644
index 0000000000..f9817dd3dc
--- /dev/null
+++ b/nptl/tst-audit-threads-mod2.c
@@ -0,0 +1,22 @@
+/* Shared object with a huge number of functions for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Define all the retNumN functions in a library.  */
+#define definenum
+#include "tst-audit-threads.h"
diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
new file mode 100644
index 0000000000..e4bf433bd8
--- /dev/null
+++ b/nptl/tst-audit-threads.c
@@ -0,0 +1,97 @@
+/* Test multi-threading using LD_AUDIT.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
+   library with a huge number of functions in order to validate lazy symbol
+   binding with an audit library.  We use one thread per CPU to test that
+   concurrent lazy resolution does not have any defects which would cause
+   the process to fail.  We use an LD_AUDIT library to force the testing of
+   the relocation resolution caching code in the dynamic loader i.e.
+   _dl_runtime_profile and _dl_profile_fixup.  */
+
+#include <support/xthread.h>
+#include <strings.h>
+#include <stdlib.h>
+#include <sys/sysinfo.h>
+
+static int do_test (void);
+
+/* This test usually takes less than 3s to run.  However, there are cases that
+   take up to 30s.  */
+#define TIMEOUT 60
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
+
+/* Declare the functions we are going to call.  */
+#define externnum
+#include "tst-audit-threads.h"
+#undef externnum
+
+int num_threads;
+pthread_barrier_t barrier;
+
+void
+sync_all (int num)
+{
+  pthread_barrier_wait (&barrier);
+}
+
+void
+call_all_ret_nums (void)
+{
+  /* Call each function one at a time from all threads.  */
+#define callnum
+#include "tst-audit-threads.h"
+#undef callnum
+}
+
+void *
+thread_main (void *unused)
+{
+  call_all_ret_nums ();
+  return NULL;
+}
+
+#define STR2(X) #X
+#define STR(X) STR2(X)
+
+static int
+do_test (void)
+{
+  int i;
+  pthread_t *threads;
+
+  num_threads = get_nprocs ();
+  if (num_threads <= 1)
+    num_threads = 2;
+
+  /* Used to synchronize all the threads after calling each retNumN.  */
+  xpthread_barrier_init (&barrier, NULL, num_threads);
+
+  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
+  for (i = 0; i < num_threads; i++)
+    threads[i] = xpthread_create(NULL, thread_main, NULL);
+
+  for (i = 0; i < num_threads; i++)
+    xpthread_join(threads[i]);
+
+  free (threads);
+
+  return 0;
+}
diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
new file mode 100644
index 0000000000..1c9ecc08df
--- /dev/null
+++ b/nptl/tst-audit-threads.h
@@ -0,0 +1,92 @@
+/* Helper header for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* We use this helper to create a large number of functions, all of
+   which will be resolved lazily and thus have their PLT updated.
+   This is done to provide enough functions that we can statistically
+   observe a thread vs. PLT resolution failure if one exists.  */
+
+#define CONCAT(a, b) a ## b
+#define NUM(x, y) CONCAT (x, y)
+
+#define FUNC10(x)	\
+  FUNC (NUM (x, 0));	\
+  FUNC (NUM (x, 1));	\
+  FUNC (NUM (x, 2));	\
+  FUNC (NUM (x, 3));	\
+  FUNC (NUM (x, 4));	\
+  FUNC (NUM (x, 5));	\
+  FUNC (NUM (x, 6));	\
+  FUNC (NUM (x, 7));	\
+  FUNC (NUM (x, 8));	\
+  FUNC (NUM (x, 9))
+
+#define FUNC100(x)	\
+  FUNC10 (NUM (x, 0));	\
+  FUNC10 (NUM (x, 1));	\
+  FUNC10 (NUM (x, 2));	\
+  FUNC10 (NUM (x, 3));	\
+  FUNC10 (NUM (x, 4));	\
+  FUNC10 (NUM (x, 5));	\
+  FUNC10 (NUM (x, 6));	\
+  FUNC10 (NUM (x, 7));	\
+  FUNC10 (NUM (x, 8));	\
+  FUNC10 (NUM (x, 9))
+
+#define FUNC1000(x)		\
+  FUNC100 (NUM (x, 0));		\
+  FUNC100 (NUM (x, 1));		\
+  FUNC100 (NUM (x, 2));		\
+  FUNC100 (NUM (x, 3));		\
+  FUNC100 (NUM (x, 4));		\
+  FUNC100 (NUM (x, 5));		\
+  FUNC100 (NUM (x, 6));		\
+  FUNC100 (NUM (x, 7));		\
+  FUNC100 (NUM (x, 8));		\
+  FUNC100 (NUM (x, 9))
+
+#define FUNC7000()	\
+  FUNC1000 (1);		\
+  FUNC1000 (2);		\
+  FUNC1000 (3);		\
+  FUNC1000 (4);		\
+  FUNC1000 (5);		\
+  FUNC1000 (6);		\
+  FUNC1000 (7);
+
+#ifdef FUNC
+# undef FUNC
+#endif
+
+#ifdef externnum
+# define FUNC(x) extern int CONCAT (retNum, x) (void)
+#endif
+
+#ifdef definenum
+# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
+#endif
+
+#ifdef callnum
+# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
+#endif
+
+/* A value of 7000 functions is chosen as an arbitrarily large
+   number of functions that will allow us enough attempts to
+   verify lazy resolution operation.  */
+FUNC7000 ();
-- 
2.14.5

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv6] Fix _dl_profile_fixup data-dependency issue (Bug 23690)
  2018-11-30 16:23                       ` [PATCHv6] Fix _dl_profile_fixup data-dependency issue (Bug 23690) Tulio Magno Quites Machado Filho
@ 2018-11-30 19:56                         ` Carlos O'Donell
  2018-11-30 20:10                           ` Tulio Magno Quites Machado Filho
  0 siblings, 1 reply; 40+ messages in thread
From: Carlos O'Donell @ 2018-11-30 19:56 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho, libc-alpha
  Cc: Florian Weimer, John David Anglin, Adhemerval Zanella,
	Joseph Myers, Florian Weimer

On 11/30/18 11:22 AM, Tulio Magno Quites Machado Filho wrote:
> Changes since v5:
>  - Changed commit message.
>  - New source code comment explaining why 7k functions.
>  - dl-runtime.c: replace a test for init != 0 with an assert() on
>    DL_FIXUP_VALUE_CODE_ADDR (value).
> 
> Changes since v4:
>  - Updated commit message.
>  - Replace memory fences with atomic load/store acquire/release.
>  - Removed resultp from _dl_profile_fixup.
>  - Improved source code comments.
> 
> Changes since v3:
> 
>  - Improved comments.
>  - Started to use -Wl,-z,now.
>  - Added field init to l_reloc_result to be used as a guard.
> 
> Changes since v2:
> 
>  - Fixed coding style in nptl/tst-audit-threads-mod1.c.
>  - Replaced pthreads.h functions with respective support/xthread.h ones.
>  - Replaced malloc() with xcalloc() in nptl/tst-audit-threads.c.
>  - Removed bzero().
>  - Reduced the amount of functions to 7k in order to fit the relocation
>    limit  of some architectures, e.g. m68k, mips.
>  - Fixed issues in nptl/Makefile.
> 
> Changes since v1:
> 
>  - Fixed the coding style issues.
>  - Replaced atomic loads/store with memory fences.
>  - Added a test.
> 
> ---- 8< ----
> 
> There is a data-dependency between the fields of struct l_reloc_result
> and the field used as the initialization guard. Users of the guard
> expect writes to the structure to be observable when they also observe
> the guard initialized. The solution for this problem is to use an acquire
> and release load and store to ensure previous writes to the structure are
> observable if the guard is initialized.
> 
> The previous implementation used DL_FIXUP_VALUE_ADDR (l_reloc_result->addr)
> as the initialization guard, making it impossible for some architectures
> to load and store it atomically, i.e. hppa and ia64, due to its larger size.
> 
> This commit adds an unsigned int to l_reloc_result to be used as the new
> initialization guard of the struct, making it possible to load and store
> it atomically in all architectures. The fix ensures that the values
> observed in l_reloc_result are consistent and do not lead to crashes.
> The algorithm is documented in the code in elf/dl-runtime.c
> (_dl_profile_fixup). Not all data races have been eliminated.
> 
> Tested with build-many-glibcs and on powerpc, powerpc64, and powerpc64le.

Perfect. Please commit to master :-)

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

> 2018-11-30  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>
> 
> 	[BZ #23690]
> 	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
> 	modification order when accessing reloc_result->addr.
> 	* include/link.h (reloc_result): Add field init.
> 	* nptl/Makefile (tests): Add tst-audit-threads.
> 	(modules-names): Add tst-audit-threads-mod1 and
> 	tst-audit-threads-mod2.
> 	Add rules to build tst-audit-threads.
> 	* nptl/tst-audit-threads-mod1.c: New file.
> 	* nptl/tst-audit-threads-mod2.c: Likewise.
> 	* nptl/tst-audit-threads.c: Likewise.
> 	* nptl/tst-audit-threads.h: Likewise.
> 
> Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
> ---
>  elf/dl-runtime.c              | 48 ++++++++++++++++++---
>  include/link.h                |  4 ++
>  nptl/Makefile                 | 14 ++++++-
>  nptl/tst-audit-threads-mod1.c | 74 +++++++++++++++++++++++++++++++++
>  nptl/tst-audit-threads-mod2.c | 22 ++++++++++
>  nptl/tst-audit-threads.c      | 97 +++++++++++++++++++++++++++++++++++++++++++
>  nptl/tst-audit-threads.h      | 92 ++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 344 insertions(+), 7 deletions(-)
>  create mode 100644 nptl/tst-audit-threads-mod1.c
>  create mode 100644 nptl/tst-audit-threads-mod2.c
>  create mode 100644 nptl/tst-audit-threads.c
>  create mode 100644 nptl/tst-audit-threads.h
> 
> diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
> index 63bbc89776..3d2f4a7a76 100644
> --- a/elf/dl-runtime.c
> +++ b/elf/dl-runtime.c
> @@ -183,10 +183,36 @@ _dl_profile_fixup (
>    /* This is the address in the array where we store the result of previous
>       relocations.  */
>    struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
> -  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>  
> -  DL_FIXUP_VALUE_TYPE value = *resultp;
> -  if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
> + /* CONCURRENCY NOTES:
> +
> +  Multiple threads may be calling the same PLT sequence and with
> +  LD_AUDIT enabled they will be calling into _dl_profile_fixup to
> +  update the reloc_result with the result of the lazy resolution.
> +  The reloc_result guard variable is reloc_init, and we use
> +  acquire/release loads and store to it to ensure that the results of
> +  the structure are consistent with the loaded value of the guard.
> +  This does not fix all of the data races that occur when two or more
> +  threads read reloc_result->reloc_init with a value of zero and read
> +  and write to that reloc_result concurrently.  The expectation is
> +  generally that while this is a data race it works because the
> +  threads write the same values.  Until the data races are fixed
> +  there is a potential for problems to arise from these data races.
> +  The reloc result updates should happen in parallel but there should
> +  be an atomic RMW which does the final update to the real result
> +  entry (see bug 23790).
> +
> +  The following code uses reloc_result->init set to 0 to indicate if it is
> +  the first time this object is being relocated, otherwise 1 which
> +  indicates the object has already been relocated.
> +
> +  Reading/Writing from/to reloc_result->reloc_init must not happen
> +  before previous writes to reloc_result complete as they could
> +  end-up with an incomplete struct.  */

OK.

> +  DL_FIXUP_VALUE_TYPE value;
> +  unsigned int init = atomic_load_acquire (&reloc_result->init);

OK.

> +
> +  if (init == 0)
>      {
>        /* This is the first time we have to relocate this object.  */
>        const ElfW(Sym) *const symtab
> @@ -346,19 +372,31 @@ _dl_profile_fixup (
>  
>        /* Store the result for later runs.  */
>        if (__glibc_likely (! GLRO(dl_bind_not)))
> -	*resultp = value;
> +	{
> +	  reloc_result->addr = value;
> +	  /* Guarantee all previous writes complete before
> +	     init is updated.  See CONCURRENCY NOTES earlier  */
> +	  atomic_store_release (&reloc_result->init, 1);
> +	}
> +      init = 1;
>      }
> +  else
> +    value = reloc_result->addr;
>  
>    /* By default we do not call the pltexit function.  */
>    long int framesize = -1;
>  
> +
>  #ifdef SHARED
>    /* Auditing checkpoint: report the PLT entering and allow the
>       auditors to change the value.  */
> -  if (DL_FIXUP_VALUE_CODE_ADDR (value) != 0 && GLRO(dl_naudit) > 0
> +  if (GLRO(dl_naudit) > 0

OK.

>        /* Don't do anything if no auditor wants to intercept this call.  */
>        && (reloc_result->enterexit & LA_SYMB_NOPLTENTER) == 0)
>      {
> +      /* Sanity check:  DL_FIXUP_VALUE_CODE_ADDR (value) should have been
> +	 initialized earlier in this function or in another thread.  */
> +      assert (DL_FIXUP_VALUE_CODE_ADDR (value) != 0);

OK.

>        ElfW(Sym) *defsym = ((ElfW(Sym) *) D_PTR (reloc_result->bound,
>  						l_info[DT_SYMTAB])
>  			   + reloc_result->boundndx);
> diff --git a/include/link.h b/include/link.h
> index 5924594548..83b1c34b7b 100644
> --- a/include/link.h
> +++ b/include/link.h
> @@ -216,6 +216,10 @@ struct link_map
>        unsigned int boundndx;
>        uint32_t enterexit;
>        unsigned int flags;
> +      /* CONCURRENCY NOTE: This is used to guard the concurrent initialization
> +	 of the relocation result across multiple threads.  See the more
> +	 detailed notes in elf/dl-runtime.c.  */
> +      unsigned int init;

OK.

>      } *l_reloc_result;
>  
>      /* Pointer to the version information if available.  */
> diff --git a/nptl/Makefile b/nptl/Makefile
> index 982e43adfa..98b0aa01c7 100644
> --- a/nptl/Makefile
> +++ b/nptl/Makefile
> @@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
>  	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
>  	 tst-oncex3 tst-oncex4
>  ifeq ($(build-shared),yes)
> -tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
> +tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
> +	 tst-audit-threads

OK.

>  tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
>  tests-nolibpthread += tst-fini1
>  ifeq ($(have-z-execstack),yes)
> @@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
>  		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
>  		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
>  		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
> -		tst-join7mod tst-compat-forwarder-mod
> +		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
> +		tst-audit-threads-mod2

OK.

>  extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
>  		   tst-cleanup4aux.o tst-cleanupx4aux.o
>  test-extras += tst-cleanup4aux tst-cleanupx4aux
> @@ -712,6 +714,14 @@ $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
>  
>  tst-mutex10-ENV = GLIBC_TUNABLES=glibc.elision.enable=1
>  
> +# Protect against a build using -Wl,-z,now.
> +LDFLAGS-tst-audit-threads-mod1.so = -Wl,-z,lazy
> +LDFLAGS-tst-audit-threads-mod2.so = -Wl,-z,lazy
> +LDFLAGS-tst-audit-threads = -Wl,-z,lazy
> +$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
> +$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
> +tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so

OK.

> +
>  # The tests here better do not run in parallel
>  ifneq ($(filter %tests,$(MAKECMDGOALS)),)
>  .NOTPARALLEL:
> diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
> new file mode 100644
> index 0000000000..615d5ee512
> --- /dev/null
> +++ b/nptl/tst-audit-threads-mod1.c
> @@ -0,0 +1,74 @@
> +/* Dummy audit library for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <elf.h>
> +#include <link.h>
> +#include <stdio.h>
> +#include <assert.h>
> +#include <string.h>
> +
> +/* We must use a dummy LD_AUDIT module to force the dynamic loader to
> +   *not* update the real PLT, and instead use a cached value for the
> +   lazy resolution result.  It is the update of that cached value that
> +   we are testing for correctness by doing this.  */
> +
> +/* Library to be audited.  */
> +#define LIB "tst-audit-threads-mod2.so"
> +/* CALLNUM is the number of retNum functions.  */
> +#define CALLNUM 7999
> +
> +#define CONCATX(a, b) __CONCAT (a, b)
> +
> +static int previous = 0;
> +
> +unsigned int
> +la_version (unsigned int ver)
> +{
> +  return 1;
> +}
> +
> +unsigned int
> +la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
> +{
> +  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
> +}
> +
> +uintptr_t
> +CONCATX(la_symbind, __ELF_NATIVE_CLASS) (ElfW(Sym) *sym,
> +					unsigned int ndx,
> +					uintptr_t *refcook,
> +					uintptr_t *defcook,
> +					unsigned int *flags,
> +					const char *symname)
> +{
> +  const char * retnum = "retNum";
> +  char * num = strstr (symname, retnum);
> +  int n;
> +  /* Validate if the symbols are getting called in the correct order.
> +     This code is here to verify binutils does not optimize out the PLT
> +     entries that require the symbol binding.  */
> +  if (num != NULL)
> +    {
> +      n = atoi (num);
> +      assert (n >= previous);
> +      assert (n <= CALLNUM);
> +      previous = n;
> +    }
> +  return sym->st_value;
> +}
> diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
> new file mode 100644
> index 0000000000..f9817dd3dc
> --- /dev/null
> +++ b/nptl/tst-audit-threads-mod2.c
> @@ -0,0 +1,22 @@
> +/* Shared object with a huge number of functions for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* Define all the retNumN functions in a library.  */
> +#define definenum
> +#include "tst-audit-threads.h"
> diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
> new file mode 100644
> index 0000000000..e4bf433bd8
> --- /dev/null
> +++ b/nptl/tst-audit-threads.c
> @@ -0,0 +1,97 @@
> +/* Test multi-threading using LD_AUDIT.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
> +   library with a huge number of functions in order to validate lazy symbol
> +   binding with an audit library.  We use one thread per CPU to test that
> +   concurrent lazy resolution does not have any defects which would cause
> +   the process to fail.  We use an LD_AUDIT library to force the testing of
> +   the relocation resolution caching code in the dynamic loader i.e.
> +   _dl_runtime_profile and _dl_profile_fixup.  */
> +
> +#include <support/xthread.h>
> +#include <strings.h>
> +#include <stdlib.h>
> +#include <sys/sysinfo.h>
> +
> +static int do_test (void);
> +
> +/* This test usually takes less than 3s to run.  However, there are cases that
> +   take up to 30s.  */
> +#define TIMEOUT 60
> +#define TEST_FUNCTION do_test ()
> +#include "../test-skeleton.c"
> +
> +/* Declare the functions we are going to call.  */
> +#define externnum
> +#include "tst-audit-threads.h"
> +#undef externnum
> +
> +int num_threads;
> +pthread_barrier_t barrier;
> +
> +void
> +sync_all (int num)
> +{
> +  pthread_barrier_wait (&barrier);
> +}
> +
> +void
> +call_all_ret_nums (void)
> +{
> +  /* Call each function one at a time from all threads.  */
> +#define callnum
> +#include "tst-audit-threads.h"
> +#undef callnum
> +}
> +
> +void *
> +thread_main (void *unused)
> +{
> +  call_all_ret_nums ();
> +  return NULL;
> +}
> +
> +#define STR2(X) #X
> +#define STR(X) STR2(X)
> +
> +static int
> +do_test (void)
> +{
> +  int i;
> +  pthread_t *threads;
> +
> +  num_threads = get_nprocs ();
> +  if (num_threads <= 1)
> +    num_threads = 2;
> +
> +  /* Used to synchronize all the threads after calling each retNumN.  */
> +  xpthread_barrier_init (&barrier, NULL, num_threads);
> +
> +  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
> +  for (i = 0; i < num_threads; i++)
> +    threads[i] = xpthread_create(NULL, thread_main, NULL);
> +
> +  for (i = 0; i < num_threads; i++)
> +    xpthread_join(threads[i]);
> +
> +  free (threads);
> +
> +  return 0;
> +}
> diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
> new file mode 100644
> index 0000000000..1c9ecc08df
> --- /dev/null
> +++ b/nptl/tst-audit-threads.h
> @@ -0,0 +1,92 @@
> +/* Helper header for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* We use this helper to create a large number of functions, all of
> +   which will be resolved lazily and thus have their PLT updated.
> +   This is done to provide enough functions that we can statistically
> +   observe a thread vs. PLT resolution failure if one exists.  */
> +
> +#define CONCAT(a, b) a ## b
> +#define NUM(x, y) CONCAT (x, y)
> +
> +#define FUNC10(x)	\
> +  FUNC (NUM (x, 0));	\
> +  FUNC (NUM (x, 1));	\
> +  FUNC (NUM (x, 2));	\
> +  FUNC (NUM (x, 3));	\
> +  FUNC (NUM (x, 4));	\
> +  FUNC (NUM (x, 5));	\
> +  FUNC (NUM (x, 6));	\
> +  FUNC (NUM (x, 7));	\
> +  FUNC (NUM (x, 8));	\
> +  FUNC (NUM (x, 9))
> +
> +#define FUNC100(x)	\
> +  FUNC10 (NUM (x, 0));	\
> +  FUNC10 (NUM (x, 1));	\
> +  FUNC10 (NUM (x, 2));	\
> +  FUNC10 (NUM (x, 3));	\
> +  FUNC10 (NUM (x, 4));	\
> +  FUNC10 (NUM (x, 5));	\
> +  FUNC10 (NUM (x, 6));	\
> +  FUNC10 (NUM (x, 7));	\
> +  FUNC10 (NUM (x, 8));	\
> +  FUNC10 (NUM (x, 9))
> +
> +#define FUNC1000(x)		\
> +  FUNC100 (NUM (x, 0));		\
> +  FUNC100 (NUM (x, 1));		\
> +  FUNC100 (NUM (x, 2));		\
> +  FUNC100 (NUM (x, 3));		\
> +  FUNC100 (NUM (x, 4));		\
> +  FUNC100 (NUM (x, 5));		\
> +  FUNC100 (NUM (x, 6));		\
> +  FUNC100 (NUM (x, 7));		\
> +  FUNC100 (NUM (x, 8));		\
> +  FUNC100 (NUM (x, 9))
> +
> +#define FUNC7000()	\
> +  FUNC1000 (1);		\
> +  FUNC1000 (2);		\
> +  FUNC1000 (3);		\
> +  FUNC1000 (4);		\
> +  FUNC1000 (5);		\
> +  FUNC1000 (6);		\
> +  FUNC1000 (7);
> +
> +#ifdef FUNC
> +# undef FUNC
> +#endif
> +
> +#ifdef externnum
> +# define FUNC(x) extern int CONCAT (retNum, x) (void)
> +#endif
> +
> +#ifdef definenum
> +# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
> +#endif
> +
> +#ifdef callnum
> +# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
> +#endif
> +
> +/* A value of 7000 functions is chosen as an arbitrarily large
> +   number of functions that will allow us enough attempts to
> +   verify lazy resolution operation.  */
> +FUNC7000 ();
> 

OK.


-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv6] Fix _dl_profile_fixup data-dependency issue (Bug 23690)
  2018-11-30 19:56                         ` Carlos O'Donell
@ 2018-11-30 20:10                           ` Tulio Magno Quites Machado Filho
  0 siblings, 0 replies; 40+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-11-30 20:10 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, John David Anglin, Adhemerval Zanella,
	Joseph Myers, Florian Weimer, Carlos O'Donell

Carlos O'Donell <carlos@redhat.com> writes:

> On 11/30/18 11:22 AM, Tulio Magno Quites Machado Filho wrote:
>> There is a data-dependency between the fields of struct l_reloc_result
>> and the field used as the initialization guard. Users of the guard
>> expect writes to the structure to be observable when they also observe
>> the guard initialized. The solution for this problem is to use an acquire
>> and release load and store to ensure previous writes to the structure are
>> observable if the guard is initialized.
>> 
>> The previous implementation used DL_FIXUP_VALUE_ADDR (l_reloc_result->addr)
>> as the initialization guard, making it impossible for some architectures
>> to load and store it atomically, i.e. hppa and ia64, due to its larger size.
>> 
>> This commit adds an unsigned int to l_reloc_result to be used as the new
>> initialization guard of the struct, making it possible to load and store
>> it atomically in all architectures. The fix ensures that the values
>> observed in l_reloc_result are consistent and do not lead to crashes.
>> The algorithm is documented in the code in elf/dl-runtime.c
>> (_dl_profile_fixup). Not all data races have been eliminated.
>> 
>> Tested with build-many-glibcs and on powerpc, powerpc64, and powerpc64le.
>
> Perfect. Please commit to master :-)
>
> Reviewed-by: Carlos O'Donell <carlos@redhat.com>

Pushed as e5d262effe3a87164308a3f37e61b32d0348692a.

Thanks!

-- 
Tulio Magno

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2018-11-30 20:10 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-19 21:48 [PATCH] Protect _dl_profile_fixup data-dependency order [BZ #23690] Tulio Magno Quites Machado Filho
2018-09-20  0:04 ` Joseph Myers
2018-09-20 13:03   ` Tulio Magno Quites Machado Filho
2018-09-20  0:16 ` Carlos O'Donell
2018-09-20  1:59   ` John David Anglin
2018-09-20  2:01     ` Carlos O'Donell
2018-09-20 13:34       ` John David Anglin
2018-10-11 21:26         ` Carlos O'Donell
2018-09-20 16:42   ` Tulio Magno Quites Machado Filho
2018-09-20 17:04     ` Florian Weimer
2018-09-21 17:21       ` Tulio Magno Quites Machado Filho
2018-09-21 17:24         ` Florian Weimer
2018-09-21 17:37           ` Tulio Magno Quites Machado Filho
2018-09-21 17:36         ` Tulio Magno Quites Machado Filho
2018-10-08 19:28   ` [PATCHv2] " Tulio Magno Quites Machado Filho
2018-10-08 19:45     ` Florian Weimer
2018-10-11  6:15       ` [PATCHv3] " Tulio Magno Quites Machado Filho
2018-10-12  1:08         ` Carlos O'Donell
2018-10-15 13:01           ` Florian Weimer
2018-10-15 15:10             ` Carlos O'Donell
2018-10-17 21:25               ` Florian Weimer
2018-10-18  2:14                 ` Carlos O'Donell
2018-10-18  7:24                   ` Carlos O'Donell
2018-10-18 10:21                   ` Florian Weimer
2018-10-18 16:56                     ` Carlos O'Donell
2018-10-18 18:22                       ` Adhemerval Zanella
2018-10-18 19:25                         ` Carlos O'Donell
2018-10-18 20:01                           ` Adhemerval Zanella
2018-10-23  1:33                             ` Carlos O'Donell
2018-10-23 14:11                               ` Adhemerval Zanella
2018-10-23 15:56                                 ` Carlos O'Donell
2018-10-18  2:02           ` [PATCHv4] " Tulio Magno Quites Machado Filho
2018-10-18  2:17             ` Carlos O'Donell
2018-10-23 21:07               ` [PATCHv5] " Tulio Magno Quites Machado Filho
2018-11-07 21:54                 ` Carlos O'Donell
2018-11-22 18:21                   ` Tulio Magno Quites Machado Filho
2018-11-29 14:28                     ` Carlos O'Donell
2018-11-30 16:23                       ` [PATCHv6] Fix _dl_profile_fixup data-dependency issue (Bug 23690) Tulio Magno Quites Machado Filho
2018-11-30 19:56                         ` Carlos O'Donell
2018-11-30 20:10                           ` Tulio Magno Quites Machado Filho

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).