BZ 20822 :powerpc: race condition in __lll_unlock

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* BZ 20822 :powerpc: race condition in __lll_unlock_elision
@ 2016-11-15 16:13 Rajalakshmi Srinivasaraghavan
  2016-11-16 13:40 ` Adhemerval Zanella
  2016-11-17 14:47 ` Torvald Riegel
  0 siblings, 2 replies; 22+ messages in thread
From: Rajalakshmi Srinivasaraghavan @ 2016-11-15 16:13 UTC (permalink / raw)
  To: libc-alpha
  Cc: aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos,
	adhemerval.zanella, adconrad, wschmidt

The initial problem reported was memory corruption in MongoDB only on 
Ubuntu 16.04 ppc64le(not on Ubuntu 15).
The problem was not easily recreatable and debugging with many tools and 
creative ideas, the problem is narrowed down to lock elision in glibc.
Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which 
made the testcase fail only on 16.04.

As stated in BZ 20822, short description is

"The update of *adapt_count after the release of the lock causes a race 
condition. Thread A unlocks, thread B continues and returns, destroying 
mutex on stack,
  then gets into another function, thread A writes to *adapt_count and 
corrupts stack.The window is very narrow and requires that the
machine be in smt mode, so likely the two threads have to be on the same 
core in order to hit this"

Looking back the file changes in 
sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit
  https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve
the performance.

Thinking of the following fixes.

1) Update of adapt_count before the unlock. But I have still not 
identified if its going to affect the performance.

2) In elision-lock.c/elision-trylock.c update adapt_count right after 
the lock acquire at the end of the function.
In either case we have to be careful not to introduce another race 
condition.

Any suggestions?

-- 
Thanks
Rajalakshmi S

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-15 16:13 BZ 20822 :powerpc: race condition in __lll_unlock_elision Rajalakshmi Srinivasaraghavan
@ 2016-11-16 13:40 ` Adhemerval Zanella
  2016-11-16 14:08   ` Szabolcs Nagy
  2016-11-17 14:47 ` Torvald Riegel
  1 sibling, 1 reply; 22+ messages in thread
From: Adhemerval Zanella @ 2016-11-16 13:40 UTC (permalink / raw)
  To: Rajalakshmi Srinivasaraghavan, libc-alpha
  Cc: aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adconrad, wschmidt

On 15/11/2016 14:12, Rajalakshmi Srinivasaraghavan wrote:
> 
> The initial problem reported was memory corruption in MongoDB only on Ubuntu 16.04 ppc64le(not on Ubuntu 15).
> The problem was not easily recreatable and debugging with many tools and creative ideas, the problem is narrowed down to lock elision in glibc.
> Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which made the testcase fail only on 16.04.
> 
> As stated in BZ 20822, short description is
> 
> "The update of *adapt_count after the release of the lock causes a race condition. Thread A unlocks, thread B continues and returns, destroying mutex on stack,
>  then gets into another function, thread A writes to *adapt_count and corrupts stack.The window is very narrow and requires that the
> machine be in smt mode, so likely the two threads have to be on the same core in order to hit this"
> 
> Looking back the file changes in sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit
>  https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve
> the performance.
> 
> Thinking of the following fixes.
> 
> 1) Update of adapt_count before the unlock. But I have still not identified if its going to affect the performance.
> 
> 2) In elision-lock.c/elision-trylock.c update adapt_count right after the lock acquire at the end of the function.
> In either case we have to be careful not to introduce another race condition.
> 
> Any suggestions?
> 

The fadd2ad9c commit indeed should have raised us a flags, since
it tries to update a variable ('adapt_count' or
pthread_mutex_t.__data.__elision) in a transaction, where it should
be semantically handled as atomic access, and later in a 
non-atomic way.  This is wrong and I think we should either revert
the patch or move the adapt count update to 'inside' the transaction
(your first option).

However, I am not sure if this is the culprit of the issue.  Based
on bug report, it states the faulty scenario is:

"[...]  Thread A unlocks, thread B continues and returns, destroying
mutex on stack, then gets into another function, thread A writes to
*adapt_count and corrupts stack. [...]"

It seems *after* thread B destroy the mutex, it was not clear that
the program just calls 'pthread_mutex_destroy' or if the thread 
deallocates the mutex on the stack (by returning or something else).
First option should not trigger an invalid access, but second is 
definitively an program error.

Either way, I am also not sure if this current behaviour would
lead to a correctness issue, but rather a performance one: the
race 'adapt_count' usage would lead to TLE being avoided, not
really invalid memory access.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-16 13:40 ` Adhemerval Zanella
@ 2016-11-16 14:08   ` Szabolcs Nagy
  2016-11-16 14:31     ` Adhemerval Zanella
  0 siblings, 1 reply; 22+ messages in thread
From: Szabolcs Nagy @ 2016-11-16 14:08 UTC (permalink / raw)
  To: Adhemerval Zanella, Rajalakshmi Srinivasaraghavan, libc-alpha
  Cc: nd, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adconrad,
	wschmidt

On 16/11/16 13:40, Adhemerval Zanella wrote:
> "[...]  Thread A unlocks, thread B continues and returns, destroying
> mutex on stack, then gets into another function, thread A writes to
> *adapt_count and corrupts stack. [...]"
> 
> It seems *after* thread B destroy the mutex, it was not clear that
> the program just calls 'pthread_mutex_destroy' or if the thread 
> deallocates the mutex on the stack (by returning or something else).
> First option should not trigger an invalid access, but second is 
> definitively an program error.
> 

does not look like a program error to me:

int i=0;

void threadB()
{
	pthread_mutex_t m[1];
	//...
	pthread_mutex_lock(m);
	if (i==1) {
		pthread_mutex_unlock(m);
		pthread_mutex_destroy(m,0); // ok A is done with m
		return; // drop stack frame
	}
	//...
}

void threadA()
{
	// get m from thread B
	pthread_mutex_lock(m);
	i=1;
	pthread_mutex_unlock(m);
	// no more use of m here
}

this code can corrupt the stack of threadB if pthread_mutex_unlock
in threadA accesses m->adapt_count after m is unlocked.

(same for rwlocks and whatever else may be used with elision.)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-16 14:08   ` Szabolcs Nagy
@ 2016-11-16 14:31     ` Adhemerval Zanella
  2016-11-16 15:54       ` Rajalakshmi Srinivasaraghavan
  0 siblings, 1 reply; 22+ messages in thread
From: Adhemerval Zanella @ 2016-11-16 14:31 UTC (permalink / raw)
  To: Szabolcs Nagy, Rajalakshmi Srinivasaraghavan, libc-alpha
  Cc: nd, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adconrad,
	wschmidt



On 16/11/2016 12:07, Szabolcs Nagy wrote:
> On 16/11/16 13:40, Adhemerval Zanella wrote:
>> "[...]  Thread A unlocks, thread B continues and returns, destroying
>> mutex on stack, then gets into another function, thread A writes to
>> *adapt_count and corrupts stack. [...]"
>>
>> It seems *after* thread B destroy the mutex, it was not clear that
>> the program just calls 'pthread_mutex_destroy' or if the thread 
>> deallocates the mutex on the stack (by returning or something else).
>> First option should not trigger an invalid access, but second is 
>> definitively an program error.
>>
> 
> does not look like a program error to me:
> 
> int i=0;
> 
> void threadB()
> {
> 	pthread_mutex_t m[1];
> 	//...
> 	pthread_mutex_lock(m);
> 	if (i==1) {
> 		pthread_mutex_unlock(m);
> 		pthread_mutex_destroy(m,0); // ok A is done with m
> 		return; // drop stack frame
> 	}
> 	//...
> }
> 
> void threadA()
> {
> 	// get m from thread B
> 	pthread_mutex_lock(m);
> 	i=1;
> 	pthread_mutex_unlock(m);
> 	// no more use of m here
> }
> 
> this code can corrupt the stack of threadB if pthread_mutex_unlock
> in threadA accesses m->adapt_count after m is unlocked.
> 
> (same for rwlocks and whatever else may be used with elision.)

If this the scenario described in bug report it could the case of
the issue.  I was thinking in another scenario, where mutex unlock
was not synchronized with a global variable.

It could be case where a similar code you described unlocks the
mutex on threadA, threadB run and finishes, and then threadA
runs again an access invalid memory.

Anyway, I see the correct fix to just just move the adapt_count
to inside the transaction.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-16 14:31     ` Adhemerval Zanella
@ 2016-11-16 15:54       ` Rajalakshmi Srinivasaraghavan
  0 siblings, 0 replies; 22+ messages in thread
From: Rajalakshmi Srinivasaraghavan @ 2016-11-16 15:54 UTC (permalink / raw)
  To: Adhemerval Zanella, Szabolcs Nagy, libc-alpha
  Cc: nd, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adconrad,
	wschmidt



On 11/16/2016 08:01 PM, Adhemerval Zanella wrote:
> nyway, I see the correct fix to just just move the adapt_count
> to inside the transaction.
Thanks Adhemerval and Szabolcs for the inputs.
I posted a patch in libc-alpha for the same.  While this is moving, I will
try to analyze the performance impact, if any.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-15 16:13 BZ 20822 :powerpc: race condition in __lll_unlock_elision Rajalakshmi Srinivasaraghavan
  2016-11-16 13:40 ` Adhemerval Zanella
@ 2016-11-17 14:47 ` Torvald Riegel
  2016-11-21 23:42   ` Steven Munroe
  2016-12-04 12:14   ` Torvald Riegel
  1 sibling, 2 replies; 22+ messages in thread
From: Torvald Riegel @ 2016-11-17 14:47 UTC (permalink / raw)
  To: Rajalakshmi Srinivasaraghavan
  Cc: libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos,
	adhemerval.zanella, adconrad, wschmidt

On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote:
> The initial problem reported was memory corruption in MongoDB only on 
> Ubuntu 16.04 ppc64le(not on Ubuntu 15).
> The problem was not easily recreatable and debugging with many tools and 
> creative ideas, the problem is narrowed down to lock elision in glibc.
> Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which 
> made the testcase fail only on 16.04.
> 
> As stated in BZ 20822, short description is
> 
> "The update of *adapt_count after the release of the lock causes a race 
> condition. Thread A unlocks, thread B continues and returns, destroying 
> mutex on stack,
>   then gets into another function, thread A writes to *adapt_count and 
> corrupts stack.The window is very narrow and requires that the
> machine be in smt mode, so likely the two threads have to be on the same 
> core in order to hit this"
> 
> Looking back the file changes in 
> sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit
>   https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve
> the performance.
> 
> Thinking of the following fixes.
> 
> 1) Update of adapt_count before the unlock. But I have still not 
> identified if its going to affect the performance.

Because of the requirements on mutex destruction (and same for rwlock
etc.), a thread must not access any data associated with the mutex after
having unlocked it (this includes reads because it's correct to also
unmap the memory holding the mutex after destroying the mutex). 

> 2) In elision-lock.c/elision-trylock.c update adapt_count right after 
> the lock acquire at the end of the function.
> In either case we have to be careful not to introduce another race 
> condition.

Every access to mutex data outside of transactions must use atomic
operations if there would be data races otherwise (according to the C11
memory model).  I know that this isn't followed in the current mutex
code, but it really should be, and your change qualifies as new code so
please take care of this too.

I also think that if we can have concurrent accesses to a particular
variable both inside and outside of transactions, we should use atomic
accesses inside the transactions too, just for consistency.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-17 14:47 ` Torvald Riegel
@ 2016-11-21 23:42   ` Steven Munroe
  2016-11-22  8:44     ` Torvald Riegel
  2016-12-04 12:14   ` Torvald Riegel
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Munroe @ 2016-11-21 23:42 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey,
	Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella,
	adconrad, wschmidt

On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote:
> On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote:
> > The initial problem reported was memory corruption in MongoDB only on 
> > Ubuntu 16.04 ppc64le(not on Ubuntu 15).
> > The problem was not easily recreatable and debugging with many tools and 
> > creative ideas, the problem is narrowed down to lock elision in glibc.
> > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which 
> > made the testcase fail only on 16.04.
> > 
> > As stated in BZ 20822, short description is
> > 
> > "The update of *adapt_count after the release of the lock causes a race 
> > condition. Thread A unlocks, thread B continues and returns, destroying 
> > mutex on stack,
> >   then gets into another function, thread A writes to *adapt_count and 
> > corrupts stack.The window is very narrow and requires that the
> > machine be in smt mode, so likely the two threads have to be on the same 
> > core in order to hit this"
> > 
> > Looking back the file changes in 
> > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit
> >   https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve
> > the performance.
> > 
> > Thinking of the following fixes.
> > 
> > 1) Update of adapt_count before the unlock. But I have still not 
> > identified if its going to affect the performance.
> 
> Because of the requirements on mutex destruction (and same for rwlock
> etc.), a thread must not access any data associated with the mutex after
> having unlocked it (this includes reads because it's correct to also
> unmap the memory holding the mutex after destroying the mutex). 
> 
> > 2) In elision-lock.c/elision-trylock.c update adapt_count right after 
> > the lock acquire at the end of the function.
> > In either case we have to be careful not to introduce another race 
> > condition.
> 
> Every access to mutex data outside of transactions must use atomic
> operations if there would be data races otherwise (according to the C11
> memory model).  I know that this isn't followed in the current mutex
> code, but it really should be, and your change qualifies as new code so
> please take care of this too.
> 
In this case (with the proposed patch) the adapt_count update is within
the critical region and adding C11 atomics does not add value and would
generally make the logical harder to read. 

__atomic_store_n (adapt_count, (__atomic_load_n(adapt_count,
__ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED)

Is a lot of syntax without any real value.

> I also think that if we can have concurrent accesses to a particular
> variable both inside and outside of transactions, we should use atomic
> accesses inside the transactions too, just for consistency.
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-21 23:42   ` Steven Munroe
@ 2016-11-22  8:44     ` Torvald Riegel
  2016-11-22  8:55       ` Florian Weimer
  2016-11-22 18:03       ` BZ 20822 :powerpc: race condition in __lll_unlock_elision Steven Munroe
  0 siblings, 2 replies; 22+ messages in thread
From: Torvald Riegel @ 2016-11-22  8:44 UTC (permalink / raw)
  To: munroesj
  Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey,
	Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella,
	adconrad, wschmidt

On Mon, 2016-11-21 at 17:42 -0600, Steven Munroe wrote:
> On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote:
> > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote:
> > > The initial problem reported was memory corruption in MongoDB only on 
> > > Ubuntu 16.04 ppc64le(not on Ubuntu 15).
> > > The problem was not easily recreatable and debugging with many tools and 
> > > creative ideas, the problem is narrowed down to lock elision in glibc.
> > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which 
> > > made the testcase fail only on 16.04.
> > > 
> > > As stated in BZ 20822, short description is
> > > 
> > > "The update of *adapt_count after the release of the lock causes a race 
> > > condition. Thread A unlocks, thread B continues and returns, destroying 
> > > mutex on stack,
> > >   then gets into another function, thread A writes to *adapt_count and 
> > > corrupts stack.The window is very narrow and requires that the
> > > machine be in smt mode, so likely the two threads have to be on the same 
> > > core in order to hit this"
> > > 
> > > Looking back the file changes in 
> > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit
> > >   https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve
> > > the performance.
> > > 
> > > Thinking of the following fixes.
> > > 
> > > 1) Update of adapt_count before the unlock. But I have still not 
> > > identified if its going to affect the performance.
> > 
> > Because of the requirements on mutex destruction (and same for rwlock
> > etc.), a thread must not access any data associated with the mutex after
> > having unlocked it (this includes reads because it's correct to also
> > unmap the memory holding the mutex after destroying the mutex). 
> > 
> > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after 
> > > the lock acquire at the end of the function.
> > > In either case we have to be careful not to introduce another race 
> > > condition.
> > 
> > Every access to mutex data outside of transactions must use atomic
> > operations if there would be data races otherwise (according to the C11
> > memory model).  I know that this isn't followed in the current mutex
> > code, but it really should be, and your change qualifies as new code so
> > please take care of this too.
> > 
> In this case (with the proposed patch) the adapt_count update is within
> the critical region and adding C11 atomics does not add value and would
> generally make the logical harder to read. 
> 
> __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count,
> __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED)
> 
> Is a lot of syntax without any real value.

First, it would actually be this in glibc:
atomic_store_relaxed (&adapt_count,
  atomic_load_relaxed(&adapt_count) - 1);

Second, it does add value in that it makes it clear that there are
concurrent accesses elsewhere that are not in transactions.  We have
consensus to use the new C11-like atomics in new or changed code, and I
don't see a reason to make an exception here.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-22  8:44     ` Torvald Riegel
@ 2016-11-22  8:55       ` Florian Weimer
  2016-11-22  9:41         ` Torvald Riegel
  2016-11-22 18:03       ` BZ 20822 :powerpc: race condition in __lll_unlock_elision Steven Munroe
  1 sibling, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2016-11-22  8:55 UTC (permalink / raw)
  To: Torvald Riegel, munroesj
  Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey,
	Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella,
	adconrad, wschmidt

On 11/22/2016 09:44 AM, Torvald Riegel wrote:

>> In this case (with the proposed patch) the adapt_count update is within
>> the critical region and adding C11 atomics does not add value and would
>> generally make the logical harder to read.
>>
>> __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count,
>> __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED)
>>
>> Is a lot of syntax without any real value.
>
> First, it would actually be this in glibc:
> atomic_store_relaxed (&adapt_count,
>   atomic_load_relaxed(&adapt_count) - 1);
>
> Second, it does add value in that it makes it clear that there are
> concurrent accesses elsewhere that are not in transactions.  We have
> consensus to use the new C11-like atomics in new or changed code, and I
> don't see a reason to make an exception here.

A possible reason: The code does not actually follow the C11 memory 
model, so using C11-like atomics is rather misleading.  And unlike code 
using incorrect synchronization or legacy atomics, we don't have a path 
towards the C11 model because what the code does is completely outside 
it, so there's no argument on favor of gradual/partial conversion.

Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-22  8:55       ` Florian Weimer
@ 2016-11-22  9:41         ` Torvald Riegel
  2016-11-22 10:52           ` Florian Weimer
  0 siblings, 1 reply; 22+ messages in thread
From: Torvald Riegel @ 2016-11-22  9:41 UTC (permalink / raw)
  To: Florian Weimer
  Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha,
	aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos,
	adhemerval.zanella, adconrad, wschmidt

On Tue, 2016-11-22 at 09:55 +0100, Florian Weimer wrote:
> On 11/22/2016 09:44 AM, Torvald Riegel wrote:
> 
> >> In this case (with the proposed patch) the adapt_count update is within
> >> the critical region and adding C11 atomics does not add value and would
> >> generally make the logical harder to read.
> >>
> >> __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count,
> >> __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED)
> >>
> >> Is a lot of syntax without any real value.
> >
> > First, it would actually be this in glibc:
> > atomic_store_relaxed (&adapt_count,
> >   atomic_load_relaxed(&adapt_count) - 1);
> >
> > Second, it does add value in that it makes it clear that there are
> > concurrent accesses elsewhere that are not in transactions.  We have
> > consensus to use the new C11-like atomics in new or changed code, and I
> > don't see a reason to make an exception here.
> 
> A possible reason: The code does not actually follow the C11 memory 
> model, so using C11-like atomics is rather misleading.

While it is true that the C11 memory model does not model HW
transactions, it is not true to claim that this code is unrelated to the
C11 memory model.  First of all, there is code outside of transactions
that accesses adapt_count, so we obviously need atomic operations there.
Second, there needs to be *some* memory model, and it needs to
incorporate the compiler too.

> And unlike code 
> using incorrect synchronization or legacy atomics, we don't have a path 
> towards the C11 model because what the code does is completely outside 
> it, so there's no argument on favor of gradual/partial conversion.

The HW transaction has to synchronize with nontransactional code.  We're
obviously going to use the C11 model for nontransactional code.  Thus,
it cannot be unrelated.  It's true that we don't have a detailed
description of how HW transactions *extend* the C11 model, but it
clearly would be an extension because it simply has to fit in there and
it will use the same underlying mechanisms.

One can also think about it this way: There is nontransactional code
that has to use atomics (because there are concurrent accesses from
other nontransactional code as well as transactional code).  So we are
going to use atomics there.  Then, even if just for consistency, we're
going to use (relaxed MO) atomic accesses too for all other concurrent
accesses to the same data.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-22  9:41         ` Torvald Riegel
@ 2016-11-22 10:52           ` Florian Weimer
  2016-11-22 13:45             ` Torvald Riegel
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2016-11-22 10:52 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha,
	aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos,
	adhemerval.zanella, adconrad, wschmidt

On 11/22/2016 10:41 AM, Torvald Riegel wrote:

>> And unlike code
>> using incorrect synchronization or legacy atomics, we don't have a path
>> towards the C11 model because what the code does is completely outside
>> it, so there's no argument on favor of gradual/partial conversion.
>
> The HW transaction has to synchronize with nontransactional code.  We're
> obviously going to use the C11 model for nontransactional code.  Thus,
> it cannot be unrelated.  It's true that we don't have a detailed
> description of how HW transactions *extend* the C11 model, but it
> clearly would be an extension because it simply has to fit in there and
> it will use the same underlying mechanisms.
>
> One can also think about it this way: There is nontransactional code
> that has to use atomics (because there are concurrent accesses from
> other nontransactional code as well as transactional code).  So we are
> going to use atomics there.  Then, even if just for consistency, we're
> going to use (relaxed MO) atomic accesses too for all other concurrent
> accesses to the same data.

The counter-argument to that this is until we have a specification of 
how this kind of transactional memory works, we don't know what it means 
if an object is accessed from both a transaction and and other 
non-transactional code, and what external constraints are needed to make 
such access valid (if any).  There are some really scary examples from 
early transaction memory implementations.

But come to think of it, the core issue hasn't got to do much with 
atomics at all.  It's just about how the transactional memory 
implementation works.  So I don't really have any strong opinions here.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-22 10:52           ` Florian Weimer
@ 2016-11-22 13:45             ` Torvald Riegel
  2016-11-22 15:02               ` Florian Weimer
  0 siblings, 1 reply; 22+ messages in thread
From: Torvald Riegel @ 2016-11-22 13:45 UTC (permalink / raw)
  To: Florian Weimer
  Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha,
	aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos,
	adhemerval.zanella, adconrad, wschmidt

On Tue, 2016-11-22 at 11:52 +0100, Florian Weimer wrote:
> On 11/22/2016 10:41 AM, Torvald Riegel wrote:
> 
> >> And unlike code
> >> using incorrect synchronization or legacy atomics, we don't have a path
> >> towards the C11 model because what the code does is completely outside
> >> it, so there's no argument on favor of gradual/partial conversion.
> >
> > The HW transaction has to synchronize with nontransactional code.  We're
> > obviously going to use the C11 model for nontransactional code.  Thus,
> > it cannot be unrelated.  It's true that we don't have a detailed
> > description of how HW transactions *extend* the C11 model, but it
> > clearly would be an extension because it simply has to fit in there and
> > it will use the same underlying mechanisms.
> >
> > One can also think about it this way: There is nontransactional code
> > that has to use atomics (because there are concurrent accesses from
> > other nontransactional code as well as transactional code).  So we are
> > going to use atomics there.  Then, even if just for consistency, we're
> > going to use (relaxed MO) atomic accesses too for all other concurrent
> > accesses to the same data.
> 
> The counter-argument to that this is until we have a specification of 
> how this kind of transactional memory works, we don't know what it means 
> if an object is accessed from both a transaction and and other 
> non-transactional code, and what external constraints are needed to make 
> such access valid (if any).

Lock elision relies on this kind of mixed access.  While it would be
nice to have a common formal model for the various HTMs out there from
the perspective of a C11/C++11 memory model setting, I don't think it's
a big problem right now that we don't (AFAIK) have such a formal model.
Lock elision should be well understood, so I'm not worried about any
surprises regarding this use case.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-22 13:45             ` Torvald Riegel
@ 2016-11-22 15:02               ` Florian Weimer
  2016-11-22 16:58                 ` Torvald Riegel
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2016-11-22 15:02 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha,
	aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos,
	adhemerval.zanella, adconrad, wschmidt

On 11/22/2016 02:45 PM, Torvald Riegel wrote:

> Lock elision relies on this kind of mixed access.  While it would be
> nice to have a common formal model for the various HTMs out there from
> the perspective of a C11/C++11 memory model setting, I don't think it's
> a big problem right now that we don't (AFAIK) have such a formal model.
> Lock elision should be well understood, so I'm not worried about any
> surprises regarding this use case.

You are the expert.  I'm just surprised we embrace a wild-west approach 
to P&C again, while trying to convince others to argue from the 
definitions instead based on gut feeling.

That being said, I'm don't have a strong opinion regarding the style to 
use in this case.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-22 15:02               ` Florian Weimer
@ 2016-11-22 16:58                 ` Torvald Riegel
  2016-11-22 18:40                   ` Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision) Florian Weimer
  0 siblings, 1 reply; 22+ messages in thread
From: Torvald Riegel @ 2016-11-22 16:58 UTC (permalink / raw)
  To: Florian Weimer
  Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha,
	aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos,
	adhemerval.zanella, adconrad, wschmidt

On Tue, 2016-11-22 at 16:02 +0100, Florian Weimer wrote:
> On 11/22/2016 02:45 PM, Torvald Riegel wrote:
> 
> > Lock elision relies on this kind of mixed access.  While it would be
> > nice to have a common formal model for the various HTMs out there from
> > the perspective of a C11/C++11 memory model setting, I don't think it's
> > a big problem right now that we don't (AFAIK) have such a formal model.
> > Lock elision should be well understood, so I'm not worried about any
> > surprises regarding this use case.
> 
> You are the expert.  I'm just surprised we embrace a wild-west approach 
> to P&C again, while trying to convince others to argue from the 
> definitions instead based on gut feeling.

To avoid misunderstandings: I'm not arguing in favor of embracing any
wild-west approach.  It would be great if we had a formal model that can
cover more than one architecture's semantics.  Nonetheless, we don't
have it, and I don't see this as a big enough problem to prevent us from
trying to use lock elision.  Once someone comes up with a model, I
suppose we'd certainly try to use it if possible.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-22  8:44     ` Torvald Riegel
  2016-11-22  8:55       ` Florian Weimer
@ 2016-11-22 18:03       ` Steven Munroe
  2016-11-23 11:30         ` Torvald Riegel
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Munroe @ 2016-11-22 18:03 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey,
	Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella,
	adconrad, wschmidt

On Tue, 2016-11-22 at 09:44 +0100, Torvald Riegel wrote:
> On Mon, 2016-11-21 at 17:42 -0600, Steven Munroe wrote:
> > On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote:
> > > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote:
> > > > The initial problem reported was memory corruption in MongoDB only on 
> > > > Ubuntu 16.04 ppc64le(not on Ubuntu 15).
> > > > The problem was not easily recreatable and debugging with many tools and 
> > > > creative ideas, the problem is narrowed down to lock elision in glibc.
> > > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which 
> > > > made the testcase fail only on 16.04.
> > > > 
> > > > As stated in BZ 20822, short description is
> > > > 
> > > > "The update of *adapt_count after the release of the lock causes a race 
> > > > condition. Thread A unlocks, thread B continues and returns, destroying 
> > > > mutex on stack,
> > > >   then gets into another function, thread A writes to *adapt_count and 
> > > > corrupts stack.The window is very narrow and requires that the
> > > > machine be in smt mode, so likely the two threads have to be on the same 
> > > > core in order to hit this"
> > > > 
> > > > Looking back the file changes in 
> > > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit
> > > >   https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve
> > > > the performance.
> > > > 
> > > > Thinking of the following fixes.
> > > > 
> > > > 1) Update of adapt_count before the unlock. But I have still not 
> > > > identified if its going to affect the performance.
> > > 
> > > Because of the requirements on mutex destruction (and same for rwlock
> > > etc.), a thread must not access any data associated with the mutex after
> > > having unlocked it (this includes reads because it's correct to also
> > > unmap the memory holding the mutex after destroying the mutex). 
> > > 
> > > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after 
> > > > the lock acquire at the end of the function.
> > > > In either case we have to be careful not to introduce another race 
> > > > condition.
> > > 
> > > Every access to mutex data outside of transactions must use atomic
> > > operations if there would be data races otherwise (according to the C11
> > > memory model).  I know that this isn't followed in the current mutex
> > > code, but it really should be, and your change qualifies as new code so
> > > please take care of this too.
> > > 
> > In this case (with the proposed patch) the adapt_count update is within
> > the critical region and adding C11 atomics does not add value and would
> > generally make the logical harder to read. 
> > 
> > __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count,
> > __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED)
> > 
> > Is a lot of syntax without any real value.
> 
> First, it would actually be this in glibc:
> atomic_store_relaxed (&adapt_count,
>   atomic_load_relaxed(&adapt_count) - 1);
> 
This is not a good example for you case. The proposed patch:

-      lll_unlock ((*lock), pshared);
-
-      /* Update the adapt count AFTER completing the critical section.
-         Doing this here prevents unneeded stalling when entering
-         a critical section.  Saving about 8% runtime on P8.  */
+      /* Update the adapt count in the critical section to
+         prevent race condition as mentioned in BZ 20822.  */
       if (*adapt_count > 0)
 	(*adapt_count)--;
+      lll_unlock ((*lock), pshared);

makes it clear that this test and update of the adapt_count is contained
with the acquire/release semantics of the lll_lock and lll_unlock.

Also if we add the explicit syntax you propose, someone, who has not
studied the architecture, might leap to the conclusion that add_fetch
would be even better.

Which would be absolutely wrong. This would generate a larx-add-stcx
sequence which has two very bad side-effects.

1) on the fall back leg, it would a force a larx-hit-stcx flush hazard
with the following lll_unlock and cause bad performance.
2) on the transaction leg, it would be redundant and just slow things
down.

There are some other cases, like the if test at the start of
__lll_lock_elision where it might be appropriate.

> Second, it does add value in that it makes it clear that there are
> concurrent accesses elsewhere that are not in transactions.  We have
> consensus to use the new C11-like atomics in new or changed code, and I
> don't see a reason to make an exception here.
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision)
  2016-11-22 16:58                 ` Torvald Riegel
@ 2016-11-22 18:40                   ` Florian Weimer
  2016-11-22 20:15                     ` Steven Munroe
  2016-11-23 14:29                     ` Torvald Riegel
  0 siblings, 2 replies; 22+ messages in thread
From: Florian Weimer @ 2016-11-22 18:40 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha,
	aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos,
	adhemerval.zanella, adconrad, wschmidt

Torvald,

I'm sorry for my earlier rather destructive comments.

Do you think those of us who do not work on the implementations of the 
libothread concurrency primitives need to deal with transactional memory 
issues at all?

The libc-internal locks do not even implement elision, and if we use 
hardware transactional memory implementations for lock elision only and 
that elision is semantically transparent, then the details would not 
matter to the rest of glibc.  We just have to think about the locking.

Is this the right approach?

Obviously, we will have to consider once we start using transactional 
memory for anything else beside lock elision, but this seems to be 
rather far away at this point.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision)
  2016-11-22 18:40                   ` Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision) Florian Weimer
@ 2016-11-22 20:15                     ` Steven Munroe
  2016-11-23 14:29                     ` Torvald Riegel
  1 sibling, 0 replies; 22+ messages in thread
From: Steven Munroe @ 2016-11-22 20:15 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Torvald Riegel, Rajalakshmi Srinivasaraghavan, libc-alpha,
	aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos,
	adhemerval.zanella, adconrad, wschmidt

On Tue, 2016-11-22 at 19:40 +0100, Florian Weimer wrote:
> Torvald,
> 
> I'm sorry for my earlier rather destructive comments.
> 
> Do you think those of us who do not work on the implementations of the 
> libothread concurrency primitives need to deal with transactional memory 
> issues at all?
> 
Well ... transactional memory is directly available to applications via builtins provided by GCC.

https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/x86-transactional-memory-intrinsics.html#x86-transactional-memory-intrinsics
https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/PowerPC-Hardware-Transactional-Memory-Built-in-Functions.html#PowerPC-Hardware-Transactional-Memory-Built-in-Functions
https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/S_002f390-System-z-Built-in-Functions.html#S_002f390-System-z-Built-in-Functions

The PowerPC and S390 APIs are closely aligned. So any one with access to Power8/OpenPOWER, Z12, or Haswell and use it.

So yes.

> 
> 
> The libc-internal locks do not even implement elision, and if we use 
> hardware transactional memory implementations for lock elision only and 
> that elision is semantically transparent, then the details would not 
> matter to the rest of glibc.  We just have to think about the locking.
> 
> Is this the right approach?
> 
> Obviously, we will have to consider once we start using transactional 
> memory for anything else beside lock elision, but this seems to be 
> rather far away at this point.

> Thanks,
> Florian
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-22 18:03       ` BZ 20822 :powerpc: race condition in __lll_unlock_elision Steven Munroe
@ 2016-11-23 11:30         ` Torvald Riegel
  2016-11-23 17:02           ` Steven Munroe
  0 siblings, 1 reply; 22+ messages in thread
From: Torvald Riegel @ 2016-11-23 11:30 UTC (permalink / raw)
  To: munroesj
  Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey,
	Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella,
	adconrad, wschmidt


On Tue, 2016-11-22 at 12:03 -0600, Steven Munroe wrote:
> On Tue, 2016-11-22 at 09:44 +0100, Torvald Riegel wrote:
> > On Mon, 2016-11-21 at 17:42 -0600, Steven Munroe wrote:
> > > On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote:
> > > > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote:
> > > > > The initial problem reported was memory corruption in MongoDB only on 
> > > > > Ubuntu 16.04 ppc64le(not on Ubuntu 15).
> > > > > The problem was not easily recreatable and debugging with many tools and 
> > > > > creative ideas, the problem is narrowed down to lock elision in glibc.
> > > > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which 
> > > > > made the testcase fail only on 16.04.
> > > > > 
> > > > > As stated in BZ 20822, short description is
> > > > > 
> > > > > "The update of *adapt_count after the release of the lock causes a race 
> > > > > condition. Thread A unlocks, thread B continues and returns, destroying 
> > > > > mutex on stack,
> > > > >   then gets into another function, thread A writes to *adapt_count and 
> > > > > corrupts stack.The window is very narrow and requires that the
> > > > > machine be in smt mode, so likely the two threads have to be on the same 
> > > > > core in order to hit this"
> > > > > 
> > > > > Looking back the file changes in 
> > > > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit
> > > > >   https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve
> > > > > the performance.
> > > > > 
> > > > > Thinking of the following fixes.
> > > > > 
> > > > > 1) Update of adapt_count before the unlock. But I have still not 
> > > > > identified if its going to affect the performance.
> > > > 
> > > > Because of the requirements on mutex destruction (and same for rwlock
> > > > etc.), a thread must not access any data associated with the mutex after
> > > > having unlocked it (this includes reads because it's correct to also
> > > > unmap the memory holding the mutex after destroying the mutex). 
> > > > 
> > > > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after 
> > > > > the lock acquire at the end of the function.
> > > > > In either case we have to be careful not to introduce another race 
> > > > > condition.
> > > > 
> > > > Every access to mutex data outside of transactions must use atomic
> > > > operations if there would be data races otherwise (according to the C11
> > > > memory model).  I know that this isn't followed in the current mutex
> > > > code, but it really should be, and your change qualifies as new code so
> > > > please take care of this too.
> > > > 
> > > In this case (with the proposed patch) the adapt_count update is within
> > > the critical region and adding C11 atomics does not add value and would
> > > generally make the logical harder to read. 
> > > 
> > > __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count,
> > > __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED)
> > > 
> > > Is a lot of syntax without any real value.
> > 
> > First, it would actually be this in glibc:
> > atomic_store_relaxed (&adapt_count,
> >   atomic_load_relaxed(&adapt_count) - 1);
> > 
> This is not a good example for you case. The proposed patch:
> 
> -      lll_unlock ((*lock), pshared);
> -
> -      /* Update the adapt count AFTER completing the critical section.
> -         Doing this here prevents unneeded stalling when entering
> -         a critical section.  Saving about 8% runtime on P8.  */
> +      /* Update the adapt count in the critical section to
> +         prevent race condition as mentioned in BZ 20822.  */
>        if (*adapt_count > 0)
>  	(*adapt_count)--;
> +      lll_unlock ((*lock), pshared);
> 
> makes it clear that this test and update of the adapt_count is contained
> with the acquire/release semantics of the lll_lock and lll_unlock.

However, it does not make it clear that it should be accessed atomically
outside of transactions.

Another reason why we want to use atomic operations consistently is to
at least allow us to really use C11 atomics in the future (or some sort
of similar underlying type), so that we can more easily catch cases
where atomic operations should be used but aren't.  If we don't
specifically use relaxed-MO operations, the default will be SC.  I'm
sure you'd like to avoid that too.

This is a consistency argument wrt other code.  The fact that the atomic
operation is in a transaction too is just another fact.
We don't want people to assume that plain memory accesses are
sufficient, for example because all accesses to adapt_count are within
transactions.

> Also if we add the explicit syntax you propose, someone, who has not
> studied the architecture, might leap to the conclusion that add_fetch
> would be even better.

I disagree.  If you look at the rules we've set up for concurrent code
in general (eg, see "Concurrency" on the wiki, or the concurrent code we
have converted), we require developers to document what the code does;
this means memory order and operations used, and the high-level intent.
If you do this properly, the misunderstandings you worry should not
happen.

> Which would be absolutely wrong. This would generate a larx-add-stcx
> sequence which has two very bad side-effects.
> 
> 1) on the fall back leg, it would a force a larx-hit-stcx flush hazard
> with the following lll_unlock and cause bad performance.
> 2) on the transaction leg, it would be redundant and just slow things
> down.

So why not add this as a comment?  The use of nonatomic accesses does
not imply these points, BTW.

Also, the same concerns for how to access adapt_count apply to other
accesses to it that may not or may be in a transaction (due to nesting).
You must use atomic accesses for these, so you'd need the explaining
comment anyway.

On a positive note, it seems that we agree that we want to clarity of
the source code.  I'm arguing that more comments and a consistent use of
atomic operations (with proper memory orders and comments, of course)
will get us there.  That's why I suggested that the patch should clean
up and document all accesses to adapt_count.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision)
  2016-11-22 18:40                   ` Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision) Florian Weimer
  2016-11-22 20:15                     ` Steven Munroe
@ 2016-11-23 14:29                     ` Torvald Riegel
  1 sibling, 0 replies; 22+ messages in thread
From: Torvald Riegel @ 2016-11-23 14:29 UTC (permalink / raw)
  To: Florian Weimer
  Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha,
	aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos,
	adhemerval.zanella, adconrad, wschmidt

On Tue, 2016-11-22 at 19:40 +0100, Florian Weimer wrote:
> I'm sorry for my earlier rather destructive comments.

No worries.  I guess I also wasn't clear enough in my statements (ie,
that I think the status quo is not too risky right now, instead of
thinking that the status quo will be fine forever).

> Do you think those of us who do not work on the implementations of the 
> libothread concurrency primitives need to deal with transactional memory 
> issues at all?

Not in the near future.

> The libc-internal locks do not even implement elision, and if we use 
> hardware transactional memory implementations for lock elision only and 
> that elision is semantically transparent, then the details would not 
> matter to the rest of glibc.  We just have to think about the locking.
> 
> Is this the right approach?

Yes, exactly.  That's why I'm not concerned about the status quo wrt TM
semantics right now; I believe we know how to implement lock elision
with sufficient confidence, and we wouldn't expose TM to non-libpthread
or arch-maintainer except through lock elision or something that uses
essentially the same approach.

Generally, current TMs don't guarantee transactions to succeed, so we
need a fallback mechanism to synchronize when transactions do not work,
and for current glibc this would mean locks (STMs or HyTMs probably
don't buy us much for glibc-internal use cases right now, at least
compared to lock elision if a decent locking scheme is possible).  Thus,
we'd use lock elision on internal locks, potentially with additional
performance hints aimed at the lock elision implementation.

(There are TM modes in some implementations that guarantee completion of
restricted transactions if there is no contention.  But these are really
restricted, and you need to tightly control the code that these
transactions execute, as well as avoid potential interference with other
threads (eg, false sharing).  The restrictions are tighter than what
AMD's ASF proposed a few years ago, for example.  Thus, these are be
special-case implementations in arch-specific code only.)

> Obviously, we will have to consider once we start using transactional 
> memory for anything else beside lock elision, but this seems to be 
> rather far away at this point.

Yes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-23 11:30         ` Torvald Riegel
@ 2016-11-23 17:02           ` Steven Munroe
  2016-11-23 18:35             ` Torvald Riegel
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Munroe @ 2016-11-23 17:02 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey,
	Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella,
	adconrad, wschmidt

On Wed, 2016-11-23 at 12:25 +0100, Torvald Riegel wrote:
> On Tue, 2016-11-22 at 12:03 -0600, Steven Munroe wrote:
> > On Tue, 2016-11-22 at 09:44 +0100, Torvald Riegel wrote:
> > > On Mon, 2016-11-21 at 17:42 -0600, Steven Munroe wrote:
> > > > On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote:
> > > > > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote:
> > > > > > The initial problem reported was memory corruption in MongoDB only on 
> > > > > > Ubuntu 16.04 ppc64le(not on Ubuntu 15).
> > > > > > The problem was not easily recreatable and debugging with many tools and 
> > > > > > creative ideas, the problem is narrowed down to lock elision in glibc.
> > > > > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which 
> > > > > > made the testcase fail only on 16.04.
> > > > > > 
> > > > > > As stated in BZ 20822, short description is
> > > > > > 
> > > > > > "The update of *adapt_count after the release of the lock causes a race 
> > > > > > condition. Thread A unlocks, thread B continues and returns, destroying 
> > > > > > mutex on stack,
> > > > > >   then gets into another function, thread A writes to *adapt_count and 
> > > > > > corrupts stack.The window is very narrow and requires that the
> > > > > > machine be in smt mode, so likely the two threads have to be on the same 
> > > > > > core in order to hit this"
> > > > > > 
> > > > > > Looking back the file changes in 
> > > > > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit
> > > > > >   https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve
> > > > > > the performance.
> > > > > > 
> > > > > > Thinking of the following fixes.
> > > > > > 
> > > > > > 1) Update of adapt_count before the unlock. But I have still not 
> > > > > > identified if its going to affect the performance.
> > > > > 
> > > > > Because of the requirements on mutex destruction (and same for rwlock
> > > > > etc.), a thread must not access any data associated with the mutex after
> > > > > having unlocked it (this includes reads because it's correct to also
> > > > > unmap the memory holding the mutex after destroying the mutex). 
> > > > > 
> > > > > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after 
> > > > > > the lock acquire at the end of the function.
> > > > > > In either case we have to be careful not to introduce another race 
> > > > > > condition.
> > > > > 
> > > > > Every access to mutex data outside of transactions must use atomic
> > > > > operations if there would be data races otherwise (according to the C11
> > > > > memory model).  I know that this isn't followed in the current mutex
> > > > > code, but it really should be, and your change qualifies as new code so
> > > > > please take care of this too.
> > > > > 
> > > > In this case (with the proposed patch) the adapt_count update is within
> > > > the critical region and adding C11 atomics does not add value and would
> > > > generally make the logical harder to read. 
> > > > 
> > > > __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count,
> > > > __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED)
> > > > 
> > > > Is a lot of syntax without any real value.
> > > 
> > > First, it would actually be this in glibc:
> > > atomic_store_relaxed (&adapt_count,
> > >   atomic_load_relaxed(&adapt_count) - 1);
> > > 
> > This is not a good example for you case. The proposed patch:
> > 
> > -      lll_unlock ((*lock), pshared);
> > -
> > -      /* Update the adapt count AFTER completing the critical section.
> > -         Doing this here prevents unneeded stalling when entering
> > -         a critical section.  Saving about 8% runtime on P8.  */
> > +      /* Update the adapt count in the critical section to
> > +         prevent race condition as mentioned in BZ 20822.  */
> >        if (*adapt_count > 0)
> >  	(*adapt_count)--;
> > +      lll_unlock ((*lock), pshared);
> > 
> > makes it clear that this test and update of the adapt_count is contained
> > with the acquire/release semantics of the lll_lock and lll_unlock.
> 
> However, it does not make it clear that it should be accessed atomically
> outside of transactions.
> 
> Another reason why we want to use atomic operations consistently is to
> at least allow us to really use C11 atomics in the future (or some sort
> of similar underlying type), so that we can more easily catch cases
> where atomic operations should be used but aren't.  If we don't
> specifically use relaxed-MO operations, the default will be SC.  I'm
> sure you'd like to avoid that too.
> 
> This is a consistency argument wrt other code.  The fact that the atomic
> operation is in a transaction too is just another fact.
> We don't want people to assume that plain memory accesses are
> sufficient, for example because all accesses to adapt_count are within
> transactions.
> 
> > Also if we add the explicit syntax you propose, someone, who has not
> > studied the architecture, might leap to the conclusion that add_fetch
> > would be even better.
> 
> I disagree.  If you look at the rules we've set up for concurrent code
> in general (eg, see "Concurrency" on the wiki, or the concurrent code we
> have converted), we require developers to document what the code does;
> this means memory order and operations used, and the high-level intent.
> If you do this properly, the misunderstandings you worry should not
> happen.
> 
I disagree, your definition of consistency is in this case misleading
for my platform.

> > Which would be absolutely wrong. This would generate a larx-add-stcx
> > sequence which has two very bad side-effects.
> > 
> > 1) on the fall back leg, it would a force a larx-hit-stcx flush hazard
> > with the following lll_unlock and cause bad performance.
> > 2) on the transaction leg, it would be redundant and just slow things
> > down.
> 
> So why not add this as a comment?  The use of nonatomic accesses does
> not imply these points, BTW.
> 
So a comment to the effect that explicit atomics are not required for
this case because it is within the critical region of the lll_lock /
lll_unlock.

That should satisfy your concern, and mine.

> Also, the same concerns for how to access adapt_count apply to other
> accesses to it that may not or may be in a transaction (due to nesting).
> You must use atomic accesses for these, so you'd need the explaining
> comment anyway.
> 
Yes I agree that accesses to adapt_count where the context is not clear
should use atomic_load_relaxed / atomic_store_relaxed in the elision
code.

> On a positive note, it seems that we agree that we want to clarity of
> the source code.  I'm arguing that more comments and a consistent use of
> atomic operations (with proper memory orders and comments, of course)
> will get us there.  That's why I suggested that the patch should clean
> up and document all accesses to adapt_count.
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-23 17:02           ` Steven Munroe
@ 2016-11-23 18:35             ` Torvald Riegel
  0 siblings, 0 replies; 22+ messages in thread
From: Torvald Riegel @ 2016-11-23 18:35 UTC (permalink / raw)
  To: munroesj
  Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey,
	Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella,
	adconrad, wschmidt

On Wed, 2016-11-23 at 11:01 -0600, Steven Munroe wrote:
> On Wed, 2016-11-23 at 12:25 +0100, Torvald Riegel wrote:
> > On Tue, 2016-11-22 at 12:03 -0600, Steven Munroe wrote:
> > > On Tue, 2016-11-22 at 09:44 +0100, Torvald Riegel wrote:
> > > > On Mon, 2016-11-21 at 17:42 -0600, Steven Munroe wrote:
> > > > > On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote:
> > > > > > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote:
> > > > > > > The initial problem reported was memory corruption in MongoDB only on 
> > > > > > > Ubuntu 16.04 ppc64le(not on Ubuntu 15).
> > > > > > > The problem was not easily recreatable and debugging with many tools and 
> > > > > > > creative ideas, the problem is narrowed down to lock elision in glibc.
> > > > > > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which 
> > > > > > > made the testcase fail only on 16.04.
> > > > > > > 
> > > > > > > As stated in BZ 20822, short description is
> > > > > > > 
> > > > > > > "The update of *adapt_count after the release of the lock causes a race 
> > > > > > > condition. Thread A unlocks, thread B continues and returns, destroying 
> > > > > > > mutex on stack,
> > > > > > >   then gets into another function, thread A writes to *adapt_count and 
> > > > > > > corrupts stack.The window is very narrow and requires that the
> > > > > > > machine be in smt mode, so likely the two threads have to be on the same 
> > > > > > > core in order to hit this"
> > > > > > > 
> > > > > > > Looking back the file changes in 
> > > > > > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit
> > > > > > >   https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve
> > > > > > > the performance.
> > > > > > > 
> > > > > > > Thinking of the following fixes.
> > > > > > > 
> > > > > > > 1) Update of adapt_count before the unlock. But I have still not 
> > > > > > > identified if its going to affect the performance.
> > > > > > 
> > > > > > Because of the requirements on mutex destruction (and same for rwlock
> > > > > > etc.), a thread must not access any data associated with the mutex after
> > > > > > having unlocked it (this includes reads because it's correct to also
> > > > > > unmap the memory holding the mutex after destroying the mutex). 
> > > > > > 
> > > > > > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after 
> > > > > > > the lock acquire at the end of the function.
> > > > > > > In either case we have to be careful not to introduce another race 
> > > > > > > condition.
> > > > > > 
> > > > > > Every access to mutex data outside of transactions must use atomic
> > > > > > operations if there would be data races otherwise (according to the C11
> > > > > > memory model).  I know that this isn't followed in the current mutex
> > > > > > code, but it really should be, and your change qualifies as new code so
> > > > > > please take care of this too.
> > > > > > 
> > > > > In this case (with the proposed patch) the adapt_count update is within
> > > > > the critical region and adding C11 atomics does not add value and would
> > > > > generally make the logical harder to read. 
> > > > > 
> > > > > __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count,
> > > > > __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED)
> > > > > 
> > > > > Is a lot of syntax without any real value.
> > > > 
> > > > First, it would actually be this in glibc:
> > > > atomic_store_relaxed (&adapt_count,
> > > >   atomic_load_relaxed(&adapt_count) - 1);
> > > > 
> > > This is not a good example for you case. The proposed patch:
> > > 
> > > -      lll_unlock ((*lock), pshared);
> > > -
> > > -      /* Update the adapt count AFTER completing the critical section.
> > > -         Doing this here prevents unneeded stalling when entering
> > > -         a critical section.  Saving about 8% runtime on P8.  */
> > > +      /* Update the adapt count in the critical section to
> > > +         prevent race condition as mentioned in BZ 20822.  */
> > >        if (*adapt_count > 0)
> > >  	(*adapt_count)--;
> > > +      lll_unlock ((*lock), pshared);
> > > 
> > > makes it clear that this test and update of the adapt_count is contained
> > > with the acquire/release semantics of the lll_lock and lll_unlock.
> > 
> > However, it does not make it clear that it should be accessed atomically
> > outside of transactions.
> > 
> > Another reason why we want to use atomic operations consistently is to
> > at least allow us to really use C11 atomics in the future (or some sort
> > of similar underlying type), so that we can more easily catch cases
> > where atomic operations should be used but aren't.  If we don't
> > specifically use relaxed-MO operations, the default will be SC.  I'm
> > sure you'd like to avoid that too.
> > 
> > This is a consistency argument wrt other code.  The fact that the atomic
> > operation is in a transaction too is just another fact.
> > We don't want people to assume that plain memory accesses are
> > sufficient, for example because all accesses to adapt_count are within
> > transactions.
> > 
> > > Also if we add the explicit syntax you propose, someone, who has not
> > > studied the architecture, might leap to the conclusion that add_fetch
> > > would be even better.
> > 
> > I disagree.  If you look at the rules we've set up for concurrent code
> > in general (eg, see "Concurrency" on the wiki, or the concurrent code we
> > have converted), we require developers to document what the code does;
> > this means memory order and operations used, and the high-level intent.
> > If you do this properly, the misunderstandings you worry should not
> > happen.
> > 
> I disagree, your definition of consistency is in this case misleading
> for my platform.
> 
> > > Which would be absolutely wrong. This would generate a larx-add-stcx
> > > sequence which has two very bad side-effects.
> > > 
> > > 1) on the fall back leg, it would a force a larx-hit-stcx flush hazard
> > > with the following lll_unlock and cause bad performance.
> > > 2) on the transaction leg, it would be redundant and just slow things
> > > down.
> > 
> > So why not add this as a comment?  The use of nonatomic accesses does
> > not imply these points, BTW.
> > 
> So a comment to the effect that explicit atomics are not required for
> this case because it is within the critical region of the lll_lock /
> lll_unlock.

To be clear:

I want to see relaxed MO atomics being used.

I would appreciate comments with the note about why using
read-modify-write would not be helpful (ie, what you wrote above), and
why using atomic loads and stores would not be strictly necessary for
those particular accesses from *purely* an atomicity perspective, but
that we still use them for consistency with other accesses outside of
transactions to adapt_count.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision
  2016-11-17 14:47 ` Torvald Riegel
  2016-11-21 23:42   ` Steven Munroe
@ 2016-12-04 12:14   ` Torvald Riegel
  1 sibling, 0 replies; 22+ messages in thread
From: Torvald Riegel @ 2016-12-04 12:14 UTC (permalink / raw)
  To: Rajalakshmi Srinivasaraghavan
  Cc: libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos,
	adhemerval.zanella, adconrad, wschmidt

On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote:
> On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote:
> > The initial problem reported was memory corruption in MongoDB only on 
> > Ubuntu 16.04 ppc64le(not on Ubuntu 15).
> > The problem was not easily recreatable and debugging with many tools and 
> > creative ideas, the problem is narrowed down to lock elision in glibc.
> > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which 
> > made the testcase fail only on 16.04.
> > 
> > As stated in BZ 20822, short description is
> > 
> > "The update of *adapt_count after the release of the lock causes a race 
> > condition. Thread A unlocks, thread B continues and returns, destroying 
> > mutex on stack,
> >   then gets into another function, thread A writes to *adapt_count and 
> > corrupts stack.The window is very narrow and requires that the
> > machine be in smt mode, so likely the two threads have to be on the same 
> > core in order to hit this"
> > 
> > Looking back the file changes in 
> > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit
> >   https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve
> > the performance.
> > 
> > Thinking of the following fixes.
> > 
> > 1) Update of adapt_count before the unlock. But I have still not 
> > identified if its going to affect the performance.
> 
> Because of the requirements on mutex destruction (and same for rwlock
> etc.), a thread must not access any data associated with the mutex after
> having unlocked it (this includes reads because it's correct to also
> unmap the memory holding the mutex after destroying the mutex). 
> 
> > 2) In elision-lock.c/elision-trylock.c update adapt_count right after 
> > the lock acquire at the end of the function.
> > In either case we have to be careful not to introduce another race 
> > condition.
> 
> Every access to mutex data outside of transactions must use atomic
> operations if there would be data races otherwise (according to the C11
> memory model).  I know that this isn't followed in the current mutex
> code, but it really should be, and your change qualifies as new code so
> please take care of this too.
> 
> I also think that if we can have concurrent accesses to a particular
> variable both inside and outside of transactions, we should use atomic
> accesses inside the transactions too, just for consistency.

To illustrate, please have a look at this patch:
https://sourceware.org/ml/libc-alpha/2016-12/msg00041.html

It changes x86_64's lock elision implementation to use atomic accesses
for adapt_count.  (It adds shorter-sized atomic_load_* and atomic_store*
functions, but you don't even need that on powerpc right now I think
because powerpc sets USE_ATOMIC_COMPILER_BUILTINS to 0 and thus does not
check sizeof on atomic loads and stores.)


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-12-04 12:14 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-15 16:13 BZ 20822 :powerpc: race condition in __lll_unlock_elision Rajalakshmi Srinivasaraghavan
2016-11-16 13:40 ` Adhemerval Zanella
2016-11-16 14:08   ` Szabolcs Nagy
2016-11-16 14:31     ` Adhemerval Zanella
2016-11-16 15:54       ` Rajalakshmi Srinivasaraghavan
2016-11-17 14:47 ` Torvald Riegel
2016-11-21 23:42   ` Steven Munroe
2016-11-22  8:44     ` Torvald Riegel
2016-11-22  8:55       ` Florian Weimer
2016-11-22  9:41         ` Torvald Riegel
2016-11-22 10:52           ` Florian Weimer
2016-11-22 13:45             ` Torvald Riegel
2016-11-22 15:02               ` Florian Weimer
2016-11-22 16:58                 ` Torvald Riegel
2016-11-22 18:40                   ` Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision) Florian Weimer
2016-11-22 20:15                     ` Steven Munroe
2016-11-23 14:29                     ` Torvald Riegel
2016-11-22 18:03       ` BZ 20822 :powerpc: race condition in __lll_unlock_elision Steven Munroe
2016-11-23 11:30         ` Torvald Riegel
2016-11-23 17:02           ` Steven Munroe
2016-11-23 18:35             ` Torvald Riegel
2016-12-04 12:14   ` Torvald Riegel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).