* BZ 20822 :powerpc: race condition in __lll_unlock_elision @ 2016-11-15 16:13 Rajalakshmi Srinivasaraghavan 2016-11-16 13:40 ` Adhemerval Zanella 2016-11-17 14:47 ` Torvald Riegel 0 siblings, 2 replies; 22+ messages in thread From: Rajalakshmi Srinivasaraghavan @ 2016-11-15 16:13 UTC (permalink / raw) To: libc-alpha Cc: aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt The initial problem reported was memory corruption in MongoDB only on Ubuntu 16.04 ppc64le(not on Ubuntu 15). The problem was not easily recreatable and debugging with many tools and creative ideas, the problem is narrowed down to lock elision in glibc. Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which made the testcase fail only on 16.04. As stated in BZ 20822, short description is "The update of *adapt_count after the release of the lock causes a race condition. Thread A unlocks, thread B continues and returns, destroying mutex on stack, then gets into another function, thread A writes to *adapt_count and corrupts stack.The window is very narrow and requires that the machine be in smt mode, so likely the two threads have to be on the same core in order to hit this" Looking back the file changes in sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve the performance. Thinking of the following fixes. 1) Update of adapt_count before the unlock. But I have still not identified if its going to affect the performance. 2) In elision-lock.c/elision-trylock.c update adapt_count right after the lock acquire at the end of the function. In either case we have to be careful not to introduce another race condition. Any suggestions? -- Thanks Rajalakshmi S ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-15 16:13 BZ 20822 :powerpc: race condition in __lll_unlock_elision Rajalakshmi Srinivasaraghavan @ 2016-11-16 13:40 ` Adhemerval Zanella 2016-11-16 14:08 ` Szabolcs Nagy 2016-11-17 14:47 ` Torvald Riegel 1 sibling, 1 reply; 22+ messages in thread From: Adhemerval Zanella @ 2016-11-16 13:40 UTC (permalink / raw) To: Rajalakshmi Srinivasaraghavan, libc-alpha Cc: aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adconrad, wschmidt On 15/11/2016 14:12, Rajalakshmi Srinivasaraghavan wrote: > > The initial problem reported was memory corruption in MongoDB only on Ubuntu 16.04 ppc64le(not on Ubuntu 15). > The problem was not easily recreatable and debugging with many tools and creative ideas, the problem is narrowed down to lock elision in glibc. > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which made the testcase fail only on 16.04. > > As stated in BZ 20822, short description is > > "The update of *adapt_count after the release of the lock causes a race condition. Thread A unlocks, thread B continues and returns, destroying mutex on stack, > then gets into another function, thread A writes to *adapt_count and corrupts stack.The window is very narrow and requires that the > machine be in smt mode, so likely the two threads have to be on the same core in order to hit this" > > Looking back the file changes in sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit > https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve > the performance. > > Thinking of the following fixes. > > 1) Update of adapt_count before the unlock. But I have still not identified if its going to affect the performance. > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after the lock acquire at the end of the function. > In either case we have to be careful not to introduce another race condition. > > Any suggestions? > The fadd2ad9c commit indeed should have raised us a flags, since it tries to update a variable ('adapt_count' or pthread_mutex_t.__data.__elision) in a transaction, where it should be semantically handled as atomic access, and later in a non-atomic way. This is wrong and I think we should either revert the patch or move the adapt count update to 'inside' the transaction (your first option). However, I am not sure if this is the culprit of the issue. Based on bug report, it states the faulty scenario is: "[...] Thread A unlocks, thread B continues and returns, destroying mutex on stack, then gets into another function, thread A writes to *adapt_count and corrupts stack. [...]" It seems *after* thread B destroy the mutex, it was not clear that the program just calls 'pthread_mutex_destroy' or if the thread deallocates the mutex on the stack (by returning or something else). First option should not trigger an invalid access, but second is definitively an program error. Either way, I am also not sure if this current behaviour would lead to a correctness issue, but rather a performance one: the race 'adapt_count' usage would lead to TLE being avoided, not really invalid memory access. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-16 13:40 ` Adhemerval Zanella @ 2016-11-16 14:08 ` Szabolcs Nagy 2016-11-16 14:31 ` Adhemerval Zanella 0 siblings, 1 reply; 22+ messages in thread From: Szabolcs Nagy @ 2016-11-16 14:08 UTC (permalink / raw) To: Adhemerval Zanella, Rajalakshmi Srinivasaraghavan, libc-alpha Cc: nd, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adconrad, wschmidt On 16/11/16 13:40, Adhemerval Zanella wrote: > "[...] Thread A unlocks, thread B continues and returns, destroying > mutex on stack, then gets into another function, thread A writes to > *adapt_count and corrupts stack. [...]" > > It seems *after* thread B destroy the mutex, it was not clear that > the program just calls 'pthread_mutex_destroy' or if the thread > deallocates the mutex on the stack (by returning or something else). > First option should not trigger an invalid access, but second is > definitively an program error. > does not look like a program error to me: int i=0; void threadB() { pthread_mutex_t m[1]; //... pthread_mutex_lock(m); if (i==1) { pthread_mutex_unlock(m); pthread_mutex_destroy(m,0); // ok A is done with m return; // drop stack frame } //... } void threadA() { // get m from thread B pthread_mutex_lock(m); i=1; pthread_mutex_unlock(m); // no more use of m here } this code can corrupt the stack of threadB if pthread_mutex_unlock in threadA accesses m->adapt_count after m is unlocked. (same for rwlocks and whatever else may be used with elision.) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-16 14:08 ` Szabolcs Nagy @ 2016-11-16 14:31 ` Adhemerval Zanella 2016-11-16 15:54 ` Rajalakshmi Srinivasaraghavan 0 siblings, 1 reply; 22+ messages in thread From: Adhemerval Zanella @ 2016-11-16 14:31 UTC (permalink / raw) To: Szabolcs Nagy, Rajalakshmi Srinivasaraghavan, libc-alpha Cc: nd, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adconrad, wschmidt On 16/11/2016 12:07, Szabolcs Nagy wrote: > On 16/11/16 13:40, Adhemerval Zanella wrote: >> "[...] Thread A unlocks, thread B continues and returns, destroying >> mutex on stack, then gets into another function, thread A writes to >> *adapt_count and corrupts stack. [...]" >> >> It seems *after* thread B destroy the mutex, it was not clear that >> the program just calls 'pthread_mutex_destroy' or if the thread >> deallocates the mutex on the stack (by returning or something else). >> First option should not trigger an invalid access, but second is >> definitively an program error. >> > > does not look like a program error to me: > > int i=0; > > void threadB() > { > pthread_mutex_t m[1]; > //... > pthread_mutex_lock(m); > if (i==1) { > pthread_mutex_unlock(m); > pthread_mutex_destroy(m,0); // ok A is done with m > return; // drop stack frame > } > //... > } > > void threadA() > { > // get m from thread B > pthread_mutex_lock(m); > i=1; > pthread_mutex_unlock(m); > // no more use of m here > } > > this code can corrupt the stack of threadB if pthread_mutex_unlock > in threadA accesses m->adapt_count after m is unlocked. > > (same for rwlocks and whatever else may be used with elision.) If this the scenario described in bug report it could the case of the issue. I was thinking in another scenario, where mutex unlock was not synchronized with a global variable. It could be case where a similar code you described unlocks the mutex on threadA, threadB run and finishes, and then threadA runs again an access invalid memory. Anyway, I see the correct fix to just just move the adapt_count to inside the transaction. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-16 14:31 ` Adhemerval Zanella @ 2016-11-16 15:54 ` Rajalakshmi Srinivasaraghavan 0 siblings, 0 replies; 22+ messages in thread From: Rajalakshmi Srinivasaraghavan @ 2016-11-16 15:54 UTC (permalink / raw) To: Adhemerval Zanella, Szabolcs Nagy, libc-alpha Cc: nd, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adconrad, wschmidt On 11/16/2016 08:01 PM, Adhemerval Zanella wrote: > nyway, I see the correct fix to just just move the adapt_count > to inside the transaction. Thanks Adhemerval and Szabolcs for the inputs. I posted a patch in libc-alpha for the same. While this is moving, I will try to analyze the performance impact, if any. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-15 16:13 BZ 20822 :powerpc: race condition in __lll_unlock_elision Rajalakshmi Srinivasaraghavan 2016-11-16 13:40 ` Adhemerval Zanella @ 2016-11-17 14:47 ` Torvald Riegel 2016-11-21 23:42 ` Steven Munroe 2016-12-04 12:14 ` Torvald Riegel 1 sibling, 2 replies; 22+ messages in thread From: Torvald Riegel @ 2016-11-17 14:47 UTC (permalink / raw) To: Rajalakshmi Srinivasaraghavan Cc: libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote: > The initial problem reported was memory corruption in MongoDB only on > Ubuntu 16.04 ppc64le(not on Ubuntu 15). > The problem was not easily recreatable and debugging with many tools and > creative ideas, the problem is narrowed down to lock elision in glibc. > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which > made the testcase fail only on 16.04. > > As stated in BZ 20822, short description is > > "The update of *adapt_count after the release of the lock causes a race > condition. Thread A unlocks, thread B continues and returns, destroying > mutex on stack, > then gets into another function, thread A writes to *adapt_count and > corrupts stack.The window is very narrow and requires that the > machine be in smt mode, so likely the two threads have to be on the same > core in order to hit this" > > Looking back the file changes in > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit > https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve > the performance. > > Thinking of the following fixes. > > 1) Update of adapt_count before the unlock. But I have still not > identified if its going to affect the performance. Because of the requirements on mutex destruction (and same for rwlock etc.), a thread must not access any data associated with the mutex after having unlocked it (this includes reads because it's correct to also unmap the memory holding the mutex after destroying the mutex). > 2) In elision-lock.c/elision-trylock.c update adapt_count right after > the lock acquire at the end of the function. > In either case we have to be careful not to introduce another race > condition. Every access to mutex data outside of transactions must use atomic operations if there would be data races otherwise (according to the C11 memory model). I know that this isn't followed in the current mutex code, but it really should be, and your change qualifies as new code so please take care of this too. I also think that if we can have concurrent accesses to a particular variable both inside and outside of transactions, we should use atomic accesses inside the transactions too, just for consistency. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-17 14:47 ` Torvald Riegel @ 2016-11-21 23:42 ` Steven Munroe 2016-11-22 8:44 ` Torvald Riegel 2016-12-04 12:14 ` Torvald Riegel 1 sibling, 1 reply; 22+ messages in thread From: Steven Munroe @ 2016-11-21 23:42 UTC (permalink / raw) To: Torvald Riegel Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote: > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote: > > The initial problem reported was memory corruption in MongoDB only on > > Ubuntu 16.04 ppc64le(not on Ubuntu 15). > > The problem was not easily recreatable and debugging with many tools and > > creative ideas, the problem is narrowed down to lock elision in glibc. > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which > > made the testcase fail only on 16.04. > > > > As stated in BZ 20822, short description is > > > > "The update of *adapt_count after the release of the lock causes a race > > condition. Thread A unlocks, thread B continues and returns, destroying > > mutex on stack, > > then gets into another function, thread A writes to *adapt_count and > > corrupts stack.The window is very narrow and requires that the > > machine be in smt mode, so likely the two threads have to be on the same > > core in order to hit this" > > > > Looking back the file changes in > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit > > https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve > > the performance. > > > > Thinking of the following fixes. > > > > 1) Update of adapt_count before the unlock. But I have still not > > identified if its going to affect the performance. > > Because of the requirements on mutex destruction (and same for rwlock > etc.), a thread must not access any data associated with the mutex after > having unlocked it (this includes reads because it's correct to also > unmap the memory holding the mutex after destroying the mutex). > > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after > > the lock acquire at the end of the function. > > In either case we have to be careful not to introduce another race > > condition. > > Every access to mutex data outside of transactions must use atomic > operations if there would be data races otherwise (according to the C11 > memory model). I know that this isn't followed in the current mutex > code, but it really should be, and your change qualifies as new code so > please take care of this too. > In this case (with the proposed patch) the adapt_count update is within the critical region and adding C11 atomics does not add value and would generally make the logical harder to read. __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count, __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED) Is a lot of syntax without any real value. > I also think that if we can have concurrent accesses to a particular > variable both inside and outside of transactions, we should use atomic > accesses inside the transactions too, just for consistency. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-21 23:42 ` Steven Munroe @ 2016-11-22 8:44 ` Torvald Riegel 2016-11-22 8:55 ` Florian Weimer 2016-11-22 18:03 ` BZ 20822 :powerpc: race condition in __lll_unlock_elision Steven Munroe 0 siblings, 2 replies; 22+ messages in thread From: Torvald Riegel @ 2016-11-22 8:44 UTC (permalink / raw) To: munroesj Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Mon, 2016-11-21 at 17:42 -0600, Steven Munroe wrote: > On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote: > > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote: > > > The initial problem reported was memory corruption in MongoDB only on > > > Ubuntu 16.04 ppc64le(not on Ubuntu 15). > > > The problem was not easily recreatable and debugging with many tools and > > > creative ideas, the problem is narrowed down to lock elision in glibc. > > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which > > > made the testcase fail only on 16.04. > > > > > > As stated in BZ 20822, short description is > > > > > > "The update of *adapt_count after the release of the lock causes a race > > > condition. Thread A unlocks, thread B continues and returns, destroying > > > mutex on stack, > > > then gets into another function, thread A writes to *adapt_count and > > > corrupts stack.The window is very narrow and requires that the > > > machine be in smt mode, so likely the two threads have to be on the same > > > core in order to hit this" > > > > > > Looking back the file changes in > > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit > > > https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve > > > the performance. > > > > > > Thinking of the following fixes. > > > > > > 1) Update of adapt_count before the unlock. But I have still not > > > identified if its going to affect the performance. > > > > Because of the requirements on mutex destruction (and same for rwlock > > etc.), a thread must not access any data associated with the mutex after > > having unlocked it (this includes reads because it's correct to also > > unmap the memory holding the mutex after destroying the mutex). > > > > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after > > > the lock acquire at the end of the function. > > > In either case we have to be careful not to introduce another race > > > condition. > > > > Every access to mutex data outside of transactions must use atomic > > operations if there would be data races otherwise (according to the C11 > > memory model). I know that this isn't followed in the current mutex > > code, but it really should be, and your change qualifies as new code so > > please take care of this too. > > > In this case (with the proposed patch) the adapt_count update is within > the critical region and adding C11 atomics does not add value and would > generally make the logical harder to read. > > __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count, > __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED) > > Is a lot of syntax without any real value. First, it would actually be this in glibc: atomic_store_relaxed (&adapt_count, atomic_load_relaxed(&adapt_count) - 1); Second, it does add value in that it makes it clear that there are concurrent accesses elsewhere that are not in transactions. We have consensus to use the new C11-like atomics in new or changed code, and I don't see a reason to make an exception here. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-22 8:44 ` Torvald Riegel @ 2016-11-22 8:55 ` Florian Weimer 2016-11-22 9:41 ` Torvald Riegel 2016-11-22 18:03 ` BZ 20822 :powerpc: race condition in __lll_unlock_elision Steven Munroe 1 sibling, 1 reply; 22+ messages in thread From: Florian Weimer @ 2016-11-22 8:55 UTC (permalink / raw) To: Torvald Riegel, munroesj Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On 11/22/2016 09:44 AM, Torvald Riegel wrote: >> In this case (with the proposed patch) the adapt_count update is within >> the critical region and adding C11 atomics does not add value and would >> generally make the logical harder to read. >> >> __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count, >> __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED) >> >> Is a lot of syntax without any real value. > > First, it would actually be this in glibc: > atomic_store_relaxed (&adapt_count, > atomic_load_relaxed(&adapt_count) - 1); > > Second, it does add value in that it makes it clear that there are > concurrent accesses elsewhere that are not in transactions. We have > consensus to use the new C11-like atomics in new or changed code, and I > don't see a reason to make an exception here. A possible reason: The code does not actually follow the C11 memory model, so using C11-like atomics is rather misleading. And unlike code using incorrect synchronization or legacy atomics, we don't have a path towards the C11 model because what the code does is completely outside it, so there's no argument on favor of gradual/partial conversion. Florian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-22 8:55 ` Florian Weimer @ 2016-11-22 9:41 ` Torvald Riegel 2016-11-22 10:52 ` Florian Weimer 0 siblings, 1 reply; 22+ messages in thread From: Torvald Riegel @ 2016-11-22 9:41 UTC (permalink / raw) To: Florian Weimer Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Tue, 2016-11-22 at 09:55 +0100, Florian Weimer wrote: > On 11/22/2016 09:44 AM, Torvald Riegel wrote: > > >> In this case (with the proposed patch) the adapt_count update is within > >> the critical region and adding C11 atomics does not add value and would > >> generally make the logical harder to read. > >> > >> __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count, > >> __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED) > >> > >> Is a lot of syntax without any real value. > > > > First, it would actually be this in glibc: > > atomic_store_relaxed (&adapt_count, > > atomic_load_relaxed(&adapt_count) - 1); > > > > Second, it does add value in that it makes it clear that there are > > concurrent accesses elsewhere that are not in transactions. We have > > consensus to use the new C11-like atomics in new or changed code, and I > > don't see a reason to make an exception here. > > A possible reason: The code does not actually follow the C11 memory > model, so using C11-like atomics is rather misleading. While it is true that the C11 memory model does not model HW transactions, it is not true to claim that this code is unrelated to the C11 memory model. First of all, there is code outside of transactions that accesses adapt_count, so we obviously need atomic operations there. Second, there needs to be *some* memory model, and it needs to incorporate the compiler too. > And unlike code > using incorrect synchronization or legacy atomics, we don't have a path > towards the C11 model because what the code does is completely outside > it, so there's no argument on favor of gradual/partial conversion. The HW transaction has to synchronize with nontransactional code. We're obviously going to use the C11 model for nontransactional code. Thus, it cannot be unrelated. It's true that we don't have a detailed description of how HW transactions *extend* the C11 model, but it clearly would be an extension because it simply has to fit in there and it will use the same underlying mechanisms. One can also think about it this way: There is nontransactional code that has to use atomics (because there are concurrent accesses from other nontransactional code as well as transactional code). So we are going to use atomics there. Then, even if just for consistency, we're going to use (relaxed MO) atomic accesses too for all other concurrent accesses to the same data. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-22 9:41 ` Torvald Riegel @ 2016-11-22 10:52 ` Florian Weimer 2016-11-22 13:45 ` Torvald Riegel 0 siblings, 1 reply; 22+ messages in thread From: Florian Weimer @ 2016-11-22 10:52 UTC (permalink / raw) To: Torvald Riegel Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On 11/22/2016 10:41 AM, Torvald Riegel wrote: >> And unlike code >> using incorrect synchronization or legacy atomics, we don't have a path >> towards the C11 model because what the code does is completely outside >> it, so there's no argument on favor of gradual/partial conversion. > > The HW transaction has to synchronize with nontransactional code. We're > obviously going to use the C11 model for nontransactional code. Thus, > it cannot be unrelated. It's true that we don't have a detailed > description of how HW transactions *extend* the C11 model, but it > clearly would be an extension because it simply has to fit in there and > it will use the same underlying mechanisms. > > One can also think about it this way: There is nontransactional code > that has to use atomics (because there are concurrent accesses from > other nontransactional code as well as transactional code). So we are > going to use atomics there. Then, even if just for consistency, we're > going to use (relaxed MO) atomic accesses too for all other concurrent > accesses to the same data. The counter-argument to that this is until we have a specification of how this kind of transactional memory works, we don't know what it means if an object is accessed from both a transaction and and other non-transactional code, and what external constraints are needed to make such access valid (if any). There are some really scary examples from early transaction memory implementations. But come to think of it, the core issue hasn't got to do much with atomics at all. It's just about how the transactional memory implementation works. So I don't really have any strong opinions here. Thanks, Florian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-22 10:52 ` Florian Weimer @ 2016-11-22 13:45 ` Torvald Riegel 2016-11-22 15:02 ` Florian Weimer 0 siblings, 1 reply; 22+ messages in thread From: Torvald Riegel @ 2016-11-22 13:45 UTC (permalink / raw) To: Florian Weimer Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Tue, 2016-11-22 at 11:52 +0100, Florian Weimer wrote: > On 11/22/2016 10:41 AM, Torvald Riegel wrote: > > >> And unlike code > >> using incorrect synchronization or legacy atomics, we don't have a path > >> towards the C11 model because what the code does is completely outside > >> it, so there's no argument on favor of gradual/partial conversion. > > > > The HW transaction has to synchronize with nontransactional code. We're > > obviously going to use the C11 model for nontransactional code. Thus, > > it cannot be unrelated. It's true that we don't have a detailed > > description of how HW transactions *extend* the C11 model, but it > > clearly would be an extension because it simply has to fit in there and > > it will use the same underlying mechanisms. > > > > One can also think about it this way: There is nontransactional code > > that has to use atomics (because there are concurrent accesses from > > other nontransactional code as well as transactional code). So we are > > going to use atomics there. Then, even if just for consistency, we're > > going to use (relaxed MO) atomic accesses too for all other concurrent > > accesses to the same data. > > The counter-argument to that this is until we have a specification of > how this kind of transactional memory works, we don't know what it means > if an object is accessed from both a transaction and and other > non-transactional code, and what external constraints are needed to make > such access valid (if any). Lock elision relies on this kind of mixed access. While it would be nice to have a common formal model for the various HTMs out there from the perspective of a C11/C++11 memory model setting, I don't think it's a big problem right now that we don't (AFAIK) have such a formal model. Lock elision should be well understood, so I'm not worried about any surprises regarding this use case. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-22 13:45 ` Torvald Riegel @ 2016-11-22 15:02 ` Florian Weimer 2016-11-22 16:58 ` Torvald Riegel 0 siblings, 1 reply; 22+ messages in thread From: Florian Weimer @ 2016-11-22 15:02 UTC (permalink / raw) To: Torvald Riegel Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On 11/22/2016 02:45 PM, Torvald Riegel wrote: > Lock elision relies on this kind of mixed access. While it would be > nice to have a common formal model for the various HTMs out there from > the perspective of a C11/C++11 memory model setting, I don't think it's > a big problem right now that we don't (AFAIK) have such a formal model. > Lock elision should be well understood, so I'm not worried about any > surprises regarding this use case. You are the expert. I'm just surprised we embrace a wild-west approach to P&C again, while trying to convince others to argue from the definitions instead based on gut feeling. That being said, I'm don't have a strong opinion regarding the style to use in this case. Thanks, Florian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-22 15:02 ` Florian Weimer @ 2016-11-22 16:58 ` Torvald Riegel 2016-11-22 18:40 ` Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision) Florian Weimer 0 siblings, 1 reply; 22+ messages in thread From: Torvald Riegel @ 2016-11-22 16:58 UTC (permalink / raw) To: Florian Weimer Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Tue, 2016-11-22 at 16:02 +0100, Florian Weimer wrote: > On 11/22/2016 02:45 PM, Torvald Riegel wrote: > > > Lock elision relies on this kind of mixed access. While it would be > > nice to have a common formal model for the various HTMs out there from > > the perspective of a C11/C++11 memory model setting, I don't think it's > > a big problem right now that we don't (AFAIK) have such a formal model. > > Lock elision should be well understood, so I'm not worried about any > > surprises regarding this use case. > > You are the expert. I'm just surprised we embrace a wild-west approach > to P&C again, while trying to convince others to argue from the > definitions instead based on gut feeling. To avoid misunderstandings: I'm not arguing in favor of embracing any wild-west approach. It would be great if we had a formal model that can cover more than one architecture's semantics. Nonetheless, we don't have it, and I don't see this as a big enough problem to prevent us from trying to use lock elision. Once someone comes up with a model, I suppose we'd certainly try to use it if possible. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision) 2016-11-22 16:58 ` Torvald Riegel @ 2016-11-22 18:40 ` Florian Weimer 2016-11-22 20:15 ` Steven Munroe 2016-11-23 14:29 ` Torvald Riegel 0 siblings, 2 replies; 22+ messages in thread From: Florian Weimer @ 2016-11-22 18:40 UTC (permalink / raw) To: Torvald Riegel Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt Torvald, I'm sorry for my earlier rather destructive comments. Do you think those of us who do not work on the implementations of the libothread concurrency primitives need to deal with transactional memory issues at all? The libc-internal locks do not even implement elision, and if we use hardware transactional memory implementations for lock elision only and that elision is semantically transparent, then the details would not matter to the rest of glibc. We just have to think about the locking. Is this the right approach? Obviously, we will have to consider once we start using transactional memory for anything else beside lock elision, but this seems to be rather far away at this point. Thanks, Florian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision) 2016-11-22 18:40 ` Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision) Florian Weimer @ 2016-11-22 20:15 ` Steven Munroe 2016-11-23 14:29 ` Torvald Riegel 1 sibling, 0 replies; 22+ messages in thread From: Steven Munroe @ 2016-11-22 20:15 UTC (permalink / raw) To: Florian Weimer Cc: Torvald Riegel, Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Tue, 2016-11-22 at 19:40 +0100, Florian Weimer wrote: > Torvald, > > I'm sorry for my earlier rather destructive comments. > > Do you think those of us who do not work on the implementations of the > libothread concurrency primitives need to deal with transactional memory > issues at all? > Well ... transactional memory is directly available to applications via builtins provided by GCC. https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/x86-transactional-memory-intrinsics.html#x86-transactional-memory-intrinsics https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/PowerPC-Hardware-Transactional-Memory-Built-in-Functions.html#PowerPC-Hardware-Transactional-Memory-Built-in-Functions https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/S_002f390-System-z-Built-in-Functions.html#S_002f390-System-z-Built-in-Functions The PowerPC and S390 APIs are closely aligned. So any one with access to Power8/OpenPOWER, Z12, or Haswell and use it. So yes. > > > The libc-internal locks do not even implement elision, and if we use > hardware transactional memory implementations for lock elision only and > that elision is semantically transparent, then the details would not > matter to the rest of glibc. We just have to think about the locking. > > Is this the right approach? > > Obviously, we will have to consider once we start using transactional > memory for anything else beside lock elision, but this seems to be > rather far away at this point. > Thanks, > Florian > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision) 2016-11-22 18:40 ` Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision) Florian Weimer 2016-11-22 20:15 ` Steven Munroe @ 2016-11-23 14:29 ` Torvald Riegel 1 sibling, 0 replies; 22+ messages in thread From: Torvald Riegel @ 2016-11-23 14:29 UTC (permalink / raw) To: Florian Weimer Cc: munroesj, Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Tue, 2016-11-22 at 19:40 +0100, Florian Weimer wrote: > I'm sorry for my earlier rather destructive comments. No worries. I guess I also wasn't clear enough in my statements (ie, that I think the status quo is not too risky right now, instead of thinking that the status quo will be fine forever). > Do you think those of us who do not work on the implementations of the > libothread concurrency primitives need to deal with transactional memory > issues at all? Not in the near future. > The libc-internal locks do not even implement elision, and if we use > hardware transactional memory implementations for lock elision only and > that elision is semantically transparent, then the details would not > matter to the rest of glibc. We just have to think about the locking. > > Is this the right approach? Yes, exactly. That's why I'm not concerned about the status quo wrt TM semantics right now; I believe we know how to implement lock elision with sufficient confidence, and we wouldn't expose TM to non-libpthread or arch-maintainer except through lock elision or something that uses essentially the same approach. Generally, current TMs don't guarantee transactions to succeed, so we need a fallback mechanism to synchronize when transactions do not work, and for current glibc this would mean locks (STMs or HyTMs probably don't buy us much for glibc-internal use cases right now, at least compared to lock elision if a decent locking scheme is possible). Thus, we'd use lock elision on internal locks, potentially with additional performance hints aimed at the lock elision implementation. (There are TM modes in some implementations that guarantee completion of restricted transactions if there is no contention. But these are really restricted, and you need to tightly control the code that these transactions execute, as well as avoid potential interference with other threads (eg, false sharing). The restrictions are tighter than what AMD's ASF proposed a few years ago, for example. Thus, these are be special-case implementations in arch-specific code only.) > Obviously, we will have to consider once we start using transactional > memory for anything else beside lock elision, but this seems to be > rather far away at this point. Yes. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-22 8:44 ` Torvald Riegel 2016-11-22 8:55 ` Florian Weimer @ 2016-11-22 18:03 ` Steven Munroe 2016-11-23 11:30 ` Torvald Riegel 1 sibling, 1 reply; 22+ messages in thread From: Steven Munroe @ 2016-11-22 18:03 UTC (permalink / raw) To: Torvald Riegel Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Tue, 2016-11-22 at 09:44 +0100, Torvald Riegel wrote: > On Mon, 2016-11-21 at 17:42 -0600, Steven Munroe wrote: > > On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote: > > > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote: > > > > The initial problem reported was memory corruption in MongoDB only on > > > > Ubuntu 16.04 ppc64le(not on Ubuntu 15). > > > > The problem was not easily recreatable and debugging with many tools and > > > > creative ideas, the problem is narrowed down to lock elision in glibc. > > > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which > > > > made the testcase fail only on 16.04. > > > > > > > > As stated in BZ 20822, short description is > > > > > > > > "The update of *adapt_count after the release of the lock causes a race > > > > condition. Thread A unlocks, thread B continues and returns, destroying > > > > mutex on stack, > > > > then gets into another function, thread A writes to *adapt_count and > > > > corrupts stack.The window is very narrow and requires that the > > > > machine be in smt mode, so likely the two threads have to be on the same > > > > core in order to hit this" > > > > > > > > Looking back the file changes in > > > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit > > > > https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve > > > > the performance. > > > > > > > > Thinking of the following fixes. > > > > > > > > 1) Update of adapt_count before the unlock. But I have still not > > > > identified if its going to affect the performance. > > > > > > Because of the requirements on mutex destruction (and same for rwlock > > > etc.), a thread must not access any data associated with the mutex after > > > having unlocked it (this includes reads because it's correct to also > > > unmap the memory holding the mutex after destroying the mutex). > > > > > > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after > > > > the lock acquire at the end of the function. > > > > In either case we have to be careful not to introduce another race > > > > condition. > > > > > > Every access to mutex data outside of transactions must use atomic > > > operations if there would be data races otherwise (according to the C11 > > > memory model). I know that this isn't followed in the current mutex > > > code, but it really should be, and your change qualifies as new code so > > > please take care of this too. > > > > > In this case (with the proposed patch) the adapt_count update is within > > the critical region and adding C11 atomics does not add value and would > > generally make the logical harder to read. > > > > __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count, > > __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED) > > > > Is a lot of syntax without any real value. > > First, it would actually be this in glibc: > atomic_store_relaxed (&adapt_count, > atomic_load_relaxed(&adapt_count) - 1); > This is not a good example for you case. The proposed patch: - lll_unlock ((*lock), pshared); - - /* Update the adapt count AFTER completing the critical section. - Doing this here prevents unneeded stalling when entering - a critical section. Saving about 8% runtime on P8. */ + /* Update the adapt count in the critical section to + prevent race condition as mentioned in BZ 20822. */ if (*adapt_count > 0) (*adapt_count)--; + lll_unlock ((*lock), pshared); makes it clear that this test and update of the adapt_count is contained with the acquire/release semantics of the lll_lock and lll_unlock. Also if we add the explicit syntax you propose, someone, who has not studied the architecture, might leap to the conclusion that add_fetch would be even better. Which would be absolutely wrong. This would generate a larx-add-stcx sequence which has two very bad side-effects. 1) on the fall back leg, it would a force a larx-hit-stcx flush hazard with the following lll_unlock and cause bad performance. 2) on the transaction leg, it would be redundant and just slow things down. There are some other cases, like the if test at the start of __lll_lock_elision where it might be appropriate. > Second, it does add value in that it makes it clear that there are > concurrent accesses elsewhere that are not in transactions. We have > consensus to use the new C11-like atomics in new or changed code, and I > don't see a reason to make an exception here. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-22 18:03 ` BZ 20822 :powerpc: race condition in __lll_unlock_elision Steven Munroe @ 2016-11-23 11:30 ` Torvald Riegel 2016-11-23 17:02 ` Steven Munroe 0 siblings, 1 reply; 22+ messages in thread From: Torvald Riegel @ 2016-11-23 11:30 UTC (permalink / raw) To: munroesj Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Tue, 2016-11-22 at 12:03 -0600, Steven Munroe wrote: > On Tue, 2016-11-22 at 09:44 +0100, Torvald Riegel wrote: > > On Mon, 2016-11-21 at 17:42 -0600, Steven Munroe wrote: > > > On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote: > > > > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote: > > > > > The initial problem reported was memory corruption in MongoDB only on > > > > > Ubuntu 16.04 ppc64le(not on Ubuntu 15). > > > > > The problem was not easily recreatable and debugging with many tools and > > > > > creative ideas, the problem is narrowed down to lock elision in glibc. > > > > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which > > > > > made the testcase fail only on 16.04. > > > > > > > > > > As stated in BZ 20822, short description is > > > > > > > > > > "The update of *adapt_count after the release of the lock causes a race > > > > > condition. Thread A unlocks, thread B continues and returns, destroying > > > > > mutex on stack, > > > > > then gets into another function, thread A writes to *adapt_count and > > > > > corrupts stack.The window is very narrow and requires that the > > > > > machine be in smt mode, so likely the two threads have to be on the same > > > > > core in order to hit this" > > > > > > > > > > Looking back the file changes in > > > > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit > > > > > https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve > > > > > the performance. > > > > > > > > > > Thinking of the following fixes. > > > > > > > > > > 1) Update of adapt_count before the unlock. But I have still not > > > > > identified if its going to affect the performance. > > > > > > > > Because of the requirements on mutex destruction (and same for rwlock > > > > etc.), a thread must not access any data associated with the mutex after > > > > having unlocked it (this includes reads because it's correct to also > > > > unmap the memory holding the mutex after destroying the mutex). > > > > > > > > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after > > > > > the lock acquire at the end of the function. > > > > > In either case we have to be careful not to introduce another race > > > > > condition. > > > > > > > > Every access to mutex data outside of transactions must use atomic > > > > operations if there would be data races otherwise (according to the C11 > > > > memory model). I know that this isn't followed in the current mutex > > > > code, but it really should be, and your change qualifies as new code so > > > > please take care of this too. > > > > > > > In this case (with the proposed patch) the adapt_count update is within > > > the critical region and adding C11 atomics does not add value and would > > > generally make the logical harder to read. > > > > > > __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count, > > > __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED) > > > > > > Is a lot of syntax without any real value. > > > > First, it would actually be this in glibc: > > atomic_store_relaxed (&adapt_count, > > atomic_load_relaxed(&adapt_count) - 1); > > > This is not a good example for you case. The proposed patch: > > - lll_unlock ((*lock), pshared); > - > - /* Update the adapt count AFTER completing the critical section. > - Doing this here prevents unneeded stalling when entering > - a critical section. Saving about 8% runtime on P8. */ > + /* Update the adapt count in the critical section to > + prevent race condition as mentioned in BZ 20822. */ > if (*adapt_count > 0) > (*adapt_count)--; > + lll_unlock ((*lock), pshared); > > makes it clear that this test and update of the adapt_count is contained > with the acquire/release semantics of the lll_lock and lll_unlock. However, it does not make it clear that it should be accessed atomically outside of transactions. Another reason why we want to use atomic operations consistently is to at least allow us to really use C11 atomics in the future (or some sort of similar underlying type), so that we can more easily catch cases where atomic operations should be used but aren't. If we don't specifically use relaxed-MO operations, the default will be SC. I'm sure you'd like to avoid that too. This is a consistency argument wrt other code. The fact that the atomic operation is in a transaction too is just another fact. We don't want people to assume that plain memory accesses are sufficient, for example because all accesses to adapt_count are within transactions. > Also if we add the explicit syntax you propose, someone, who has not > studied the architecture, might leap to the conclusion that add_fetch > would be even better. I disagree. If you look at the rules we've set up for concurrent code in general (eg, see "Concurrency" on the wiki, or the concurrent code we have converted), we require developers to document what the code does; this means memory order and operations used, and the high-level intent. If you do this properly, the misunderstandings you worry should not happen. > Which would be absolutely wrong. This would generate a larx-add-stcx > sequence which has two very bad side-effects. > > 1) on the fall back leg, it would a force a larx-hit-stcx flush hazard > with the following lll_unlock and cause bad performance. > 2) on the transaction leg, it would be redundant and just slow things > down. So why not add this as a comment? The use of nonatomic accesses does not imply these points, BTW. Also, the same concerns for how to access adapt_count apply to other accesses to it that may not or may be in a transaction (due to nesting). You must use atomic accesses for these, so you'd need the explaining comment anyway. On a positive note, it seems that we agree that we want to clarity of the source code. I'm arguing that more comments and a consistent use of atomic operations (with proper memory orders and comments, of course) will get us there. That's why I suggested that the patch should clean up and document all accesses to adapt_count. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-23 11:30 ` Torvald Riegel @ 2016-11-23 17:02 ` Steven Munroe 2016-11-23 18:35 ` Torvald Riegel 0 siblings, 1 reply; 22+ messages in thread From: Steven Munroe @ 2016-11-23 17:02 UTC (permalink / raw) To: Torvald Riegel Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Wed, 2016-11-23 at 12:25 +0100, Torvald Riegel wrote: > On Tue, 2016-11-22 at 12:03 -0600, Steven Munroe wrote: > > On Tue, 2016-11-22 at 09:44 +0100, Torvald Riegel wrote: > > > On Mon, 2016-11-21 at 17:42 -0600, Steven Munroe wrote: > > > > On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote: > > > > > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote: > > > > > > The initial problem reported was memory corruption in MongoDB only on > > > > > > Ubuntu 16.04 ppc64le(not on Ubuntu 15). > > > > > > The problem was not easily recreatable and debugging with many tools and > > > > > > creative ideas, the problem is narrowed down to lock elision in glibc. > > > > > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which > > > > > > made the testcase fail only on 16.04. > > > > > > > > > > > > As stated in BZ 20822, short description is > > > > > > > > > > > > "The update of *adapt_count after the release of the lock causes a race > > > > > > condition. Thread A unlocks, thread B continues and returns, destroying > > > > > > mutex on stack, > > > > > > then gets into another function, thread A writes to *adapt_count and > > > > > > corrupts stack.The window is very narrow and requires that the > > > > > > machine be in smt mode, so likely the two threads have to be on the same > > > > > > core in order to hit this" > > > > > > > > > > > > Looking back the file changes in > > > > > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit > > > > > > https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve > > > > > > the performance. > > > > > > > > > > > > Thinking of the following fixes. > > > > > > > > > > > > 1) Update of adapt_count before the unlock. But I have still not > > > > > > identified if its going to affect the performance. > > > > > > > > > > Because of the requirements on mutex destruction (and same for rwlock > > > > > etc.), a thread must not access any data associated with the mutex after > > > > > having unlocked it (this includes reads because it's correct to also > > > > > unmap the memory holding the mutex after destroying the mutex). > > > > > > > > > > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after > > > > > > the lock acquire at the end of the function. > > > > > > In either case we have to be careful not to introduce another race > > > > > > condition. > > > > > > > > > > Every access to mutex data outside of transactions must use atomic > > > > > operations if there would be data races otherwise (according to the C11 > > > > > memory model). I know that this isn't followed in the current mutex > > > > > code, but it really should be, and your change qualifies as new code so > > > > > please take care of this too. > > > > > > > > > In this case (with the proposed patch) the adapt_count update is within > > > > the critical region and adding C11 atomics does not add value and would > > > > generally make the logical harder to read. > > > > > > > > __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count, > > > > __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED) > > > > > > > > Is a lot of syntax without any real value. > > > > > > First, it would actually be this in glibc: > > > atomic_store_relaxed (&adapt_count, > > > atomic_load_relaxed(&adapt_count) - 1); > > > > > This is not a good example for you case. The proposed patch: > > > > - lll_unlock ((*lock), pshared); > > - > > - /* Update the adapt count AFTER completing the critical section. > > - Doing this here prevents unneeded stalling when entering > > - a critical section. Saving about 8% runtime on P8. */ > > + /* Update the adapt count in the critical section to > > + prevent race condition as mentioned in BZ 20822. */ > > if (*adapt_count > 0) > > (*adapt_count)--; > > + lll_unlock ((*lock), pshared); > > > > makes it clear that this test and update of the adapt_count is contained > > with the acquire/release semantics of the lll_lock and lll_unlock. > > However, it does not make it clear that it should be accessed atomically > outside of transactions. > > Another reason why we want to use atomic operations consistently is to > at least allow us to really use C11 atomics in the future (or some sort > of similar underlying type), so that we can more easily catch cases > where atomic operations should be used but aren't. If we don't > specifically use relaxed-MO operations, the default will be SC. I'm > sure you'd like to avoid that too. > > This is a consistency argument wrt other code. The fact that the atomic > operation is in a transaction too is just another fact. > We don't want people to assume that plain memory accesses are > sufficient, for example because all accesses to adapt_count are within > transactions. > > > Also if we add the explicit syntax you propose, someone, who has not > > studied the architecture, might leap to the conclusion that add_fetch > > would be even better. > > I disagree. If you look at the rules we've set up for concurrent code > in general (eg, see "Concurrency" on the wiki, or the concurrent code we > have converted), we require developers to document what the code does; > this means memory order and operations used, and the high-level intent. > If you do this properly, the misunderstandings you worry should not > happen. > I disagree, your definition of consistency is in this case misleading for my platform. > > Which would be absolutely wrong. This would generate a larx-add-stcx > > sequence which has two very bad side-effects. > > > > 1) on the fall back leg, it would a force a larx-hit-stcx flush hazard > > with the following lll_unlock and cause bad performance. > > 2) on the transaction leg, it would be redundant and just slow things > > down. > > So why not add this as a comment? The use of nonatomic accesses does > not imply these points, BTW. > So a comment to the effect that explicit atomics are not required for this case because it is within the critical region of the lll_lock / lll_unlock. That should satisfy your concern, and mine. > Also, the same concerns for how to access adapt_count apply to other > accesses to it that may not or may be in a transaction (due to nesting). > You must use atomic accesses for these, so you'd need the explaining > comment anyway. > Yes I agree that accesses to adapt_count where the context is not clear should use atomic_load_relaxed / atomic_store_relaxed in the elision code. > On a positive note, it seems that we agree that we want to clarity of > the source code. I'm arguing that more comments and a consistent use of > atomic operations (with proper memory orders and comments, of course) > will get us there. That's why I suggested that the patch should clean > up and document all accesses to adapt_count. > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-23 17:02 ` Steven Munroe @ 2016-11-23 18:35 ` Torvald Riegel 0 siblings, 0 replies; 22+ messages in thread From: Torvald Riegel @ 2016-11-23 18:35 UTC (permalink / raw) To: munroesj Cc: Rajalakshmi Srinivasaraghavan, libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Wed, 2016-11-23 at 11:01 -0600, Steven Munroe wrote: > On Wed, 2016-11-23 at 12:25 +0100, Torvald Riegel wrote: > > On Tue, 2016-11-22 at 12:03 -0600, Steven Munroe wrote: > > > On Tue, 2016-11-22 at 09:44 +0100, Torvald Riegel wrote: > > > > On Mon, 2016-11-21 at 17:42 -0600, Steven Munroe wrote: > > > > > On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote: > > > > > > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote: > > > > > > > The initial problem reported was memory corruption in MongoDB only on > > > > > > > Ubuntu 16.04 ppc64le(not on Ubuntu 15). > > > > > > > The problem was not easily recreatable and debugging with many tools and > > > > > > > creative ideas, the problem is narrowed down to lock elision in glibc. > > > > > > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which > > > > > > > made the testcase fail only on 16.04. > > > > > > > > > > > > > > As stated in BZ 20822, short description is > > > > > > > > > > > > > > "The update of *adapt_count after the release of the lock causes a race > > > > > > > condition. Thread A unlocks, thread B continues and returns, destroying > > > > > > > mutex on stack, > > > > > > > then gets into another function, thread A writes to *adapt_count and > > > > > > > corrupts stack.The window is very narrow and requires that the > > > > > > > machine be in smt mode, so likely the two threads have to be on the same > > > > > > > core in order to hit this" > > > > > > > > > > > > > > Looking back the file changes in > > > > > > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit > > > > > > > https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve > > > > > > > the performance. > > > > > > > > > > > > > > Thinking of the following fixes. > > > > > > > > > > > > > > 1) Update of adapt_count before the unlock. But I have still not > > > > > > > identified if its going to affect the performance. > > > > > > > > > > > > Because of the requirements on mutex destruction (and same for rwlock > > > > > > etc.), a thread must not access any data associated with the mutex after > > > > > > having unlocked it (this includes reads because it's correct to also > > > > > > unmap the memory holding the mutex after destroying the mutex). > > > > > > > > > > > > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after > > > > > > > the lock acquire at the end of the function. > > > > > > > In either case we have to be careful not to introduce another race > > > > > > > condition. > > > > > > > > > > > > Every access to mutex data outside of transactions must use atomic > > > > > > operations if there would be data races otherwise (according to the C11 > > > > > > memory model). I know that this isn't followed in the current mutex > > > > > > code, but it really should be, and your change qualifies as new code so > > > > > > please take care of this too. > > > > > > > > > > > In this case (with the proposed patch) the adapt_count update is within > > > > > the critical region and adding C11 atomics does not add value and would > > > > > generally make the logical harder to read. > > > > > > > > > > __atomic_store_n (adapt_count, (__atomic_load_n(adapt_count, > > > > > __ATOMIC_RELAXED) - 1), __ATOMIC_RELAXED) > > > > > > > > > > Is a lot of syntax without any real value. > > > > > > > > First, it would actually be this in glibc: > > > > atomic_store_relaxed (&adapt_count, > > > > atomic_load_relaxed(&adapt_count) - 1); > > > > > > > This is not a good example for you case. The proposed patch: > > > > > > - lll_unlock ((*lock), pshared); > > > - > > > - /* Update the adapt count AFTER completing the critical section. > > > - Doing this here prevents unneeded stalling when entering > > > - a critical section. Saving about 8% runtime on P8. */ > > > + /* Update the adapt count in the critical section to > > > + prevent race condition as mentioned in BZ 20822. */ > > > if (*adapt_count > 0) > > > (*adapt_count)--; > > > + lll_unlock ((*lock), pshared); > > > > > > makes it clear that this test and update of the adapt_count is contained > > > with the acquire/release semantics of the lll_lock and lll_unlock. > > > > However, it does not make it clear that it should be accessed atomically > > outside of transactions. > > > > Another reason why we want to use atomic operations consistently is to > > at least allow us to really use C11 atomics in the future (or some sort > > of similar underlying type), so that we can more easily catch cases > > where atomic operations should be used but aren't. If we don't > > specifically use relaxed-MO operations, the default will be SC. I'm > > sure you'd like to avoid that too. > > > > This is a consistency argument wrt other code. The fact that the atomic > > operation is in a transaction too is just another fact. > > We don't want people to assume that plain memory accesses are > > sufficient, for example because all accesses to adapt_count are within > > transactions. > > > > > Also if we add the explicit syntax you propose, someone, who has not > > > studied the architecture, might leap to the conclusion that add_fetch > > > would be even better. > > > > I disagree. If you look at the rules we've set up for concurrent code > > in general (eg, see "Concurrency" on the wiki, or the concurrent code we > > have converted), we require developers to document what the code does; > > this means memory order and operations used, and the high-level intent. > > If you do this properly, the misunderstandings you worry should not > > happen. > > > I disagree, your definition of consistency is in this case misleading > for my platform. > > > > Which would be absolutely wrong. This would generate a larx-add-stcx > > > sequence which has two very bad side-effects. > > > > > > 1) on the fall back leg, it would a force a larx-hit-stcx flush hazard > > > with the following lll_unlock and cause bad performance. > > > 2) on the transaction leg, it would be redundant and just slow things > > > down. > > > > So why not add this as a comment? The use of nonatomic accesses does > > not imply these points, BTW. > > > So a comment to the effect that explicit atomics are not required for > this case because it is within the critical region of the lll_lock / > lll_unlock. To be clear: I want to see relaxed MO atomics being used. I would appreciate comments with the note about why using read-modify-write would not be helpful (ie, what you wrote above), and why using atomic loads and stores would not be strictly necessary for those particular accesses from *purely* an atomicity perspective, but that we still use them for consistency with other accesses outside of transactions to adapt_count. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision 2016-11-17 14:47 ` Torvald Riegel 2016-11-21 23:42 ` Steven Munroe @ 2016-12-04 12:14 ` Torvald Riegel 1 sibling, 0 replies; 22+ messages in thread From: Torvald Riegel @ 2016-12-04 12:14 UTC (permalink / raw) To: Rajalakshmi Srinivasaraghavan Cc: libc-alpha, aaron Sawdey, Ulrich Weigand, Steve Munroe, carlos, adhemerval.zanella, adconrad, wschmidt On Thu, 2016-11-17 at 15:46 +0100, Torvald Riegel wrote: > On Tue, 2016-11-15 at 21:42 +0530, Rajalakshmi Srinivasaraghavan wrote: > > The initial problem reported was memory corruption in MongoDB only on > > Ubuntu 16.04 ppc64le(not on Ubuntu 15). > > The problem was not easily recreatable and debugging with many tools and > > creative ideas, the problem is narrowed down to lock elision in glibc. > > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which > > made the testcase fail only on 16.04. > > > > As stated in BZ 20822, short description is > > > > "The update of *adapt_count after the release of the lock causes a race > > condition. Thread A unlocks, thread B continues and returns, destroying > > mutex on stack, > > then gets into another function, thread A writes to *adapt_count and > > corrupts stack.The window is very narrow and requires that the > > machine be in smt mode, so likely the two threads have to be on the same > > core in order to hit this" > > > > Looking back the file changes in > > sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit > > https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve > > the performance. > > > > Thinking of the following fixes. > > > > 1) Update of adapt_count before the unlock. But I have still not > > identified if its going to affect the performance. > > Because of the requirements on mutex destruction (and same for rwlock > etc.), a thread must not access any data associated with the mutex after > having unlocked it (this includes reads because it's correct to also > unmap the memory holding the mutex after destroying the mutex). > > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after > > the lock acquire at the end of the function. > > In either case we have to be careful not to introduce another race > > condition. > > Every access to mutex data outside of transactions must use atomic > operations if there would be data races otherwise (according to the C11 > memory model). I know that this isn't followed in the current mutex > code, but it really should be, and your change qualifies as new code so > please take care of this too. > > I also think that if we can have concurrent accesses to a particular > variable both inside and outside of transactions, we should use atomic > accesses inside the transactions too, just for consistency. To illustrate, please have a look at this patch: https://sourceware.org/ml/libc-alpha/2016-12/msg00041.html It changes x86_64's lock elision implementation to use atomic accesses for adapt_count. (It adds shorter-sized atomic_load_* and atomic_store* functions, but you don't even need that on powerpc right now I think because powerpc sets USE_ATOMIC_COMPILER_BUILTINS to 0 and thus does not check sizeof on atomic loads and stores.) ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2016-12-04 12:14 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-11-15 16:13 BZ 20822 :powerpc: race condition in __lll_unlock_elision Rajalakshmi Srinivasaraghavan 2016-11-16 13:40 ` Adhemerval Zanella 2016-11-16 14:08 ` Szabolcs Nagy 2016-11-16 14:31 ` Adhemerval Zanella 2016-11-16 15:54 ` Rajalakshmi Srinivasaraghavan 2016-11-17 14:47 ` Torvald Riegel 2016-11-21 23:42 ` Steven Munroe 2016-11-22 8:44 ` Torvald Riegel 2016-11-22 8:55 ` Florian Weimer 2016-11-22 9:41 ` Torvald Riegel 2016-11-22 10:52 ` Florian Weimer 2016-11-22 13:45 ` Torvald Riegel 2016-11-22 15:02 ` Florian Weimer 2016-11-22 16:58 ` Torvald Riegel 2016-11-22 18:40 ` Informal model for transactional memory (was: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision) Florian Weimer 2016-11-22 20:15 ` Steven Munroe 2016-11-23 14:29 ` Torvald Riegel 2016-11-22 18:03 ` BZ 20822 :powerpc: race condition in __lll_unlock_elision Steven Munroe 2016-11-23 11:30 ` Torvald Riegel 2016-11-23 17:02 ` Steven Munroe 2016-11-23 18:35 ` Torvald Riegel 2016-12-04 12:14 ` Torvald Riegel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).