* [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. @ 2013-01-31 8:42 katsuki.uwatoko 2013-02-07 23:33 ` Joseph S. Myers 0 siblings, 1 reply; 11+ messages in thread From: katsuki.uwatoko @ 2013-01-31 8:42 UTC (permalink / raw) To: libc-ports Hi, I have found an issue in __lll_timedlock_wait on ARM. The following sequence causes unnecessary busy loop. "A thread" gets the lock. (futex = 1) "B thread" tries to get the lock, and has not called futex syscall yet. (futex = 2) "A thread" releases the lock (futex = 0) "C thread" gets the lock, and does not call futex syscall because of futex=0 (futex = 1) "B thread" calls futex syscall, and returns with an error. Because futex syscall in Linux Kernel checks if the val is changed from 2, which is the 3rd arg of the syscall at futex_wait_setup(), but in this case futex is 1. "B thread" tries to get the lock in userspace but cannot get it because futex is not 0. After all the thread keeps calling futex syscall until "C thread" will release it (futex = 0) or timeout. Therefore I think that the value should be set 2 in every loop the same as __lll_lock_wait_private, and attached a patch for this issue. -- UWATOKO Katsuki --- .../unix/sysv/linux/arm/nptl/lowlevellock.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c b/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c index 756f39f..a495d85 100644 --- a/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c +++ b/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c @@ -65,6 +65,7 @@ __lll_timedlock_wait (int *futex, const struct timespec *abstime, int private) do { struct timeval tv; + int oldval; /* Get the current time. */ (void) __gettimeofday (&tv, NULL); @@ -82,8 +83,10 @@ __lll_timedlock_wait (int *futex, const struct timespec *abstime, int private) if (rt.tv_sec < 0) return ETIMEDOUT; + oldval = atomic_compare_and_exchange_val_acq (futex, 2, 1); // XYZ: Lost the lock to check whether it was private. - lll_futex_timed_wait (futex, 2, &rt, private); + if (oldval != 0) + lll_futex_timed_wait (futex, 2, &rt, private); } while (atomic_compare_and_exchange_bool_acq (futex, 2, 0) != 0); -- 1.7.4.1 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. 2013-01-31 8:42 [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM katsuki.uwatoko @ 2013-02-07 23:33 ` Joseph S. Myers 2013-02-08 0:13 ` katsuki.uwatoko 2013-02-09 4:18 ` Carlos O'Donell 0 siblings, 2 replies; 11+ messages in thread From: Joseph S. Myers @ 2013-02-07 23:33 UTC (permalink / raw) To: katsuki.uwatoko; +Cc: libc-ports, Carlos O'Donell On Thu, 31 Jan 2013, katsuki.uwatoko@toshiba.co.jp wrote: > Hi, > > I have found an issue in __lll_timedlock_wait on ARM. I think the busy loop should have a bug filed in Bugzilla, as a user-visible bug - could you file that bug? > The following sequence causes unnecessary busy loop. > > "A thread" gets the lock. (futex = 1) > "B thread" tries to get the lock, and has not called futex syscall yet. (futex = 2) > "A thread" releases the lock (futex = 0) > "C thread" gets the lock, and does not call futex syscall because of futex=0 (futex = 1) > "B thread" calls futex syscall, and returns with an error. > Because futex syscall in Linux Kernel checks if the val is changed > from 2, which is the 3rd arg of the syscall at futex_wait_setup(), > but in this case futex is 1. > "B thread" tries to get the lock in userspace but cannot get it > because futex is not 0. > After all the thread keeps calling futex syscall > until "C thread" will release it (futex = 0) or timeout. > > Therefore I think that the value should be set 2 in every loop > the same as __lll_lock_wait_private, and attached a patch for this issue. Carlos, any comments on this patch <http://sourceware.org/ml/libc-ports/2013-01/msg00084.html>? It makes the ARM version of __lll_timedlock_wait closer to the HPPA version, but I don't know if many of the differences between different architecture versions of this code are really deliberate.... Would you agree that the generic Linux version (nptl/sysdeps/unix/sysv/linux/lowlevellock.c) doesn't need such a change because the loop is using atomic_exchange_acq rather than atomic_compare_and_exchange_bool_acq, so is already setting the futex to 2 in every loop iteration? -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. 2013-02-07 23:33 ` Joseph S. Myers @ 2013-02-08 0:13 ` katsuki.uwatoko 2013-02-09 4:18 ` Carlos O'Donell 1 sibling, 0 replies; 11+ messages in thread From: katsuki.uwatoko @ 2013-02-08 0:13 UTC (permalink / raw) To: joseph; +Cc: libc-ports, carlos > > I have found an issue in __lll_timedlock_wait on ARM. > > I think the busy loop should have a bug filed in Bugzilla, as a > user-visible bug - could you file that bug? Thank you for the reply. I have filed this in Bugzilla. http://sourceware.org/bugzilla/show_bug.cgi?id=15119 -- UWATOKO Katsuki ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. 2013-02-07 23:33 ` Joseph S. Myers 2013-02-08 0:13 ` katsuki.uwatoko @ 2013-02-09 4:18 ` Carlos O'Donell 2013-02-09 6:49 ` David Miller 1 sibling, 1 reply; 11+ messages in thread From: Carlos O'Donell @ 2013-02-09 4:18 UTC (permalink / raw) To: David Miller, Daniel Jacobowitz, Joseph S. Myers, Andreas Schwab, Thomas Schwinge, Kaz Kojima, Marcus Shawcroft Cc: katsuki.uwatoko, libc-ports Dan, If you wouldn't mind I'd like your historical perspective on a glibc issue. You were the person who last touched ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c; do you have any recollection why ARM needed a custom copy of this file? Dave, I see that sparc32 also has a unique copy of lowlevellock.c Why the use of *_24_* atomic primitives? Faster? Andreas, Thomas, Kaz, Marcus, I've added you to the TO: because machines you maintain are affected by a bug in their lowlevellock.h implementations. Please see section (c) below and review my analysis. > Carlos, any comments on this patch > <http://sourceware.org/ml/libc-ports/2013-01/msg00084.html>? It makes the > ARM version of __lll_timedlock_wait closer to the HPPA version, but I > don't know if many of the differences between different architecture > versions of this code are really deliberate.... I do not believe HPPA needs a unique version of lowlevellock.c and should instead be using the generic version. I do not recollect why I would have created a copy of lowlevellock.c other than I was mostly cloning SPARC when doing binutils and glibc work (Dave does a good job). The HPPA kernel helper emulates compare-and-exchange and thus everything that is required for a generic NPTL implementation. > Would you agree that the generic Linux version > (nptl/sysdeps/unix/sysv/linux/lowlevellock.c) doesn't need such a change > because the loop is using atomic_exchange_acq rather than > atomic_compare_and_exchange_bool_acq, so is already setting the futex to 2 > in every loop iteration? (a) Review of the reporter's claim that a busy loop can result in the ARM version of __lll_timedlock_wait. I have reviewed the ARM function __lll_timedlock_wait and I can confirm that it does indeed seem like the current code can cause a busy-wait loop with three threads given the conditions provided by Uwatoko-san. I see nothing unusual about the given conditions and with enough cores you should be able to trigger the busy-wait loop. (b) Review of the generic version of __lll_timedlock_wait. I can confirm that the generic version of __lll_timedlock_wait is *not* susceptible to a busy-wait loop. The use of `atomic_exchange_acq (futex, 2)' in the while loop condition ensures that the code always correctly *assumes* the lock is locked with more than one waiter. Note that the expectation is that __lll_timedlock *must not* reset futex to 1, otherwise new threads may prevent us from waiting on a futex value of 2. The expectation is that __lll_timedlock sets futex to 1 only if it was zero. (c) Solution to the busy wait loop. I do not like the proposed solution. It is a band-aid on top of a function that doesn't look like it has any architectural merits. I think ARM should be using the generic NPTL version of lowlevellock.c. The real problem is that ARM's __lll_timedlock does a "atomic_exchange_acq (__futex, 1)" while attempting to acquire the futex (violating what I noted in (b)). This means that any new thread trying to acquire the futex will reset the futex to "locked no waiters" and this causes all other inflight threads doing a futex wait to fail with -EWOUDLBLOCK. It seems to me that this is an ARM bug in __lll_timedlock macro in lowlevelock.h that was partly worked around in __lll_timedlock_wait in lowlevellock.c. For example compare ARM's implementation: #define __lll_timedlock(futex, abstime, private) \ ({ \ int *__futex = (futex); \ int __val = 0; \ \ if (__builtin_expect (atomic_exchange_acq (__futex, 1), 0)) \ __val = __lll_timedlock_wait (__futex, abstime, private); \ __val; \ }) To SPARC's implementation: static inline int __attribute__ ((always_inline)) __lll_timedlock (int *futex, const struct timespec *abstime, int private) { int val = atomic_compare_and_exchange_val_24_acq (futex, 1, 0); int result = 0; if (__builtin_expect (val != 0, 0)) result = __lll_timedlock_wait (futex, abstime, private); return result; } Note that the SPARC implementation sets the futex to 1 only if it was zero, otherwise it calls into __lll_timedlock_wait, which will then set futex to 2, because we can't know at that point that there aren't one or more waiters (the "perhaps" scenario if you've read Ulrich's paper "Futexes Are Tricky"). I have reviewed all of the lowlevellock.h implementations and find the following results: No chance of busy-wait: * mips, ia64, alpha, tile, i386, powerpc, x86-64, s390, sparc, hppa - All of these targets use compare-and-exchange in __lll_timedlock without resetting futex to 1. Possible chance of busy-wait: * m68k, aarch64, arm, sh/sh4, - All of these targets reset futex to 1 using atomic-exchange in __lll_timedlock. m68k and sh/sh4 use the generic version of lowlevellock.c, and this means the busy-wait loop bug is not present but the use of atomic-exchange causes a different bug. On m68k and sh/sh4 you get at most N spurious failed futex wait syscalls for each thread calling __lll_timedwait where N is the number of inflight threads expecting futex == 2. For example on m68k: * T1 calls __lll_timedlock setting futex to 1 and taking the lock. * T2 calls __lll_timedlock setting futex to 1 but does not take the lock. * T2 calls __lll_timedlock_wait and sets the futex to 2 and does not gain the lock. * T3 calls __lll_timedlock setting futex to 1 but does not take the lock. * T2 calls futex wait but fails with -EWOULDBLOCK because T3 reset futex to 1. -> One inflight thread (T2), and one spurious failed futex wait syscall. * T2 again sets the futex to 2 and does not gain the lock. * ... T2 and T3 go on to call futex wait syscall and both sleep. In the ARM case with the current implementation of __lll_timedwait we get a busy-loop as outlined by Owatoko-san. The real fix is to use atomic-compare-and-swap in __lll_timedlock. In fact the use of atomic-exchange in ARM's __lll_timedlock seems dangerous given this unlock: #define __lll_unlock(futex, private) \ (void) \ ({ int *__futex = (futex); \ int __oldval = atomic_exchange_rel (__futex, 0); \ if (__builtin_expect (__oldval > 1, 0)) \ lll_futex_wake (__futex, 1, private); \ }) Thus you *could* have a new thread calling __lll_timedlock, setting futex to 1, and every thread calling __lll_unlock skips the wake. This could cause a *huge* delay between sleeping and waking as the wake is deferred later and later. Eventually one of the threads does an atomic operation on the futex to set the value to 2 and will take the lock, and eventually call wake. However, the latency is increased and that's a performance bug. In summary: - All machines listed above as `Possible chance of busy-wait' should be using a generic __lll_timedlock that looks like this: ~~~ int result = 0; if (atomic_compare_and_exchange_bool_acq (futex, 1, 0) != 0) result = __lll_timedlock_wait (futex, abstime, private); return result; ~~~ (simplest example taken from alpha's lowlevellock.h) Comments? Cheers, Carlos. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. 2013-02-09 4:18 ` Carlos O'Donell @ 2013-02-09 6:49 ` David Miller 2013-02-09 14:56 ` Carlos O'Donell 0 siblings, 1 reply; 11+ messages in thread From: David Miller @ 2013-02-09 6:49 UTC (permalink / raw) To: carlos Cc: drow, joseph, schwab, thomas_schwinge, kkojima, marcus.shawcroft, katsuki.uwatoko, libc-ports From: "Carlos O'Donell" <carlos@systemhalted.org> Date: Fri, 8 Feb 2013 23:18:42 -0500 > I see that sparc32 also has a unique copy of lowlevellock.c Why the > use of *_24_* atomic primitives? Faster? On pre-v9 32-bit sparc, we lack any usable atomic compare and swap. All we have is an 8-bit spinlock. So we implement things in a 32-bit word which is composed of a 24-bit counter and an 8-bit lock. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. 2013-02-09 6:49 ` David Miller @ 2013-02-09 14:56 ` Carlos O'Donell 2013-02-10 4:56 ` David Miller 0 siblings, 1 reply; 11+ messages in thread From: Carlos O'Donell @ 2013-02-09 14:56 UTC (permalink / raw) To: David Miller Cc: carlos, drow, joseph, schwab, thomas_schwinge, kkojima, marcus.shawcroft, katsuki.uwatoko, libc-ports On 02/09/2013 01:49 AM, David Miller wrote: > From: "Carlos O'Donell" <carlos@systemhalted.org> > Date: Fri, 8 Feb 2013 23:18:42 -0500 > >> I see that sparc32 also has a unique copy of lowlevellock.c Why the >> use of *_24_* atomic primitives? Faster? > > On pre-v9 32-bit sparc, we lack any usable atomic compare and > swap. > > All we have is an 8-bit spinlock. > > So we implement things in a 32-bit word which is composed of a 24-bit > counter and an 8-bit lock. Thus a futex on sparc looks like this? struct futex { union { int whole; struct { char lock; char value[3]; } __split; } __futex; }; With only 24-bits for the value? I'll have to remember pre-v9 sparc only has 24-bits there. I'd seen the *other* sparc pre-v9 implementation that used 64 global locks per-library and that seemed signal unsafe and prone to deadlocks. How do you deal with the FUTEX_WAITERS/FUTEX_OWNER_DIED bits that are set in the high bits of the word? Or a tid that is 26-bits (FUTEX_TID_MASK)? Cheers, Carlos. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. 2013-02-09 14:56 ` Carlos O'Donell @ 2013-02-10 4:56 ` David Miller 2013-02-10 17:55 ` Carlos O'Donell 0 siblings, 1 reply; 11+ messages in thread From: David Miller @ 2013-02-10 4:56 UTC (permalink / raw) To: carlos Cc: carlos, drow, joseph, schwab, thomas_schwinge, kkojima, marcus.shawcroft, katsuki.uwatoko, libc-ports From: "Carlos O'Donell" <carlos@redhat.com> Date: Sat, 09 Feb 2013 09:55:43 -0500 > I'd seen the *other* sparc pre-v9 implementation that used 64 global > locks per-library and that seemed signal unsafe and prone to deadlocks. All of these pre-v9 things are signal unsafe and deadlock. I thought about doing the kernel atomic emulation other platforms have adopted, but frankly these cpus are so old and deprecated that they're not worth doing the work for. And by the time we'd propagate all of this infrastructure necessary to support this kind of scheme, those cpus would be even more outdated. Even debian does v9-only build on 32-bit. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. 2013-02-10 4:56 ` David Miller @ 2013-02-10 17:55 ` Carlos O'Donell 2013-02-12 21:41 ` Daniel Jacobowitz 0 siblings, 1 reply; 11+ messages in thread From: Carlos O'Donell @ 2013-02-10 17:55 UTC (permalink / raw) To: David Miller Cc: carlos, drow, joseph, schwab, thomas_schwinge, kkojima, marcus.shawcroft, katsuki.uwatoko, libc-ports On Sat, Feb 9, 2013 at 11:56 PM, David Miller <davem@davemloft.net> wrote: > From: "Carlos O'Donell" <carlos@redhat.com> > Date: Sat, 09 Feb 2013 09:55:43 -0500 > >> I'd seen the *other* sparc pre-v9 implementation that used 64 global >> locks per-library and that seemed signal unsafe and prone to deadlocks. > > All of these pre-v9 things are signal unsafe and deadlock. > > I thought about doing the kernel atomic emulation other platforms have > adopted, but frankly these cpus are so old and deprecated that they're > not worth doing the work for. > > And by the time we'd propagate all of this infrastructure necessary to > support this kind of scheme, those cpus would be even more outdated. > > Even debian does v9-only build on 32-bit. Eminently practical. Just curious. Thanks for verifying what I suspected. Cheers, Carlos. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. 2013-02-10 17:55 ` Carlos O'Donell @ 2013-02-12 21:41 ` Daniel Jacobowitz 2013-02-12 21:41 ` Daniel Jacobowitz 2013-02-12 21:57 ` Carlos O'Donell 0 siblings, 2 replies; 11+ messages in thread From: Daniel Jacobowitz @ 2013-02-12 21:41 UTC (permalink / raw) To: Carlos O'Donell Cc: David Miller, carlos, Joseph S. Myers, schwab, thomas_schwinge, kkojima, marcus.shawcroft, katsuki.uwatoko, libc-ports It's been too long, sorry. It may have been necessary solely to provide the separate EABI and NPTL versions in sysdeps; you'd have to look at e.g. the sysdeps selection order for the LinuxThreads version. It may also be related to the lack of usable atomic primitives, pre-EABI. On Sun, Feb 10, 2013 at 12:54 PM, Carlos O'Donell <carlos@systemhalted.org> wrote: > On Sat, Feb 9, 2013 at 11:56 PM, David Miller <davem@davemloft.net> wrote: >> From: "Carlos O'Donell" <carlos@redhat.com> >> Date: Sat, 09 Feb 2013 09:55:43 -0500 >> >>> I'd seen the *other* sparc pre-v9 implementation that used 64 global >>> locks per-library and that seemed signal unsafe and prone to deadlocks. >> >> All of these pre-v9 things are signal unsafe and deadlock. >> >> I thought about doing the kernel atomic emulation other platforms have >> adopted, but frankly these cpus are so old and deprecated that they're >> not worth doing the work for. >> >> And by the time we'd propagate all of this infrastructure necessary to >> support this kind of scheme, those cpus would be even more outdated. >> >> Even debian does v9-only build on 32-bit. > > Eminently practical. Just curious. Thanks for verifying what I suspected. > > Cheers, > Carlos. -- Thanks, Daniel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. 2013-02-12 21:41 ` Daniel Jacobowitz @ 2013-02-12 21:41 ` Daniel Jacobowitz 2013-02-12 21:57 ` Carlos O'Donell 1 sibling, 0 replies; 11+ messages in thread From: Daniel Jacobowitz @ 2013-02-12 21:41 UTC (permalink / raw) To: Carlos O'Donell Cc: David Miller, carlos, Joseph S. Myers, schwab, thomas_schwinge, kkojima, marcus.shawcroft, katsuki.uwatoko, libc-ports You could try asking Richard Earnshaw... On Tue, Feb 12, 2013 at 4:41 PM, Daniel Jacobowitz <drow@false.org> wrote: > It's been too long, sorry. It may have been necessary solely to > provide the separate EABI and NPTL versions in sysdeps; you'd have to > look at e.g. the sysdeps selection order for the LinuxThreads version. > It may also be related to the lack of usable atomic primitives, > pre-EABI. > > On Sun, Feb 10, 2013 at 12:54 PM, Carlos O'Donell > <carlos@systemhalted.org> wrote: >> On Sat, Feb 9, 2013 at 11:56 PM, David Miller <davem@davemloft.net> wrote: >>> From: "Carlos O'Donell" <carlos@redhat.com> >>> Date: Sat, 09 Feb 2013 09:55:43 -0500 >>> >>>> I'd seen the *other* sparc pre-v9 implementation that used 64 global >>>> locks per-library and that seemed signal unsafe and prone to deadlocks. >>> >>> All of these pre-v9 things are signal unsafe and deadlock. >>> >>> I thought about doing the kernel atomic emulation other platforms have >>> adopted, but frankly these cpus are so old and deprecated that they're >>> not worth doing the work for. >>> >>> And by the time we'd propagate all of this infrastructure necessary to >>> support this kind of scheme, those cpus would be even more outdated. >>> >>> Even debian does v9-only build on 32-bit. >> >> Eminently practical. Just curious. Thanks for verifying what I suspected. >> >> Cheers, >> Carlos. > > > > -- > Thanks, > Daniel -- Thanks, Daniel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. 2013-02-12 21:41 ` Daniel Jacobowitz 2013-02-12 21:41 ` Daniel Jacobowitz @ 2013-02-12 21:57 ` Carlos O'Donell 1 sibling, 0 replies; 11+ messages in thread From: Carlos O'Donell @ 2013-02-12 21:57 UTC (permalink / raw) To: Daniel Jacobowitz Cc: David Miller, carlos, Joseph S. Myers, schwab, thomas_schwinge, kkojima, marcus.shawcroft, katsuki.uwatoko, libc-ports On Tue, Feb 12, 2013 at 4:41 PM, Daniel Jacobowitz <drow@false.org> wrote: > It's been too long, sorry. It may have been necessary solely to > provide the separate EABI and NPTL versions in sysdeps; you'd have to > look at e.g. the sysdeps selection order for the LinuxThreads version. > It may also be related to the lack of usable atomic primitives, > pre-EABI. Thanks! I knew it was a long shot, but sometimes people remember the oddest things for the longest time :-) Cheers, Carlos. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-02-12 21:57 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-01-31 8:42 [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM katsuki.uwatoko 2013-02-07 23:33 ` Joseph S. Myers 2013-02-08 0:13 ` katsuki.uwatoko 2013-02-09 4:18 ` Carlos O'Donell 2013-02-09 6:49 ` David Miller 2013-02-09 14:56 ` Carlos O'Donell 2013-02-10 4:56 ` David Miller 2013-02-10 17:55 ` Carlos O'Donell 2013-02-12 21:41 ` Daniel Jacobowitz 2013-02-12 21:41 ` Daniel Jacobowitz 2013-02-12 21:57 ` Carlos O'Donell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).