From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 14468 invoked by alias); 9 Feb 2013 04:18:52 -0000 Received: (qmail 14424 invoked by uid 22791); 9 Feb 2013 04:18:50 -0000 X-SWARE-Spam-Status: No, hits=-3.4 required=5.0 tests=AWL,BAYES_00,KHOP_RCVD_UNTRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE X-Spam-Check-By: sourceware.org Received: from mail-oa0-f53.google.com (HELO mail-oa0-f53.google.com) (209.85.219.53) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Sat, 09 Feb 2013 04:18:43 +0000 Received: by mail-oa0-f53.google.com with SMTP id m1so4782257oag.12 for ; Fri, 08 Feb 2013 20:18:43 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:x-originating-ip:in-reply-to:references :date:message-id:subject:from:to:cc:content-type:x-gm-message-state; bh=j0L38iK1tLaQfVx2zZOFjmLFXmF6waSiE87uKLQ1nJI=; b=bYlvorBckOugM3Dlohq0d3PNeZ8i7IveNvdc0lP3vUgedPgl2jJM0+RTjGDuhdqt5Z FkPhDjzdvaeYL75X8jrJ9TFYXCLuYc1ekDLz0hi4GCFJ5VXTWt8dim2Bx5IroLVl4lfy YR8nIeF8DZvAaZwjRk2qmamlLDPB+/nBCkpgXpn8e4HYZxS+rOgiIl34aLuCLxmhZzjb ak0mix/ca//JLj/yEKcLTUXDynPZ4wfOTsHZALtfaa05OXTUNHlPTHcdoOYdsM5ksZ9w CUBuWHSR8RND4RDsZM5rxrq+3hiaweH9xHW62g3vA76RO5w02p+W8OkVyIOGPx8H5D1k j2Pg== MIME-Version: 1.0 X-Received: by 10.60.8.65 with SMTP id p1mr6134017oea.4.1360383523165; Fri, 08 Feb 2013 20:18:43 -0800 (PST) Received: by 10.76.93.37 with HTTP; Fri, 8 Feb 2013 20:18:42 -0800 (PST) In-Reply-To: References: Date: Sat, 09 Feb 2013 04:18:00 -0000 Message-ID: Subject: Re: [PATCH] Avoid unnecessary busy loop in __lll_timedlock_wait on ARM. From: "Carlos O'Donell" To: David Miller , Daniel Jacobowitz , "Joseph S. Myers" , Andreas Schwab , Thomas Schwinge , Kaz Kojima , Marcus Shawcroft Cc: katsuki.uwatoko@toshiba.co.jp, libc-ports@sourceware.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQnsxn0lWrb/IEfFOWwEI3Fg3MS8TvOMlA8PLvXKq+gVaa6Q3pJ7pyb9X/vFSRL9jcv7URpX X-IsSubscribed: yes Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org X-SW-Source: 2013-02/txt/msg00021.txt.bz2 Dan, If you wouldn't mind I'd like your historical perspective on a glibc issue. You were the person who last touched ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.c; do you have any recollection why ARM needed a custom copy of this file? Dave, I see that sparc32 also has a unique copy of lowlevellock.c Why the use of *_24_* atomic primitives? Faster? Andreas, Thomas, Kaz, Marcus, I've added you to the TO: because machines you maintain are affected by a bug in their lowlevellock.h implementations. Please see section (c) below and review my analysis. > Carlos, any comments on this patch > ? It makes the > ARM version of __lll_timedlock_wait closer to the HPPA version, but I > don't know if many of the differences between different architecture > versions of this code are really deliberate.... I do not believe HPPA needs a unique version of lowlevellock.c and should instead be using the generic version. I do not recollect why I would have created a copy of lowlevellock.c other than I was mostly cloning SPARC when doing binutils and glibc work (Dave does a good job). The HPPA kernel helper emulates compare-and-exchange and thus everything that is required for a generic NPTL implementation. > Would you agree that the generic Linux version > (nptl/sysdeps/unix/sysv/linux/lowlevellock.c) doesn't need such a change > because the loop is using atomic_exchange_acq rather than > atomic_compare_and_exchange_bool_acq, so is already setting the futex to 2 > in every loop iteration? (a) Review of the reporter's claim that a busy loop can result in the ARM version of __lll_timedlock_wait. I have reviewed the ARM function __lll_timedlock_wait and I can confirm that it does indeed seem like the current code can cause a busy-wait loop with three threads given the conditions provided by Uwatoko-san. I see nothing unusual about the given conditions and with enough cores you should be able to trigger the busy-wait loop. (b) Review of the generic version of __lll_timedlock_wait. I can confirm that the generic version of __lll_timedlock_wait is *not* susceptible to a busy-wait loop. The use of `atomic_exchange_acq (futex, 2)' in the while loop condition ensures that the code always correctly *assumes* the lock is locked with more than one waiter. Note that the expectation is that __lll_timedlock *must not* reset futex to 1, otherwise new threads may prevent us from waiting on a futex value of 2. The expectation is that __lll_timedlock sets futex to 1 only if it was zero. (c) Solution to the busy wait loop. I do not like the proposed solution. It is a band-aid on top of a function that doesn't look like it has any architectural merits. I think ARM should be using the generic NPTL version of lowlevellock.c. The real problem is that ARM's __lll_timedlock does a "atomic_exchange_acq (__futex, 1)" while attempting to acquire the futex (violating what I noted in (b)). This means that any new thread trying to acquire the futex will reset the futex to "locked no waiters" and this causes all other inflight threads doing a futex wait to fail with -EWOUDLBLOCK. It seems to me that this is an ARM bug in __lll_timedlock macro in lowlevelock.h that was partly worked around in __lll_timedlock_wait in lowlevellock.c. For example compare ARM's implementation: #define __lll_timedlock(futex, abstime, private) \ ({ \ int *__futex = (futex); \ int __val = 0; \ \ if (__builtin_expect (atomic_exchange_acq (__futex, 1), 0)) \ __val = __lll_timedlock_wait (__futex, abstime, private); \ __val; \ }) To SPARC's implementation: static inline int __attribute__ ((always_inline)) __lll_timedlock (int *futex, const struct timespec *abstime, int private) { int val = atomic_compare_and_exchange_val_24_acq (futex, 1, 0); int result = 0; if (__builtin_expect (val != 0, 0)) result = __lll_timedlock_wait (futex, abstime, private); return result; } Note that the SPARC implementation sets the futex to 1 only if it was zero, otherwise it calls into __lll_timedlock_wait, which will then set futex to 2, because we can't know at that point that there aren't one or more waiters (the "perhaps" scenario if you've read Ulrich's paper "Futexes Are Tricky"). I have reviewed all of the lowlevellock.h implementations and find the following results: No chance of busy-wait: * mips, ia64, alpha, tile, i386, powerpc, x86-64, s390, sparc, hppa - All of these targets use compare-and-exchange in __lll_timedlock without resetting futex to 1. Possible chance of busy-wait: * m68k, aarch64, arm, sh/sh4, - All of these targets reset futex to 1 using atomic-exchange in __lll_timedlock. m68k and sh/sh4 use the generic version of lowlevellock.c, and this means the busy-wait loop bug is not present but the use of atomic-exchange causes a different bug. On m68k and sh/sh4 you get at most N spurious failed futex wait syscalls for each thread calling __lll_timedwait where N is the number of inflight threads expecting futex == 2. For example on m68k: * T1 calls __lll_timedlock setting futex to 1 and taking the lock. * T2 calls __lll_timedlock setting futex to 1 but does not take the lock. * T2 calls __lll_timedlock_wait and sets the futex to 2 and does not gain the lock. * T3 calls __lll_timedlock setting futex to 1 but does not take the lock. * T2 calls futex wait but fails with -EWOULDBLOCK because T3 reset futex to 1. -> One inflight thread (T2), and one spurious failed futex wait syscall. * T2 again sets the futex to 2 and does not gain the lock. * ... T2 and T3 go on to call futex wait syscall and both sleep. In the ARM case with the current implementation of __lll_timedwait we get a busy-loop as outlined by Owatoko-san. The real fix is to use atomic-compare-and-swap in __lll_timedlock. In fact the use of atomic-exchange in ARM's __lll_timedlock seems dangerous given this unlock: #define __lll_unlock(futex, private) \ (void) \ ({ int *__futex = (futex); \ int __oldval = atomic_exchange_rel (__futex, 0); \ if (__builtin_expect (__oldval > 1, 0)) \ lll_futex_wake (__futex, 1, private); \ }) Thus you *could* have a new thread calling __lll_timedlock, setting futex to 1, and every thread calling __lll_unlock skips the wake. This could cause a *huge* delay between sleeping and waking as the wake is deferred later and later. Eventually one of the threads does an atomic operation on the futex to set the value to 2 and will take the lock, and eventually call wake. However, the latency is increased and that's a performance bug. In summary: - All machines listed above as `Possible chance of busy-wait' should be using a generic __lll_timedlock that looks like this: ~~~ int result = 0; if (atomic_compare_and_exchange_bool_acq (futex, 1, 0) != 0) result = __lll_timedlock_wait (futex, abstime, private); return result; ~~~ (simplest example taken from alpha's lowlevellock.h) Comments? Cheers, Carlos.