From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 124174 invoked by alias); 3 Jul 2018 12:46:29 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 123895 invoked by uid 89); 3 Jul 2018 12:46:14 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-26.1 required=5.0 tests=BAYES_00,FREEMAIL_FROM,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_PASS,TIME_LIMIT_EXCEEDED autolearn=unavailable version=3.3.2 spammy=02, 4.5, 04, bench X-HELO: mail-oi0-f66.google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=5VbN18UVaWx8iMUIoT9dpKQYH2qp9CxGNuzLAyE2Xd8=; b=u9mE6vxJPOu/pwwL9KCglRXSfx5+PZ1zAGqmLZY4MeYMnoLD7NW8wHEW2MIL9qpAsN sPYK6WNqrKmJWNofNHvu1uy2HOhcX9unX0u7kwTV+33x/7yXCFj7Z/3LccqZrBORq9QZ T+9Q0Lx7xGZEfKo1cunSdOnO7Wjvl3z4vkaCjgioD8vf45Zd6MO+9IRNPR4BqUX6w6Gm 4BXhui14ftrp5KDQSnbiYM12Qrsg+57mWZ0a9b2Gf96aQI0+H/RJqwmfNd/OcvPzK1Wt u1JOVpBlCeRfxANmMDtbIB/K/wMyGI+g0FTE2rBThatQD0kPJ00l2+dY17Kj+peJU7Zq 4aew== MIME-Version: 1.0 In-Reply-To: <1530520046-18343-3-git-send-email-kemi.wang@intel.com> References: <1530520046-18343-1-git-send-email-kemi.wang@intel.com> <1530520046-18343-3-git-send-email-kemi.wang@intel.com> From: "H.J. Lu" Date: Tue, 03 Jul 2018 12:46:00 -0000 Message-ID: Subject: Re: [PATCH v6 3/3] Mutex: Replace trylock by read only while spinning To: Kemi Wang Cc: Adhemerval Zanella , Florian Weimer , Rical Jason , Carlos Donell , Glibc alpha , Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey Content-Type: text/plain; charset="UTF-8" X-SW-Source: 2018-07/txt/msg00054.txt.bz2 On Mon, Jul 2, 2018 at 1:27 AM, Kemi Wang wrote: > The pthread adaptive spin mutex spins on the lock for a while before > calling into the kernel to block. But, in the current implementation of > spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when > the lock is contended, it is not a good idea on many targets as that will > force expensive memory synchronization among processors and penalize other > running threads. For example, it constantly floods the system with "read > for ownership" requests, which are much more expensive to process than a > single read. Thus, we only use MO read until we observe the lock to not be > acquired anymore, as suggested by Andi Kleen. > > Performance impact: > Significant mutex performance improvement is not expected for this patch, > though, it probably bring some benefit for the scenarios with severe lock > contention on many architectures, the whole system performance can benefit > from this modification because a number of unnecessary "read for ownership" > requests which stress the cache system by broadcasting cache line > invalidity are eliminated during spinning. > Meanwhile, it may have some tiny performance regression on the lock holder > transformation for the case of lock acquisition via spinning gets, because > the lock state is checked before acquiring the lock via trylock. > > Similar mechanism has been implemented for pthread spin lock. > > Test machine: > 2-sockets Skylake platform, 112 cores with 62G RAM > > Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex > with global update) > Usage: make bench BENCHSET=mutex-adaptive-thread > Test result: > +----------------+-----------------+-----------------+------------+ > | Configuration | Base | Head | % Change | > | | Total iteration | Total iteration | base->head | > +----------------+-----------------+-----------------+------------+ > | | Critical section size: 1x | > +----------------+------------------------------------------------+ > |1 thread | 2.76681e+07 | 2.7965e+07 | +1.1% | > |2 threads | 3.29905e+07 | 3.55279e+07 | +7.7% | > |3 threads | 4.38102e+07 | 3.98567e+07 | -9.0% | > |4 threads | 1.72172e+07 | 2.09498e+07 | +21.7% | > |28 threads | 1.03732e+07 | 1.05133e+07 | +1.4% | > |56 threads | 1.06308e+07 | 5.06522e+07 | +14.6% | > |112 threads | 8.55177e+06 | 1.02954e+07 | +20.4% | > +----------------+------------------------------------------------+ > | | Critical section size: 10x | > +----------------+------------------------------------------------+ > |1 thread | 1.57006e+07 | 1.54727e+07 | -1.5% | > |2 threads | 1.8044e+07 | 1.75601e+07 | -2.7% | > |3 threads | 1.35634e+07 | 1.46384e+07 | +7.9% | > |4 threads | 1.21257e+07 | 1.32046e+07 | +8.9% | > |28 threads | 8.09593e+06 | 1.02713e+07 | +26.9% | > |56 threads | 9.09907e+06 | 4.16203e+07 | +16.4% | > |112 threads | 7.09731e+06 | 8.62406e+06 | +21.5% | > +----------------+------------------------------------------------+ > | | Critical section size: 100x | > +----------------+------------------------------------------------+ > |1 thread | 2.87116e+06 | 2.89188e+06 | +0.7% | > |2 threads | 2.23409e+06 | 2.24216e+06 | +0.4% | > |3 threads | 2.29888e+06 | 2.29964e+06 | +0.0% | > |4 threads | 2.26898e+06 | 2.21394e+06 | -2.4% | > |28 threads | 1.03228e+06 | 1.0051e+06 | -2.6% | > |56 threads | 1.02953 +06 | 1.6344e+07 | -2.3% | > |112 threads | 1.01615e+06 | 1.00134e+06 | -1.5% | > +----------------+------------------------------------------------+ > | | Critical section size: 1000x | > +----------------+------------------------------------------------+ > |1 thread | 316392 | 315635 | -0.2% | > |2 threads | 302806 | 303469 | +0.2% | > |3 threads | 298506 | 294281 | -1.4% | > |4 threads | 292037 | 289945 | -0.7% | > |28 threads | 155188 | 155250 | +0.0% | > |56 threads | 190657 | 183106 | -4.0% | > |112 threads | 210818 | 220342 | +4.5% | > +----------------+-----------------+-----------------+------------+ > > * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex > * nptl/pthread_mutex_timedlock.c: Likewise > * sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c: Enable read only > while spinning for x86 architecture > * sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c: Likewise > * sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c: Likewise > > ChangLog: > V5->V6: > no change > > V4->V5: > a) Make the optimization work for pthread mutex_timedlock() in x86 > architecture. > b) Move the READ_ONLY_SPIN macro definition from this patch to the > first patch which adds glibc.mutex.spin_count tunable entry > > V3->V4: > a) Make the optimization opt-in, and enable for x86 architecture as > default, as suggested by Florian Weimer. > > V2->V3: > a) Drop the idea of blocking spinners if fail to acquire a lock, since > this idea would not be an universal winner. E.g. several threads > contend for a lock which protects a small critical section, thus, > probably any thread can acquire the lock via spinning. > b) Fix the format issue AFAIC > > V1->V2: fix format issue > > Suggested-by: Andi Kleen > Signed-off-by: Kemi Wang > --- > nptl/pthread_mutex_lock.c | 15 +++++++++++++++ > nptl/pthread_mutex_timedlock.c | 15 +++++++++++++++ > sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c | 1 + > sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c | 1 + > sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c | 1 + > 5 files changed, 33 insertions(+) > > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c > index 1519c14..9a7b5c2 100644 > --- a/nptl/pthread_mutex_lock.c > +++ b/nptl/pthread_mutex_lock.c > @@ -133,7 +139,16 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) > LLL_MUTEX_LOCK (mutex); > break; > } > +#ifdef READ_ONLY_SPIN > + do > + { > + atomic_spin_nop (); > + val = atomic_load_relaxed (&mutex->__data.__lock); > + } > + while (val != 0 && ++cnt < max_cnt); > +#else > atomic_spin_nop (); > +#endif > } > while (LLL_MUTEX_TRYLOCK (mutex) != 0); > > diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c > index 66efd39..dcaeca8 100644 > --- a/nptl/pthread_mutex_timedlock.c > +++ b/nptl/pthread_mutex_timedlock.c > @@ -126,7 +132,16 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex, > PTHREAD_MUTEX_PSHARED (mutex)); > break; > } > +#ifdef READ_ONLY_SPIN > + do > + { > + atomic_spin_nop (); > + val = atomic_load_relaxed (&mutex->__data.__lock); > + } > + while (val != 0 && ++cnt < max_cnt); > +#else > atomic_spin_nop (); > +#endif > } > while (lll_trylock (mutex->__data.__lock) != 0); > Please define generic static inline void __always_inline atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex) { atomic_spin_nop (); } and x86 version: static inline void __always_inline atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex) { do { atomic_spin_nop (); val = atomic_load_relaxed (&mutex->__data.__lock); } while (val != 0 && ++cnt < max_cnt); } in a standalone patch. -- H.J.