From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 58177 invoked by alias); 10 Apr 2018 01:49:24 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 57735 invoked by uid 89); 10 Apr 2018 01:49:18 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-22.9 required=5.0 tests=BAYES_00,BODY_8BITS,GARBLED_BODY,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS autolearn=ham version=3.3.2 spammy=10=e6?= X-HELO: mga09.intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 Subject: Re: [PATCH 2/3] Mutex: Only read while spinning To: Adhemerval Zanella , libc-alpha@sourceware.org Cc: Andi Kleen , "Chen, Tim C" , Dave , "ying.huang" , aaron , aubrey References: <1522394093-9835-1-git-send-email-kemi.wang@intel.com> <1522394093-9835-2-git-send-email-kemi.wang@intel.com> <14c4945b-f9f8-51b6-32c9-ce3049493dab@linaro.org> From: kemi Message-ID: <11425ecd-b51e-8396-e3d7-612f7767c870@intel.com> Date: Tue, 10 Apr 2018 01:49:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <14c4945b-f9f8-51b6-32c9-ce3049493dab@linaro.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2018-04/txt/msg00184.txt.bz2 On 2018年04月10日 04:52, Adhemerval Zanella wrote: > > > On 08/04/2018 05:28, kemi wrote: >> >> >> On 2018年04月06日 04:55, Adhemerval Zanella wrote: >>> >>> >>> On 30/03/2018 04:14, Kemi Wang wrote: >>>> The pthread adaptive spin mutex spins on the lock for a while before going >>>> to a sleep. While the lock is contended and we need to wait, going straight >>>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as >>>> that will force expensive memory synchronization among processors and >>>> penalize other running threads. For example, it constantly floods the >>>> system with "read for ownership" requests, which are much more expensive to >>>> process than a single read. Thus, we only use MO read until we observe the >>>> lock to not be acquired anymore, as suggusted by Andi Kleen. >>>> >>>> Test machine: >>>> 2-sockets Skylake paltform, 112 cores with 62G RAM >>>> >>>> Test case: Contended pthread adaptive spin mutex with global update >>>> each thread of the workload does: >>>> a) Lock the mutex (adaptive spin type) >>>> b) Globle variable increment >>>> c) Unlock the mutex >>>> in a loop until timeout, and the main thread reports the total iteration >>>> number of all the threads in one second. >>>> >>>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is >>>> modified to PTHREAD_MUTEX_ADAPTIVE_NP. >>>> github: https://github.com/antonblanchard/will-it-scale.git >>>> >>>> nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000) >>>> 1 51644585 51307573(-0.7%) 51323778(-0.6%) >>>> 2 7914789 10011301(+26.5%) 9867343(+24.7%) >>>> 7 1687620 4224135(+150.3%) 3430504(+103.3%) >>>> 14 1026555 3784957(+268.7%) 1843458(+79.6%) >>>> 28 962001 2886885(+200.1%) 681965(-29.1%) >>>> 56 883770 2740755(+210.1%) 364879(-58.7%) >>>> 112 1150589 2707089(+135.3%) 415261(-63.9%) >>> >>> In pthread_mutex3 it is basically more updates in a global variable synchronized >>> with a mutex, so if I am reading correct the benchmark, a higher value means >>> less contention. I also assume you use the 'threads' value in this table. >>> >> >> Yes. I used multi-threads mode for testing. >> >>> I checked on a 64 cores aarch64 machine to see what kind of improvement, if >>> any; one would get with this change: >>> >>> nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000) >>> 1 27566206 28778254 (4.211680) 28778467 (4.212389) >>> 2 8498813 7777589 (-9.273105) 7806043 (-8.874791) >>> 7 5019434 2869629 (-74.915782) 3307812 (-51.744839) >>> 14 4379155 2906255 (-50.680343) 2825041 (-55.012087) >>> 28 4397464 3261094 (-34.846282) 3259486 (-34.912805) >>> 56 4020956 3898386 (-3.144122) 4038505 (0.434542) >>> >> >> Thanks for your job to test it on aarch64 machine. That's great supplement for us. >> >>> So I think this change should be platform-specific. >>> >> >> It may depend on platform, especially for cmpxchg, MO read and pause instructions. > > Yes, that's why for this patch isolated it would be better to make the adaptive algorithm > more platform specific. > >> >> Well, I think the performance also depends on spin count. A simple explanation is that: >> spin always fails in severe lock contention with large thread numbers (fit case b) below), >> thus, the overhead increases due to large spin count, larger spin count, more overhead. >> It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised >> that the performance does not increase with smaller SPIN_COUNT compare to larger one. >> Did you run "export LD_SPIN_COUNT=10" before the second round test? > > I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the > current patch set version is the same). I redid the number to check for some > mishandled from my part, but still the number are somewhat not ok to set as > default (even more that 1000 is the default value for spin_counts). > > nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000) > 1 27563021 28775107 (4.212273) 28775127 (4.212339) > 2 8261443 9044577 (8.658603) 7792347 (-6.019958) > 7 4697529 4623052 (-1.610992) 3241736 (-44.907821) > 14 4951300 5348550 (7.427247) 2854946 (-73.428849) > 28 4296920 4551240 (5.587928) 3231684 (-32.962257) > 56 4182013 3420937 (-22.247589) 3769944 (-10.930375) > > >> >> Analysis: >> let's assume the time for critical section is *c*, and the time for spinning the lock is *s*. >> Then s = k*c, *k* is factor. >> >> a) If *k* < 1, which means the spinning time is less than the time for the task to consume >> in the critical section, it is easy to understand that the overhead for acquiring the lock >> equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). >> >> b) If *k* > 1 && threads number *n* >= *k* (small critical section case). >> In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we >> can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start >> at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame >> >> T(0) task(0) <-- task(1) <-- task(2) ... task(n) // original status >> | >> T(1) task(1) <-- task(2) ... task(n) <-- task(0) // after time c, task 0 release the lock and compete to lock again >> . >> . >> . >> T(k) task(k) <-- task(k+1) ... task(n) <-- task(0) ... task(k-1) >> >> after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call >> futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1) >> for time (k-2)*c,... task(k-1) for time c. >> >> T(k+1) task(k+1) <-- task(k+1) ... task(n) <-- task(0) ... task(k-1) >> sleep sleep >> >> task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again >> until threads exist. Therefore, the spin count effects the system performance largely! >> >> c) If *k* > 1 && threads number *n* < *k* (small critical section case) >> Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can >> only consider spin overhead. > > I do agree that the sping count does play a role here, but for the machine > I am testing I think what is actually happening is this patch is adding more > branch on critical loop section, and does hurt performance: > > Current spin loop version: > --- > do > { > if (cnt++ >= max_cnt) > { > LLL_MUTEX_LOCK (mutex); > break; > } > atomic_spin_nop (); > } > while (LLL_MUTEX_TRYLOCK (mutex) != 0); > --- > > This patch changes to: > --- > do > { > if (cnt >= max_cnt) > { > LLL_MUTEX_LOCK (mutex); > break; > } > do { > atomic_spin_nop (); > } > while (atomic_load_relaxed (&mutex->__data.__lock) != 0 > && ++cnt < max_cnt); > } > while (LLL_MUTEX_TRYLOCK (mutex) != 0); > --- > > While 3 patch in set changes to: > --- > do > { > atomic_spin_nop (); > } > while (atomic_load_relaxed (&mutex->__data.__lock) != 0 && > ++cnt < max_cnt); > > if (LLL_MUTEX_TRYLOCK (mutex) != 0) > LLL_MUTEX_LOCK (mutex); > --- > > That's why I suggested for next version to squash second and third patch on only > one aimed to optimize the adapting spinning algorithm. > Agree, will combine 2nd and 3th patch in V2. > Also, if you could provide a benchmark to stress this point without resorting in > an external one it would be better. > I will consider to do that. Preliminarily, we can provide several test cases each of which has different size of critical section and can be run in multi-threads mode. Your idea? >> >>>> >>>> Suggested-by: Andi Kleen >>>> Signed-off-by: Kemi Wang >>>> --- >>>> nptl/pthread_mutex_lock.c | 23 +++++++++++++++-------- >>>> 1 file changed, 15 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c >>>> index 1519c14..c3aca93 100644 >>>> --- a/nptl/pthread_mutex_lock.c >>>> +++ b/nptl/pthread_mutex_lock.c >>>> @@ -26,6 +26,7 @@ >>>> #include >>>> #include >>>> #include >>>> +#include >>>> >>>> #ifndef lll_lock_elision >>>> #define lll_lock_elision(lock, try_lock, private) ({ \ >>>> @@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) >>>> if (LLL_MUTEX_TRYLOCK (mutex) != 0) >>>> { >>>> int cnt = 0; >>>> - int max_cnt = MIN (MAX_ADAPTIVE_COUNT, >>>> - mutex->__data.__spins * 2 + 10); >>>> + int max_cnt = MIN (__mutex_aconf.spin_count, >>>> + mutex->__data.__spins * 2 + 100); >>>> do >>>> { >>>> - if (cnt++ >= max_cnt) >>>> - { >>>> - LLL_MUTEX_LOCK (mutex); >>>> - break; >>>> - } >>>> - atomic_spin_nop (); >>>> + if (cnt >= max_cnt) >>>> + { >>>> + LLL_MUTEX_LOCK (mutex); >>>> + break; >>>> + } >>>> + /* MO read while spinning */ >>>> + do >>>> + { >>>> + atomic_spin_nop (); >>>> + } >>>> + while (atomic_load_relaxed (&mutex->__data.__lock) != 0 && >>>> + ++cnt < max_cnt); >>>> } >>>> while (LLL_MUTEX_TRYLOCK (mutex) != 0); >>>> >>>> >>> >>>