From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16460 invoked by alias); 9 Apr 2018 20:52:29 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 16449 invoked by uid 89); 9 Apr 2018 20:52:28 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-20.8 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=aimed X-HELO: mail-qk0-f193.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:openpgp:autocrypt :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=mmwsl9XxNnoAy8g/mC+HsYniHD+Of9hyBR5FFFONjGs=; b=oC9VA1fCIsbi7xJBQC2c9bmuN0JAEtlz7ZU0Rwc8qKK3v/KgnUM+X4rdc/qhZTks5d EMo9CONoZuP7t1+aCc1ml0JdyixqTxlDDRyyh6ySMMOm5BBheomxMOtemhBGbTg5nsO+ F2QJfPBnbKX8AV82FGntALYCEpNv79X/sFQNbMn0J1Z8JeKEBB794akv0Cjgm5ED7tb0 FTNGm5ghNKaqF0RRPiUgklJrTMtBl7jVaBLCVf2mrgKeGPjq6EMnuoVdfdUtTdjkmLv/ Ip7JuX60PI8OYoO3z1FPItEr6tztehWX0vZ9jaNJIVWTFT2cLZyXfQVGPag9olXCkyr/ MsAQ== X-Gm-Message-State: ALQs6tCbwizkIcz2dQXoc5WXmxi4dvjm4Xac2tT51w8NyIFLi0iHh3aG Cd+PziVkjX74eOEjMkqWuf/89A== X-Google-Smtp-Source: AIpwx48izuSuddPO4Ema/z1pB1qQMybz9PzYRwU9f+p8cJ9Wcqx9kRGwXpiSL1j29lZvLjdk04Kvig== X-Received: by 10.55.174.7 with SMTP id x7mr49578657qke.218.1523307144123; Mon, 09 Apr 2018 13:52:24 -0700 (PDT) Subject: Re: [PATCH 2/3] Mutex: Only read while spinning To: kemi , libc-alpha@sourceware.org Cc: Andi Kleen , "Chen, Tim C" , Dave , "ying.huang" , aaron , aubrey References: <1522394093-9835-1-git-send-email-kemi.wang@intel.com> <1522394093-9835-2-git-send-email-kemi.wang@intel.com> From: Adhemerval Zanella Openpgp: preference=signencrypt Autocrypt: addr=adhemerval.zanella@linaro.org; prefer-encrypt=mutual; keydata= xsFNBFcVGkoBEADiQU2x/cBBmAVf5C2d1xgz6zCnlCefbqaflUBw4hB/bEME40QsrVzWZ5Nq 8kxkEczZzAOKkkvv4pRVLlLn/zDtFXhlcvQRJ3yFMGqzBjofucOrmdYkOGo0uCaoJKPT186L NWp53SACXguFJpnw4ODI64ziInzXQs/rUJqrFoVIlrPDmNv/LUv1OVPKz20ETjgfpg8MNwG6 iMizMefCl+RbtXbIEZ3TE/IaDT/jcOirjv96lBKrc/pAL0h/O71Kwbbp43fimW80GhjiaN2y WGByepnkAVP7FyNarhdDpJhoDmUk9yfwNuIuESaCQtfd3vgKKuo6grcKZ8bHy7IXX1XJj2X/ BgRVhVgMHAnDPFIkXtP+SiarkUaLjGzCz7XkUn4XAGDskBNfbizFqYUQCaL2FdbW3DeZqNIa nSzKAZK7Dm9+0VVSRZXP89w71Y7JUV56xL/PlOE+YKKFdEw+gQjQi0e+DZILAtFjJLoCrkEX w4LluMhYX/X8XP6/C3xW0yOZhvHYyn72sV4yJ1uyc/qz3OY32CRy+bwPzAMAkhdwcORA3JPb kPTlimhQqVgvca8m+MQ/JFZ6D+K7QPyvEv7bQ7M+IzFmTkOCwCJ3xqOD6GjX3aphk8Sr0dq3 4Awlf5xFDAG8dn8Uuutb7naGBd/fEv6t8dfkNyzj6yvc4jpVxwARAQABzUlBZGhlbWVydmFs IFphbmVsbGEgTmV0dG8gKExpbmFybyBWUE4gS2V5KSA8YWRoZW1lcnZhbC56YW5lbGxhQGxp bmFyby5vcmc+wsF3BBMBCAAhBQJXFRpKAhsDBQsJCAcDBRUKCQgLBRYCAwEAAh4BAheAAAoJ EKqx7BSnlIjv0e8P/1YOYoNkvJ+AJcNUaM5a2SA9oAKjSJ/M/EN4Id5Ow41ZJS4lUA0apSXW NjQg3VeVc2RiHab2LIB4MxdJhaWTuzfLkYnBeoy4u6njYcaoSwf3g9dSsvsl3mhtuzm6aXFH /Qsauav77enJh99tI4T+58rp0EuLhDsQbnBic/ukYNv7sQV8dy9KxA54yLnYUFqH6pfH8Lly sTVAMyi5Fg5O5/hVV+Z0Kpr+ZocC1YFJkTsNLAW5EIYSP9ftniqaVsim7MNmodv/zqK0IyDB GLLH1kjhvb5+6ySGlWbMTomt/or/uvMgulz0bRS+LUyOmlfXDdT+t38VPKBBVwFMarNuREU2 69M3a3jdTfScboDd2ck1u7l+QbaGoHZQ8ZNUrzgObltjohiIsazqkgYDQzXIMrD9H19E+8fw kCNUlXxjEgH/Kg8DlpoYJXSJCX0fjMWfXywL6ZXc2xyG/hbl5hvsLNmqDpLpc1CfKcA0BkK+ k8R57fr91mTCppSwwKJYO9T+8J+o4ho/CJnK/jBy1pWKMYJPvvrpdBCWq3MfzVpXYdahRKHI ypk8m4QlRlbOXWJ3TDd/SKNfSSrWgwRSg7XCjSlR7PNzNFXTULLB34sZhjrN6Q8NQZsZnMNs TX8nlGOVrKolnQPjKCLwCyu8PhllU8OwbSMKskcD1PSkG6h3r0AqzsFNBFcVGkoBEACgAdbR Ck+fsfOVwT8zowMiL3l9a2DP3Eeak23ifdZG+8Avb/SImpv0UMSbRfnw/N81IWwlbjkjbGTu oT37iZHLRwYUFmA8fZX0wNDNKQUUTjN6XalJmvhdz9l71H3WnE0wneEM5ahu5V1L1utUWTyh VUwzX1lwJeV3vyrNgI1kYOaeuNVvq7npNR6t6XxEpqPsNc6O77I12XELic2+36YibyqlTJIQ V1SZEbIy26AbC2zH9WqaKyGyQnr/IPbTJ2Lv0dM3RaXoVf+CeK7gB2B+w1hZummD21c1Laua +VIMPCUQ+EM8W9EtX+0iJXxI+wsztLT6vltQcm+5Q7tY+HFUucizJkAOAz98YFucwKefbkTp eKvCfCwiM1bGatZEFFKIlvJ2QNMQNiUrqJBlW9nZp/k7pbG3oStOjvawD9ZbP9e0fnlWJIsj 6c7pX354Yi7kxIk/6gREidHLLqEb/otuwt1aoMPg97iUgDV5mlNef77lWE8vxmlY0FBWIXuZ yv0XYxf1WF6dRizwFFbxvUZzIJp3spAao7jLsQj1DbD2s5+S1BW09A0mI/1DjB6EhNN+4bDB SJCOv/ReK3tFJXuj/HbyDrOdoMt8aIFbe7YFLEExHpSk+HgN05Lg5TyTro8oW7TSMTk+8a5M kzaH4UGXTTBDP/g5cfL3RFPl79ubXwARAQABwsFfBBgBCAAJBQJXFRpKAhsMAAoJEKqx7BSn lIjvI/8P/jg0jl4Tbvg3B5kT6PxJOXHYu9OoyaHLcay6Cd+ZrOd1VQQCbOcgLFbf4Yr+rE9l mYsY67AUgq2QKmVVbn9pjvGsEaz8UmfDnz5epUhDxC6yRRvY4hreMXZhPZ1pbMa6A0a/WOSt AgFj5V6Z4dXGTM/lNManr0HjXxbUYv2WfbNt3/07Db9T+GZkpUotC6iknsTA4rJi6u2ls0W9 1UIvW4o01vb4nZRCj4rni0g6eWoQCGoVDk/xFfy7ZliR5B+3Z3EWRJcQskip/QAHjbLa3pml xAZ484fVxgeESOoaeC9TiBIp0NfH8akWOI0HpBCiBD5xaCTvR7ujUWMvhsX2n881r/hNlR9g fcE6q00qHSPAEgGr1bnFv74/1vbKtjeXLCcRKk3Ulw0bY1OoDxWQr86T2fZGJ/HIZuVVBf3+ gaYJF92GXFynHnea14nFFuFgOni0Mi1zDxYH/8yGGBXvo14KWd8JOW0NJPaCDFJkdS5hu0VY 7vJwKcyHJGxsCLU+Et0mryX8qZwqibJIzu7kUJQdQDljbRPDFd/xmGUFCQiQAncSilYOcxNU EMVCXPAQTteqkvA+gNqSaK1NM9tY0eQ4iJpo+aoX8HAcn4sZzt2pfUB9vQMTBJ2d4+m/qO6+ cFTAceXmIoFsN8+gFN3i8Is3u12u8xGudcBPvpoy4OoG Message-ID: <14c4945b-f9f8-51b6-32c9-ce3049493dab@linaro.org> Date: Mon, 09 Apr 2018 20:52:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2018-04/txt/msg00179.txt.bz2 On 08/04/2018 05:28, kemi wrote: > > > On 2018年04月06日 04:55, Adhemerval Zanella wrote: >> >> >> On 30/03/2018 04:14, Kemi Wang wrote: >>> The pthread adaptive spin mutex spins on the lock for a while before going >>> to a sleep. While the lock is contended and we need to wait, going straight >>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as >>> that will force expensive memory synchronization among processors and >>> penalize other running threads. For example, it constantly floods the >>> system with "read for ownership" requests, which are much more expensive to >>> process than a single read. Thus, we only use MO read until we observe the >>> lock to not be acquired anymore, as suggusted by Andi Kleen. >>> >>> Test machine: >>> 2-sockets Skylake paltform, 112 cores with 62G RAM >>> >>> Test case: Contended pthread adaptive spin mutex with global update >>> each thread of the workload does: >>> a) Lock the mutex (adaptive spin type) >>> b) Globle variable increment >>> c) Unlock the mutex >>> in a loop until timeout, and the main thread reports the total iteration >>> number of all the threads in one second. >>> >>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is >>> modified to PTHREAD_MUTEX_ADAPTIVE_NP. >>> github: https://github.com/antonblanchard/will-it-scale.git >>> >>> nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000) >>> 1 51644585 51307573(-0.7%) 51323778(-0.6%) >>> 2 7914789 10011301(+26.5%) 9867343(+24.7%) >>> 7 1687620 4224135(+150.3%) 3430504(+103.3%) >>> 14 1026555 3784957(+268.7%) 1843458(+79.6%) >>> 28 962001 2886885(+200.1%) 681965(-29.1%) >>> 56 883770 2740755(+210.1%) 364879(-58.7%) >>> 112 1150589 2707089(+135.3%) 415261(-63.9%) >> >> In pthread_mutex3 it is basically more updates in a global variable synchronized >> with a mutex, so if I am reading correct the benchmark, a higher value means >> less contention. I also assume you use the 'threads' value in this table. >> > > Yes. I used multi-threads mode for testing. > >> I checked on a 64 cores aarch64 machine to see what kind of improvement, if >> any; one would get with this change: >> >> nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000) >> 1 27566206 28778254 (4.211680) 28778467 (4.212389) >> 2 8498813 7777589 (-9.273105) 7806043 (-8.874791) >> 7 5019434 2869629 (-74.915782) 3307812 (-51.744839) >> 14 4379155 2906255 (-50.680343) 2825041 (-55.012087) >> 28 4397464 3261094 (-34.846282) 3259486 (-34.912805) >> 56 4020956 3898386 (-3.144122) 4038505 (0.434542) >> > > Thanks for your job to test it on aarch64 machine. That's great supplement for us. > >> So I think this change should be platform-specific. >> > > It may depend on platform, especially for cmpxchg, MO read and pause instructions. Yes, that's why for this patch isolated it would be better to make the adaptive algorithm more platform specific. > > Well, I think the performance also depends on spin count. A simple explanation is that: > spin always fails in severe lock contention with large thread numbers (fit case b) below), > thus, the overhead increases due to large spin count, larger spin count, more overhead. > It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised > that the performance does not increase with smaller SPIN_COUNT compare to larger one. > Did you run "export LD_SPIN_COUNT=10" before the second round test? I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the current patch set version is the same). I redid the number to check for some mishandled from my part, but still the number are somewhat not ok to set as default (even more that 1000 is the default value for spin_counts). nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000) 1 27563021 28775107 (4.212273) 28775127 (4.212339) 2 8261443 9044577 (8.658603) 7792347 (-6.019958) 7 4697529 4623052 (-1.610992) 3241736 (-44.907821) 14 4951300 5348550 (7.427247) 2854946 (-73.428849) 28 4296920 4551240 (5.587928) 3231684 (-32.962257) 56 4182013 3420937 (-22.247589) 3769944 (-10.930375) > > Analysis: > let's assume the time for critical section is *c*, and the time for spinning the lock is *s*. > Then s = k*c, *k* is factor. > > a) If *k* < 1, which means the spinning time is less than the time for the task to consume > in the critical section, it is easy to understand that the overhead for acquiring the lock > equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). > > b) If *k* > 1 && threads number *n* >= *k* (small critical section case). > In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we > can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start > at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame > > T(0) task(0) <-- task(1) <-- task(2) ... task(n) // original status > | > T(1) task(1) <-- task(2) ... task(n) <-- task(0) // after time c, task 0 release the lock and compete to lock again > . > . > . > T(k) task(k) <-- task(k+1) ... task(n) <-- task(0) ... task(k-1) > > after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call > futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1) > for time (k-2)*c,... task(k-1) for time c. > > T(k+1) task(k+1) <-- task(k+1) ... task(n) <-- task(0) ... task(k-1) > sleep sleep > > task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again > until threads exist. Therefore, the spin count effects the system performance largely! > > c) If *k* > 1 && threads number *n* < *k* (small critical section case) > Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can > only consider spin overhead. I do agree that the sping count does play a role here, but for the machine I am testing I think what is actually happening is this patch is adding more branch on critical loop section, and does hurt performance: Current spin loop version: --- do { if (cnt++ >= max_cnt) { LLL_MUTEX_LOCK (mutex); break; } atomic_spin_nop (); } while (LLL_MUTEX_TRYLOCK (mutex) != 0); --- This patch changes to: --- do { if (cnt >= max_cnt) { LLL_MUTEX_LOCK (mutex); break; } do { atomic_spin_nop (); } while (atomic_load_relaxed (&mutex->__data.__lock) != 0 && ++cnt < max_cnt); } while (LLL_MUTEX_TRYLOCK (mutex) != 0); --- While 3 patch in set changes to: --- do { atomic_spin_nop (); } while (atomic_load_relaxed (&mutex->__data.__lock) != 0 && ++cnt < max_cnt); if (LLL_MUTEX_TRYLOCK (mutex) != 0) LLL_MUTEX_LOCK (mutex); --- That's why I suggested for next version to squash second and third patch on only one aimed to optimize the adapting spinning algorithm. Also, if you could provide a benchmark to stress this point without resorting in an external one it would be better. > >>> >>> Suggested-by: Andi Kleen >>> Signed-off-by: Kemi Wang >>> --- >>> nptl/pthread_mutex_lock.c | 23 +++++++++++++++-------- >>> 1 file changed, 15 insertions(+), 8 deletions(-) >>> >>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c >>> index 1519c14..c3aca93 100644 >>> --- a/nptl/pthread_mutex_lock.c >>> +++ b/nptl/pthread_mutex_lock.c >>> @@ -26,6 +26,7 @@ >>> #include >>> #include >>> #include >>> +#include >>> >>> #ifndef lll_lock_elision >>> #define lll_lock_elision(lock, try_lock, private) ({ \ >>> @@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) >>> if (LLL_MUTEX_TRYLOCK (mutex) != 0) >>> { >>> int cnt = 0; >>> - int max_cnt = MIN (MAX_ADAPTIVE_COUNT, >>> - mutex->__data.__spins * 2 + 10); >>> + int max_cnt = MIN (__mutex_aconf.spin_count, >>> + mutex->__data.__spins * 2 + 100); >>> do >>> { >>> - if (cnt++ >= max_cnt) >>> - { >>> - LLL_MUTEX_LOCK (mutex); >>> - break; >>> - } >>> - atomic_spin_nop (); >>> + if (cnt >= max_cnt) >>> + { >>> + LLL_MUTEX_LOCK (mutex); >>> + break; >>> + } >>> + /* MO read while spinning */ >>> + do >>> + { >>> + atomic_spin_nop (); >>> + } >>> + while (atomic_load_relaxed (&mutex->__data.__lock) != 0 && >>> + ++cnt < max_cnt); >>> } >>> while (LLL_MUTEX_TRYLOCK (mutex) != 0); >>> >>> >> >>