From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 113556 invoked by alias); 11 Apr 2018 13:28:23 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 113543 invoked by uid 89); 11 Apr 2018 13:28:22 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-3.9 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,GIT_PATCH_2,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=constantly, 8:=b0=ef?= X-HELO: mail-qt0-f181.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:openpgp:autocrypt :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=kIVLJCbqvSmLi2D2z+uaJa8mLwkVO3GUnyojSMnmwpc=; b=fdBMBAF7fbjd1llBwzAnCrWOXi8PNjbjvQk9cE3M5RnCIqt+O8fqNBW1zeViR6Xh1S QbiGqJIP/559KQud09C+0T/PEY8TSJFg/Ff7xO+cq2F0mQTOoTnklryATL9al+x/QNxS gmWErkgHNuLp3dHlMBIg502UfN7PjqLTwa1L8DkkN9PLl6XijfWnr6Esl/PWnE5RZkTI qIsyWCWTeOJBj7c4HmmOiSgnYQ6XIBaRQbcYJhkhpL2GNE5t+10VvbMOBCmVIMnEi4YP TJG5VZ4JQjiYDXYulx2/s44Uz0wJJOk74rykl2pQEiMpevhzpA8Jf9YubtEnS6jiYp+X q7zA== X-Gm-Message-State: ALQs6tDji6lXfonNWj/kNBCZLJLSuGC+mRd7zC/v5pvUM1JdqJV49n7k jiOxC4oTsT0k4kdE7nl3Z9R0iw== X-Google-Smtp-Source: AIpwx4/v1Ss/Lju6TMuA3UyYbTDkQ+oTrKqhQuT6loO6oZbsTV9n259aACR7bt8ITc7dM8tHJ7RyUQ== X-Received: by 10.200.50.251 with SMTP id a56mr743493qtb.304.1523453297991; Wed, 11 Apr 2018 06:28:17 -0700 (PDT) Subject: Re: [PATCH 2/3] Mutex: Only read while spinning To: kemi , libc-alpha@sourceware.org Cc: Andi Kleen , "Chen, Tim C" , Dave , "ying.huang" , aaron , aubrey References: <1522394093-9835-1-git-send-email-kemi.wang@intel.com> <1522394093-9835-2-git-send-email-kemi.wang@intel.com> <14c4945b-f9f8-51b6-32c9-ce3049493dab@linaro.org> <11425ecd-b51e-8396-e3d7-612f7767c870@intel.com> From: Adhemerval Zanella Openpgp: preference=signencrypt Autocrypt: addr=adhemerval.zanella@linaro.org; prefer-encrypt=mutual; keydata= xsFNBFcVGkoBEADiQU2x/cBBmAVf5C2d1xgz6zCnlCefbqaflUBw4hB/bEME40QsrVzWZ5Nq 8kxkEczZzAOKkkvv4pRVLlLn/zDtFXhlcvQRJ3yFMGqzBjofucOrmdYkOGo0uCaoJKPT186L NWp53SACXguFJpnw4ODI64ziInzXQs/rUJqrFoVIlrPDmNv/LUv1OVPKz20ETjgfpg8MNwG6 iMizMefCl+RbtXbIEZ3TE/IaDT/jcOirjv96lBKrc/pAL0h/O71Kwbbp43fimW80GhjiaN2y WGByepnkAVP7FyNarhdDpJhoDmUk9yfwNuIuESaCQtfd3vgKKuo6grcKZ8bHy7IXX1XJj2X/ BgRVhVgMHAnDPFIkXtP+SiarkUaLjGzCz7XkUn4XAGDskBNfbizFqYUQCaL2FdbW3DeZqNIa nSzKAZK7Dm9+0VVSRZXP89w71Y7JUV56xL/PlOE+YKKFdEw+gQjQi0e+DZILAtFjJLoCrkEX w4LluMhYX/X8XP6/C3xW0yOZhvHYyn72sV4yJ1uyc/qz3OY32CRy+bwPzAMAkhdwcORA3JPb kPTlimhQqVgvca8m+MQ/JFZ6D+K7QPyvEv7bQ7M+IzFmTkOCwCJ3xqOD6GjX3aphk8Sr0dq3 4Awlf5xFDAG8dn8Uuutb7naGBd/fEv6t8dfkNyzj6yvc4jpVxwARAQABzUlBZGhlbWVydmFs IFphbmVsbGEgTmV0dG8gKExpbmFybyBWUE4gS2V5KSA8YWRoZW1lcnZhbC56YW5lbGxhQGxp bmFyby5vcmc+wsF3BBMBCAAhBQJXFRpKAhsDBQsJCAcDBRUKCQgLBRYCAwEAAh4BAheAAAoJ EKqx7BSnlIjv0e8P/1YOYoNkvJ+AJcNUaM5a2SA9oAKjSJ/M/EN4Id5Ow41ZJS4lUA0apSXW NjQg3VeVc2RiHab2LIB4MxdJhaWTuzfLkYnBeoy4u6njYcaoSwf3g9dSsvsl3mhtuzm6aXFH /Qsauav77enJh99tI4T+58rp0EuLhDsQbnBic/ukYNv7sQV8dy9KxA54yLnYUFqH6pfH8Lly sTVAMyi5Fg5O5/hVV+Z0Kpr+ZocC1YFJkTsNLAW5EIYSP9ftniqaVsim7MNmodv/zqK0IyDB GLLH1kjhvb5+6ySGlWbMTomt/or/uvMgulz0bRS+LUyOmlfXDdT+t38VPKBBVwFMarNuREU2 69M3a3jdTfScboDd2ck1u7l+QbaGoHZQ8ZNUrzgObltjohiIsazqkgYDQzXIMrD9H19E+8fw kCNUlXxjEgH/Kg8DlpoYJXSJCX0fjMWfXywL6ZXc2xyG/hbl5hvsLNmqDpLpc1CfKcA0BkK+ k8R57fr91mTCppSwwKJYO9T+8J+o4ho/CJnK/jBy1pWKMYJPvvrpdBCWq3MfzVpXYdahRKHI ypk8m4QlRlbOXWJ3TDd/SKNfSSrWgwRSg7XCjSlR7PNzNFXTULLB34sZhjrN6Q8NQZsZnMNs TX8nlGOVrKolnQPjKCLwCyu8PhllU8OwbSMKskcD1PSkG6h3r0AqzsFNBFcVGkoBEACgAdbR Ck+fsfOVwT8zowMiL3l9a2DP3Eeak23ifdZG+8Avb/SImpv0UMSbRfnw/N81IWwlbjkjbGTu oT37iZHLRwYUFmA8fZX0wNDNKQUUTjN6XalJmvhdz9l71H3WnE0wneEM5ahu5V1L1utUWTyh VUwzX1lwJeV3vyrNgI1kYOaeuNVvq7npNR6t6XxEpqPsNc6O77I12XELic2+36YibyqlTJIQ V1SZEbIy26AbC2zH9WqaKyGyQnr/IPbTJ2Lv0dM3RaXoVf+CeK7gB2B+w1hZummD21c1Laua +VIMPCUQ+EM8W9EtX+0iJXxI+wsztLT6vltQcm+5Q7tY+HFUucizJkAOAz98YFucwKefbkTp eKvCfCwiM1bGatZEFFKIlvJ2QNMQNiUrqJBlW9nZp/k7pbG3oStOjvawD9ZbP9e0fnlWJIsj 6c7pX354Yi7kxIk/6gREidHLLqEb/otuwt1aoMPg97iUgDV5mlNef77lWE8vxmlY0FBWIXuZ yv0XYxf1WF6dRizwFFbxvUZzIJp3spAao7jLsQj1DbD2s5+S1BW09A0mI/1DjB6EhNN+4bDB SJCOv/ReK3tFJXuj/HbyDrOdoMt8aIFbe7YFLEExHpSk+HgN05Lg5TyTro8oW7TSMTk+8a5M kzaH4UGXTTBDP/g5cfL3RFPl79ubXwARAQABwsFfBBgBCAAJBQJXFRpKAhsMAAoJEKqx7BSn lIjvI/8P/jg0jl4Tbvg3B5kT6PxJOXHYu9OoyaHLcay6Cd+ZrOd1VQQCbOcgLFbf4Yr+rE9l mYsY67AUgq2QKmVVbn9pjvGsEaz8UmfDnz5epUhDxC6yRRvY4hreMXZhPZ1pbMa6A0a/WOSt AgFj5V6Z4dXGTM/lNManr0HjXxbUYv2WfbNt3/07Db9T+GZkpUotC6iknsTA4rJi6u2ls0W9 1UIvW4o01vb4nZRCj4rni0g6eWoQCGoVDk/xFfy7ZliR5B+3Z3EWRJcQskip/QAHjbLa3pml xAZ484fVxgeESOoaeC9TiBIp0NfH8akWOI0HpBCiBD5xaCTvR7ujUWMvhsX2n881r/hNlR9g fcE6q00qHSPAEgGr1bnFv74/1vbKtjeXLCcRKk3Ulw0bY1OoDxWQr86T2fZGJ/HIZuVVBf3+ gaYJF92GXFynHnea14nFFuFgOni0Mi1zDxYH/8yGGBXvo14KWd8JOW0NJPaCDFJkdS5hu0VY 7vJwKcyHJGxsCLU+Et0mryX8qZwqibJIzu7kUJQdQDljbRPDFd/xmGUFCQiQAncSilYOcxNU EMVCXPAQTteqkvA+gNqSaK1NM9tY0eQ4iJpo+aoX8HAcn4sZzt2pfUB9vQMTBJ2d4+m/qO6+ cFTAceXmIoFsN8+gFN3i8Is3u12u8xGudcBPvpoy4OoG Message-ID: <0e0645be-2d2b-5275-6f36-7069152bfc87@linaro.org> Date: Wed, 11 Apr 2018 13:28:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <11425ecd-b51e-8396-e3d7-612f7767c870@intel.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2018-04/txt/msg00194.txt.bz2 On 09/04/2018 22:46, kemi wrote: > > > On 2018年04月10日 04:52, Adhemerval Zanella wrote: >> >> >> On 08/04/2018 05:28, kemi wrote: >>> >>> >>> On 2018年04月06日 04:55, Adhemerval Zanella wrote: >>>> >>>> >>>> On 30/03/2018 04:14, Kemi Wang wrote: >>>>> The pthread adaptive spin mutex spins on the lock for a while before going >>>>> to a sleep. While the lock is contended and we need to wait, going straight >>>>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as >>>>> that will force expensive memory synchronization among processors and >>>>> penalize other running threads. For example, it constantly floods the >>>>> system with "read for ownership" requests, which are much more expensive to >>>>> process than a single read. Thus, we only use MO read until we observe the >>>>> lock to not be acquired anymore, as suggusted by Andi Kleen. >>>>> >>>>> Test machine: >>>>> 2-sockets Skylake paltform, 112 cores with 62G RAM >>>>> >>>>> Test case: Contended pthread adaptive spin mutex with global update >>>>> each thread of the workload does: >>>>> a) Lock the mutex (adaptive spin type) >>>>> b) Globle variable increment >>>>> c) Unlock the mutex >>>>> in a loop until timeout, and the main thread reports the total iteration >>>>> number of all the threads in one second. >>>>> >>>>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is >>>>> modified to PTHREAD_MUTEX_ADAPTIVE_NP. >>>>> github: https://github.com/antonblanchard/will-it-scale.git >>>>> >>>>> nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000) >>>>> 1 51644585 51307573(-0.7%) 51323778(-0.6%) >>>>> 2 7914789 10011301(+26.5%) 9867343(+24.7%) >>>>> 7 1687620 4224135(+150.3%) 3430504(+103.3%) >>>>> 14 1026555 3784957(+268.7%) 1843458(+79.6%) >>>>> 28 962001 2886885(+200.1%) 681965(-29.1%) >>>>> 56 883770 2740755(+210.1%) 364879(-58.7%) >>>>> 112 1150589 2707089(+135.3%) 415261(-63.9%) >>>> >>>> In pthread_mutex3 it is basically more updates in a global variable synchronized >>>> with a mutex, so if I am reading correct the benchmark, a higher value means >>>> less contention. I also assume you use the 'threads' value in this table. >>>> >>> >>> Yes. I used multi-threads mode for testing. >>> >>>> I checked on a 64 cores aarch64 machine to see what kind of improvement, if >>>> any; one would get with this change: >>>> >>>> nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000) >>>> 1 27566206 28778254 (4.211680) 28778467 (4.212389) >>>> 2 8498813 7777589 (-9.273105) 7806043 (-8.874791) >>>> 7 5019434 2869629 (-74.915782) 3307812 (-51.744839) >>>> 14 4379155 2906255 (-50.680343) 2825041 (-55.012087) >>>> 28 4397464 3261094 (-34.846282) 3259486 (-34.912805) >>>> 56 4020956 3898386 (-3.144122) 4038505 (0.434542) >>>> >>> >>> Thanks for your job to test it on aarch64 machine. That's great supplement for us. >>> >>>> So I think this change should be platform-specific. >>>> >>> >>> It may depend on platform, especially for cmpxchg, MO read and pause instructions. >> >> Yes, that's why for this patch isolated it would be better to make the adaptive algorithm >> more platform specific. >> >>> >>> Well, I think the performance also depends on spin count. A simple explanation is that: >>> spin always fails in severe lock contention with large thread numbers (fit case b) below), >>> thus, the overhead increases due to large spin count, larger spin count, more overhead. >>> It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised >>> that the performance does not increase with smaller SPIN_COUNT compare to larger one. >>> Did you run "export LD_SPIN_COUNT=10" before the second round test? >> >> I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the >> current patch set version is the same). I redid the number to check for some >> mishandled from my part, but still the number are somewhat not ok to set as >> default (even more that 1000 is the default value for spin_counts). >> >> nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000) >> 1 27563021 28775107 (4.212273) 28775127 (4.212339) >> 2 8261443 9044577 (8.658603) 7792347 (-6.019958) >> 7 4697529 4623052 (-1.610992) 3241736 (-44.907821) >> 14 4951300 5348550 (7.427247) 2854946 (-73.428849) >> 28 4296920 4551240 (5.587928) 3231684 (-32.962257) >> 56 4182013 3420937 (-22.247589) 3769944 (-10.930375) >> >> >>> >>> Analysis: >>> let's assume the time for critical section is *c*, and the time for spinning the lock is *s*. >>> Then s = k*c, *k* is factor. >>> >>> a) If *k* < 1, which means the spinning time is less than the time for the task to consume >>> in the critical section, it is easy to understand that the overhead for acquiring the lock >>> equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). >>> >>> b) If *k* > 1 && threads number *n* >= *k* (small critical section case). >>> In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we >>> can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start >>> at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame >>> >>> T(0) task(0) <-- task(1) <-- task(2) ... task(n) // original status >>> | >>> T(1) task(1) <-- task(2) ... task(n) <-- task(0) // after time c, task 0 release the lock and compete to lock again >>> . >>> . >>> . >>> T(k) task(k) <-- task(k+1) ... task(n) <-- task(0) ... task(k-1) >>> >>> after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call >>> futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1) >>> for time (k-2)*c,... task(k-1) for time c. >>> >>> T(k+1) task(k+1) <-- task(k+1) ... task(n) <-- task(0) ... task(k-1) >>> sleep sleep >>> >>> task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again >>> until threads exist. Therefore, the spin count effects the system performance largely! >>> >>> c) If *k* > 1 && threads number *n* < *k* (small critical section case) >>> Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can >>> only consider spin overhead. >> >> I do agree that the sping count does play a role here, but for the machine >> I am testing I think what is actually happening is this patch is adding more >> branch on critical loop section, and does hurt performance: >> >> Current spin loop version: >> --- >> do >> { >> if (cnt++ >= max_cnt) >> { >> LLL_MUTEX_LOCK (mutex); >> break; >> } >> atomic_spin_nop (); >> } >> while (LLL_MUTEX_TRYLOCK (mutex) != 0); >> --- >> >> This patch changes to: >> --- >> do >> { >> if (cnt >= max_cnt) >> { >> LLL_MUTEX_LOCK (mutex); >> break; >> } >> do { >> atomic_spin_nop (); >> } >> while (atomic_load_relaxed (&mutex->__data.__lock) != 0 >> && ++cnt < max_cnt); >> } >> while (LLL_MUTEX_TRYLOCK (mutex) != 0); >> --- >> >> While 3 patch in set changes to: >> --- >> do >> { >> atomic_spin_nop (); >> } >> while (atomic_load_relaxed (&mutex->__data.__lock) != 0 && >> ++cnt < max_cnt); >> >> if (LLL_MUTEX_TRYLOCK (mutex) != 0) >> LLL_MUTEX_LOCK (mutex); >> --- >> >> That's why I suggested for next version to squash second and third patch on only >> one aimed to optimize the adapting spinning algorithm. >> > > Agree, will combine 2nd and 3th patch in V2. > >> Also, if you could provide a benchmark to stress this point without resorting in >> an external one it would be better. >> > > I will consider to do that. Preliminarily, we can provide several test cases each of which has > different size of critical section and can be run in multi-threads mode. > Your idea? I would the pthread_mutex3/threads from the project you used as base, it is simply enough and give us a direct comparable metric. I would ran with number of threads of 1, nproc/2 and nproc.