From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-91489-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 58177 invoked by alias); 10 Apr 2018 01:49:24 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 57735 invoked by uid 89); 10 Apr 2018 01:49:18 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-22.9 required=5.0 tests=BAYES_00,BODY_8BITS,GARBLED_BODY,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS autolearn=ham version=3.3.2 spammy=10=e6?=
X-HELO: mga09.intel.com
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
X-ExtLoop1: 1
Subject: Re: [PATCH 2/3] Mutex: Only read while spinning
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>,
 libc-alpha@sourceware.org
Cc: Andi Kleen <andi.kleen@intel.com>, "Chen, Tim C" <tim.c.chen@intel.com>,
 Dave <dave.hansen@linux.intel.com>, "ying.huang" <ying.huang@intel.com>,
 aaron <aaron.lu@intel.com>, aubrey <aubrey.li@intel.com>
References: <1522394093-9835-1-git-send-email-kemi.wang@intel.com>
 <1522394093-9835-2-git-send-email-kemi.wang@intel.com>
 <d45dd71a-0ab5-47b1-c1ca-1e4fe39a58d1@linaro.org>
 <ddd6c708-3fdd-1081-4070-e6ed351094f9@intel.com>
 <14c4945b-f9f8-51b6-32c9-ce3049493dab@linaro.org>
From: kemi <kemi.wang@intel.com>
Message-ID: <11425ecd-b51e-8396-e3d7-612f7767c870@intel.com>
Date: Tue, 10 Apr 2018 01:49:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <14c4945b-f9f8-51b6-32c9-ce3049493dab@linaro.org>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2018-04/txt/msg00184.txt.bz2


On 2018å¹´04æ10æ¥ 04:52, Adhemerval Zanella wrote:
> 
> 
> On 08/04/2018 05:28, kemi wrote:
>>
>>
>> On 2018å¹´04æ06æ¥ 04:55, Adhemerval Zanella wrote:
>>>
>>>
>>> On 30/03/2018 04:14, Kemi Wang wrote:
>>>> The pthread adaptive spin mutex spins on the lock for a while before going
>>>> to a sleep. While the lock is contended and we need to wait, going straight
>>>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
>>>> that will force expensive memory synchronization among processors and
>>>> penalize other running threads. For example, it constantly floods the
>>>> system with "read for ownership" requests, which are much more expensive to
>>>> process than a single read. Thus, we only use MO read until we observe the
>>>> lock to not be acquired anymore, as suggusted by Andi Kleen.
>>>>
>>>> Test machine:
>>>> 2-sockets Skylake paltform, 112 cores with 62G RAM
>>>>
>>>> Test case: Contended pthread adaptive spin mutex with global update
>>>> each thread of the workload does:
>>>> a) Lock the mutex (adaptive spin type)
>>>> b) Globle variable increment
>>>> c) Unlock the mutex
>>>> in a loop until timeout, and the main thread reports the total iteration
>>>> number of all the threads in one second.
>>>>
>>>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
>>>> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
>>>> github: https://github.com/antonblanchard/will-it-scale.git
>>>>
>>>> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>>>> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
>>>> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
>>>> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
>>>> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
>>>> 28              962001          2886885(+200.1%)   681965(-29.1%)
>>>> 56              883770          2740755(+210.1%)   364879(-58.7%)
>>>> 112             1150589         2707089(+135.3%)   415261(-63.9%)
>>>
>>> In pthread_mutex3 it is basically more updates in a global variable synchronized
>>> with a mutex, so if I am reading correct the benchmark, a higher value means
>>> less contention. I also assume you use the 'threads' value in this table.
>>>
>>
>> Yes. I used multi-threads mode for testing.
>>
>>> I checked on a 64 cores aarch64 machine to see what kind of improvement, if
>>> any; one would get with this change:
>>>
>>> nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
>>> 1               27566206        28778254 (4.211680)   28778467 (4.212389)
>>> 2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
>>> 7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
>>> 14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
>>> 28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
>>> 56              4020956         3898386 (-3.144122)   4038505 (0.434542)
>>>
>>
>> Thanks for your job to test it on aarch64 machine. That's great supplement for us.
>>
>>> So I think this change should be platform-specific.
>>>
>>
>> It may depend on platform, especially for cmpxchg, MO read and pause instructions.
> 
> Yes, that's why for this patch isolated it would be better to make the adaptive algorithm
> more platform specific.
> 
>>
>> Well, I think the performance also depends on spin count. A simple explanation is that:
>> spin always fails in severe lock contention with large thread numbers (fit case b) below),
>> thus, the overhead increases due to large spin count, larger spin count, more overhead.
>> It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised 
>> that the performance does not increase with smaller SPIN_COUNT compare to larger one. 
>> Did you run "export LD_ï¼³ï¼°ï¼©ï¼®_COUNT=10" before the second round test?
> 
> I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the
> current patch set version is the same).  I redid the number to check for some
> mishandled from my part, but still the number are somewhat not ok to set as
> default (even more that 1000 is the default value for spin_counts).
> 
> nr_threads	base		head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
> 1		27563021		28775107 (4.212273)		28775127 (4.212339)
> 2		8261443		9044577 (8.658603)		7792347 (-6.019958)
> 7		4697529		4623052 (-1.610992)		3241736 (-44.907821)
> 14		4951300		5348550 (7.427247)		2854946 (-73.428849)
> 28		4296920		4551240 (5.587928)		3231684 (-32.962257)
> 56		4182013		3420937 (-22.247589)		3769944 (-10.930375)
> 
> 
>>
>> Analysis:
>> let's assume the time for critical section is *c*, and the time for spinning the lock is *s*.
>> Then s = k*c, *k* is factor.  
>>
>> a) If *k* < 1, which means the spinning time is less than the time for the task to consume 
>> in the critical section, it is easy to understand that the overhead for acquiring the lock 
>> equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). 
>>
>> b) If *k* > 1 && threads number *n* >= *k* (small critical section case).
>>    In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we
>> can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start
>> at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame
>>
>>  T(0)   task(0) <-- task(1) <-- task(2)  ...   task(n)     // original status
>>  |
>>  T(1)   task(1) <-- task(2)    ...   task(n) <-- task(0)   // after time c, task 0 release the lock and compete to lock again
>>  .
>>  .
>>  .
>>  T(k)   task(k) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1)  
>>
>> after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call 
>> futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1)
>> for time (k-2)*c,... task(k-1) for time c.
>>
>>  T(k+1)   task(k+1) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1) 
>>           sleep         sleep   
>>
>>  task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again
>> until threads exist. Therefore, the spin count effects the system performance largely!
>>
>> c) If *k* > 1 && threads number *n* < *k* (small critical section case)
>>   Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can
>> only consider spin overhead.
> 
> I do agree that the sping count does play a role here, but for the machine
> I am testing I think what is actually happening is this patch is adding more
> branch on critical loop section, and does hurt performance:
> 
> Current spin loop version:
> ---
>   do
>     { 
>       if (cnt++ >= max_cnt)
>         { 
>           LLL_MUTEX_LOCK (mutex);
>           break;
>         }
>       atomic_spin_nop ();
>     }
>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> ---
> 
> This patch changes to:
> ---
>   do
>     {   
>       if (cnt >= max_cnt)
>         {
>           LLL_MUTEX_LOCK (mutex);
>           break;
>         }
>       do { 
>         atomic_spin_nop ();
>       }
>       while (atomic_load_relaxed (&mutex->__data.__lock) != 0
>              && ++cnt < max_cnt);
>     }
>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> ---
> 
> While 3 patch in set changes to:
> ---
>   do
>     {
>       atomic_spin_nop ();
>      }
>   while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>          ++cnt < max_cnt);
> 
>   if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>     LLL_MUTEX_LOCK (mutex);
> ---
> 
> That's why I suggested for next version to squash second and third patch on only
> one aimed to optimize the adapting spinning algorithm. 
> 

Agree, will combine 2nd and 3th patch in V2.

> Also, if you could provide a benchmark to stress this point without resorting in
> an external one it would be better.
> 

I will consider to do that. Preliminarily, we can provide several test cases each of which has
different size of critical section and can be run in multi-threads mode.
Your idea?

>>
>>>>
>>>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>>>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>>>> ---
>>>>  nptl/pthread_mutex_lock.c | 23 +++++++++++++++--------
>>>>  1 file changed, 15 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>>>> index 1519c14..c3aca93 100644
>>>> --- a/nptl/pthread_mutex_lock.c
>>>> +++ b/nptl/pthread_mutex_lock.c
>>>> @@ -26,6 +26,7 @@
>>>>  #include <atomic.h>
>>>>  #include <lowlevellock.h>
>>>>  #include <stap-probe.h>
>>>> +#include <mutex-conf.h>
>>>>  
>>>>  #ifndef lll_lock_elision
>>>>  #define lll_lock_elision(lock, try_lock, private)	({ \
>>>> @@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>>>        if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>>>  	{
>>>>  	  int cnt = 0;
>>>> -	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
>>>> -			     mutex->__data.__spins * 2 + 10);
>>>> +	  int max_cnt = MIN (__mutex_aconf.spin_count,
>>>> +			mutex->__data.__spins * 2 + 100);
>>>>  	  do
>>>>  	    {
>>>> -	      if (cnt++ >= max_cnt)
>>>> -		{
>>>> -		  LLL_MUTEX_LOCK (mutex);
>>>> -		  break;
>>>> -		}
>>>> -	      atomic_spin_nop ();
>>>> +		if (cnt >= max_cnt)
>>>> +		  {
>>>> +		    LLL_MUTEX_LOCK (mutex);
>>>> +		    break;
>>>> +		  }
>>>> +		/* MO read while spinning */
>>>> +		do
>>>> +		  {
>>>> +		    atomic_spin_nop ();
>>>> +		  }
>>>> +		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>>> +			++cnt < max_cnt);
>>>>  	    }
>>>>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>>>>  
>>>>
>>>
>>>