From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
To: kemi <kemi.wang@intel.com>, libc-alpha@sourceware.org
Cc: Andi Kleen <andi.kleen@intel.com>,
"Chen, Tim C" <tim.c.chen@intel.com>,
Dave <dave.hansen@linux.intel.com>,
"ying.huang" <ying.huang@intel.com>, aaron <aaron.lu@intel.com>,
aubrey <aubrey.li@intel.com>
Subject: Re: [PATCH 2/3] Mutex: Only read while spinning
Date: Wed, 11 Apr 2018 13:28:00 -0000 [thread overview]
Message-ID: <0e0645be-2d2b-5275-6f36-7069152bfc87@linaro.org> (raw)
In-Reply-To: <11425ecd-b51e-8396-e3d7-612f7767c870@intel.com>
On 09/04/2018 22:46, kemi wrote:
>
>
> On 2018å¹´04æ10æ¥ 04:52, Adhemerval Zanella wrote:
>>
>>
>> On 08/04/2018 05:28, kemi wrote:
>>>
>>>
>>> On 2018å¹´04æ06æ¥ 04:55, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 30/03/2018 04:14, Kemi Wang wrote:
>>>>> The pthread adaptive spin mutex spins on the lock for a while before going
>>>>> to a sleep. While the lock is contended and we need to wait, going straight
>>>>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
>>>>> that will force expensive memory synchronization among processors and
>>>>> penalize other running threads. For example, it constantly floods the
>>>>> system with "read for ownership" requests, which are much more expensive to
>>>>> process than a single read. Thus, we only use MO read until we observe the
>>>>> lock to not be acquired anymore, as suggusted by Andi Kleen.
>>>>>
>>>>> Test machine:
>>>>> 2-sockets Skylake paltform, 112 cores with 62G RAM
>>>>>
>>>>> Test case: Contended pthread adaptive spin mutex with global update
>>>>> each thread of the workload does:
>>>>> a) Lock the mutex (adaptive spin type)
>>>>> b) Globle variable increment
>>>>> c) Unlock the mutex
>>>>> in a loop until timeout, and the main thread reports the total iteration
>>>>> number of all the threads in one second.
>>>>>
>>>>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
>>>>> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
>>>>> github: https://github.com/antonblanchard/will-it-scale.git
>>>>>
>>>>> nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000)
>>>>> 1 51644585 51307573(-0.7%) 51323778(-0.6%)
>>>>> 2 7914789 10011301(+26.5%) 9867343(+24.7%)
>>>>> 7 1687620 4224135(+150.3%) 3430504(+103.3%)
>>>>> 14 1026555 3784957(+268.7%) 1843458(+79.6%)
>>>>> 28 962001 2886885(+200.1%) 681965(-29.1%)
>>>>> 56 883770 2740755(+210.1%) 364879(-58.7%)
>>>>> 112 1150589 2707089(+135.3%) 415261(-63.9%)
>>>>
>>>> In pthread_mutex3 it is basically more updates in a global variable synchronized
>>>> with a mutex, so if I am reading correct the benchmark, a higher value means
>>>> less contention. I also assume you use the 'threads' value in this table.
>>>>
>>>
>>> Yes. I used multi-threads mode for testing.
>>>
>>>> I checked on a 64 cores aarch64 machine to see what kind of improvement, if
>>>> any; one would get with this change:
>>>>
>>>> nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000)
>>>> 1 27566206 28778254 (4.211680) 28778467 (4.212389)
>>>> 2 8498813 7777589 (-9.273105) 7806043 (-8.874791)
>>>> 7 5019434 2869629 (-74.915782) 3307812 (-51.744839)
>>>> 14 4379155 2906255 (-50.680343) 2825041 (-55.012087)
>>>> 28 4397464 3261094 (-34.846282) 3259486 (-34.912805)
>>>> 56 4020956 3898386 (-3.144122) 4038505 (0.434542)
>>>>
>>>
>>> Thanks for your job to test it on aarch64 machine. That's great supplement for us.
>>>
>>>> So I think this change should be platform-specific.
>>>>
>>>
>>> It may depend on platform, especially for cmpxchg, MO read and pause instructions.
>>
>> Yes, that's why for this patch isolated it would be better to make the adaptive algorithm
>> more platform specific.
>>
>>>
>>> Well, I think the performance also depends on spin count. A simple explanation is that:
>>> spin always fails in severe lock contention with large thread numbers (fit case b) below),
>>> thus, the overhead increases due to large spin count, larger spin count, more overhead.
>>> It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised
>>> that the performance does not increase with smaller SPIN_COUNT compare to larger one.
>>> Did you run "export LD_SPIN_COUNT=10" before the second round test?
>>
>> I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the
>> current patch set version is the same). I redid the number to check for some
>> mishandled from my part, but still the number are somewhat not ok to set as
>> default (even more that 1000 is the default value for spin_counts).
>>
>> nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000)
>> 1 27563021 28775107 (4.212273) 28775127 (4.212339)
>> 2 8261443 9044577 (8.658603) 7792347 (-6.019958)
>> 7 4697529 4623052 (-1.610992) 3241736 (-44.907821)
>> 14 4951300 5348550 (7.427247) 2854946 (-73.428849)
>> 28 4296920 4551240 (5.587928) 3231684 (-32.962257)
>> 56 4182013 3420937 (-22.247589) 3769944 (-10.930375)
>>
>>
>>>
>>> Analysis:
>>> let's assume the time for critical section is *c*, and the time for spinning the lock is *s*.
>>> Then s = k*c, *k* is factor.
>>>
>>> a) If *k* < 1, which means the spinning time is less than the time for the task to consume
>>> in the critical section, it is easy to understand that the overhead for acquiring the lock
>>> equals to spin+wait+wake because spin always fails due to time out (the case of large critical section).
>>>
>>> b) If *k* > 1 && threads number *n* >= *k* (small critical section case).
>>> In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we
>>> can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start
>>> at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame
>>>
>>> T(0) task(0) <-- task(1) <-- task(2) ... task(n) // original status
>>> |
>>> T(1) task(1) <-- task(2) ... task(n) <-- task(0) // after time c, task 0 release the lock and compete to lock again
>>> .
>>> .
>>> .
>>> T(k) task(k) <-- task(k+1) ... task(n) <-- task(0) ... task(k-1)
>>>
>>> after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call
>>> futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1)
>>> for time (k-2)*c,... task(k-1) for time c.
>>>
>>> T(k+1) task(k+1) <-- task(k+1) ... task(n) <-- task(0) ... task(k-1)
>>> sleep sleep
>>>
>>> task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again
>>> until threads exist. Therefore, the spin count effects the system performance largely!
>>>
>>> c) If *k* > 1 && threads number *n* < *k* (small critical section case)
>>> Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can
>>> only consider spin overhead.
>>
>> I do agree that the sping count does play a role here, but for the machine
>> I am testing I think what is actually happening is this patch is adding more
>> branch on critical loop section, and does hurt performance:
>>
>> Current spin loop version:
>> ---
>> do
>> {
>> if (cnt++ >= max_cnt)
>> {
>> LLL_MUTEX_LOCK (mutex);
>> break;
>> }
>> atomic_spin_nop ();
>> }
>> while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>> ---
>>
>> This patch changes to:
>> ---
>> do
>> {
>> if (cnt >= max_cnt)
>> {
>> LLL_MUTEX_LOCK (mutex);
>> break;
>> }
>> do {
>> atomic_spin_nop ();
>> }
>> while (atomic_load_relaxed (&mutex->__data.__lock) != 0
>> && ++cnt < max_cnt);
>> }
>> while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>> ---
>>
>> While 3 patch in set changes to:
>> ---
>> do
>> {
>> atomic_spin_nop ();
>> }
>> while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>> ++cnt < max_cnt);
>>
>> if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>> LLL_MUTEX_LOCK (mutex);
>> ---
>>
>> That's why I suggested for next version to squash second and third patch on only
>> one aimed to optimize the adapting spinning algorithm.
>>
>
> Agree, will combine 2nd and 3th patch in V2.
>
>> Also, if you could provide a benchmark to stress this point without resorting in
>> an external one it would be better.
>>
>
> I will consider to do that. Preliminarily, we can provide several test cases each of which has
> different size of critical section and can be run in multi-threads mode.
> Your idea?
I would the pthread_mutex3/threads from the project you used as base, it is
simply enough and give us a direct comparable metric. I would ran with
number of threads of 1, nproc/2 and nproc.
next prev parent reply other threads:[~2018-04-11 13:28 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-30 7:17 [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex Kemi Wang
2018-03-30 7:17 ` [PATCH 3/3] Mutex: Avoid useless spinning Kemi Wang
2018-04-05 20:59 ` Adhemerval Zanella
2018-04-08 8:33 ` kemi
2018-03-30 7:17 ` [PATCH 2/3] Mutex: Only read while spinning Kemi Wang
2018-04-05 20:55 ` Adhemerval Zanella
2018-04-08 8:30 ` kemi
2018-04-09 20:52 ` Adhemerval Zanella
2018-04-10 1:49 ` kemi
2018-04-11 13:28 ` Adhemerval Zanella [this message]
2018-04-02 15:19 ` [PATCH 1/3] Tunables: Add tunables of spin count for adaptive spin mutex Adhemerval Zanella
2018-04-04 10:27 ` kemi
2018-04-04 17:17 ` Adhemerval Zanella
2018-04-05 1:11 ` Carlos O'Donell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0e0645be-2d2b-5275-6f36-7069152bfc87@linaro.org \
--to=adhemerval.zanella@linaro.org \
--cc=aaron.lu@intel.com \
--cc=andi.kleen@intel.com \
--cc=aubrey.li@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=kemi.wang@intel.com \
--cc=libc-alpha@sourceware.org \
--cc=tim.c.chen@intel.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).