From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 57262 invoked by alias); 2 Aug 2018 00:38:10 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 57249 invoked by uid 89); 2 Aug 2018 00:38:09 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-12.2 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS autolearn=ham version=3.3.2 spammy=13=e6, scott, referred, gradually?= X-HELO: mga04.intel.com Subject: Re: [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP To: Adhemerval Zanella , Florian Weimer , Rical Jason , Carlos Donell , Glibc alpha Cc: Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey References: <1531464752-18830-1-git-send-email-kemi.wang@intel.com> From: kemi Message-ID: Date: Thu, 02 Aug 2018 00:38:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <1531464752-18830-1-git-send-email-kemi.wang@intel.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2018-08/txt/msg00023.txt.bz2 Hi, Gentle maintainers Could we get this patch series reviewed next? Thanks https://sourceware.org/ml/libc-alpha/2018-07/msg00354.html https://sourceware.org/ml/libc-alpha/2018-07/msg00357.html https://sourceware.org/ml/libc-alpha/2018-07/msg00355.html https://sourceware.org/ml/libc-alpha/2018-07/msg00358.html https://sourceware.org/ml/libc-alpha/2018-07/msg00356.html https://sourceware.org/ml/libc-alpha/2018-07/msg00359.html On 2018年07月13日 14:52, Kemi Wang wrote: > The pthread adaptive mutex is designed based-on the truth that the lock > probably would be released after a few of CPU cycles in most cases. > Especially for the case: the applications code is well-behaved with a short > critical section that is highly contended. Thus, spinning on the lock for > a while instead of calling into the kernel to block immediately can help > to improve the system performance. > > But there are two main problems in current adaptive mutex. The first one is > fairness, multiple spinners contend for the lock simultaneously and there > is no any guarantee for a spinner to acquire the lock no matter how long it > has been waiting for. The other is the heavy cache line bouncing. Since the > cache line including the lock is shared among all of the spinners, when > lock is released, each spinner will try to acquire the lock via cmpxchg > instruction which constantly floods the system via "read-for-ownership" > requests. As a result, there will be a lot of cache line bouncing in a > large system with a lots of CPUs. > > To address the problems mentioned above, the idea for queuing mutex > spinners with MCS lock[1] referred to the implementation of mutex[2] in > kernel is proposed and a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP is > introduced. Compare to adaptive mutex (read only while spinning), the test > result on Intel 2 sockets Skylake platform has showns significant > performance improvement (See the first patch for details). > > Though, queuing spinner with mcs lock can help to improve the performance > of adaptive mutex when multiple threads contending for a lock, people > probably want to know how queue spinner mutex performs when compare to > other lock discipline widely knowns as pthread spin lock and pthread mutex. > We can see the details of the test result in the first patch. Simply, some > conclusion is summarized as below: > a) In case of little lock contention, spin lock performs best, queue > spinner mutex performs similar to adaptive mutex, and both perform a > little better than pthread mutex. > b) In the case of severe lock contention with large number of CPUs when > protecting a small critical section (less than 1000ns). Most of lock > acquisition is got via spinning. Queue spinner mutex. > performs much better than spin lock and adaptive mutex. This is because the > overhead of heavy cache line bouncing plays a big role on lock performance. > c) With the increase of the size of a critical section, the advantage of > queue spinner mutex on performance in reduced gradually. This is because > the overhead of cache line bouncing will not become the bottleneck of lock > performance, instead, the overhead of futex_wait and futex_wake > plays a big role. When the critical section is increased to 1ms, even the > latency of futex syscall would be ignorable compared to the total time of > lock acquisition. > > As we can see above, queue spinner mutex performs well in kinds of workload, > but there would be a potential risk to use this type of mutex. When the > lock holder is transformed to the next spinner in the queue, but it is not > running (the CPU is scheduled to run other task). Thus, the other spinners > have to wait in the queue, this would probably collapse the lock performance. > To emulate this case, we run two same processes simultaneously, the > process has 28 threads each of which sets CPU affinity to an individual CPU > according to the thread id. Thus, CPU [0~27] are subscribed by two threads. > In the worst case (s=1000ns, t=6000ns), the lock performance is reduced by > 58.1% (2205245->924263). > Therefore, queue spinner mutex would be carefully used for applications to > pursue fairness and performance without oversubscribing CPU resource. E.g. > Containers in public cloud infrastructure. > > To overcome the disadvantage of lock performance collapse in the legacy MCS > lock when CPUs are oversubscribed, we proposed an enhanced MCS lock > algorithm in the second patch which allows a spinner to spin on mutex lock > without having to be waken up by its previous spinner. In such case, this > new mutex type would be widely and safely used in kinds of usage scenarios. > > ChangeLog: > V1->V2 > a) Propose an enhanced MCS lock algrithm to avoid potential lock > performance collapse. > b) Add a link of the paper which introduces original MCS lock. > c) Add the commit id for mutex implementation with MCS lock in kernel. > > [1] Mellor-Crummey, John M., and Michael L. Scott. "Algorithms for scalable > synchronization on shared-memory multiprocessors." ACM Transactions on > Computer Systems (TOCS) 9.1 (1991): 21-65. > https://www.cos.ufrj.br/~ines/courses/progpar01/papers/p21-mellor- > crummey.pdf > [2] Commit id for mutex with MCS lock in kernel: > 2bd2c92cf07cc4a373bf316c75b78ac465fefd35 > > Kemi Wang (5): > Mutex: Queue spinners to reduce cache line bouncing and ensure > fairness > MCS lock: Enhance legacy MCS lock algorithm > Mutex: add unit tests for new type > BENCHMARK: add a benchmark for testing new type of mutex > Manual: Add manual for pthread mutex > > benchtests/Makefile | 4 +- > benchtests/bench-mutex-adaptive-thread.c | 8 ++- > benchtests/bench-mutex-queuespinner-thread.c | 21 +++++++ > manual/Makefile | 2 +- > manual/mutex.texi | 68 +++++++++++++++++++++ > nptl/Makefile | 8 +-- > nptl/allocatestack.c | 2 +- > nptl/descr.h | 26 ++++---- > nptl/mcs_lock.c | 88 ++++++++++++++++++++++++++++ > nptl/mcs_lock.h | 21 +++++++ > nptl/nptl-init.c | 2 +- > nptl/pthreadP.h | 2 +- > nptl/pthread_mutex_init.c | 3 +- > nptl/pthread_mutex_lock.c | 40 ++++++++++++- > nptl/pthread_mutex_timedlock.c | 35 ++++++++++- > nptl/pthread_mutex_trylock.c | 5 +- > nptl/pthread_mutex_unlock.c | 7 ++- > nptl/pthread_mutexattr_settype.c | 2 +- > nptl/tst-initializers1.c | 11 ++-- > nptl/tst-mutex5b.c | 2 + > nptl/tst-mutex7b.c | 2 + > sysdeps/nptl/bits/thread-shared-types.h | 22 +++++-- > sysdeps/nptl/pthread.h | 15 +++-- > sysdeps/unix/sysv/linux/hppa/pthread.h | 4 ++ > 24 files changed, 350 insertions(+), 50 deletions(-) > create mode 100644 benchtests/bench-mutex-queuespinner-thread.c > create mode 100644 manual/mutex.texi > create mode 100644 nptl/mcs_lock.c > create mode 100644 nptl/mcs_lock.h > create mode 100644 nptl/tst-mutex5b.c > create mode 100644 nptl/tst-mutex7b.c >