From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com [IPv6:2607:f8b0:4864:20::432]) by sourceware.org (Postfix) with ESMTPS id A0B82385E82D for ; Wed, 30 Mar 2022 19:39:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A0B82385E82D Received: by mail-pf1-x432.google.com with SMTP id b15so19766926pfm.5 for ; Wed, 30 Mar 2022 12:39:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=eBaEnzGMGes9GVpJRH43mzPlAByvgOoHjmo0IGCxy28=; b=8Ak67SK+/nqejTfgF5Z9xMuWKtaDHx1/WZZSNvUpE0iqW9EyWJ16OcTGqegsFq0t+L n4LPXZkMeOXM5rQb2+HXQww/9jcj/59uAKmnF+8hduQZDmD9Bxkx+yQH1BSGYudDcUur BrqP6Au9pZjcLLHZsFeyuBC48EpcoaqkLC+mpO7lBqfTC+DGD/lyqQBFJxEySfwNVMz1 edwWSG8sP33pnBPhUghmxe1qyfETPNN0Ec3KuPbqdgfUxMTmRGuwIOv/4aWwp+Li7T6q 53iZg2xtWvSYq/6DG5Auj4lN0uGmYjqiFq9A+CBA3vxrZXUhxIyPXoTkwrIq3upuURN2 qNFw== X-Gm-Message-State: AOAM531haxUdxrvdp8c4XFEY9ZnKzMvjGqiWatVKEDaRqkuwuaCq4f2I 9X4KbplzGmWCfjJd+hyMwvXgQFCOT7Sc9u6zfuYh55C6iag= X-Google-Smtp-Source: ABdhPJyJJljOAm/wK3Ol9V/k+aJA+zl4iLTsh/ean8xw2qypav4/WCL0lDbq/KRxbURpjsEaqTs9M2UkXDJNdCk5h08= X-Received: by 2002:a63:d44c:0:b0:380:8c48:e040 with SMTP id i12-20020a63d44c000000b003808c48e040mr7657036pgj.14.1648669191589; Wed, 30 Mar 2022 12:39:51 -0700 (PDT) MIME-Version: 1.0 References: <20220328084705.468207-1-wangyang.guo@intel.com> In-Reply-To: From: Noah Goldstein Date: Wed, 30 Mar 2022 14:39:40 -0500 Message-ID: Subject: Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop To: "Guo, Wangyang" Cc: GNU C Library Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_ASCII_DIVIDERS, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Mar 2022 19:39:54 -0000 On Wed, Mar 30, 2022 at 6:45 AM Guo, Wangyang wrote: > > On 3/29/2022 12:41 AM, Noah Goldstein via Libc-alpha wrote: > > On Mon, Mar 28, 2022 at 3:47 AM Wangyang Guo wrote: > >> > >> When mutiple threads waiting for lock at the same time, once lock owner > >> releases the lock, waiters will see lock available and all try to lock, > >> which may cause an expensive CAS storm. > >> > >> Binary exponential backoff with random jitter is introduced. As try-lock > >> attempt increases, there is more likely that a larger number threads > >> compete for adaptive mutex lock, so increase wait time in exponential. > >> A random jitter is also added to avoid synchronous try-lock from other > >> threads. > >> > >> v2: Remove read-check before try-lock for performance. > >> > >> Signed-off-by: Wangyang Guo > >> --- > >> nptl/pthread_mutex_lock.c | 25 ++++++++++++++++--------- > >> 1 file changed, 16 insertions(+), 9 deletions(-) > >> > >> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c > >> index d2e652d151..7e75ec1cba 100644 > >> --- a/nptl/pthread_mutex_lock.c > >> +++ b/nptl/pthread_mutex_lock.c > >> @@ -26,6 +26,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> > >> /* Some of the following definitions differ when pthread_mutex_cond_lock.c > >> includes this file. */ > >> @@ -64,11 +65,6 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex) > >> # define PTHREAD_MUTEX_VERSIONS 1 > >> #endif > >> > >> -#ifndef LLL_MUTEX_READ_LOCK > >> -# define LLL_MUTEX_READ_LOCK(mutex) \ > >> - atomic_load_relaxed (&(mutex)->__data.__lock) > >> -#endif > >> - > >> static int __pthread_mutex_lock_full (pthread_mutex_t *mutex) > >> __attribute_noinline__; > >> > >> @@ -138,17 +134,28 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex) > >> int cnt = 0; > >> int max_cnt = MIN (max_adaptive_count (), > >> mutex->__data.__spins * 2 + 10); > >> + int spin_count, exp_backoff = 1; > >> + unsigned int jitter = random_bits (); > >> do > >> { > >> - if (cnt++ >= max_cnt) > >> + /* In each loop, spin count is exponential backoff plus > >> + random jitter, random range is [0, exp_backoff-1]. */ > >> + spin_count = exp_backoff + (jitter & (exp_backoff - 1)); > >> + cnt += spin_count; > >> + if (cnt >= max_cnt) > >> { > >> + /* If cnt exceeds max spin count, just go to wait > >> + queue. */ > >> LLL_MUTEX_LOCK (mutex); > >> break; > >> } > >> - atomic_spin_nop (); > >> + do > >> + atomic_spin_nop (); > >> + while (--spin_count > 0); > >> + /* Binary exponential backoff, prepare for next loop. */ > >> + exp_backoff <<= 1; > >> } > >> - while (LLL_MUTEX_READ_LOCK (mutex) != 0 > >> - || LLL_MUTEX_TRYLOCK (mutex) != 0); > >> + while (LLL_MUTEX_TRYLOCK (mutex) != 0); > > > > Does it perform better w.o the read? > > > > In general can you post some benchmarks varying the number of threads / size of > > the critical section? Would also be nice if you could collect some > > stats regarding > > the average number of failed CAS attempts before/after. > > I work out a mutex bench: > https://gist.github.com/guowangy/459085e5be6bfeda101c591d2d2304c5 > It can test with different threads and critical section (critical length > determined by num of pause instructions). Can you add a benchmark to glibc? Also you are compiling w.o optimizations on. > > Here is the result of v1 against upstream: > (Tested in Xeon 8380) > > npause 1 4 16 64 128 <== thread num > -------------------------------------------- > 0 1.00 1.70 1.94 1.70 1.76 > 1 0.99 1.57 1.75 1.54 1.58 > 4 1.00 1.28 1.42 1.38 1.41 > 16 1.00 1.02 1.11 1.18 1.19 > 64 1.00 1.00 0.95 0.99 0.97 > 128 1.00 0.92 0.92 0.87 0.87 I was able to essentially reproduce this. Maybe in light of these changes `max_cnt` needs to a new threshold. Also maybe instead of `cnt += spin_count` it should stay `cnt++` so we don't hit the futex so eagerly. > > The first row header is number of threads. > The first column is num of pause, which represents different critical > section sizes. > The value is the rate of v1/upstream throughput (High Is Better). > RSD varies by thread num: thread(4, 16) < 2%, thread(1, 64, 128) < 1%. > > Thread=1 is the baseline and should be the same for the both. > Backoff works well when critical section is small. > But it cause some regression when critical section is very large, > because of upstream using frequently attempt to reduce the latency. > Since adaptive is aimed for 'quick' locks, I think it's not a big > problem here. > > The next is v2 against v1: > 1 4 16 64 128 > -------------------------------------------- > 0 1.00 0.99 0.95 1.02 1.02 > 1 1.00 1.08 0.92 1.04 1.04 > 4 1.00 0.94 0.94 1.05 1.04 > 16 1.00 1.01 0.98 1.03 1.03 > 64 1.00 1.01 1.00 1.00 1.01 > 128 1.00 1.02 1.00 1.00 1.00 > > v2 without read-check can have benefit in large thread num, but have > drawback in small thread num. > I am seeing a pretty significant benefit in keeping the read check on my Tigerlake (only measuring up to 16 threads as its an 8-core machine). W/ read check: [pause: 0, thread: 1] -- Throughput: 76528393 ops/s RSD: 4.55% [pause: 0, thread: 4] -- Throughput: 28065221 ops/s RSD: 1.58% [pause: 0, thread: 8] -- Throughput: 14757408 ops/s RSD: 1.86% [pause: 0, thread: 16] -- Throughput: 15501579 ops/s RSD: 1.06% w/o read check: [pause: 0, thread: 1] -- Throughput: 71303250 ops/s RSD: 4.98% [pause: 0, thread: 4] -- Throughput: 11308910 ops/s RSD: 8.96% [pause: 0, thread: 8] -- Throughput: 7380217 ops/s RSD: 5.52% [pause: 0, thread: 16] -- Throughput: 8253073 ops/s RSD: 2.19% > I also have a version with random jitter removed (v3), here is the > result of v3 against v1: > 1 4 16 64 128 > -------------------------------------------- > 0 1.00 0.95 1.04 1.02 1.03 > 1 1.01 1.09 1.04 1.02 1.03 > 4 1.00 1.02 1.05 1.02 1.02 > 16 1.00 1.01 1.01 1.02 1.02 > 64 1.00 1.04 1.02 1.01 1.01 > 128 1.00 0.98 0.99 0.99 0.98 > > The result shows without random is a better solution. > > >> > >> mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8; > >> } > >> -- > >> 2.35.1 > >> > > >