From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by sourceware.org (Postfix) with ESMTPS id 5EE763858405 for ; Wed, 30 Mar 2022 11:44:59 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5EE763858405 X-IronPort-AV: E=McAfee;i="6200,9189,10301"; a="239451381" X-IronPort-AV: E=Sophos;i="5.90,222,1643702400"; d="scan'208";a="239451381" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Mar 2022 04:44:57 -0700 X-IronPort-AV: E=Sophos;i="5.90,222,1643702400"; d="scan'208";a="565501170" Received: from guowangy-mobl1.ccr.corp.intel.com (HELO [10.249.175.110]) ([10.249.175.110]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Mar 2022 04:44:56 -0700 Message-ID: Date: Wed, 30 Mar 2022 19:44:54 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop Content-Language: en-US To: libc-alpha@sourceware.org References: <20220328084705.468207-1-wangyang.guo@intel.com> Cc: Noah Goldstein , wangyang.guo@intel.com From: "Guo, Wangyang" In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-10.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_ASCII_DIVIDERS, NICE_REPLY_A, SPF_HELO_PASS, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Mar 2022 11:45:02 -0000 On 3/29/2022 12:41 AM, Noah Goldstein via Libc-alpha wrote: > On Mon, Mar 28, 2022 at 3:47 AM Wangyang Guo wrote: >> >> When mutiple threads waiting for lock at the same time, once lock owner >> releases the lock, waiters will see lock available and all try to lock, >> which may cause an expensive CAS storm. >> >> Binary exponential backoff with random jitter is introduced. As try-lock >> attempt increases, there is more likely that a larger number threads >> compete for adaptive mutex lock, so increase wait time in exponential. >> A random jitter is also added to avoid synchronous try-lock from other >> threads. >> >> v2: Remove read-check before try-lock for performance. >> >> Signed-off-by: Wangyang Guo >> --- >> nptl/pthread_mutex_lock.c | 25 ++++++++++++++++--------- >> 1 file changed, 16 insertions(+), 9 deletions(-) >> >> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c >> index d2e652d151..7e75ec1cba 100644 >> --- a/nptl/pthread_mutex_lock.c >> +++ b/nptl/pthread_mutex_lock.c >> @@ -26,6 +26,7 @@ >> #include >> #include >> #include >> +#include >> >> /* Some of the following definitions differ when pthread_mutex_cond_lock.c >> includes this file. */ >> @@ -64,11 +65,6 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex) >> # define PTHREAD_MUTEX_VERSIONS 1 >> #endif >> >> -#ifndef LLL_MUTEX_READ_LOCK >> -# define LLL_MUTEX_READ_LOCK(mutex) \ >> - atomic_load_relaxed (&(mutex)->__data.__lock) >> -#endif >> - >> static int __pthread_mutex_lock_full (pthread_mutex_t *mutex) >> __attribute_noinline__; >> >> @@ -138,17 +134,28 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex) >> int cnt = 0; >> int max_cnt = MIN (max_adaptive_count (), >> mutex->__data.__spins * 2 + 10); >> + int spin_count, exp_backoff = 1; >> + unsigned int jitter = random_bits (); >> do >> { >> - if (cnt++ >= max_cnt) >> + /* In each loop, spin count is exponential backoff plus >> + random jitter, random range is [0, exp_backoff-1]. */ >> + spin_count = exp_backoff + (jitter & (exp_backoff - 1)); >> + cnt += spin_count; >> + if (cnt >= max_cnt) >> { >> + /* If cnt exceeds max spin count, just go to wait >> + queue. */ >> LLL_MUTEX_LOCK (mutex); >> break; >> } >> - atomic_spin_nop (); >> + do >> + atomic_spin_nop (); >> + while (--spin_count > 0); >> + /* Binary exponential backoff, prepare for next loop. */ >> + exp_backoff <<= 1; >> } >> - while (LLL_MUTEX_READ_LOCK (mutex) != 0 >> - || LLL_MUTEX_TRYLOCK (mutex) != 0); >> + while (LLL_MUTEX_TRYLOCK (mutex) != 0); > > Does it perform better w.o the read? > > In general can you post some benchmarks varying the number of threads / size of > the critical section? Would also be nice if you could collect some > stats regarding > the average number of failed CAS attempts before/after. I work out a mutex bench: https://gist.github.com/guowangy/459085e5be6bfeda101c591d2d2304c5 It can test with different threads and critical section (critical length determined by num of pause instructions). Here is the result of v1 against upstream: (Tested in Xeon 8380) npause 1 4 16 64 128 <== thread num -------------------------------------------- 0 1.00 1.70 1.94 1.70 1.76 1 0.99 1.57 1.75 1.54 1.58 4 1.00 1.28 1.42 1.38 1.41 16 1.00 1.02 1.11 1.18 1.19 64 1.00 1.00 0.95 0.99 0.97 128 1.00 0.92 0.92 0.87 0.87 The first row header is number of threads. The first column is num of pause, which represents different critical section sizes. The value is the rate of v1/upstream throughput (High Is Better). RSD varies by thread num: thread(4, 16) < 2%, thread(1, 64, 128) < 1%. Thread=1 is the baseline and should be the same for the both. Backoff works well when critical section is small. But it cause some regression when critical section is very large, because of upstream using frequently attempt to reduce the latency. Since adaptive is aimed for 'quick' locks, I think it's not a big problem here. The next is v2 against v1: 1 4 16 64 128 -------------------------------------------- 0 1.00 0.99 0.95 1.02 1.02 1 1.00 1.08 0.92 1.04 1.04 4 1.00 0.94 0.94 1.05 1.04 16 1.00 1.01 0.98 1.03 1.03 64 1.00 1.01 1.00 1.00 1.01 128 1.00 1.02 1.00 1.00 1.00 v2 without read-check can have benefit in large thread num, but have drawback in small thread num. I also have a version with random jitter removed (v3), here is the result of v3 against v1: 1 4 16 64 128 -------------------------------------------- 0 1.00 0.95 1.04 1.02 1.03 1 1.01 1.09 1.04 1.02 1.03 4 1.00 1.02 1.05 1.02 1.02 16 1.00 1.01 1.01 1.02 1.02 64 1.00 1.04 1.02 1.01 1.01 128 1.00 0.98 0.99 0.99 0.98 The result shows without random is a better solution. >> >> mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8; >> } >> -- >> 2.35.1 >> >