From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 35486 invoked by alias); 8 May 2018 15:08:12 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 35468 invoked by uid 89); 8 May 2018 15:08:11 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.3 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_PASS autolearn=no version=3.3.2 spammy=Wang, 02, apology, multi-line X-HELO: mx1.redhat.com Subject: Re: [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm To: kemi , Adhemerval Zanella , Glibc alpha Cc: Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey References: <1524624988-29141-1-git-send-email-kemi.wang@intel.com> <1524624988-29141-3-git-send-email-kemi.wang@intel.com> <591d1f86-21e8-0a01-721b-4adff26e839b@redhat.com> From: Florian Weimer Message-ID: <5873b82e-97f2-dd8d-ab55-353138264517@redhat.com> Date: Tue, 08 May 2018 15:08:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-SW-Source: 2018-05/txt/msg00282.txt.bz2 On 05/02/2018 01:04 PM, kemi wrote: > > > On 2018年05月02日 16:19, Florian Weimer wrote: >> On 04/25/2018 04:56 AM, Kemi Wang wrote: >>> @@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) >>>         if (LLL_MUTEX_TRYLOCK (mutex) != 0) >>>       { >>>         int cnt = 0; >> … >>> +      int max_cnt = MIN (__mutex_aconf.spin_count, >>> +            mutex->__data.__spins * 2 + 100); >>> + >>> +      /* MO read while spinning */ >>> +      do >>> +        { >>> +         atomic_spin_nop (); >>> +        } >>> +      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 && >>> +            ++cnt < max_cnt); >>> +        /* Try to acquire the lock if lock is available or the spin count >>> +         * is run out, call into kernel to block if fails >>> +         */ >>> +      if (LLL_MUTEX_TRYLOCK (mutex) != 0) >>> +        LLL_MUTEX_LOCK (mutex); >>> >> … >>> +      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8; >>> +    } >> >> The indentation is off.  Comments should end with a ”.  ” (dot and two spaces).  Multi-line comments do not start with “*” on subsequent lines.  We don't use braces when we can avoid them.  Operators such as “&&” should be on the following line when breaking up lines. >> > > Will fold these changes in next version. > I am not familiar with glibc coding style, apologize for that. No apology needed, it takes some time to get use to. >> Why is the LLL_MUTEX_TRYLOCK call still needed?  Shouldn't be an unconditional call to LLL_MUTEX_LOCK be sufficient? >> > > The purpose of calling LLL_MUTEX_TRYLOCK here is to try to acquire the lock at user > space without block when we observed the lock is available. Thus, in case of multiple > spinners contending for the lock, only one spinner can acquire the lock successfully > and others fall into block. > > I am not sure an unconditional call to LLL_MUTEX_LOCK as you mentioned here can satisfy > this purpose. It's what we use for the default case. It expands to lll_lock, so it should try atomic_compare_and_exchange_bool_acq first and only perform a futex syscall in case there is contention. So I do think that LLL_MUTEX_TRYLOCK is redundant here. Perhaps manually review the disassembly to make sure? > >> But the real question is if the old way of doing CAS in a loop is beneficial on other, non-Intel architectures.  You either need get broad consensus from the large SMP architectures (various aarch64 implementations, IBM POWER and Z), or somehow make this opt-in at the source level. >> > > That would be a platform-specific change and have obvious performance improvement for x86 architecture. > And according to Adhemerval, this change could also have some improvement for arrch64 architecture. > If you or someone else still have some concern of performance regression on other architecture, making > this opt-in could eliminate people's worries. > > " > I checked the change on a 64 cores aarch64 machine, but > differently than previous patch this one seems to show improvements: > > nr_threads base head(SPIN_COUNT=10) head(SPIN_COUNT=1000) > 1 27566206 28776779 (4.206770) 28778073 (4.211078) > 2 8498813 9129102 (6.904173) 7042975 (-20.670782) > 7 5019434 5832195 (13.935765) 5098511 (1.550982) > 14 4379155 6507212 (32.703053) 5200018 (15.785772) > 28 4397464 4584480 (4.079329) 4456767 (1.330628) > 56 4020956 3534899 (-13.750237) 4096197 (1.836850) > " Ah, nice, I had missed that. I suppose this means we can risk enabling it by default. Thanks, Florian