From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433]) by sourceware.org (Postfix) with ESMTPS id 493323858D35 for ; Wed, 10 Nov 2021 23:35:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 493323858D35 Received: by mail-pf1-x433.google.com with SMTP id m14so3973049pfc.9 for ; Wed, 10 Nov 2021 15:35:31 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Wym7aGzgYXglksj06NUAlAuz2Lc62BHz39sT4zbPhyc=; b=HqZqA9p4x/BCFn5Slms+tsKQV6T0vFw3vWsMqhjXvCcxE4WphU+GjgOa88YF/WWjiT Fjpxd8pJzlESx6drkvhRwnRCOpDY4jQ/RCPKRBbBSPk39brC5I6AJ3MVRVZhPsCzw5u6 6UH4+DJzHVuUrykRg7ft/i0oRFT7/W6xtHkERiUG/Xbtp9gB1BFLlH8JMYiflACVJCGL uo6ToAy/P1CJBgivWo3E12ih6ADn3A8lwzw80UvyJmxwaoXGMhDkXWjKAHLh75N+Lymo fY7fgIdnmYaeJkqYz7Y4k2N63sJOMdHF3L+GEd0qvVfDAmyZXY6+nv9nzUb05jmVtGHj RFJg== X-Gm-Message-State: AOAM533UjcVduT2SnZjcJEtXa6bFFlWt4j+QTCMz5QzKoXe+w/gaLdCb Xo9H/O6a/OBVbXVpCGGlLtMTCyIZ11JoIke+A9g= X-Google-Smtp-Source: ABdhPJymtD1vW2S1YH6Qb7fijSj6bMj/aX5KSTKzvTCdM44T3JHW2KmFkmVt5A2QZjAKN2aEI3vZHH9gqrUXaUud6p8= X-Received: by 2002:a05:6a00:b4c:b0:481:2a:f374 with SMTP id p12-20020a056a000b4c00b00481002af374mr2848021pfo.60.1636587330344; Wed, 10 Nov 2021 15:35:30 -0800 (PST) MIME-Version: 1.0 References: <20211110001614.2087610-1-hjl.tools@gmail.com> <20211110200722.GF4930@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com> In-Reply-To: <20211110200722.GF4930@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com> From: "H.J. Lu" Date: Wed, 10 Nov 2021 15:34:54 -0800 Message-ID: Subject: Re: [PATCH v4 0/3] Optimize CAS [BZ #28537] To: "Paul A. Clarke" Cc: Paul E Murphy , GNU C Library , Florian Weimer , Andreas Schwab , Arjan van de Ven Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3023.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Nov 2021 23:35:33 -0000 On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke wrote: > > On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote: > > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote: > > > CAS instruction is expensive. From the x86 CPU's point of view, getting > > > a cache line for writing is more expensive than reading. See Appendix > > > A.2 Spinlock in: > > > > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > > > > > The full compare and swap will grab the cache line exclusive and cause > > > excessive cache line bouncing. > > > > > > Optimize CAS in low level locks and pthread_mutex_lock.c: > > > > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache > > > line bouncing on contended locks. > > > 2. Replace atomic_compare_and_exchange_bool_acq with > > > atomic_compare_and_exchange_val_acq to avoid the extra load. > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > > > don't know if it's actually rare; in the contended case it is clearly not > > > rare. > > > > Are you able to share benchmarks of this change? I am curious what effects > > this might have on other platforms. > > I'd like to see the expected performance results, too. > > For me, the results are not uniformly positive (Power10). > From bench-pthread-locks: > > bench bench-patched > mutex-empty 4.73371 4.54792 3.9% > mutex-filler 18.5395 18.3419 1.1% > mutex_trylock-empty 10.46 2.46364 76.4% > mutex_trylock-filler 16.2188 16.1758 0.3% > rwlock_read-empty 16.5118 16.4681 0.3% > rwlock_read-filler 20.68 20.4416 1.2% > rwlock_tryread-empty 2.06572 2.17284 -5.2% > rwlock_tryread-filler 16.082 16.1215 -0.2% > rwlock_write-empty 31.3723 31.259 0.4% > rwlock_write-filler 41.6492 69.313 -66.4% > rwlock_trywrite-empty 2.20584 2.32178 -5.3% > rwlock_trywrite-filler 15.7044 15.9088 -1.3% > spin_lock-empty 16.7964 16.7731 0.1% > spin_lock-filler 20.6118 20.4175 0.9% > spin_trylock-empty 8.99989 8.98879 0.1% > spin_trylock-filler 16.4732 15.9957 2.9% > sem_wait-empty 15.805 15.7391 0.4% > sem_wait-filler 19.2346 19.5098 -1.4% > sem_trywait-empty 2.06405 2.03782 1.3% > sem_trywait-filler 15.921 15.8408 0.5% > condvar-empty 1385.84 1387.29 -0.1% > condvar-filler 1419.82 1424.01 -0.3% > consumer_producer-empty 2550.01 2395.29 6.1% > consumer_producer-filler 2709.4 2558.28 5.6% > > PC Here are the results on a machine with 112 cores: mutex-empty 16.0112 16.5728 -3.5% mutex-filler 49.4354 48.7608 1.4% mutex_trylock-empty 19.2854 8.56795 56% mutex_trylock-filler 54.9643 41.5418 24% rwlock_read-empty 39.8855 39.7448 0.35% rwlock_read-filler 75.1334 75.1218 0.015% rwlock_tryread-empty 5.29094 5.2917 -0.014% rwlock_tryread-filler 39.6653 40.209 -1.4% rwlock_write-empty 60.6445 60.6236 0.034% rwlock_write-filler 91.431 92.9016 -1.6% rwlock_trywrite-empty 5.28404 5.94623 -13% rwlock_trywrite-filler 40.7044 40.7709 -0.16% spin_lock-empty 19.1067 19.1068 -0.00052% spin_lock-filler 51.643 51.2963 0.67% spin_trylock-empty 16.4705 16.4707 -0.0012% spin_trylock-filler 45.4647 50.5047 -11% sem_wait-empty 42.169 42.1889 -0.047% sem_wait-filler 74.4302 74.4577 -0.037% sem_trywait-empty 5.27318 5.27172 0.028% sem_trywait-filler 40.191 40.8506 -1.6% condvar-empty 5404.27 5406.39 -0.039% condvar-filler 5022.93 1566.82 69% consumer_producer-empty 15899.2 16755.8 -5.4% consumer_producer-filler 16076.9 16065.8 0.069% rwlock_trywrite-empty has 13% regression and spin_trylock-filler has 11% regression. But there are 69%, 56% and 24% improvements. -- H.J.