From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <hjl.tools@gmail.com>
Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com
 [IPv6:2607:f8b0:4864:20::433])
 by sourceware.org (Postfix) with ESMTPS id 493323858D35
 for <libc-alpha@sourceware.org>; Wed, 10 Nov 2021 23:35:31 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 493323858D35
Received: by mail-pf1-x433.google.com with SMTP id m14so3973049pfc.9
 for <libc-alpha@sourceware.org>; Wed, 10 Nov 2021 15:35:31 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=Wym7aGzgYXglksj06NUAlAuz2Lc62BHz39sT4zbPhyc=;
 b=HqZqA9p4x/BCFn5Slms+tsKQV6T0vFw3vWsMqhjXvCcxE4WphU+GjgOa88YF/WWjiT
 Fjpxd8pJzlESx6drkvhRwnRCOpDY4jQ/RCPKRBbBSPk39brC5I6AJ3MVRVZhPsCzw5u6
 6UH4+DJzHVuUrykRg7ft/i0oRFT7/W6xtHkERiUG/Xbtp9gB1BFLlH8JMYiflACVJCGL
 uo6ToAy/P1CJBgivWo3E12ih6ADn3A8lwzw80UvyJmxwaoXGMhDkXWjKAHLh75N+Lymo
 fY7fgIdnmYaeJkqYz7Y4k2N63sJOMdHF3L+GEd0qvVfDAmyZXY6+nv9nzUb05jmVtGHj
 RFJg==
X-Gm-Message-State: AOAM533UjcVduT2SnZjcJEtXa6bFFlWt4j+QTCMz5QzKoXe+w/gaLdCb
 Xo9H/O6a/OBVbXVpCGGlLtMTCyIZ11JoIke+A9g=
X-Google-Smtp-Source: ABdhPJymtD1vW2S1YH6Qb7fijSj6bMj/aX5KSTKzvTCdM44T3JHW2KmFkmVt5A2QZjAKN2aEI3vZHH9gqrUXaUud6p8=
X-Received: by 2002:a05:6a00:b4c:b0:481:2a:f374 with SMTP id
 p12-20020a056a000b4c00b00481002af374mr2848021pfo.60.1636587330344; Wed, 10
 Nov 2021 15:35:30 -0800 (PST)
MIME-Version: 1.0
References: <20211110001614.2087610-1-hjl.tools@gmail.com>
 <d12b76f2-a810-d58d-4b7c-844a7b0a689b@linux.ibm.com>
 <20211110200722.GF4930@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com>
In-Reply-To: <20211110200722.GF4930@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Wed, 10 Nov 2021 15:34:54 -0800
Message-ID: <CAMe9rOpR4wNHOH07KY+JC8o_jqHb4Xspb-cP=Dyxn6+QycTN2Q@mail.gmail.com>
Subject: Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
To: "Paul A. Clarke" <pc@us.ibm.com>
Cc: Paul E Murphy <murphyp@linux.ibm.com>,
 GNU C Library <libc-alpha@sourceware.org>, 
 Florian Weimer <fweimer@redhat.com>, Andreas Schwab <schwab@linux-m68k.org>, 
 Arjan van de Ven <arjan@linux.intel.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3023.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Nov 2021 23:35:33 -0000

On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading.  See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > >
> > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > line bouncing on contended locks.
> > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > don't know if it's actually rare; in the contended case it is clearly not
> > > rare.
> >
> > Are you able to share benchmarks of this change? I am curious what effects
> > this might have on other platforms.
>
> I'd like to see the expected performance results, too.
>
> For me, the results are not uniformly positive (Power10).
> From bench-pthread-locks:
>
>                          bench   bench-patched
> mutex-empty              4.73371 4.54792   3.9%
> mutex-filler             18.5395 18.3419   1.1%
> mutex_trylock-empty      10.46   2.46364  76.4%
> mutex_trylock-filler     16.2188 16.1758   0.3%
> rwlock_read-empty        16.5118 16.4681   0.3%
> rwlock_read-filler       20.68   20.4416   1.2%
> rwlock_tryread-empty     2.06572 2.17284  -5.2%
> rwlock_tryread-filler    16.082  16.1215  -0.2%
> rwlock_write-empty       31.3723 31.259    0.4%
> rwlock_write-filler      41.6492 69.313  -66.4%
> rwlock_trywrite-empty    2.20584 2.32178  -5.3%
> rwlock_trywrite-filler   15.7044 15.9088  -1.3%
> spin_lock-empty          16.7964 16.7731   0.1%
> spin_lock-filler         20.6118 20.4175   0.9%
> spin_trylock-empty       8.99989 8.98879   0.1%
> spin_trylock-filler      16.4732 15.9957   2.9%
> sem_wait-empty           15.805  15.7391   0.4%
> sem_wait-filler          19.2346 19.5098  -1.4%
> sem_trywait-empty        2.06405 2.03782   1.3%
> sem_trywait-filler       15.921  15.8408   0.5%
> condvar-empty            1385.84 1387.29  -0.1%
> condvar-filler           1419.82 1424.01  -0.3%
> consumer_producer-empty  2550.01 2395.29   6.1%
> consumer_producer-filler 2709.4  2558.28   5.6%
>
> PC

Here are the results on a machine with 112 cores:

             mutex-empty    16.0112    16.5728  -3.5%
             mutex-filler    49.4354    48.7608  1.4%
      mutex_trylock-empty    19.2854    8.56795  56%
     mutex_trylock-filler    54.9643    41.5418  24%
        rwlock_read-empty    39.8855    39.7448  0.35%
       rwlock_read-filler    75.1334    75.1218  0.015%
     rwlock_tryread-empty    5.29094     5.2917  -0.014%
    rwlock_tryread-filler    39.6653     40.209  -1.4%
       rwlock_write-empty    60.6445    60.6236  0.034%
      rwlock_write-filler     91.431    92.9016  -1.6%
    rwlock_trywrite-empty    5.28404    5.94623  -13%
   rwlock_trywrite-filler    40.7044    40.7709  -0.16%
          spin_lock-empty    19.1067    19.1068  -0.00052%
         spin_lock-filler     51.643    51.2963  0.67%
       spin_trylock-empty    16.4705    16.4707  -0.0012%
      spin_trylock-filler    45.4647    50.5047  -11%
           sem_wait-empty     42.169    42.1889  -0.047%
          sem_wait-filler    74.4302    74.4577  -0.037%
        sem_trywait-empty    5.27318    5.27172  0.028%
       sem_trywait-filler     40.191    40.8506  -1.6%
            condvar-empty    5404.27    5406.39  -0.039%
           condvar-filler    5022.93    1566.82  69%
  consumer_producer-empty    15899.2    16755.8  -5.4%
 consumer_producer-filler    16076.9    16065.8  0.069%

rwlock_trywrite-empty has 13% regression and spin_trylock-filler
has 11% regression.  But there are 69%, 56% and 24% improvements.


--
H.J.