From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <goldstein.w.n@gmail.com>
Received: from mail-lf1-x12d.google.com (mail-lf1-x12d.google.com
 [IPv6:2a00:1450:4864:20::12d])
 by sourceware.org (Postfix) with ESMTPS id AD7363858C2C
 for <libc-alpha@sourceware.org>; Mon,  8 Nov 2021 15:32:28 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org AD7363858C2C
Received: by mail-lf1-x12d.google.com with SMTP id p16so37100308lfa.2
 for <libc-alpha@sourceware.org>; Mon, 08 Nov 2021 07:32:28 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=VhQBU7wm6CpestJDfmjsbPR+CiDKzuDGlOm34UZ2Wm4=;
 b=dknxZ8ZdPKVZGkONS0FOGXJcAvTxywkP7fUl/XLZ7wc50kZI6AWT6QzVftzD/C59Mz
 ouBH1iLC+lqgccLDsNKh8qjhnQlZMwsFYPyMa/HuOHY0V6VluFyV//Za05WkuJpTtkXQ
 Iy2MSsDN3Fg8YEekvYYWSZP6nEybLbDkVf5NPzj+bQfTLkp/Pyx3eurihtxQmhIM0oig
 Fv7CfSYz4I8a2RkfbOMYhWyUDx7/v+YGe8z802xnWK0Bc6X+4bG6J2iexTFn++78bPwL
 ifqA0jxNSucMPLDmw4REQ2gFIFwSkT0wA9aDzMWo0SenEbJMwwRLxI4Z0WIeUKsbVXEu
 c9kg==
X-Gm-Message-State: AOAM530eEQ+Kq22ZFe4PRbY+RHvWSK9B35s759F9AtMZDDvdO/5r65R7
 s1nc+Vw/nCDnsuTcQ+ex3Ty1k5A0BpDXKqx0pmE=
X-Google-Smtp-Source: ABdhPJzZGPyFQ48Ef+9tIys7jSsP/A4Fz2TfXtWR2S1Y+rJ/PFcSX2uK8xT5U7ivat7z8pgVTVCwPBD5c2c5t4FwEeU=
X-Received: by 2002:a05:6512:21cb:: with SMTP id
 d11mr97310lft.579.1636385547283; 
 Mon, 08 Nov 2021 07:32:27 -0800 (PST)
MIME-Version: 1.0
References: <20211104161443.734681-1-hjl.tools@gmail.com>
In-Reply-To: <20211104161443.734681-1-hjl.tools@gmail.com>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Mon, 8 Nov 2021 09:32:16 -0600
Message-ID: <CAFUsyf+g=aieH9z0vc81=vN8RgJu0CD+0Zs_zWo-9XB6MU6+ww@mail.gmail.com>
Subject: Re: [PATCH v3] x86: Optimize
 atomic_compare_and_exchange_[val|bool]_acq [BZ #28537]
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>,
 Florian Weimer <fweimer@redhat.com>, 
 Hongyu Wang <hongyu.wang@intel.com>, Andreas Schwab <schwab@linux-m68k.org>, 
 liuhongt <hongtao.liu@intel.com>, Arjan van de Ven <arjan@linux.intel.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-9.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Nov 2021 15:32:30 -0000

On Thu, Nov 4, 2021 at 11:15 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> From the CPU's point of view, getting a cache line for writing is more
> expensive than reading.  See Appendix A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.  Load the current memory value via a
> volatile pointer first, which should be atomic and won't be optimized
> out by compiler, check and return immediately if writing cache line may
> fail to reduce cache line bouncing on contended locks.
>
> This fixes BZ# 28537.
>
> A GCC bug is opened:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065
>
> The fixed compiler should define __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK
> to indicate that compiler will generate the check with the volatile load.
> Then glibc can check __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK to avoid the
> extra volatile load.
> ---
>  sysdeps/x86/atomic-machine.h | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
> index 2692d94a92..597dc1cf92 100644
> --- a/sysdeps/x86/atomic-machine.h
> +++ b/sysdeps/x86/atomic-machine.h
> @@ -73,9 +73,20 @@ typedef uintmax_t uatomic_max_t;
>  #define ATOMIC_EXCHANGE_USES_CAS       0
>
>  #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
> -  __sync_val_compare_and_swap (mem, oldval, newval)
> +  ({ volatile __typeof (*(mem)) *memp = (mem);                         \
> +     __typeof (*(mem)) oldmem = *memp, ret;                            \
> +     ret = (oldmem == (oldval)                                         \
> +           ? __sync_val_compare_and_swap (mem, oldval, newval)         \
> +           : oldmem);                                                  \
> +     ret; })
>  #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
> -  (! __sync_bool_compare_and_swap (mem, oldval, newval))
> +  ({ volatile __typeof (*(mem)) *memp = (mem);                         \
> +     __typeof (*(mem)) oldmem = *memp;                                 \
> +     int ret;                                                          \
> +     ret = (oldmem == (oldval)                                         \
> +           ? !__sync_bool_compare_and_swap (mem, oldval, newval)       \
> +           : 1);                                                       \
> +     ret; })
>
>
>  #define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \
> --
> 2.33.1
>

Worth noting on X86 any of the __atomic_fetch_* builtins aside from add/sub
are implemented with a CAS loop that may benefit from this:
https://godbolt.org/z/z87v9Kbcz