[Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
@ 2022-02-25 14:22 xry111 at mengyan1223 dot wang
  2022-02-25 14:30 ` [Bug target/104688] " jakub at gcc dot gnu.org
                   ` (33 more replies)
  0 siblings, 34 replies; 35+ messages in thread
From: xry111 at mengyan1223 dot wang @ 2022-02-25 14:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

            Bug ID: 104688
           Summary: gcc and libatomic can use SSE for 128-bit atomic loads
                    on Intel CPUs with AVX
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: xry111 at mengyan1223 dot wang
  Target Milestone: ---

In Dec 2021, Intel updated the SDM and added the following content:

> Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:
> - MOVAPD, MOVAPS, and MOVDQA.
> - VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
> - VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).
> 
> (Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)

(see Change 13, https://cdrdv2.intel.com/v1/dl/getContent/671294)

So we can use SSE for Intel CPUs with AVX, instead of a loop with LOCK
CMPXCHG16B.

AMD has no such guarantee (at least for now), so we still need LOCK CMPXCHG16B
on old Intel CPUs and (old or new) AMD CPUs.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
@ 2022-02-25 14:30 ` jakub at gcc dot gnu.org
  2022-02-25 14:33 ` xry111 at mengyan1223 dot wang
                   ` (32 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-02-25 14:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hjl.tools at gmail dot com,
                   |                            |jakub at gcc dot gnu.org,
                   |                            |rth at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
So, shall we just handle it in libatomic by adding yet another ifunc selected
version for __atomic_load_16 ?
Or do we want to expand it back inline if some new -m* option selected by
default  for -march= of Intel made CPUs with AVX is set?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
  2022-02-25 14:30 ` [Bug target/104688] " jakub at gcc dot gnu.org
@ 2022-02-25 14:33 ` xry111 at mengyan1223 dot wang
  2022-02-25 14:34 ` fw at gcc dot gnu.org
                   ` (31 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: xry111 at mengyan1223 dot wang @ 2022-02-25 14:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Xi Ruoyao <xry111 at mengyan1223 dot wang> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|                            |https://gcc.gnu.org/piperma
                   |                            |il/gcc-help/2022-February/1
                   |                            |41279.html

--- Comment #2 from Xi Ruoyao <xry111 at mengyan1223 dot wang> ---
See option 4 of https://gcc.gnu.org/legacy-ml/gcc/2017-01/msg00167.html.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
  2022-02-25 14:30 ` [Bug target/104688] " jakub at gcc dot gnu.org
  2022-02-25 14:33 ` xry111 at mengyan1223 dot wang
@ 2022-02-25 14:34 ` fw at gcc dot gnu.org
  2022-02-25 16:36 ` jakub at gcc dot gnu.org
                   ` (30 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: fw at gcc dot gnu.org @ 2022-02-25 14:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #3 from Florian Weimer <fw at gcc dot gnu.org> ---
I feel we should give AMD some time to comment here. If they can commit
supporting it like Intel did, that alters the design space somewhat.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (2 preceding siblings ...)
  2022-02-25 14:34 ` fw at gcc dot gnu.org
@ 2022-02-25 16:36 ` jakub at gcc dot gnu.org
  2022-02-25 16:37 ` jakub at gcc dot gnu.org
                   ` (29 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-02-25 16:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 52517
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52517&action=edit
gcc12-pr104688.patch

Untested patch to handle it so far on the libatomic side only.
Not sure about what exactly to use for 16-byte __atomic_store_n, for 8-byte
we use a store for relaxed and xchg for seq_cst (haven't checked other models),
we don't have any xchg, so I'm using vmovdqa + mfence, but am not sure if that
is fastest.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (3 preceding siblings ...)
  2022-02-25 16:36 ` jakub at gcc dot gnu.org
@ 2022-02-25 16:37 ` jakub at gcc dot gnu.org
  2022-03-17 17:50 ` cvs-commit at gcc dot gnu.org
                   ` (28 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-02-25 16:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Of course, if AMD confirms the same, we could just revert the
__libat_feat1_init change.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (4 preceding siblings ...)
  2022-02-25 16:37 ` jakub at gcc dot gnu.org
@ 2022-03-17 17:50 ` cvs-commit at gcc dot gnu.org
  2022-03-29  5:54 ` cvs-commit at gcc dot gnu.org
                   ` (27 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-03-17 17:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #6 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:1d47c0512a265d4bb3ab9e56259fd1e4f4d42c75

commit r12-7689-g1d47c0512a265d4bb3ab9e56259fd1e4f4d42c75
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Mar 17 18:49:00 2022 +0100

    libatomic: Improve 16-byte atomics on Intel AVX [PR104688]

    As mentioned in the PR, the latest Intel SDM has added:
    "Processors that enumerate support for Intel® AVX (by setting the feature
flag CPUID.01H:ECX.AVX[bit 28])
    guarantee that the 16-byte memory operations performed by the following
instructions will always be
    carried out atomically:
    â¢ MOVAPD, MOVAPS, and MOVDQA.
    â¢ VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
    â¢ VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128
and k0 (masking disabled).
    (Note that these instructions require the linear addresses of their memory
operands to be 16-byte
    aligned.)"

    The following patch deals with it just on the libatomic library side so
far,
    currently (since ~ 2017) we emit all the __atomic_* 16-byte builtins as
    library calls since and this is something that we can hopefully backport.

    The patch simply introduces yet another ifunc variant that takes priority
    over the pure CMPXCHG16B one, one that checks AVX and CMPXCHG16B bits and
    on non-Intel clears the AVX bit during detection for now (if AMD comes
    with the same guarantee, we could revert the config/x86/init.c hunk),
    which implements 16-byte atomic load as vmovdqa and 16-byte atomic store
    as vmovdqa followed by mfence.

    2022-03-17  Jakub Jelinek  <jakub@redhat.com>

            PR target/104688
            * Makefile.am (IFUNC_OPTIONS): Change on x86_64 to -mcx16 -mcx16.
            (libatomic_la_LIBADD): Add $(addsuffix _16_2_.lo,$(SIZEOBJS)) for
            x86_64.
            * Makefile.in: Regenerated.
            * config/x86/host-config.h (IFUNC_COND_1): For x86_64 define to
            both AVX and CMPXCHG16B bits.
            (IFUNC_COND_2): Define.
            (IFUNC_NCOND): For x86_64 define to 2 * (N == 16).
            (MAYBE_HAVE_ATOMIC_CAS_16, MAYBE_HAVE_ATOMIC_EXCHANGE_16,
            MAYBE_HAVE_ATOMIC_LDST_16): Define to IFUNC_COND_2 rather than
            IFUNC_COND_1.
            (HAVE_ATOMIC_CAS_16): Redefine to 1 whenever IFUNC_ALT != 0.
            (HAVE_ATOMIC_LDST_16): Redefine to 1 whenever IFUNC_ALT == 1.
            (atomic_compare_exchange_n): Define whenever IFUNC_ALT != 0
            on x86_64 for N == 16.
            (__atomic_load_n, __atomic_store_n): Redefine whenever IFUNC_ALT ==
1
            on x86_64 for N == 16.
            (atomic_load_n, atomic_store_n): New functions.
            * config/x86/init.c (__libat_feat1_init): On x86_64 clear bit_AVX
            if CPU vendor is not Intel.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (5 preceding siblings ...)
  2022-03-17 17:50 ` cvs-commit at gcc dot gnu.org
@ 2022-03-29  5:54 ` cvs-commit at gcc dot gnu.org
  2022-04-05 12:30 ` xry111 at mengyan1223 dot wang
                   ` (26 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-03-29  5:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-11 branch has been updated by Jakub Jelinek
<jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:1861b9a9f13c64333306a2eb146b2da0a41d044f

commit r11-9729-g1861b9a9f13c64333306a2eb146b2da0a41d044f
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Mar 17 18:49:00 2022 +0100

    libatomic: Improve 16-byte atomics on Intel AVX [PR104688]

    As mentioned in the PR, the latest Intel SDM has added:
    "Processors that enumerate support for Intel® AVX (by setting the feature
flag CPUID.01H:ECX.AVX[bit 28])
    guarantee that the 16-byte memory operations performed by the following
instructions will always be
    carried out atomically:
    â¢ MOVAPD, MOVAPS, and MOVDQA.
    â¢ VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
    â¢ VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128
and k0 (masking disabled).
    (Note that these instructions require the linear addresses of their memory
operands to be 16-byte
    aligned.)"

    The following patch deals with it just on the libatomic library side so
far,
    currently (since ~ 2017) we emit all the __atomic_* 16-byte builtins as
    library calls since and this is something that we can hopefully backport.

    The patch simply introduces yet another ifunc variant that takes priority
    over the pure CMPXCHG16B one, one that checks AVX and CMPXCHG16B bits and
    on non-Intel clears the AVX bit during detection for now (if AMD comes
    with the same guarantee, we could revert the config/x86/init.c hunk),
    which implements 16-byte atomic load as vmovdqa and 16-byte atomic store
    as vmovdqa followed by mfence.

    2022-03-17  Jakub Jelinek  <jakub@redhat.com>

            PR target/104688
            * Makefile.am (IFUNC_OPTIONS): Change on x86_64 to -mcx16 -mcx16.
            (libatomic_la_LIBADD): Add $(addsuffix _16_2_.lo,$(SIZEOBJS)) for
            x86_64.
            * Makefile.in: Regenerated.
            * config/x86/host-config.h (IFUNC_COND_1): For x86_64 define to
            both AVX and CMPXCHG16B bits.
            (IFUNC_COND_2): Define.
            (IFUNC_NCOND): For x86_64 define to 2 * (N == 16).
            (MAYBE_HAVE_ATOMIC_CAS_16, MAYBE_HAVE_ATOMIC_EXCHANGE_16,
            MAYBE_HAVE_ATOMIC_LDST_16): Define to IFUNC_COND_2 rather than
            IFUNC_COND_1.
            (HAVE_ATOMIC_CAS_16): Redefine to 1 whenever IFUNC_ALT != 0.
            (HAVE_ATOMIC_LDST_16): Redefine to 1 whenever IFUNC_ALT == 1.
            (atomic_compare_exchange_n): Define whenever IFUNC_ALT != 0
            on x86_64 for N == 16.
            (__atomic_load_n, __atomic_store_n): Redefine whenever IFUNC_ALT ==
1
            on x86_64 for N == 16.
            (atomic_load_n, atomic_store_n): New functions.
            * config/x86/init.c (__libat_feat1_init): On x86_64 clear bit_AVX
            if CPU vendor is not Intel.

    (cherry picked from commit 1d47c0512a265d4bb3ab9e56259fd1e4f4d42c75)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (6 preceding siblings ...)
  2022-03-29  5:54 ` cvs-commit at gcc dot gnu.org
@ 2022-04-05 12:30 ` xry111 at mengyan1223 dot wang
  2022-04-05 12:35 ` jakub at gcc dot gnu.org
                   ` (25 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: xry111 at mengyan1223 dot wang @ 2022-04-05 12:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #8 from Xi Ruoyao <xry111 at mengyan1223 dot wang> ---
Shall I close it as FIXED, or keep it opening waiting for AMD response?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (7 preceding siblings ...)
  2022-04-05 12:30 ` xry111 at mengyan1223 dot wang
@ 2022-04-05 12:35 ` jakub at gcc dot gnu.org
  2022-11-14  3:31 ` Ganesh.Gopalasubramanian at amd dot com
                   ` (24 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-04-05 12:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Besides missing AMD response, it isn't fully fixed, because the change is on
the libatomic side only.
So we still pay the cost to call those functions (and often PLT cost too) and
return from them.
For GCC 13, we should add some option that optionally reverts the change to use
library calls all the time, and default that option for -march= Intel CPUs with
AVX support or something similar (perhaps only if AVX is also enabled).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (8 preceding siblings ...)
  2022-04-05 12:35 ` jakub at gcc dot gnu.org
@ 2022-11-14  3:31 ` Ganesh.Gopalasubramanian at amd dot com
  2022-11-14  4:58 ` sam at gentoo dot org
                   ` (23 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: Ganesh.Gopalasubramanian at amd dot com @ 2022-11-14  3:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

GGanesh <Ganesh.Gopalasubramanian at amd dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Ganesh.Gopalasubramanian@am
                   |                            |d.com

--- Comment #10 from GGanesh <Ganesh.Gopalasubramanian at amd dot com> ---
Apologies for late response!

We would update the AMD APM manuals in the next revision.

For all AMD architectures,

Processors that support AVX extend the atomicity for cacheable,
naturally-aligned single loads or stores from a quadword to a double quadword.

which means all 128b instructions, even the *MOVDQU instructions, are atomic if
they end up being naturally aligned.

Can we extend this patch to AMD processors as well. If not, I will plan to
submit the patch for stage-1!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (9 preceding siblings ...)
  2022-11-14  3:31 ` Ganesh.Gopalasubramanian at amd dot com
@ 2022-11-14  4:58 ` sam at gentoo dot org
  2022-11-14  5:10 ` [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD " xry111 at gcc dot gnu.org
                   ` (22 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: sam at gentoo dot org @ 2022-11-14  4:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #11 from Sam James <sam at gentoo dot org> ---
(In reply to GGanesh from comment #10)
> Can we extend this patch to AMD processors as well. If not, I will plan to
> submit the patch for stage-1!

GCC 13 (as of today) is in stage 3 - see https://gcc.gnu.org/develop.html, but
it may or may not still be possible to submit it (not my call).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (10 preceding siblings ...)
  2022-11-14  4:58 ` sam at gentoo dot org
@ 2022-11-14  5:10 ` xry111 at gcc dot gnu.org
  2022-11-14  7:54 ` jakub at gcc dot gnu.org
                   ` (21 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: xry111 at gcc dot gnu.org @ 2022-11-14  5:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Xi Ruoyao <xry111 at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|gcc and libatomic can use   |gcc and libatomic can use
                   |SSE for 128-bit atomic      |SSE for 128-bit atomic
                   |loads on Intel CPUs with    |loads on Intel and AMD CPUs
                   |AVX                         |with AVX
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2022-11-14

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (11 preceding siblings ...)
  2022-11-14  5:10 ` [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD " xry111 at gcc dot gnu.org
@ 2022-11-14  7:54 ` jakub at gcc dot gnu.org
  2022-11-14  9:24 ` amonakov at gcc dot gnu.org
                   ` (20 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-11-14  7:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I've posted the patches (so far only lightly tested):
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606021.html
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606022.html
It is still Sunday in AoE, so we still have stage1 there.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (12 preceding siblings ...)
  2022-11-14  7:54 ` jakub at gcc dot gnu.org
@ 2022-11-14  9:24 ` amonakov at gcc dot gnu.org
  2022-11-14  9:29 ` jakub at gcc dot gnu.org
                   ` (19 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-11-14  9:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #13 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Jakub, sorry if I misunderstood the patches from a brief glance, but what
ordering guarantees are you assuming for AVX accesses? It should not be
SEQ_CST. I think what Intel manual is saying is that said accessing will not
tear, but reordering is the same as pre-existing x86 TSO rules (a load can
finish before an earlier store is globally visible).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (13 preceding siblings ...)
  2022-11-14  9:24 ` amonakov at gcc dot gnu.org
@ 2022-11-14  9:29 ` jakub at gcc dot gnu.org
  2022-11-14  9:52 ` amonakov at gcc dot gnu.org
                   ` (18 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-11-14  9:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
For ordering guarantees I assume (already since the r12-7689 change) that
VMOVDQA behaves the same as MOVL/MOVQ.
This PR was about whether there is a quarantee that VMOVDQA will be an atomic
load or store provided 128-bit aligned address.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (14 preceding siblings ...)
  2022-11-14  9:29 ` jakub at gcc dot gnu.org
@ 2022-11-14  9:52 ` amonakov at gcc dot gnu.org
  2022-11-15  7:18 ` cvs-commit at gcc dot gnu.org
                   ` (17 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-11-14  9:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #15 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Ah, there will be an mfence after the vmovdqa when necessary for an atomic
store, thanks (I missed that because the testcase doesn't scan for mfence).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (15 preceding siblings ...)
  2022-11-14  9:52 ` amonakov at gcc dot gnu.org
@ 2022-11-15  7:18 ` cvs-commit at gcc dot gnu.org
  2022-11-15  7:20 ` jakub at gcc dot gnu.org
                   ` (16 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-11-15  7:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #16 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:4a7a846687e076eae58ad3ea959245b2bf7fdc07

commit r13-4048-g4a7a846687e076eae58ad3ea959245b2bf7fdc07
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Nov 15 08:14:45 2022 +0100

    libatomic: Handle AVX+CX16 AMD like Intel for 16b atomics [PR104688]

    We got a response from AMD in
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688#c10
    so the following patch starts treating AMD with AVX and CMPXCHG16B
    ISAs like Intel by using vmovdqa for atomic load/store in libatomic.
    We still don't have confirmation from Zhaoxin and VIA (anything else
    with CPUs featuring AVX and CX16?).

    2022-11-15  Jakub Jelinek  <jakub@redhat.com>

            PR target/104688
            * config/x86/init.c (__libat_feat1_init): Don't clear
            bit_AVX on AMD CPUs.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (16 preceding siblings ...)
  2022-11-15  7:18 ` cvs-commit at gcc dot gnu.org
@ 2022-11-15  7:20 ` jakub at gcc dot gnu.org
  2022-11-20 23:11 ` cvs-commit at gcc dot gnu.org
                   ` (15 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-11-15  7:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #17 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Fixed for AMD on the library side too.
We need a statement from Zhaoxin and VIA for their CPUs.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (17 preceding siblings ...)
  2022-11-15  7:20 ` jakub at gcc dot gnu.org
@ 2022-11-20 23:11 ` cvs-commit at gcc dot gnu.org
  2022-11-21  9:23 ` cvs-commit at gcc dot gnu.org
                   ` (14 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-11-20 23:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #18 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-12 branch has been updated by Jakub Jelinek
<jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:86dea99d8525bf49d51636332d6be440e51b931a

commit r12-8920-g86dea99d8525bf49d51636332d6be440e51b931a
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Nov 15 08:14:45 2022 +0100

    libatomic: Handle AVX+CX16 AMD like Intel for 16b atomics [PR104688]

    We got a response from AMD in
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688#c10
    so the following patch starts treating AMD with AVX and CMPXCHG16B
    ISAs like Intel by using vmovdqa for atomic load/store in libatomic.
    We still don't have confirmation from Zhaoxin and VIA (anything else
    with CPUs featuring AVX and CX16?).

    2022-11-15  Jakub Jelinek  <jakub@redhat.com>

            PR target/104688
            * config/x86/init.c (__libat_feat1_init): Don't clear
            bit_AVX on AMD CPUs.

    (cherry picked from commit 4a7a846687e076eae58ad3ea959245b2bf7fdc07)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (18 preceding siblings ...)
  2022-11-20 23:11 ` cvs-commit at gcc dot gnu.org
@ 2022-11-21  9:23 ` cvs-commit at gcc dot gnu.org
  2022-11-23  9:18 ` xry111 at gcc dot gnu.org
                   ` (13 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-11-21  9:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #19 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-11 branch has been updated by Jakub Jelinek
<jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:60880f3afc82f55b834643e449883dd5b6ad057a

commit r11-10385-g60880f3afc82f55b834643e449883dd5b6ad057a
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Nov 15 08:14:45 2022 +0100

    libatomic: Handle AVX+CX16 AMD like Intel for 16b atomics [PR104688]

    We got a response from AMD in
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688#c10
    so the following patch starts treating AMD with AVX and CMPXCHG16B
    ISAs like Intel by using vmovdqa for atomic load/store in libatomic.
    We still don't have confirmation from Zhaoxin and VIA (anything else
    with CPUs featuring AVX and CX16?).

    2022-11-15  Jakub Jelinek  <jakub@redhat.com>

            PR target/104688
            * config/x86/init.c (__libat_feat1_init): Don't clear
            bit_AVX on AMD CPUs.

    (cherry picked from commit 4a7a846687e076eae58ad3ea959245b2bf7fdc07)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (19 preceding siblings ...)
  2022-11-21  9:23 ` cvs-commit at gcc dot gnu.org
@ 2022-11-23  9:18 ` xry111 at gcc dot gnu.org
  2022-11-23  9:51 ` jakub at gcc dot gnu.org
                   ` (12 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: xry111 at gcc dot gnu.org @ 2022-11-23  9:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #20 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
From Mayshao (Zhaoxin engineer):

"On Zhaoxin CPUs with AVX, the VMOVDQA instruction is atomic if the accessed
memory is Write Back, but it's not guaranteed for other memory types."

Is it allowed to use VMOVDQA then?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (20 preceding siblings ...)
  2022-11-23  9:18 ` xry111 at gcc dot gnu.org
@ 2022-11-23  9:51 ` jakub at gcc dot gnu.org
  2022-11-23 10:23 ` xry111 at gcc dot gnu.org
                   ` (11 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-11-23  9:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #21 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
What about loads?  That is even more important than the stores.  While atomic
store can be worst case done through cmpxchg16b, even when it is slower, we
can't use cmpxchg16b on atomic load because we don't know if the memory isn't
read-only.
As for the Write Back only vs. other types, doesn't that match the
" for cacheable" in the AMD statement?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (21 preceding siblings ...)
  2022-11-23  9:51 ` jakub at gcc dot gnu.org
@ 2022-11-23 10:23 ` xry111 at gcc dot gnu.org
  2022-11-28 18:35 ` peter at cordes dot ca
                   ` (10 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: xry111 at gcc dot gnu.org @ 2022-11-23 10:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #22 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #21)
> What about loads?  That is even more important than the stores.  While
> atomic store can be worst case done through cmpxchg16b, even when it is
> slower, we can't use cmpxchg16b on atomic load because we don't know if the
> memory isn't read-only.

Loads are also atomic for WB.

> As for the Write Back only vs. other types, doesn't that match the
> " for cacheable" in the AMD statement?

If I read the manual correctly, Write Back, Write Through, and Write Protected
are all "cacheable".  Mayshao told me VMOVDQA is atomic for WB, but not atomic
for UC and WC (they are not cacheable so I think we don't need to take care). 
So how about WT and WP?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (22 preceding siblings ...)
  2022-11-23 10:23 ` xry111 at gcc dot gnu.org
@ 2022-11-28 18:35 ` peter at cordes dot ca
  2022-11-28 18:46 ` amonakov at gcc dot gnu.org
                   ` (9 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: peter at cordes dot ca @ 2022-11-28 18:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #23 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Xi Ruoyao from comment #20)
> "On Zhaoxin CPUs with AVX, the VMOVDQA instruction is atomic if the accessed
> memory is Write Back, but it's not guaranteed for other memory types."

VMOVDQA is still fine, I think WB is the only memory type that's relevant for
atomics, at least on the mainstream OSes we compile for.  It's not normally
possible for user-space to allocate memory of other types.  Kernels normally
use WB memory for their shared data, too.

You're correct that WT and WP are the other two cacheable memory types, and
Zhaoxin's statement doesn't explicitly guarantee atomicity for those, unlike
Intel and AMD.

But at least on Linux, I don't think there's a way for user-space to even ask
for a page of WT or WP memory (or UC or WC).  Only WB memory is easily
available without hacking the kernel.  As far as I know, this is true on other
existing OSes.

WT = write-through: read caching, no write-allocate.  Write hits update the
line and memory.
WP = write-protect: read caching, no write-allocate.  Writes go around the
cache, evicting even on hit.
(https://stackoverflow.com/questions/65953033/whats-the-usecase-of-write-protected-pat-memory-type
quotes the Intel definitions.)

Until recently, the main work on formalizing the x86 TSO memory model had only
looked at WB memory.
A 2022 paper looked at WT, UC, and WC memory types:
https://dl.acm.org/doi/pdf/10.1145/3498683 - Extending Intel-x86 Consistency
and Persistency
Formalising the Semantics of Intel-x86 Memory Types and Non-temporal Stores
(The intro part describing memory types is quite readable, in plain English not
full of formal symbols.  They only mention WP once, but tested some litmus
tests with readers and writers using any combination of the other memory
types.)

Some commenters on my answer on when WT is ever used or useful confirmed that
mainstream OSes don't give easy access to it.
https://stackoverflow.com/questions/61129142/when-use-write-through-cache-policy-for-pages/61130838#61130838
* Linux has never merged a patch to let user-space allocate WT pages.
* The Windows kernel reportedly doesn't have a mechanism to keep track of pages
that should be WT or WP, so you won't find any.

I don't know about *BSD making it plausible for user-space to point an _Atomic
int * at a page of WT or WP memory.  I'd guess not.

I don't know if there's anywhere we can document that _Atomic objects need to
be in memory that's allocated in a "normal" way.  Probably hard to word without
accidentally disallowing something that's fine.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (23 preceding siblings ...)
  2022-11-28 18:35 ` peter at cordes dot ca
@ 2022-11-28 18:46 ` amonakov at gcc dot gnu.org
  2022-11-28 19:03 ` peter at cordes dot ca
                   ` (8 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-11-28 18:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #24 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Peter Cordes from comment #23)
> But at least on Linux, I don't think there's a way for user-space to even
> ask for a page of WT or WP memory (or UC or WC).  Only WB memory is easily
> available without hacking the kernel.  As far as I know, this is true on
> other existing OSes.

I think it's possible to get UC/WC mappings via a graphics/compute API (e.g.
OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device memory
(and then CPU vendor cannot guarantee that 128b access won't tear because it
might depend on downstream devices).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (24 preceding siblings ...)
  2022-11-28 18:46 ` amonakov at gcc dot gnu.org
@ 2022-11-28 19:03 ` peter at cordes dot ca
  2022-11-28 20:11 ` amonakov at gcc dot gnu.org
                   ` (7 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: peter at cordes dot ca @ 2022-11-28 19:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #25 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Alexander Monakov from comment #24)
> 
> I think it's possible to get UC/WC mappings via a graphics/compute API (e.g.
> OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device
> memory (and then CPU vendor cannot guarantee that 128b access won't tear
> because it might depend on downstream devices).

Even atomic_int doesn't work properly if you deref a pointer to WC memory.  WC
doesn't have the same ordering guarantees, so it would break acquire/release
semantics.
So we already don't support WC for this.

We do at least de-facto support atomics on UC memory because the ordering
guarantees are a superset of cacheable memory, and 8-byte atomicity for aligned
load/store is guaranteed even for non-cacheable memory types since P5 Pentium
(and on AMD).  (And lock cmpxchg16b is always atomic even on UC memory.)

But you're right that only Intel guarantees that 16-byte VMOVDQA loads/stores
would be atomic on UC memory.  So this change could break that very unwise
corner-case on AMD which only guarantees that for cacheable loads/stores, and
Zhaoxin only for WB.

But was anyone previously using 16-byte atomics on UC device memory?  Do we
actually care about supporting that?  I'd guess no and no, so it's just a
matter of documenting that somewhere.

Since GCC7 we've reported 16-byte atomics as being non-lock-free, so I *hope*
people weren't using __atomic_store_n on device memory.  The underlying
implementation was never guaranteed.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (25 preceding siblings ...)
  2022-11-28 19:03 ` peter at cordes dot ca
@ 2022-11-28 20:11 ` amonakov at gcc dot gnu.org
  2022-11-28 20:47 ` peter at cordes dot ca
                   ` (6 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-11-28 20:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #26 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Sure, the right course of action seems to be to simply document that atomic
types and built-ins are meant to be used on "common" (writeback) memory, and no
guarantees can be given otherwise, because it would involve platform specifics
(relaxed ordering of WC writes as you say; tearing by PCI bridges and device
interfaces seems like another possible caveat).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (26 preceding siblings ...)
  2022-11-28 20:11 ` amonakov at gcc dot gnu.org
@ 2022-11-28 20:47 ` peter at cordes dot ca
  2022-11-29  8:11 ` fw at gcc dot gnu.org
                   ` (5 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: peter at cordes dot ca @ 2022-11-28 20:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #27 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Alexander Monakov from comment #26)
> Sure, the right course of action seems to be to simply document that atomic
> types and built-ins are meant to be used on "common" (writeback) memory

Agreed.  Where in the manual should this go?  Maybe a new subsection of the
chapter about __atomic builtins where we document per-ISA requirements for them
to actually work?

e.g. x86 memory-type stuff, and that ARM assumes all cores are in the same
inner-shareable cache-coherency domain, thus barriers are   dmb ish   not  dmb
sy and so on.
I guess we might want to avoid documenting the actual asm implementation
strategies in the main manual, because that would imply it's supported to make
assumptions based on that.

Putting it near the __atomic docs might make it easier for readers to notice
that the list of requirements exists, vs. scattering them into different pages
for different ISAs.  And we don't currently have any section in the manual
about per-ISA quirks or requirements, just about command-line options,
builtins, and attributes that are per-ISA, so there's no existing page where
this could get tacked on.

This would also be a place where we can document that __atomic ops are
address-free when they're lock-free, and thus usable on shared memory between
processes.  ISO C++ says that *should* be the case for std::atomic<T>, but
doesn't standardize the existence of multiple processes.

To avoid undue worry, documentation about this should probably start by saying
that normal programs (running under mainstream OSes) don't have to worry about
it or do anything special.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (27 preceding siblings ...)
  2022-11-28 20:47 ` peter at cordes dot ca
@ 2022-11-29  8:11 ` fw at gcc dot gnu.org
  2023-02-15 12:27 ` segher at gcc dot gnu.org
                   ` (4 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: fw at gcc dot gnu.org @ 2022-11-29  8:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #28 from Florian Weimer <fw at gcc dot gnu.org> ---
(In reply to Peter Cordes from comment #27)
> (In reply to Alexander Monakov from comment #26)
> > Sure, the right course of action seems to be to simply document that atomic
> > types and built-ins are meant to be used on "common" (writeback) memory
> 
> Agreed.  Where in the manual should this go?  Maybe a new subsection of the
> chapter about __atomic builtins where we document per-ISA requirements for
> them to actually work?

Maybe this belongs in the ABI manual? For example, the POWER ABI says that
memcpy needs to work on device memory. Documenting the required memory types
for automics seems along the same lines.

The rules are also potentially different for different targets sharing the same
processor architecture.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (28 preceding siblings ...)
  2022-11-29  8:11 ` fw at gcc dot gnu.org
@ 2023-02-15 12:27 ` segher at gcc dot gnu.org
  2023-02-15 12:46 ` fw at gcc dot gnu.org
                   ` (3 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: segher at gcc dot gnu.org @ 2023-02-15 12:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #29 from Segher Boessenkool <segher at gcc dot gnu.org> ---
(In reply to Florian Weimer from comment #28)
> Maybe this belongs in the ABI manual? For example, the POWER ABI says that
> memcpy needs to work on device memory.

Huh?!

Where do you see this?  The way you state it it is trivially impossible to
implement, so if we really say that it needs fixing asap.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (29 preceding siblings ...)
  2023-02-15 12:27 ` segher at gcc dot gnu.org
@ 2023-02-15 12:46 ` fw at gcc dot gnu.org
  2023-02-15 16:03 ` segher at gcc dot gnu.org
                   ` (2 subsequent siblings)
  33 siblings, 0 replies; 35+ messages in thread
From: fw at gcc dot gnu.org @ 2023-02-15 12:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #30 from Florian Weimer <fw at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #29)
> (In reply to Florian Weimer from comment #28)
> > Maybe this belongs in the ABI manual? For example, the POWER ABI says that
> > memcpy needs to work on device memory.
> 
> Huh?!
> 
> Where do you see this?  The way you state it it is trivially impossible to
> implement, so if we really say that it needs fixing asap.

I thought I had an explicit documented reference somewhere, but for now, all we
have is an undocumented requirement (so not a good example in the context of
this bug at all):

[PATCH] powerpc: Use aligned stores in memset
<https://inbox.sourceware.org/libc-alpha/1503033107-20047-1-git-send-email-raji@linux.vnet.ibm.com/>

(There's also a CPU quirk in this area, but I think this wasn't about that.)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (30 preceding siblings ...)
  2023-02-15 12:46 ` fw at gcc dot gnu.org
@ 2023-02-15 16:03 ` segher at gcc dot gnu.org
  2023-02-15 16:07 ` pinskia at gcc dot gnu.org
  2023-02-15 16:09 ` segher at gcc dot gnu.org
  33 siblings, 0 replies; 35+ messages in thread
From: segher at gcc dot gnu.org @ 2023-02-15 16:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #31 from Segher Boessenkool <segher at gcc dot gnu.org> ---
Yes, there was a user who incorrectly used memcpy on non-memory memory.

This is not valid, and never has been.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (31 preceding siblings ...)
  2023-02-15 16:03 ` segher at gcc dot gnu.org
@ 2023-02-15 16:07 ` pinskia at gcc dot gnu.org
  2023-02-15 16:09 ` segher at gcc dot gnu.org
  33 siblings, 0 replies; 35+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-02-15 16:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #32 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #31)
> Yes, there was a user who incorrectly used memcpy on non-memory memory.
From what I remember (it was also reported about aarch64 at one point too), one
of the graphics libraries would call memcpy from normal memory to GPU Memory
(over PCIe) and memcpy will sometimes use unaligned accesses which causes a
fault to the GPU memory.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
  2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
                   ` (32 preceding siblings ...)
  2023-02-15 16:07 ` pinskia at gcc dot gnu.org
@ 2023-02-15 16:09 ` segher at gcc dot gnu.org
  33 siblings, 0 replies; 35+ messages in thread
From: segher at gcc dot gnu.org @ 2023-02-15 16:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #33 from Segher Boessenkool <segher at gcc dot gnu.org> ---
Yes, exactly.  It was the X server I think?  I try to forget such horrors :-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-02-15 16:09 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-25 14:22 [Bug target/104688] New: gcc and libatomic can use SSE for 128-bit atomic loads on Intel CPUs with AVX xry111 at mengyan1223 dot wang
2022-02-25 14:30 ` [Bug target/104688] " jakub at gcc dot gnu.org
2022-02-25 14:33 ` xry111 at mengyan1223 dot wang
2022-02-25 14:34 ` fw at gcc dot gnu.org
2022-02-25 16:36 ` jakub at gcc dot gnu.org
2022-02-25 16:37 ` jakub at gcc dot gnu.org
2022-03-17 17:50 ` cvs-commit at gcc dot gnu.org
2022-03-29  5:54 ` cvs-commit at gcc dot gnu.org
2022-04-05 12:30 ` xry111 at mengyan1223 dot wang
2022-04-05 12:35 ` jakub at gcc dot gnu.org
2022-11-14  3:31 ` Ganesh.Gopalasubramanian at amd dot com
2022-11-14  4:58 ` sam at gentoo dot org
2022-11-14  5:10 ` [Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD " xry111 at gcc dot gnu.org
2022-11-14  7:54 ` jakub at gcc dot gnu.org
2022-11-14  9:24 ` amonakov at gcc dot gnu.org
2022-11-14  9:29 ` jakub at gcc dot gnu.org
2022-11-14  9:52 ` amonakov at gcc dot gnu.org
2022-11-15  7:18 ` cvs-commit at gcc dot gnu.org
2022-11-15  7:20 ` jakub at gcc dot gnu.org
2022-11-20 23:11 ` cvs-commit at gcc dot gnu.org
2022-11-21  9:23 ` cvs-commit at gcc dot gnu.org
2022-11-23  9:18 ` xry111 at gcc dot gnu.org
2022-11-23  9:51 ` jakub at gcc dot gnu.org
2022-11-23 10:23 ` xry111 at gcc dot gnu.org
2022-11-28 18:35 ` peter at cordes dot ca
2022-11-28 18:46 ` amonakov at gcc dot gnu.org
2022-11-28 19:03 ` peter at cordes dot ca
2022-11-28 20:11 ` amonakov at gcc dot gnu.org
2022-11-28 20:47 ` peter at cordes dot ca
2022-11-29  8:11 ` fw at gcc dot gnu.org
2023-02-15 12:27 ` segher at gcc dot gnu.org
2023-02-15 12:46 ` fw at gcc dot gnu.org
2023-02-15 16:03 ` segher at gcc dot gnu.org
2023-02-15 16:07 ` pinskia at gcc dot gnu.org
2023-02-15 16:09 ` segher at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).