[Bug target/80878] -mcx16 (enable 128 bit CAS) on x86

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
@ 2020-04-18 17:14 ` avi@cloudius-systems.com
  2020-04-18 17:23 ` fw at gcc dot gnu.org
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: avi@cloudius-systems.com @ 2020-04-18 17:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #20 from Avi Kivity <avi@cloudius-systems.com> ---
Note that clang generates cmpxchg16b when the conditions are ripe:

https://godbolt.org/z/j9Whgh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
  2020-04-18 17:14 ` [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0 avi@cloudius-systems.com
@ 2020-04-18 17:23 ` fw at gcc dot gnu.org
  2020-04-18 17:31 ` avi@cloudius-systems.com
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: fw at gcc dot gnu.org @ 2020-04-18 17:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #21 from Florian Weimer <fw at gcc dot gnu.org> ---
(In reply to Avi Kivity from comment #20)
> Note that clang generates cmpxchg16b when the conditions are ripe:
> 
> https://godbolt.org/z/j9Whgh

I believe this is a different, C++-specific issue. The C front end already
emits cmpxchgq in this situation.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
  2020-04-18 17:14 ` [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0 avi@cloudius-systems.com
  2020-04-18 17:23 ` fw at gcc dot gnu.org
@ 2020-04-18 17:31 ` avi@cloudius-systems.com
  2020-04-18 17:40 ` fw at gcc dot gnu.org
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: avi@cloudius-systems.com @ 2020-04-18 17:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #22 from Avi Kivity <avi@cloudius-systems.com> ---
Perhaps PR 84522 should be reopened and unmarked as a duplicate. While the
reproducer there is a C API, it is the C equivalent of <atomic>
(<stdatomic.h>).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2020-04-18 17:31 ` avi@cloudius-systems.com
@ 2020-04-18 17:40 ` fw at gcc dot gnu.org
  2020-04-18 18:13 ` avi@cloudius-systems.com
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: fw at gcc dot gnu.org @ 2020-04-18 17:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #23 from Florian Weimer <fw at gcc dot gnu.org> ---
Ahh, I think this bug here is specific to __uint128 (with the C front end at
least)

The C translation of the C++ reproducer from comment 20:

struct a
{
  long  _Alignas(16) x;
  long y;
};

_Bool
cmpxchg (struct a *data, struct a expected, struct a newval)
{
  return __atomic_compare_exchange_n (&data, &expected, &newval, 1,
                                      __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}

produces the atomic instruction.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2020-04-18 17:40 ` fw at gcc dot gnu.org
@ 2020-04-18 18:13 ` avi@cloudius-systems.com
  2020-04-18 18:18 ` avi@cloudius-systems.com
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: avi@cloudius-systems.com @ 2020-04-18 18:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #24 from Avi Kivity <avi@cloudius-systems.com> ---
I'll file a new PR.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2020-04-18 18:13 ` avi@cloudius-systems.com
@ 2020-04-18 18:18 ` avi@cloudius-systems.com
  2020-04-18 18:41 ` fw at gcc dot gnu.org
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: avi@cloudius-systems.com @ 2020-04-18 18:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #25 from Avi Kivity <avi@cloudius-systems.com> ---
PR 94649.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2020-04-18 18:18 ` avi@cloudius-systems.com
@ 2020-04-18 18:41 ` fw at gcc dot gnu.org
  2021-03-14  3:21 ` wuyongwei at gmail dot com
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: fw at gcc dot gnu.org @ 2020-04-18 18:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #26 from Florian Weimer <fw at gcc dot gnu.org> ---
(In reply to Florian Weimer from comment #23)
> Ahh, I think this bug here is specific to __uint128 (with the C front end at
> least)
> 
> The C translation of the C++ reproducer from comment 20:
> 
> struct a
> {
>   long  _Alignas(16) x;
>   long y;
> };
> 
> _Bool
> cmpxchg (struct a *data, struct a expected, struct a newval)
> {
>   return __atomic_compare_exchange_n (&data, &expected, &newval, 1,
>                                       __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
> }
> 
> produces the atomic instruction.

Eh, no, that code does something else.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2020-04-18 18:41 ` fw at gcc dot gnu.org
@ 2021-03-14  3:21 ` wuyongwei at gmail dot com
  2021-03-14  3:46 ` wuyongwei at gmail dot com
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: wuyongwei at gmail dot com @ 2021-03-14  3:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

Yongwei Wu <wuyongwei at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wuyongwei at gmail dot com

--- Comment #27 from Yongwei Wu <wuyongwei at gmail dot com> ---
Anyone can show a valid use case for a non-lock-free version of 128-bit
atomic_compare_exchange?

I am trying to use it in a data structure intended to be lock-free. I am
surprised to find that the C++ std::atomic::compare_exchange_weak does not
result in lock-free code for a 128-bit struct intended for ABA-free CAS. As a
result, the GCC-generated code is MUCH slower than the mutex-based version in
my 8-thread contention test, defeating all its valid purposes. I am talking
about a 10x difference. And the Clang-generated code is more than 200x faster
in the same test.

Friends, being 200x worse in an important use case (lock-free, ABA-free data
structures like queues and lists) is not funny at all.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2021-03-14  3:21 ` wuyongwei at gmail dot com
@ 2021-03-14  3:46 ` wuyongwei at gmail dot com
  2021-03-16 16:02 ` wuyongwei at gmail dot com
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: wuyongwei at gmail dot com @ 2021-03-14  3:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #28 from Yongwei Wu <wuyongwei at gmail dot com> ---
OK, somewhat answering myself. I was not aware of the fact that 128-bit atomic
read has to be implemented using cmpxchg16b as well, thus defeating some
non-CAS usage scenarios.

The natural question is: which usage scenario is more significant? Or is there
a way to support both?

I still think lock-free data structures are too import to ignore.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2021-03-14  3:46 ` wuyongwei at gmail dot com
@ 2021-03-16 16:02 ` wuyongwei at gmail dot com
  2021-05-06 17:50 ` s_gccbugzilla at nedprod dot com
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: wuyongwei at gmail dot com @ 2021-03-16 16:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #29 from Yongwei Wu <wuyongwei at gmail dot com> ---
As usual, test results are always elusive. I have to add yet another important
piece of information. The very bad performance result does not occur on Linux,
but only on macOS (Homebrew versions of GCC and libatomic).

So far, it seems to indicate that this is more a libatomic issue on macOS,
which I traced to pthread_mutex_lock, instead of "lock cmpxchg16b" on Linux...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (9 preceding siblings ...)
  2021-03-16 16:02 ` wuyongwei at gmail dot com
@ 2021-05-06 17:50 ` s_gccbugzilla at nedprod dot com
  2021-05-06 19:32 ` pinskia at gcc dot gnu.org
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: s_gccbugzilla at nedprod dot com @ 2021-05-06 17:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

Niall Douglas <s_gccbugzilla at nedprod dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |s_gccbugzilla at nedprod dot com

--- Comment #30 from Niall Douglas <s_gccbugzilla at nedprod dot com> ---
I got bit by this GCC regression today at work. Consider
https://godbolt.org/z/M9fd7nhdh where std::atomic<__int128> is compare
exchanged with -march=sandybridge:

- On GCC 6.4 and earlier, this emits lock cmpxchg16b, as you would expect.

- From GCC 7 up to trunk (12?), this emits __atomic_compare_exchange_16.

- On clang, this emits lock cmpxchg16b, as you would expect.

This is clearly a regression. GCCs before 7 did the right thing. GCCs from 7
onwards do not. clangs with libstdc++ do do the right thing.

This isn't just an x64 thing, either. Consider https://godbolt.org/z/x6d5GE4o6
where GCC on ARM64 emits __atomic_compare_exchange_16, whereas clang on ARM64
emits ldaxp/stlxp, as you would expect.

Please mark this bug as a regression affecting all versions of GCC from 7 to
trunk, and affecting all 128 bit atomic capable architectures not just x64.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (10 preceding siblings ...)
  2021-05-06 17:50 ` s_gccbugzilla at nedprod dot com
@ 2021-05-06 19:32 ` pinskia at gcc dot gnu.org
  2021-05-06 20:53 ` liblfds_gccbz at winterflaw dot net
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-05-06 19:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #31 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Niall Douglas from comment #30)
> I got bit by this GCC regression today at work. Consider
> https://godbolt.org/z/M9fd7nhdh where std::atomic<__int128> is compare
> exchanged with -march=sandybridge:
> 
> - On GCC 6.4 and earlier, this emits lock cmpxchg16b, as you would expect.
> 
> - From GCC 7 up to trunk (12?), this emits __atomic_compare_exchange_16.
> 
> - On clang, this emits lock cmpxchg16b, as you would expect.
> 
> This is clearly a regression. GCCs before 7 did the right thing. GCCs from 7
> onwards do not. clangs with libstdc++ do do the right thing.
> 
> This isn't just an x64 thing, either. Consider
> https://godbolt.org/z/x6d5GE4o6 where GCC on ARM64 emits
> __atomic_compare_exchange_16, whereas clang on ARM64 emits ldaxp/stlxp, as
> you would expect.
> 
> Please mark this bug as a regression affecting all versions of GCC from 7 to
> trunk, and affecting all 128 bit atomic capable architectures not just x64.

Again the problem is stuff like:
static const _Atomic __int128_t t = 2000;

__int128_t g(void)
{
  return t;
}

DOES NOT WORK if you use CAS (or ldaxp/stlxp).

So clang is broken really ....

Also GCC for ARM64 emits calls for all compare and exchange because using the
LSE (from ARMv8.1-a) is useful.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (11 preceding siblings ...)
  2021-05-06 19:32 ` pinskia at gcc dot gnu.org
@ 2021-05-06 20:53 ` liblfds_gccbz at winterflaw dot net
  2021-05-07 12:27 ` s_gccbugzilla at nedprod dot com
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: liblfds_gccbz at winterflaw dot net @ 2021-05-06 20:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

liblfds admin <liblfds_gccbz at winterflaw dot net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |liblfds_gccbz at winterflaw dot ne
                   |                            |t

--- Comment #32 from liblfds admin <liblfds_gccbz at winterflaw dot net> ---
(In reply to Andrew Pinski from comment #31)
> Again the problem is stuff like:
> static const _Atomic __int128_t t = 2000;
> 
> __int128_t g(void)
> {
>   return t;
> }
> 
> DOES NOT WORK if you use CAS (or ldaxp/stlxp).
> 
> So clang is broken really ....
> 
> Also GCC for ARM64 emits calls for all compare and exchange because using
> the LSE (from ARMv8.1-a) is useful.

It may be a case of selecting the lesser of two evils.

The problem for me, as an author of a lock-free data structure library, is that
a mutex is not repeat NOT a replacement for a compare-exchange instruction.

This is because lock-free data structures possess the property of not sleeping.
 Such data structures are used in kernels, at times and in places where
sleeping is absolutely forbidden and will cause the kernel to panic. 
Accordingly, replacing an atomic exchange with a mutex does *not* provide
identical functionality - an atomic exchange works fine, a mutex makes the
kernel panic.

To reiterate : I write a library of lock-free data structures, and on the face
of it you would then think I would be a prime user of libatomic, and I
*specifically* MUST avoid libatomic, and indeed have basically implemented my
own, because of how libatomic behaves.

It's crazy that people writing lock-free data structure must specifically
ensure and guarantee they absolutely do not touch libatomic,

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (12 preceding siblings ...)
  2021-05-06 20:53 ` liblfds_gccbz at winterflaw dot net
@ 2021-05-07 12:27 ` s_gccbugzilla at nedprod dot com
  2021-05-07 14:07 ` redi at gcc dot gnu.org
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: s_gccbugzilla at nedprod dot com @ 2021-05-07 12:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #33 from Niall Douglas <s_gccbugzilla at nedprod dot com> ---
(In reply to Andrew Pinski from comment #31)
> 
> Again the problem is stuff like:
> static const _Atomic __int128_t t = 2000;
> 
> __int128_t g(void)
> {
>   return t;
> }
> 
> DOES NOT WORK if you use CAS (or ldaxp/stlxp).

I think we are talking about different things here. You are talking about
calling convention. I'm talking about
std::atomic<__int128>::compare_exchange_weak() i.e. that the specific member
function compare_exchange_weak() is not generating cmpxchg16b if compiled with
GCC, but does with clang.

Re: your original point, I cannot say anything about _Atomic. However, for
std::atomic<__int128>, as __int128 is not an integral type it seems reasonable
to me that its specialisation tell the compiler to not store it in read only
memory. Mark it with attribute section, give it a non-trivial destructor, or
whatever it needs.

Perhaps I ought to open a separate issue here, as my specific problem is that
std::atomic<__int128>::compare_exchange_weak() is not using cmpxchg16b? Mr.
Wakely your thoughts?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (13 preceding siblings ...)
  2021-05-07 12:27 ` s_gccbugzilla at nedprod dot com
@ 2021-05-07 14:07 ` redi at gcc dot gnu.org
  2021-05-07 14:44 ` s_gccbugzilla at nedprod dot com
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: redi at gcc dot gnu.org @ 2021-05-07 14:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #34 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Niall Douglas from comment #33)
> Re: your original point, I cannot say anything about _Atomic. However, for
> std::atomic<__int128>, as __int128 is not an integral type it seems

That depends whether you use -std=c++NN or -std=gnu++NN

> reasonable to me that its specialisation tell the compiler to not store it
> in read only memory. Mark it with attribute section, give it a non-trivial
> destructor, or whatever it needs.

std::atomic<T> requires T to have a trivial destructor, so the destructor is
always trivial.

> Perhaps I ought to open a separate issue here, as my specific problem is
> that std::atomic<__int128>::compare_exchange_weak() is not using cmpxchg16b?

Isn't that covered by PR 94649?

std::atomic just calls the relevant __atomic built-in for all operations. What
the built-in does is not up to libstdc++.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (14 preceding siblings ...)
  2021-05-07 14:07 ` redi at gcc dot gnu.org
@ 2021-05-07 14:44 ` s_gccbugzilla at nedprod dot com
  2022-11-03  9:55 ` lh_mouse at 126 dot com
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: s_gccbugzilla at nedprod dot com @ 2021-05-07 14:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #35 from Niall Douglas <s_gccbugzilla at nedprod dot com> ---
(In reply to Jonathan Wakely from comment #34)

> > Perhaps I ought to open a separate issue here, as my specific problem is
> > that std::atomic<__int128>::compare_exchange_weak() is not using cmpxchg16b?
> 
> Isn't that covered by PR 94649?

That issue is definitely closer to mine, but still not the same. Still, I'll
relocate this report from here to there. Thanks for pointing me at it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (15 preceding siblings ...)
  2021-05-07 14:44 ` s_gccbugzilla at nedprod dot com
@ 2022-11-03  9:55 ` lh_mouse at 126 dot com
  2022-11-03 10:04 ` jakub at gcc dot gnu.org
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: lh_mouse at 126 dot com @ 2022-11-03  9:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

LIU Hao <lh_mouse at 126 dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lh_mouse at 126 dot com

--- Comment #36 from LIU Hao <lh_mouse at 126 dot com> ---
(In reply to Andrew Pinski from comment #1)
> IIRC this was removed as the instruction cannot be used for read only memory.

That's not a valid argument. The first argument is a pointer to non-const type,
and whoever passes a read-only object bears the risk on their own.

As mention in previous posts, the double-word compare-and-swap operation is
invaluable for many algorithms. The fact that GCC does not generate it, even
when requested explicitly with `-mcx16`, is silly and unacceptable.

--- Comment #37 from LIU Hao <lh_mouse at 126 dot com> ---
(In reply to Andrew Pinski from comment #31)
> Again the problem is stuff like:
> static const _Atomic __int128_t t = 2000;
> 
> __int128_t g(void)
> {
>   return t;
> }
> 
> DOES NOT WORK if you use CAS (or ldaxp/stlxp).
> 

Can this be made using MOVDQA instead? I haven't tested this though, just out
of curiosity.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (16 preceding siblings ...)
  2022-11-03  9:55 ` lh_mouse at 126 dot com
@ 2022-11-03 10:04 ` jakub at gcc dot gnu.org
  2022-11-03 10:32 ` fw at gcc dot gnu.org
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-11-03 10:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #38 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Please see PR104688 .  We got a response from Intel, where they guaranteed
atomicity of certain 16-byte load instructions for Intel CPUs with AVX support.
AFAIK we didn't get similar guarantee from AMD.
The current state is that on the libatomic side when ifuncs are possible we use
those atomic loads etc. on Intel with AVX, and do what we used to do before for
other CPUs.
We haven't changed what the compiler emits, I think we'd need to introduce some
new option for it (guarantee code will run only on Intel CPUs) and imply that
from -march= listing Intel CPUs (with AVX).  If AMD would give a similar
guarantee, it would be much easier, we could just emit that whenever -mavx.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (17 preceding siblings ...)
  2022-11-03 10:04 ` jakub at gcc dot gnu.org
@ 2022-11-03 10:32 ` fw at gcc dot gnu.org
  2022-11-03 11:16 ` admin_public at liblfds dot org
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: fw at gcc dot gnu.org @ 2022-11-03 10:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #39 from Florian Weimer <fw at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #38)
> Please see PR104688 .  We got a response from Intel, where they guaranteed
> atomicity of certain 16-byte load instructions for Intel CPUs with AVX
> support.
> AFAIK we didn't get similar guarantee from AMD.

I'm trying to work with AMD to get an official statement that covers older CPUs
as well. I have a preliminary statement, but I hope to get to the point that we
can say the rule is the same as for Intel (AVX support can act as a proxy).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (18 preceding siblings ...)
  2022-11-03 10:32 ` fw at gcc dot gnu.org
@ 2022-11-03 11:16 ` admin_public at liblfds dot org
  2023-11-16  1:33 ` lh_mouse at 126 dot com
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 23+ messages in thread
From: admin_public at liblfds dot org @ 2022-11-03 11:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #40 from admin_public at liblfds dot org ---
On 03/11/2022 12:04, jakub at gcc dot gnu.org wrote:
> --- Comment #38 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Please see PR104688 .  We got a response from Intel, where they guaranteed
> atomicity of certain 16-byte load instructions for Intel CPUs with AVX support.

Now, it's been a quite a long time since I've delved into lock-free, and I have
reason to doubt my earlier understanding anyway 
- so I may be *completely* wrong - but, as I recall, and, as I understood it,
the "usual" atomic operations (i.e. non-AVX) are 
essential in that they force the honouring of any previously issued read/write
barriers, as they force a read from, and write 
to, memory (well, I say memory - I mean to say, at least out to the cache
coherency protocol).

Will AVX do the same?

> The current state is that on the libatomic side when ifuncs are possible we use
> those atomic loads etc. on Intel with AVX, and do what we used to do before for
> other CPUs.

Yes.  As I recall, this is the problem for me - if such lock-free is not
available, mutexes or some such as used instead, and 
this is absolutely *not* okay, because their properties are completely
different; if I have a lock-free data structure and I'm 
using it in the kernel and I'm not allowed to sleep, I *can't* use a
sleep-based locking mechanism.

Lock-free has unique properties and when those properties are needed, if they
are not available, the only option is to fail to 
compile/build/run.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (19 preceding siblings ...)
  2022-11-03 11:16 ` admin_public at liblfds dot org
@ 2023-11-16  1:33 ` lh_mouse at 126 dot com
  2023-12-10  9:54 ` lh_mouse at 126 dot com
  2023-12-10 10:32 ` admin_public at liblfds dot org
  22 siblings, 0 replies; 23+ messages in thread
From: lh_mouse at 126 dot com @ 2023-11-16  1:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #41 from LIU Hao <lh_mouse at 126 dot com> ---
There should have been an option, long ago since GCC 7, which may be called

 
-mcx16-just-emit-the-god-damn-cmpxchg16b-for-me-if-it-does-not-work-its-not-your-fault


`__sync_*` are not an option as 1) they do not pass the old value and the zero
flag in a single operation, and 2) they do not accept 16-byte structs, and 3)
they are not full barriers.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (20 preceding siblings ...)
  2023-11-16  1:33 ` lh_mouse at 126 dot com
@ 2023-12-10  9:54 ` lh_mouse at 126 dot com
  2023-12-10 10:32 ` admin_public at liblfds dot org
  22 siblings, 0 replies; 23+ messages in thread
From: lh_mouse at 126 dot com @ 2023-12-10  9:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #42 from LIU Hao <lh_mouse at 126 dot com> ---
(In reply to Yongwei Wu from comment #27)
> Anyone can show a valid use case for a non-lock-free version of 128-bit
> atomic_compare_exchange?
> 
> I am trying to use it in a data structure intended to be lock-free. I am
> surprised to find that the C++ std::atomic::compare_exchange_weak does not
> result in lock-free code for a 128-bit struct intended for ABA-free CAS. As
> a result, the GCC-generated code is MUCH slower than the mutex-based version
> in my 8-thread contention test, defeating all its valid purposes. I am
> talking about a 10x difference. And the Clang-generated code is more than
> 200x faster in the same test.

[I think this is off topic though.]

I tested CMPXCHG16B with inline assembly on an i7-1165G7 (Dell XPS 13 9305) and
it turned out to be much slower than CMPXCHG, even slower than a pair of calls
to `pthread_mutex_lock()` and unlock. Similar results were observed on a
desktop i7 11700 and a server Xeon Cascadelake. The performance degeneration
might be caused by more μops, more locking work for the extra width of
operands, and more cache synchronization, which makes some sense if we assume
the CPU should be optimized mostly for 8-byte access.

The conclusion is probably that 16-byte compare-and-swap isn't recommended.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
       [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
                   ` (21 preceding siblings ...)
  2023-12-10  9:54 ` lh_mouse at 126 dot com
@ 2023-12-10 10:32 ` admin_public at liblfds dot org
  22 siblings, 0 replies; 23+ messages in thread
From: admin_public at liblfds dot org @ 2023-12-10 10:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #43 from admin_public at liblfds dot org ---
> I tested CMPXCHG16B with inline assembly on an i7-1165G7 (Dell XPS 13 9305) and it turned out to be much slower than CMPXCHG, even slower than a pair of calls to `pthread_mutex_lock()` and unlock.

Mutexes are faster when single threaded and there's no contention to the
locking object.  Compare-exchange (8 or 16) is much faster (orders of magnitude
faster) as contention rises.

Sometimes you need CAS 16, rather than CAS 8, due to the implementation
requirements of lock free data structures.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-12-10 10:32 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-80878-4@http.gcc.gnu.org/bugzilla/>
2020-04-18 17:14 ` [Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0 avi@cloudius-systems.com
2020-04-18 17:23 ` fw at gcc dot gnu.org
2020-04-18 17:31 ` avi@cloudius-systems.com
2020-04-18 17:40 ` fw at gcc dot gnu.org
2020-04-18 18:13 ` avi@cloudius-systems.com
2020-04-18 18:18 ` avi@cloudius-systems.com
2020-04-18 18:41 ` fw at gcc dot gnu.org
2021-03-14  3:21 ` wuyongwei at gmail dot com
2021-03-14  3:46 ` wuyongwei at gmail dot com
2021-03-16 16:02 ` wuyongwei at gmail dot com
2021-05-06 17:50 ` s_gccbugzilla at nedprod dot com
2021-05-06 19:32 ` pinskia at gcc dot gnu.org
2021-05-06 20:53 ` liblfds_gccbz at winterflaw dot net
2021-05-07 12:27 ` s_gccbugzilla at nedprod dot com
2021-05-07 14:07 ` redi at gcc dot gnu.org
2021-05-07 14:44 ` s_gccbugzilla at nedprod dot com
2022-11-03  9:55 ` lh_mouse at 126 dot com
2022-11-03 10:04 ` jakub at gcc dot gnu.org
2022-11-03 10:32 ` fw at gcc dot gnu.org
2022-11-03 11:16 ` admin_public at liblfds dot org
2023-11-16  1:33 ` lh_mouse at 126 dot com
2023-12-10  9:54 ` lh_mouse at 126 dot com
2023-12-10 10:32 ` admin_public at liblfds dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).