public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
@ 2021-11-03 17:35 hjl.tools at gmail dot com
2021-11-05 5:57 ` [Bug target/103066] " wwwhhhyyy333 at gmail dot com
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: hjl.tools at gmail dot com @ 2021-11-03 17:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
Bug ID: 103066
Summary: __sync_val_compare_and_swap/__sync_bool_compare_and_sw
ap aren't optimized
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: hjl.tools at gmail dot com
CC: crazylht at gmail dot com, wwwhhhyyy333 at gmail dot com
Blocks: 103065
Target Milestone: ---
Target: i386,x86-64
>From the CPU's point of view, getting a cache line for writing is more
expensive than reading. See Appendix A.2 Spinlock in:
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
The full compare and swap will grab the cache line exclusive and causes
excessive cache line bouncing.
[hjl@gnu-cfl-1 tmp]$ cat x.c
extern int m;
int test(int oldv, int newv)
{
return __sync_val_compare_and_swap (&m, oldv, newv);
}
[hjl@gnu-cfl-1 tmp]$ gcc -S -O2 x.c
[hjl@gnu-cfl-1 tmp]$ cat x.s
.file "x.c"
.text
.p2align 4
.globl test
.type test, @function
test:
.LFB0:
.cfi_startproc
movl %edi, %eax
lock cmpxchgl %esi, m(%rip)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GCC should first emit a normal load, check and return immediately if cmpxchgl
may fail.
ret
.cfi_endproc
.LFE0:
.size test, .-test
.ident "GCC: (GNU) 11.2.1 20211019 (Red Hat 11.2.1-6)"
.section .note.GNU-stack,"",@progbits
[hjl@gnu-cfl-1 tmp]$
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065
[Bug 103065] [meta] atomic operations aren't optimized
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
@ 2021-11-05 5:57 ` wwwhhhyyy333 at gmail dot com
2021-11-05 11:57 ` hjl.tools at gmail dot com
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: wwwhhhyyy333 at gmail dot com @ 2021-11-05 5:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
--- Comment #1 from Hongyu Wang <wwwhhhyyy333 at gmail dot com> ---
__sync_val_compare_and_swap will be expanded to atomic_compare_exchange_strong
by default, should we restrict the check and return under
atomic_compare_exchange_weak which is allowed to fail spuriously?
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
2021-11-05 5:57 ` [Bug target/103066] " wwwhhhyyy333 at gmail dot com
@ 2021-11-05 11:57 ` hjl.tools at gmail dot com
2021-11-05 12:17 ` jakub at gcc dot gnu.org
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: hjl.tools at gmail dot com @ 2021-11-05 11:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
H.J. Lu <hjl.tools at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
Last reconfirmed| |2021-11-05
--- Comment #2 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Hongyu Wang from comment #1)
> __sync_val_compare_and_swap will be expanded to
> atomic_compare_exchange_strong
> by default, should we restrict the check and return under
> atomic_compare_exchange_weak which is allowed to fail spuriously?
On x86, "lock cmpxchgl" can fail.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
2021-11-05 5:57 ` [Bug target/103066] " wwwhhhyyy333 at gmail dot com
2021-11-05 11:57 ` hjl.tools at gmail dot com
@ 2021-11-05 12:17 ` jakub at gcc dot gnu.org
2021-11-05 12:33 ` hjl.tools at gmail dot com
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-11-05 12:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jakub at gcc dot gnu.org
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
If by fail you mean that it doesn't update the memory if the memory isn't equal
to expected, sure, but do you mean it can fail spuriously, not update the
memory even if the memory is equal to expected?
Neither __sync_{bool,val}_compare_and_swap nor __atomic_compare_exchange_n with
weak set to false can fail spuriously, __atomic_compare_exchange_n with weak
set to true can.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
` (2 preceding siblings ...)
2021-11-05 12:17 ` jakub at gcc dot gnu.org
@ 2021-11-05 12:33 ` hjl.tools at gmail dot com
2021-11-05 12:44 ` jakub at gcc dot gnu.org
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: hjl.tools at gmail dot com @ 2021-11-05 12:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
--- Comment #4 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Jakub Jelinek from comment #3)
> If by fail you mean that it doesn't update the memory if the memory isn't
> equal to expected, sure, but do you mean it can fail spuriously, not update
> the memory even if the memory is equal to expected?
> Neither __sync_{bool,val}_compare_and_swap nor __atomic_compare_exchange_n
> with weak set to false can fail spuriously, __atomic_compare_exchange_n with
> weak set to true can.
If we generate
movl m(%rip), %eax
cmpl %edi, %eax
jne .L1
movl %edi, %eax
lock cmpxchgl %esi, m(%rip)
.L1:
ret
is it a valid implementation of atomic_compare_exchange_strong?
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
` (3 preceding siblings ...)
2021-11-05 12:33 ` hjl.tools at gmail dot com
@ 2021-11-05 12:44 ` jakub at gcc dot gnu.org
2021-11-05 12:55 ` jakub at gcc dot gnu.org
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-11-05 12:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |redi at gcc dot gnu.org,
| |rodgertq at gcc dot gnu.org
--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I don't think so, but we'd need somebody more familiar with the C/C++ memory
model to say for sure.
In any case, even if it is, I'm not convinced it is a good idea, users should
be able to choose whether they do the separate atomic load first themselves
rather than compiler forcing them to do that unconditionally.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
` (4 preceding siblings ...)
2021-11-05 12:44 ` jakub at gcc dot gnu.org
@ 2021-11-05 12:55 ` jakub at gcc dot gnu.org
2021-11-05 13:07 ` hjl.tools at gmail dot com
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-11-05 12:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
E.g. the builtin is often used in a loop where the user does his own atomic
load first and decides what to do based on that.
Say for
float f;
void
foo ()
{
#pragma omp atomic
f += 3.0f;
}
with -O2 -fopenmp we emit:
D.2113 = &f;
D.2115 = __atomic_load_4 (D.2113, 0);
D.2114 = D.2115;
<bb 3> :
D.2112 = VIEW_CONVERT_EXPR<float>(D.2114);
_1 = D.2112 + 3.0e+0;
D.2116 = VIEW_CONVERT_EXPR<unsigned int>(_1);
D.2117 = .ATOMIC_COMPARE_EXCHANGE (D.2113, D.2114, D.2116, 4, 0, 0);
D.2118 = REALPART_EXPR <D.2117>;
D.2119 = D.2114;
D.2114 = D.2118;
if (D.2118 != D.2119)
goto <bb 3>; [0.00%]
else
goto <bb 4>; [100.00%]
<bb 4> :
return;
which is essentially
void
foo ()
{
int x = __atomic_load_4 ((int *) &f, __ATOMIC_RELAXED), y;
float g;
do
{
__builtin_memcpy (&g, &x, 4);
g += 3.0f;
__builtin_memcpy (&y, &g, 4);
}
while (!__atomic_compare_exchange_n ((int *) &f, &x, y, false,
__ATOMIC_RELAXED, __ATOMIC_RELAXED));
}
Can you explain how your proposed change would improve this? It would just
slow it down and make it larger.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
` (5 preceding siblings ...)
2021-11-05 12:55 ` jakub at gcc dot gnu.org
@ 2021-11-05 13:07 ` hjl.tools at gmail dot com
2021-11-05 13:25 ` jakub at gcc dot gnu.org
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: hjl.tools at gmail dot com @ 2021-11-05 13:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
--- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> ---
Instead of generating:
movl f(%rip), %eax
.L2:
movd %eax, %xmm0
addss .LC0(%rip), %xmm0
movd %xmm0, %edx
lock cmpxchgl %edx, f(%rip)
jne .L2
ret
we want
movl f(%rip), %eax
.L2:
movd %eax, %xmm0
addss .LC0(%rip), %xmm0
movd %xmm0, %edx
cmpl f(%rip), %eax
jne .L2
lock cmpxchgl %edx, f(%rip)
jne .L2
ret
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
` (6 preceding siblings ...)
2021-11-05 13:07 ` hjl.tools at gmail dot com
@ 2021-11-05 13:25 ` jakub at gcc dot gnu.org
2021-11-05 17:40 ` redi at gcc dot gnu.org
2021-11-06 16:31 ` thiago at kde dot org
9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-11-05 13:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #7)
> Instead of generating:
>
> movl f(%rip), %eax
> .L2:
> movd %eax, %xmm0
> addss .LC0(%rip), %xmm0
> movd %xmm0, %edx
> lock cmpxchgl %edx, f(%rip)
> jne .L2
> ret
>
> we want
>
> movl f(%rip), %eax
> .L2:
> movd %eax, %xmm0
> addss .LC0(%rip), %xmm0
> movd %xmm0, %edx
> cmpl f(%rip), %eax
> jne .L2
> lock cmpxchgl %edx, f(%rip)
> jne .L2
> ret
No, certainly not. The mov before or the remembered value from previous lock
cmpxchgl already has the right value unless the atomic memory is extremely
contended, so you don't want to add the non-atomic comparison in between. Not
to mention that the way you've written it totally breaks it, because if the
memory is not equal to the expected value, you should get the current value.
With the above code, if f is modified by another thread in between the initial
movl f(%rip), %eax and cmpl f(%rip), %eax and never after it, it will loop
forever.
I believe what the above paper is talking about should be addressed by users of
these intrinsics if they care and if it is beneficial (e.g. depending on extra
information on how much the lock etc. is contended etc., in OpenMP one has
omp_sync_hint_* constants one can use in hint clause to tell if the lock is
contended, uncontended, unknown, speculative, non-speculative, unknown etc.).
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
` (7 preceding siblings ...)
2021-11-05 13:25 ` jakub at gcc dot gnu.org
@ 2021-11-05 17:40 ` redi at gcc dot gnu.org
2021-11-06 16:31 ` thiago at kde dot org
9 siblings, 0 replies; 11+ messages in thread
From: redi at gcc dot gnu.org @ 2021-11-05 17:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
--- Comment #9 from Jonathan Wakely <redi at gcc dot gnu.org> ---
I don't know what the assembly code does sorry, but it's not permitted to store
to the 'expected' value unless the atomic variable is not equal to it. It looks
like this would update the 'expected' value unconditionally?
I agree with Jakub that we don't want the additional instructions because they
will be redundant in some cases where the user has already done a load.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
` (8 preceding siblings ...)
2021-11-05 17:40 ` redi at gcc dot gnu.org
@ 2021-11-06 16:31 ` thiago at kde dot org
9 siblings, 0 replies; 11+ messages in thread
From: thiago at kde dot org @ 2021-11-06 16:31 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
--- Comment #10 from Thiago Macieira <thiago at kde dot org> ---
You're right that emitting more penalises those who have done their job and
written proper code.
The problem we're seeing is that such code appears to be the minority. Or,
maybe put differently, the bad code is showing up a lot in our benchmarks,
especially on very big multi-core and multi-socket systems. "Fixing" the
compiler would make a broad update to the industry -- so long as the code is
recompiled with new compilers. Fixing the actual code would make it better even
if used with old ones.
Does anyone have a suggestion on how to get best "bang for buck"? (Biggest
benefit for smallest effort) This is a sincere question. I'm not trying to be
ironic or sarcastic. How can we help the most, the quickest, for the limited
amount of resources we can marshal?
Also, and I've been hitting this key for a few years, how can we do better at
teaching people how to use the tools at their disposal at the proper way? A
very good counter-example is C++11's std::atomic_flag: you MUST NEVER use it
(at least until C++20, where it got a test() member).
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2021-11-06 16:31 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
2021-11-05 5:57 ` [Bug target/103066] " wwwhhhyyy333 at gmail dot com
2021-11-05 11:57 ` hjl.tools at gmail dot com
2021-11-05 12:17 ` jakub at gcc dot gnu.org
2021-11-05 12:33 ` hjl.tools at gmail dot com
2021-11-05 12:44 ` jakub at gcc dot gnu.org
2021-11-05 12:55 ` jakub at gcc dot gnu.org
2021-11-05 13:07 ` hjl.tools at gmail dot com
2021-11-05 13:25 ` jakub at gcc dot gnu.org
2021-11-05 17:40 ` redi at gcc dot gnu.org
2021-11-06 16:31 ` thiago at kde dot org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).