[Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
@ 2021-11-03 17:35 hjl.tools at gmail dot com
  2021-11-05  5:57 ` [Bug target/103066] " wwwhhhyyy333 at gmail dot com
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: hjl.tools at gmail dot com @ 2021-11-03 17:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

            Bug ID: 103066
           Summary: __sync_val_compare_and_swap/__sync_bool_compare_and_sw
                    ap aren't optimized
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hjl.tools at gmail dot com
                CC: crazylht at gmail dot com, wwwhhhyyy333 at gmail dot com
            Blocks: 103065
  Target Milestone: ---
            Target: i386,x86-64

>From the CPU's point of view, getting a cache line for writing is more
expensive than reading.  See Appendix A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and causes
excessive cache line bouncing.

[hjl@gnu-cfl-1 tmp]$ cat x.c
extern int m;

int test(int oldv, int newv)
{
  return __sync_val_compare_and_swap (&m, oldv, newv);
}
[hjl@gnu-cfl-1 tmp]$ gcc -S -O2 x.c
[hjl@gnu-cfl-1 tmp]$ cat x.s
        .file   "x.c"
        .text
        .p2align 4
        .globl  test
        .type   test, @function
test:
.LFB0:
        .cfi_startproc
        movl    %edi, %eax
        lock cmpxchgl   %esi, m(%rip)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GCC should first emit a normal load, check and return immediately if cmpxchgl
may fail.
        ret
        .cfi_endproc
.LFE0:
        .size   test, .-test
        .ident  "GCC: (GNU) 11.2.1 20211019 (Red Hat 11.2.1-6)"
        .section        .note.GNU-stack,"",@progbits
[hjl@gnu-cfl-1 tmp]$


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065
[Bug 103065] [meta] atomic operations aren't optimized

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
  2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
@ 2021-11-05  5:57 ` wwwhhhyyy333 at gmail dot com
  2021-11-05 11:57 ` hjl.tools at gmail dot com
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: wwwhhhyyy333 at gmail dot com @ 2021-11-05  5:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

--- Comment #1 from Hongyu Wang <wwwhhhyyy333 at gmail dot com> ---
__sync_val_compare_and_swap will be expanded to atomic_compare_exchange_strong
by default, should we restrict the check and return under
atomic_compare_exchange_weak which is allowed to fail spuriously?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
  2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
  2021-11-05  5:57 ` [Bug target/103066] " wwwhhhyyy333 at gmail dot com
@ 2021-11-05 11:57 ` hjl.tools at gmail dot com
  2021-11-05 12:17 ` jakub at gcc dot gnu.org
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: hjl.tools at gmail dot com @ 2021-11-05 11:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-11-05

--- Comment #2 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Hongyu Wang from comment #1)
> __sync_val_compare_and_swap will be expanded to
> atomic_compare_exchange_strong
> by default, should we restrict the check and return under
> atomic_compare_exchange_weak which is allowed to fail spuriously?

On x86, "lock cmpxchgl" can fail.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
  2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
  2021-11-05  5:57 ` [Bug target/103066] " wwwhhhyyy333 at gmail dot com
  2021-11-05 11:57 ` hjl.tools at gmail dot com
@ 2021-11-05 12:17 ` jakub at gcc dot gnu.org
  2021-11-05 12:33 ` hjl.tools at gmail dot com
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-11-05 12:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
If by fail you mean that it doesn't update the memory if the memory isn't equal
to expected, sure, but do you mean it can fail spuriously, not update the
memory even if the memory is equal to expected?
Neither __sync_{bool,val}_compare_and_swap nor __atomic_compare_exchange_n with
weak set to false can fail spuriously, __atomic_compare_exchange_n with weak
set to true can.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
  2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
                   ` (2 preceding siblings ...)
  2021-11-05 12:17 ` jakub at gcc dot gnu.org
@ 2021-11-05 12:33 ` hjl.tools at gmail dot com
  2021-11-05 12:44 ` jakub at gcc dot gnu.org
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: hjl.tools at gmail dot com @ 2021-11-05 12:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

--- Comment #4 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Jakub Jelinek from comment #3)
> If by fail you mean that it doesn't update the memory if the memory isn't
> equal to expected, sure, but do you mean it can fail spuriously, not update
> the memory even if the memory is equal to expected?
> Neither __sync_{bool,val}_compare_and_swap nor __atomic_compare_exchange_n
> with weak set to false can fail spuriously, __atomic_compare_exchange_n with
> weak set to true can.

If we generate

        movl m(%rip), %eax
        cmpl  %edi, %eax
        jne  .L1
        movl    %edi, %eax
        lock cmpxchgl   %esi, m(%rip)
.L1:
        ret

is it a valid implementation of atomic_compare_exchange_strong?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
  2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
                   ` (3 preceding siblings ...)
  2021-11-05 12:33 ` hjl.tools at gmail dot com
@ 2021-11-05 12:44 ` jakub at gcc dot gnu.org
  2021-11-05 12:55 ` jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-11-05 12:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |redi at gcc dot gnu.org,
                   |                            |rodgertq at gcc dot gnu.org

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I don't think so, but we'd need somebody more familiar with the C/C++ memory
model to say for sure.
In any case, even if it is, I'm not convinced it is a good idea, users should
be able to choose whether they do the separate atomic load first themselves
rather than compiler forcing them to do that unconditionally.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
  2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
                   ` (4 preceding siblings ...)
  2021-11-05 12:44 ` jakub at gcc dot gnu.org
@ 2021-11-05 12:55 ` jakub at gcc dot gnu.org
  2021-11-05 13:07 ` hjl.tools at gmail dot com
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-11-05 12:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
E.g. the builtin is often used in a loop where the user does his own atomic
load first and decides what to do based on that.
Say for
float f;

void
foo ()
{
  #pragma omp atomic
  f += 3.0f;
}
with -O2 -fopenmp we emit:
  D.2113 = &f;
  D.2115 = __atomic_load_4 (D.2113, 0);
  D.2114 = D.2115;

  <bb 3> :
  D.2112 = VIEW_CONVERT_EXPR<float>(D.2114);
  _1 = D.2112 + 3.0e+0;
  D.2116 = VIEW_CONVERT_EXPR<unsigned int>(_1);
  D.2117 = .ATOMIC_COMPARE_EXCHANGE (D.2113, D.2114, D.2116, 4, 0, 0);
  D.2118 = REALPART_EXPR <D.2117>;
  D.2119 = D.2114;
  D.2114 = D.2118;
  if (D.2118 != D.2119)
    goto <bb 3>; [0.00%]
  else
    goto <bb 4>; [100.00%]

  <bb 4> :
  return;
which is essentially
void
foo ()
{
  int x = __atomic_load_4 ((int *) &f, __ATOMIC_RELAXED), y;
  float g;
  do
    {
      __builtin_memcpy (&g, &x, 4);
      g += 3.0f;
      __builtin_memcpy (&y, &g, 4);
    }
  while (!__atomic_compare_exchange_n ((int *) &f, &x, y, false,
__ATOMIC_RELAXED, __ATOMIC_RELAXED));
}
Can you explain how your proposed change would improve this?  It would just
slow it down and make it larger.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
  2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
                   ` (5 preceding siblings ...)
  2021-11-05 12:55 ` jakub at gcc dot gnu.org
@ 2021-11-05 13:07 ` hjl.tools at gmail dot com
  2021-11-05 13:25 ` jakub at gcc dot gnu.org
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: hjl.tools at gmail dot com @ 2021-11-05 13:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

--- Comment #7 from H.J. Lu <hjl.tools at gmail dot com> ---
Instead of generating:

        movl    f(%rip), %eax
.L2:
        movd    %eax, %xmm0
        addss   .LC0(%rip), %xmm0
        movd    %xmm0, %edx
        lock cmpxchgl   %edx, f(%rip)
        jne     .L2
        ret

we want

        movl    f(%rip), %eax
.L2:
        movd    %eax, %xmm0
        addss   .LC0(%rip), %xmm0
        movd    %xmm0, %edx
        cmpl    f(%rip), %eax
        jne     .L2
        lock cmpxchgl   %edx, f(%rip)
        jne     .L2
        ret

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
  2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
                   ` (6 preceding siblings ...)
  2021-11-05 13:07 ` hjl.tools at gmail dot com
@ 2021-11-05 13:25 ` jakub at gcc dot gnu.org
  2021-11-05 17:40 ` redi at gcc dot gnu.org
  2021-11-06 16:31 ` thiago at kde dot org
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-11-05 13:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #7)
> Instead of generating:
> 
> 	movl	f(%rip), %eax
> .L2:
> 	movd	%eax, %xmm0
> 	addss	.LC0(%rip), %xmm0
> 	movd	%xmm0, %edx
> 	lock cmpxchgl	%edx, f(%rip)
> 	jne	.L2
> 	ret
> 
> we want
> 
> 	movl	f(%rip), %eax
> .L2:
> 	movd	%eax, %xmm0
> 	addss	.LC0(%rip), %xmm0
> 	movd	%xmm0, %edx
> 	cmpl	f(%rip), %eax
> 	jne	.L2
> 	lock cmpxchgl	%edx, f(%rip)
> 	jne	.L2
> 	ret

No, certainly not.  The mov before or the remembered value from previous lock
cmpxchgl already has the right value unless the atomic memory is extremely
contended, so you don't want to add the non-atomic comparison in between.  Not
to mention that the way you've written it totally breaks it, because if the
memory is not equal to the expected value, you should get the current value.
With the above code, if f is modified by another thread in between the initial
movl f(%rip), %eax and cmpl f(%rip), %eax and never after it, it will loop
forever.
I believe what the above paper is talking about should be addressed by users of
these intrinsics if they care and if it is beneficial (e.g. depending on extra
information on how much the lock etc. is contended etc., in OpenMP one has
omp_sync_hint_* constants one can use in hint clause to tell if the lock is
contended, uncontended, unknown, speculative, non-speculative, unknown etc.).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
  2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
                   ` (7 preceding siblings ...)
  2021-11-05 13:25 ` jakub at gcc dot gnu.org
@ 2021-11-05 17:40 ` redi at gcc dot gnu.org
  2021-11-06 16:31 ` thiago at kde dot org
  9 siblings, 0 replies; 11+ messages in thread
From: redi at gcc dot gnu.org @ 2021-11-05 17:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

--- Comment #9 from Jonathan Wakely <redi at gcc dot gnu.org> ---
I don't know what the assembly code does sorry, but it's not permitted to store
to the 'expected' value unless the atomic variable is not equal to it. It looks
like this would update the 'expected' value unconditionally?

I agree with Jakub that we don't want the additional instructions because they
will be redundant in some cases where the user has already done a load.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized
  2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
                   ` (8 preceding siblings ...)
  2021-11-05 17:40 ` redi at gcc dot gnu.org
@ 2021-11-06 16:31 ` thiago at kde dot org
  9 siblings, 0 replies; 11+ messages in thread
From: thiago at kde dot org @ 2021-11-06 16:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

--- Comment #10 from Thiago Macieira <thiago at kde dot org> ---
You're right that emitting more penalises those who have done their job and
written proper code.

The problem we're seeing is that such code appears to be the minority. Or,
maybe put differently, the bad code is showing up a lot in our benchmarks,
especially on very big multi-core and multi-socket systems. "Fixing" the
compiler would make a broad update to the industry -- so long as the code is
recompiled with new compilers. Fixing the actual code would make it better even
if used with old ones.

Does anyone have a suggestion on how to get best "bang for buck"? (Biggest
benefit for smallest effort) This is a sincere question. I'm not trying to be
ironic or sarcastic. How can we help the most, the quickest, for the limited
amount of resources we can marshal?

Also, and I've been hitting this key for a few years, how can we do better at
teaching people how to use the tools at their disposal at the proper way? A
very good counter-example is C++11's std::atomic_flag: you MUST NEVER use it
(at least until C++20, where it got a test() member).

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-11-06 16:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-03 17:35 [Bug target/103066] New: __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized hjl.tools at gmail dot com
2021-11-05  5:57 ` [Bug target/103066] " wwwhhhyyy333 at gmail dot com
2021-11-05 11:57 ` hjl.tools at gmail dot com
2021-11-05 12:17 ` jakub at gcc dot gnu.org
2021-11-05 12:33 ` hjl.tools at gmail dot com
2021-11-05 12:44 ` jakub at gcc dot gnu.org
2021-11-05 12:55 ` jakub at gcc dot gnu.org
2021-11-05 13:07 ` hjl.tools at gmail dot com
2021-11-05 13:25 ` jakub at gcc dot gnu.org
2021-11-05 17:40 ` redi at gcc dot gnu.org
2021-11-06 16:31 ` thiago at kde dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).