[Bug target/105617] New: Regression in code generation for _addcarry

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/105617] New: Regression in code generation for _addcarry_u64()
@ 2022-05-16 12:08 already5chosen at yahoo dot com
  2022-05-16 12:13 ` [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases pinskia at gcc dot gnu.org
                   ` (20 more replies)
  0 siblings, 21 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-16 12:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

            Bug ID: 105617
           Summary: Regression in code generation for _addcarry_u64()
           Product: gcc
           Version: 12.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: already5chosen at yahoo dot com
  Target Milestone: ---

It took many years until gcc caught up with MSVC and LLVM/clang in generation
of code for chains of  Intel's _addcarry_u64() intrinsic calls. But your
finally managed to do it in gcc11.
Unfortunately, the luck didn't last for long.


void add4i(uint64_t dst[4], const uint64_t srcA[4], const uint64_t srcB[4])
{
  unsigned char c;
  unsigned long long r0; c = _addcarry_u64(0, srcA[0], srcB[0], &r0);
  unsigned long long r1; c = _addcarry_u64(c, srcA[1], srcB[1], &r1);
  unsigned long long r2; c = _addcarry_u64(c, srcA[2], srcB[2], &r2);
  unsigned long long r3; c = _addcarry_u64(c, srcA[3], srcB[3], &r3);
  dst[0] = r0;
  dst[1] = r1;
  dst[2] = r2;
  dst[3] = r3;
}

gcc 11.1 -O2

add4i:
        movq    (%rdx), %rax
        addq    (%rsi), %rax
        movq    8(%rsi), %rcx
        movq    %rax, %r8
        adcq    8(%rdx), %rcx
        movq    16(%rsi), %rax
        adcq    16(%rdx), %rax
        movq    24(%rdx), %rdx
        adcq    24(%rsi), %rdx
        movq    %r8, (%rdi)
        movq    %rcx, 8(%rdi)
        movq    %rax, 16(%rdi)
        movq    %rdx, 24(%rdi)
        ret


gcc 12.1  -O2

add4i:
        movq    (%rdx), %rax
        movq    8(%rsi), %rcx
        addq    (%rsi), %rax
        movq    16(%rsi), %r8
        adcq    8(%rdx), %rcx
        adcq    16(%rdx), %r8
        movq    %rax, %xmm1
        movq    24(%rdx), %rdx
        adcq    24(%rsi), %rdx
        movq    %r8, %xmm0
        movq    %rcx, %xmm3
        movq    %rdx, %xmm2
        punpcklqdq      %xmm3, %xmm1
        punpcklqdq      %xmm2, %xmm0
        movups  %xmm1, (%rdi)
        movups  %xmm0, 16(%rdi)
        ret

What ... ?!

BTW, gcc 12.1 -O1 is still o.k.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
@ 2022-05-16 12:13 ` pinskia at gcc dot gnu.org
  2022-05-16 12:14 ` pinskia at gcc dot gnu.org
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-05-16 12:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Regression in code          |[12/13 regressi] Slp is
                   |generation for              |maybe too aggressive in
                   |_addcarry_u64()             |some/many cases
           Keywords|                            |missed-optimization

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is just the vectorizer still being too aggressive right before a return.
It is a cost model issue and it might not really be an issue in the final code
just microbenchmarks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
  2022-05-16 12:13 ` [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases pinskia at gcc dot gnu.org
@ 2022-05-16 12:14 ` pinskia at gcc dot gnu.org
  2022-05-16 12:15 ` rguenth at gcc dot gnu.org
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-05-16 12:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |12.2

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
My bet llvm has a similar issue sometimes too.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
  2022-05-16 12:13 ` [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases pinskia at gcc dot gnu.org
  2022-05-16 12:14 ` pinskia at gcc dot gnu.org
@ 2022-05-16 12:15 ` rguenth at gcc dot gnu.org
  2022-05-16 12:38 ` [Bug target/105617] [12/13 Regression] " already5chosen at yahoo dot com
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-16 12:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
We are vectorizing the store it dst[] now at -O2 since that appears profitable:

t.c:10:10: note: Cost model analysis:
r0.0_12 1 times scalar_store costs 12 in body
r1.1_13 1 times scalar_store costs 12 in body
r2.2_14 1 times scalar_store costs 12 in body
r3.3_15 1 times scalar_store costs 12 in body
r0.0_12 2 times unaligned_store (misalign -1) costs 24 in body
node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
t.c:10:10: note: Cost model analysis for part in loop 0:
  Vector cost: 32
  Scalar cost: 48
t.c:10:10: note: Basic block will be vectorized using SLP

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (2 preceding siblings ...)
  2022-05-16 12:15 ` rguenth at gcc dot gnu.org
@ 2022-05-16 12:38 ` already5chosen at yahoo dot com
  2022-05-16 13:02 ` already5chosen at yahoo dot com
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-16 12:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #4 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Andrew Pinski from comment #1)
> This is just the vectorizer still being too aggressive right before a return.
> It is a cost model issue and it might not really be an issue in the final
> code just microbenchmarks.

BTW, Andrew, in one of the older related threads you made a very wise comment:
"Maybe even generic builtins/internal functions for this case even as clang has
__builtin_addc, etc."

IMHO, that is not only necessary, but long overdue.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (3 preceding siblings ...)
  2022-05-16 12:38 ` [Bug target/105617] [12/13 Regression] " already5chosen at yahoo dot com
@ 2022-05-16 13:02 ` already5chosen at yahoo dot com
  2022-05-16 14:08 ` already5chosen at yahoo dot com
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-16 13:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #5 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Richard Biener from comment #3)
> We are vectorizing the store it dst[] now at -O2 since that appears
> profitable:
> 
> t.c:10:10: note: Cost model analysis:
> r0.0_12 1 times scalar_store costs 12 in body
> r1.1_13 1 times scalar_store costs 12 in body
> r2.2_14 1 times scalar_store costs 12 in body
> r3.3_15 1 times scalar_store costs 12 in body
> r0.0_12 2 times unaligned_store (misalign -1) costs 24 in body
> node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
> node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
> t.c:10:10: note: Cost model analysis for part in loop 0:
>   Vector cost: 32
>   Scalar cost: 48
> t.c:10:10: note: Basic block will be vectorized using SLP

That makes no sense.
4 scalar-to-vector moves + 2 vector shuffles + 2 vector stores are ALOT more
costly than 4 scalar stores.
Even more so considering that scalar store go to adjacent addresses so, on good
CPUs, they are likely combined
at the level of store queue, so a cache subsystem sees fewer operations.

Either your cost model is broken or there are bugs in summation.
I'd guess, somehow compiler thinks that moves have zero cost. But
scalar-to-vector moves are certainly not of zero cost. 
Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer
does not have a zero cost, because quite often renamer itself constitutes the
narrowest performance bottleneck. But those moves... I don't think that they
are hoisted by renamer.
Also, it's likely that compiler thinks that scalar store costs the same as
vector store. That's also generally incorrect, esp. when you don't know your
target CPU and don't know whether stores are aligned or not, like in this case.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (4 preceding siblings ...)
  2022-05-16 13:02 ` already5chosen at yahoo dot com
@ 2022-05-16 14:08 ` already5chosen at yahoo dot com
  2022-05-17  2:47 ` crazylht at gmail dot com
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-16 14:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #6 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Michael_S from comment #5)
> 
> Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer
> does not have a zero cost, because quite often renamer itself constitutes
> the narrowest performance bottleneck. But those moves... I don't think that
> they are hoisted by renamer.

I took a look at several Intel and AMD Optimization Reference Manuals and
instruction tables. None of existing x86 microarchitectures, either old or new,
eliminates scalar-to-SIMD moves at renamer. Which is sort of obvious for new
microarchitectures (Bulldozer or later for AMD, Sandy Bridge or later for
Intel), because on these microarchitectures scalar and SIMD registers live in
separate physical register files.
As to older microarchitectures, they, may be, had them in the common physical
storage area, but they simply were not sufficiently smart to eliminate the
moves.
So, these moves have non-zero latency. On some of the cores, including some of
the newest, the latency is even higher than one clock. And the throughput tends
to be rather low, most typically, one scalar-to-SIMD move per clock. For
comparison,  scalar-to-scalar and SIMD-to-SIMD moves can be executed (or
eliminated at renamer) at rates of 2, 3 or even 4 per clock.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (5 preceding siblings ...)
  2022-05-16 14:08 ` already5chosen at yahoo dot com
@ 2022-05-17  2:47 ` crazylht at gmail dot com
  2022-05-17  3:29 ` crazylht at gmail dot com
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: crazylht at gmail dot com @ 2022-05-17  2:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
construct, but it seems not to work here, guess it's because &r0,and thought it
was load not scalar？ 

r0.1_21 1 times scalar_store costs 12 in body
r1.3_23 1 times scalar_store costs 12 in body
r2.5_25 1 times scalar_store costs 12 in body
r3.7_27 1 times scalar_store costs 12 in body
node 0x866f238 1 times vec_construct costs 4 in prologue
node 0x866f238 1 times vec_construct costs 4 in prologue
r0.1_21 2 times unaligned_store (misalign -1) costs 24 in body
/app/example.c:10:10: note: Cost model analysis for part in loop 0:
  Vector cost: 32
  Scalar cost: 48

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (6 preceding siblings ...)
  2022-05-17  2:47 ` crazylht at gmail dot com
@ 2022-05-17  3:29 ` crazylht at gmail dot com
  2022-05-17  3:37 ` crazylht at gmail dot com
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: crazylht at gmail dot com @ 2022-05-17  3:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #7)
> Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> construct, but it seems not to work here, guess it's because &r0,and thought
> it was load not scalar？ 
Yes, true for as gimple_assign_load_p

(gdb) p debug_gimple_stmt (def)
72# VUSE <.MEM_46>
73r0.0_20 = r0;
(gdb) l
21723246                move it to a vector register, otherwise we have to go
21823247                via a GPR or via vpinsr which involves similar cost.
21923248                Likewise with a BIT_FIELD_REF extracting from a vector
22023249                register we can hope to avoid using a GPR.  */
22123250             if (!is_gimple_assign (def)
22223251                 || (!gimple_assign_load_p (def)
22323252                     && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
22423253                         || !VECTOR_TYPE_P (TREE_TYPE
22523254                                   (TREE_OPERAND (gimple_assign_rhs1
(def), 0))))))
22623255               stmt_cost += ix86_cost->sse_to_integer;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (7 preceding siblings ...)
  2022-05-17  3:29 ` crazylht at gmail dot com
@ 2022-05-17  3:37 ` crazylht at gmail dot com
  2022-05-17  6:48 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: crazylht at gmail dot com @ 2022-05-17  3:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #8)
> (In reply to Hongtao.liu from comment #7)
> > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> > construct, but it seems not to work here, guess it's because &r0,and thought
> > it was load not scalar？ 
> Yes, true for as gimple_assign_load_p
> 
> 
> (gdb) p debug_gimple_stmt (def)
> 72# VUSE <.MEM_46>
> 73r0.0_20 = r0;
It's a load from stack, and finally eliminated in rtl dse1, but here the
vectorizer doesn't know.

And slp will not vectorize it when there's extra scalar->vector cost.

typedef long long uint64_t;
void add4i(uint64_t r0, uint64_t r1, uint64_t r2, uint64_t r3, uint64_t *dst)
{

  dst[0] = r0;
  dst[1] = r1;
  dst[2] = r2;
  dst[3] = r3;
}

add4i:
        mov     QWORD PTR [r8], rdi
        mov     QWORD PTR [r8+8], rsi
        mov     QWORD PTR [r8+16], rdx
        mov     QWORD PTR [r8+24], rcx
        ret

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (8 preceding siblings ...)
  2022-05-17  3:37 ` crazylht at gmail dot com
@ 2022-05-17  6:48 ` rguenth at gcc dot gnu.org
  2022-05-17  9:17 ` already5chosen at yahoo dot com
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-17  6:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2022-05-17
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #9)
> (In reply to Hongtao.liu from comment #8)
> > (In reply to Hongtao.liu from comment #7)
> > > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> > > construct, but it seems not to work here, guess it's because &r0,and thought
> > > it was load not scalar？ 
> > Yes, true for as gimple_assign_load_p
> > 
> > 
> > (gdb) p debug_gimple_stmt (def)
> > 72# VUSE <.MEM_46>
> > 73r0.0_20 = r0;
> It's a load from stack, and finally eliminated in rtl dse1, but here the
> vectorizer doesn't know.

Yes, it's difficult for the SLP vectorizer to guess whether rN will come
from memory or not.  Some friendlier middle-end representation for
add-with-carry might be nice - the x86 backend could for example fold
__builtin_ia32_addcarryx_u64 to use a _Complex unsinged long long for the
return, ferrying the carry in __imag.  Alternatively we could devise
some special GIMPLE_ASM kind ferrying RTL and not assembly so the
backend could fold it directly to RTL on GIMPLE with asm constraints
doing the plumbing ... (we'd need some match-scratch and RTL expansion
would still need to allocate the actual pseudos).

  <bb 2> [local count: 1073741824]:
  _1 = *srcB_17(D);
  _2 = *srcA_18(D);
  _30 = __builtin_ia32_addcarryx_u64 (0, _2, _1, &r0);
  _3 = MEM[(const uint64_t *)srcB_17(D) + 8B];
  _4 = MEM[(const uint64_t *)srcA_18(D) + 8B];
  _5 = (int) _30;
  _29 = __builtin_ia32_addcarryx_u64 (_5, _4, _3, &r1);
  _6 = MEM[(const uint64_t *)srcB_17(D) + 16B];
  _7 = MEM[(const uint64_t *)srcA_18(D) + 16B];
  _8 = (int) _29;
  _28 = __builtin_ia32_addcarryx_u64 (_8, _7, _6, &r2);
  _9 = MEM[(const uint64_t *)srcB_17(D) + 24B];
  _10 = MEM[(const uint64_t *)srcA_18(D) + 24B];
  _11 = (int) _28;
  __builtin_ia32_addcarryx_u64 (_11, _10, _9, &r3);
  r0.0_12 = r0;
  r1.1_13 = r1;
  _36 = {r0.0_12, r1.1_13};
  r2.2_14 = r2;
  r3.3_15 = r3;
  _37 = {r2.2_14, r3.3_15};
  vectp.9_35 = dst_19(D);
  MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_35] = _36;
  vectp.9_39 = vectp.9_35 + 16;
  MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_39] = _37;

so for the situation at hand I don't see any reasonable way out that
doesn't have the chance of regressing things in other places (like
treat loads from non-indexed auto variables specially or so).  The
only real solution is to find a GIMPLE representation for
__builtin_ia32_addcarryx_u64 that doesn't force the alternate output
to memory.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (9 preceding siblings ...)
  2022-05-17  6:48 ` rguenth at gcc dot gnu.org
@ 2022-05-17  9:17 ` already5chosen at yahoo dot com
  2022-05-17  9:25 ` already5chosen at yahoo dot com
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-17  9:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #11 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Richard Biener from comment #10)
> (In reply to Hongtao.liu from comment #9)
> > (In reply to Hongtao.liu from comment #8)
> > > (In reply to Hongtao.liu from comment #7)
> > > > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> > > > construct, but it seems not to work here, guess it's because &r0,and thought
> > > > it was load not scalar？ 
> > > Yes, true for as gimple_assign_load_p
> > > 
> > > 
> > > (gdb) p debug_gimple_stmt (def)
> > > 72# VUSE <.MEM_46>
> > > 73r0.0_20 = r0;
> > It's a load from stack, and finally eliminated in rtl dse1, but here the
> > vectorizer doesn't know.
> 
> Yes, it's difficult for the SLP vectorizer to guess whether rN will come
> from memory or not.  Some friendlier middle-end representation for
> add-with-carry might be nice - the x86 backend could for example fold
> __builtin_ia32_addcarryx_u64 to use a _Complex unsinged long long for the
> return, ferrying the carry in __imag.  Alternatively we could devise
> some special GIMPLE_ASM kind ferrying RTL and not assembly so the
> backend could fold it directly to RTL on GIMPLE with asm constraints
> doing the plumbing ... (we'd need some match-scratch and RTL expansion
> would still need to allocate the actual pseudos).
> 
>   <bb 2> [local count: 1073741824]:
>   _1 = *srcB_17(D);
>   _2 = *srcA_18(D);
>   _30 = __builtin_ia32_addcarryx_u64 (0, _2, _1, &r0);
>   _3 = MEM[(const uint64_t *)srcB_17(D) + 8B];
>   _4 = MEM[(const uint64_t *)srcA_18(D) + 8B];
>   _5 = (int) _30;
>   _29 = __builtin_ia32_addcarryx_u64 (_5, _4, _3, &r1);
>   _6 = MEM[(const uint64_t *)srcB_17(D) + 16B];
>   _7 = MEM[(const uint64_t *)srcA_18(D) + 16B];
>   _8 = (int) _29;
>   _28 = __builtin_ia32_addcarryx_u64 (_8, _7, _6, &r2);
>   _9 = MEM[(const uint64_t *)srcB_17(D) + 24B];
>   _10 = MEM[(const uint64_t *)srcA_18(D) + 24B];
>   _11 = (int) _28;
>   __builtin_ia32_addcarryx_u64 (_11, _10, _9, &r3);
>   r0.0_12 = r0;
>   r1.1_13 = r1;
>   _36 = {r0.0_12, r1.1_13};
>   r2.2_14 = r2;
>   r3.3_15 = r3;
>   _37 = {r2.2_14, r3.3_15};
>   vectp.9_35 = dst_19(D);
>   MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_35] = _36;
>   vectp.9_39 = vectp.9_35 + 16;
>   MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_39] = _37;
> 
> so for the situation at hand I don't see any reasonable way out that
> doesn't have the chance of regressing things in other places (like
> treat loads from non-indexed auto variables specially or so).  The
> only real solution is to find a GIMPLE representation for
> __builtin_ia32_addcarryx_u64 that doesn't force the alternate output
> to memory.


How about simple heuristic: "Never auto-vectorize integer code unless in inner
loop"?
It would be optimal >80% of time and even in cases where it's sub-optimal the
impact is likely single-digit per cents. On the other hand, the impact of
mistake in the opposit direction (i.e. over-vectorization) is often quite
large.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (10 preceding siblings ...)
  2022-05-17  9:17 ` already5chosen at yahoo dot com
@ 2022-05-17  9:25 ` already5chosen at yahoo dot com
  2022-05-19  8:05 ` crazylht at gmail dot com
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-17  9:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #12 from Michael_S <already5chosen at yahoo dot com> ---
On related note...
One of the historical good features of gcc relatively to other popular
compilers was absence of auto-vectorization at -O2.
When did you decide to change it and why?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (11 preceding siblings ...)
  2022-05-17  9:25 ` already5chosen at yahoo dot com
@ 2022-05-19  8:05 ` crazylht at gmail dot com
  2022-07-26 11:22 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: crazylht at gmail dot com @ 2022-05-19  8:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #13 from Hongtao.liu <crazylht at gmail dot com> ---

> so for the situation at hand I don't see any reasonable way out that
> doesn't have the chance of regressing things in other places (like
> treat loads from non-indexed auto variables specially or so).  The
> only real solution is to find a GIMPLE representation for
> __builtin_ia32_addcarryx_u64 that doesn't force the alternate output
> to memory.
Do we support parameter with "out" attribute(a parameter will be set by
builtin)? __builtin_ia32_addcarryx_u64 produces 2 outputs, one can be return
another must use by-reference parameter which force memory usage here.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (12 preceding siblings ...)
  2022-05-19  8:05 ` crazylht at gmail dot com
@ 2022-07-26 11:22 ` rguenth at gcc dot gnu.org
  2022-07-26 11:22 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-26 11:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Michael_S from comment #12)
> On related note...
> One of the historical good features of gcc relatively to other popular
> compilers was absence of auto-vectorization at -O2.
> When did you decide to change it and why?

Well, people requested it ... so GCC 12 is now having it (with an extra
eye on code size but this all doesn't work for scalar code vectorization
which OTOH is thought to almost always benefit code size ...).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (13 preceding siblings ...)
  2022-07-26 11:22 ` rguenth at gcc dot gnu.org
@ 2022-07-26 11:22 ` rguenth at gcc dot gnu.org
  2022-07-29 13:21 ` already5chosen at yahoo dot com
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-26 11:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (14 preceding siblings ...)
  2022-07-26 11:22 ` rguenth at gcc dot gnu.org
@ 2022-07-29 13:21 ` already5chosen at yahoo dot com
  2023-05-08 12:24 ` [Bug target/105617] [12/13/14 " rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-07-29 13:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #15 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Richard Biener from comment #14)
> (In reply to Michael_S from comment #12)
> > On related note...
> > One of the historical good features of gcc relatively to other popular
> > compilers was absence of auto-vectorization at -O2.
> > When did you decide to change it and why?
> 
> Well, people requested it ... so GCC 12 is now having it (with an extra
> eye on code size but this all doesn't work for scalar code vectorization
> which OTOH is thought to almost always benefit code size ...).

"People" is a vague term.
You didn't organize voting among all gcc users, I suppose.
So, count me as "people" who asks to remove it.
Somehow , I suspect that the other one who will vote like me has a middle name
Benedict.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (15 preceding siblings ...)
  2022-07-29 13:21 ` already5chosen at yahoo dot com
@ 2023-05-08 12:24 ` rguenth at gcc dot gnu.org
  2023-06-01  7:58 ` slash.tmp at free dot fr
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-08 12:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|12.3                        |12.4

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 12.3 is being released, retargeting bugs to GCC 12.4.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (16 preceding siblings ...)
  2023-05-08 12:24 ` [Bug target/105617] [12/13/14 " rguenth at gcc dot gnu.org
@ 2023-06-01  7:58 ` slash.tmp at free dot fr
  2023-06-07 23:16 ` already5chosen at yahoo dot com
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: slash.tmp at free dot fr @ 2023-06-01  7:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #18 from Mason <slash.tmp at free dot fr> ---
Hello Michael_S,

As far as I can see, massaging the source helps GCC generate optimal code
(in terms of instruction count, not convinced about scheduling).

#include <x86intrin.h>
typedef unsigned long long u64;
void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
{
  unsigned char c = 0;
  c = _addcarry_u64(c, A[0], B[0], dst+0);
  c = _addcarry_u64(c, A[1], B[1], dst+1);
  c = _addcarry_u64(c, A[2], B[2], dst+2);
  c = _addcarry_u64(c, A[3], B[3], dst+3);
}


On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate
the expected:

add4i:
        movq    (%rdx), %rax
        addq    (%rsi), %rax
        movq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    8(%rdx), %rax
        movq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    16(%rdx), %rax
        movq    %rax, 16(%rdi)
        movq    24(%rdx), %rax
        adcq    24(%rsi), %rax
        movq    %rax, 24(%rdi)
        ret

I'll run a few benchmarks to test optimal scheduling.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (17 preceding siblings ...)
  2023-06-01  7:58 ` slash.tmp at free dot fr
@ 2023-06-07 23:16 ` already5chosen at yahoo dot com
  2023-06-13 14:54 ` slash.tmp at free dot fr
  2023-06-16 14:56 ` already5chosen at yahoo dot com
  20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2023-06-07 23:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #19 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Mason from comment #18)
> Hello Michael_S,
> 
> As far as I can see, massaging the source helps GCC generate optimal code
> (in terms of instruction count, not convinced about scheduling).
> 
> #include <x86intrin.h>
> typedef unsigned long long u64;
> void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
> {
>   unsigned char c = 0;
>   c = _addcarry_u64(c, A[0], B[0], dst+0);
>   c = _addcarry_u64(c, A[1], B[1], dst+1);
>   c = _addcarry_u64(c, A[2], B[2], dst+2);
>   c = _addcarry_u64(c, A[3], B[3], dst+3);
> }
> 
> 
> On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate
> the expected:
> 
> add4i:
>         movq    (%rdx), %rax
>         addq    (%rsi), %rax
>         movq    %rax, (%rdi)
>         movq    8(%rsi), %rax
>         adcq    8(%rdx), %rax
>         movq    %rax, 8(%rdi)
>         movq    16(%rsi), %rax
>         adcq    16(%rdx), %rax
>         movq    %rax, 16(%rdi)
>         movq    24(%rdx), %rax
>         adcq    24(%rsi), %rax
>         movq    %rax, 24(%rdi)
>         ret
> 
> I'll run a few benchmarks to test optimal scheduling.

That's not merely "massaging the source". That's changing semantics.
Think about what happens when dst points to the middle of A or of B.
The change of semantics effectively prevented vectorizer from doing harm.

And yes, for common non-aliasing case the scheduling is problematic, too. 
It would probably not cause slowdown on the latest and greatest cores, but
could be slow on less great cores, including your default Zen1.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (18 preceding siblings ...)
  2023-06-07 23:16 ` already5chosen at yahoo dot com
@ 2023-06-13 14:54 ` slash.tmp at free dot fr
  2023-06-16 14:56 ` already5chosen at yahoo dot com
  20 siblings, 0 replies; 22+ messages in thread
From: slash.tmp at free dot fr @ 2023-06-13 14:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #20 from Mason <slash.tmp at free dot fr> ---
Doh! You're right.
I come from a background where overlapping/aliasing inputs are heresy,
thus got blindsided :(

This would be the optimal code, right?

add4i:
# rdi = dst, rsi = a, rdx = b
        movq     0(%rdx), %r8
        movq     8(%rdx), %rax
        movq    16(%rdx), %rcx
        movq    24(%rdx), %rdx
        addq     0(%rsi), %r8
        adcq     8(%rsi), %rax
        adcq    16(%rsi), %rcx
        adcq    24(%rsi), %rdx
        movq    %r8,   0(%rdi)
        movq    %rax,  8(%rdi)
        movq    %rcx, 16(%rdi)
        movq    %rdx, 24(%rdi)
        ret


FWIW, trunk generates even nastier code for znver2:

add4i:
        movq    (%rdx), %rax
        addq    (%rsi), %rax
        movq    8(%rsi), %rcx
        adcq    8(%rdx), %rcx
        movq    16(%rsi), %r8
        adcq    16(%rdx), %r8
        movq    24(%rdx), %rdx
        adcq    24(%rsi), %rdx
        vmovq   %rax, %xmm2
        vpinsrq $1, %rcx, %xmm2, %xmm0
        vmovq   %r8, %xmm1
        vpinsrq $1, %rdx, %xmm1, %xmm1
        vinserti128     $0x1, %xmm1, %ymm0, %ymm0
        vmovdqu %ymm0, (%rdi)
        vzeroupper
        ret

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
  2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
                   ` (19 preceding siblings ...)
  2023-06-13 14:54 ` slash.tmp at free dot fr
@ 2023-06-16 14:56 ` already5chosen at yahoo dot com
  20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2023-06-16 14:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #21 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Mason from comment #20)
> Doh! You're right.
> I come from a background where overlapping/aliasing inputs are heresy,
> thus got blindsided :(
> 
> This would be the optimal code, right?
> 
> add4i:
> # rdi = dst, rsi = a, rdx = b
> 	movq	 0(%rdx), %r8
> 	movq	 8(%rdx), %rax
> 	movq	16(%rdx), %rcx
> 	movq	24(%rdx), %rdx
> 	addq	 0(%rsi), %r8
> 	adcq	 8(%rsi), %rax
> 	adcq	16(%rsi), %rcx
> 	adcq	24(%rsi), %rdx
> 	movq	%r8,   0(%rdi)
> 	movq	%rax,  8(%rdi)
> 	movq	%rcx, 16(%rdi)
> 	movq	%rdx, 24(%rdi)
> 	ret
> 

If one does not care deeply about latency (which is likely for function that
stores result into memory) then that looks good enough.
But if one does care deeply then I'd expect interleaved loads, as in first 8
lines of code generated by trunk, to produce slightly lower latency on majority
of modern CPUs.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-06-16 14:56 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
2022-05-16 12:13 ` [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases pinskia at gcc dot gnu.org
2022-05-16 12:14 ` pinskia at gcc dot gnu.org
2022-05-16 12:15 ` rguenth at gcc dot gnu.org
2022-05-16 12:38 ` [Bug target/105617] [12/13 Regression] " already5chosen at yahoo dot com
2022-05-16 13:02 ` already5chosen at yahoo dot com
2022-05-16 14:08 ` already5chosen at yahoo dot com
2022-05-17  2:47 ` crazylht at gmail dot com
2022-05-17  3:29 ` crazylht at gmail dot com
2022-05-17  3:37 ` crazylht at gmail dot com
2022-05-17  6:48 ` rguenth at gcc dot gnu.org
2022-05-17  9:17 ` already5chosen at yahoo dot com
2022-05-17  9:25 ` already5chosen at yahoo dot com
2022-05-19  8:05 ` crazylht at gmail dot com
2022-07-26 11:22 ` rguenth at gcc dot gnu.org
2022-07-26 11:22 ` rguenth at gcc dot gnu.org
2022-07-29 13:21 ` already5chosen at yahoo dot com
2023-05-08 12:24 ` [Bug target/105617] [12/13/14 " rguenth at gcc dot gnu.org
2023-06-01  7:58 ` slash.tmp at free dot fr
2023-06-07 23:16 ` already5chosen at yahoo dot com
2023-06-13 14:54 ` slash.tmp at free dot fr
2023-06-16 14:56 ` already5chosen at yahoo dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).