public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/105617] New: Regression in code generation for _addcarry_u64()
@ 2022-05-16 12:08 already5chosen at yahoo dot com
2022-05-16 12:13 ` [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases pinskia at gcc dot gnu.org
` (20 more replies)
0 siblings, 21 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-16 12:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
Bug ID: 105617
Summary: Regression in code generation for _addcarry_u64()
Product: gcc
Version: 12.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: already5chosen at yahoo dot com
Target Milestone: ---
It took many years until gcc caught up with MSVC and LLVM/clang in generation
of code for chains of Intel's _addcarry_u64() intrinsic calls. But your
finally managed to do it in gcc11.
Unfortunately, the luck didn't last for long.
void add4i(uint64_t dst[4], const uint64_t srcA[4], const uint64_t srcB[4])
{
unsigned char c;
unsigned long long r0; c = _addcarry_u64(0, srcA[0], srcB[0], &r0);
unsigned long long r1; c = _addcarry_u64(c, srcA[1], srcB[1], &r1);
unsigned long long r2; c = _addcarry_u64(c, srcA[2], srcB[2], &r2);
unsigned long long r3; c = _addcarry_u64(c, srcA[3], srcB[3], &r3);
dst[0] = r0;
dst[1] = r1;
dst[2] = r2;
dst[3] = r3;
}
gcc 11.1 -O2
add4i:
movq (%rdx), %rax
addq (%rsi), %rax
movq 8(%rsi), %rcx
movq %rax, %r8
adcq 8(%rdx), %rcx
movq 16(%rsi), %rax
adcq 16(%rdx), %rax
movq 24(%rdx), %rdx
adcq 24(%rsi), %rdx
movq %r8, (%rdi)
movq %rcx, 8(%rdi)
movq %rax, 16(%rdi)
movq %rdx, 24(%rdi)
ret
gcc 12.1 -O2
add4i:
movq (%rdx), %rax
movq 8(%rsi), %rcx
addq (%rsi), %rax
movq 16(%rsi), %r8
adcq 8(%rdx), %rcx
adcq 16(%rdx), %r8
movq %rax, %xmm1
movq 24(%rdx), %rdx
adcq 24(%rsi), %rdx
movq %r8, %xmm0
movq %rcx, %xmm3
movq %rdx, %xmm2
punpcklqdq %xmm3, %xmm1
punpcklqdq %xmm2, %xmm0
movups %xmm1, (%rdi)
movups %xmm0, 16(%rdi)
ret
What ... ?!
BTW, gcc 12.1 -O1 is still o.k.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
@ 2022-05-16 12:13 ` pinskia at gcc dot gnu.org
2022-05-16 12:14 ` pinskia at gcc dot gnu.org
` (19 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-05-16 12:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|Regression in code |[12/13 regressi] Slp is
|generation for |maybe too aggressive in
|_addcarry_u64() |some/many cases
Keywords| |missed-optimization
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is just the vectorizer still being too aggressive right before a return.
It is a cost model issue and it might not really be an issue in the final code
just microbenchmarks.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
2022-05-16 12:13 ` [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases pinskia at gcc dot gnu.org
@ 2022-05-16 12:14 ` pinskia at gcc dot gnu.org
2022-05-16 12:15 ` rguenth at gcc dot gnu.org
` (18 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-05-16 12:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |12.2
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
My bet llvm has a similar issue sometimes too.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
2022-05-16 12:13 ` [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases pinskia at gcc dot gnu.org
2022-05-16 12:14 ` pinskia at gcc dot gnu.org
@ 2022-05-16 12:15 ` rguenth at gcc dot gnu.org
2022-05-16 12:38 ` [Bug target/105617] [12/13 Regression] " already5chosen at yahoo dot com
` (17 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-16 12:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
We are vectorizing the store it dst[] now at -O2 since that appears profitable:
t.c:10:10: note: Cost model analysis:
r0.0_12 1 times scalar_store costs 12 in body
r1.1_13 1 times scalar_store costs 12 in body
r2.2_14 1 times scalar_store costs 12 in body
r3.3_15 1 times scalar_store costs 12 in body
r0.0_12 2 times unaligned_store (misalign -1) costs 24 in body
node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
t.c:10:10: note: Cost model analysis for part in loop 0:
Vector cost: 32
Scalar cost: 48
t.c:10:10: note: Basic block will be vectorized using SLP
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (2 preceding siblings ...)
2022-05-16 12:15 ` rguenth at gcc dot gnu.org
@ 2022-05-16 12:38 ` already5chosen at yahoo dot com
2022-05-16 13:02 ` already5chosen at yahoo dot com
` (16 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-16 12:38 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #4 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Andrew Pinski from comment #1)
> This is just the vectorizer still being too aggressive right before a return.
> It is a cost model issue and it might not really be an issue in the final
> code just microbenchmarks.
BTW, Andrew, in one of the older related threads you made a very wise comment:
"Maybe even generic builtins/internal functions for this case even as clang has
__builtin_addc, etc."
IMHO, that is not only necessary, but long overdue.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (3 preceding siblings ...)
2022-05-16 12:38 ` [Bug target/105617] [12/13 Regression] " already5chosen at yahoo dot com
@ 2022-05-16 13:02 ` already5chosen at yahoo dot com
2022-05-16 14:08 ` already5chosen at yahoo dot com
` (15 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-16 13:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #5 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Richard Biener from comment #3)
> We are vectorizing the store it dst[] now at -O2 since that appears
> profitable:
>
> t.c:10:10: note: Cost model analysis:
> r0.0_12 1 times scalar_store costs 12 in body
> r1.1_13 1 times scalar_store costs 12 in body
> r2.2_14 1 times scalar_store costs 12 in body
> r3.3_15 1 times scalar_store costs 12 in body
> r0.0_12 2 times unaligned_store (misalign -1) costs 24 in body
> node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
> node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
> t.c:10:10: note: Cost model analysis for part in loop 0:
> Vector cost: 32
> Scalar cost: 48
> t.c:10:10: note: Basic block will be vectorized using SLP
That makes no sense.
4 scalar-to-vector moves + 2 vector shuffles + 2 vector stores are ALOT more
costly than 4 scalar stores.
Even more so considering that scalar store go to adjacent addresses so, on good
CPUs, they are likely combined
at the level of store queue, so a cache subsystem sees fewer operations.
Either your cost model is broken or there are bugs in summation.
I'd guess, somehow compiler thinks that moves have zero cost. But
scalar-to-vector moves are certainly not of zero cost.
Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer
does not have a zero cost, because quite often renamer itself constitutes the
narrowest performance bottleneck. But those moves... I don't think that they
are hoisted by renamer.
Also, it's likely that compiler thinks that scalar store costs the same as
vector store. That's also generally incorrect, esp. when you don't know your
target CPU and don't know whether stores are aligned or not, like in this case.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (4 preceding siblings ...)
2022-05-16 13:02 ` already5chosen at yahoo dot com
@ 2022-05-16 14:08 ` already5chosen at yahoo dot com
2022-05-17 2:47 ` crazylht at gmail dot com
` (14 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-16 14:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #6 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Michael_S from comment #5)
>
> Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer
> does not have a zero cost, because quite often renamer itself constitutes
> the narrowest performance bottleneck. But those moves... I don't think that
> they are hoisted by renamer.
I took a look at several Intel and AMD Optimization Reference Manuals and
instruction tables. None of existing x86 microarchitectures, either old or new,
eliminates scalar-to-SIMD moves at renamer. Which is sort of obvious for new
microarchitectures (Bulldozer or later for AMD, Sandy Bridge or later for
Intel), because on these microarchitectures scalar and SIMD registers live in
separate physical register files.
As to older microarchitectures, they, may be, had them in the common physical
storage area, but they simply were not sufficiently smart to eliminate the
moves.
So, these moves have non-zero latency. On some of the cores, including some of
the newest, the latency is even higher than one clock. And the throughput tends
to be rather low, most typically, one scalar-to-SIMD move per clock. For
comparison, scalar-to-scalar and SIMD-to-SIMD moves can be executed (or
eliminated at renamer) at rates of 2, 3 or even 4 per clock.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (5 preceding siblings ...)
2022-05-16 14:08 ` already5chosen at yahoo dot com
@ 2022-05-17 2:47 ` crazylht at gmail dot com
2022-05-17 3:29 ` crazylht at gmail dot com
` (13 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: crazylht at gmail dot com @ 2022-05-17 2:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
construct, but it seems not to work here, guess it's because &r0,and thought it
was load not scalar?
r0.1_21 1 times scalar_store costs 12 in body
r1.3_23 1 times scalar_store costs 12 in body
r2.5_25 1 times scalar_store costs 12 in body
r3.7_27 1 times scalar_store costs 12 in body
node 0x866f238 1 times vec_construct costs 4 in prologue
node 0x866f238 1 times vec_construct costs 4 in prologue
r0.1_21 2 times unaligned_store (misalign -1) costs 24 in body
/app/example.c:10:10: note: Cost model analysis for part in loop 0:
Vector cost: 32
Scalar cost: 48
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (6 preceding siblings ...)
2022-05-17 2:47 ` crazylht at gmail dot com
@ 2022-05-17 3:29 ` crazylht at gmail dot com
2022-05-17 3:37 ` crazylht at gmail dot com
` (12 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: crazylht at gmail dot com @ 2022-05-17 3:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #7)
> Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> construct, but it seems not to work here, guess it's because &r0,and thought
> it was load not scalar?
Yes, true for as gimple_assign_load_p
(gdb) p debug_gimple_stmt (def)
72# VUSE <.MEM_46>
73r0.0_20 = r0;
(gdb) l
21723246 move it to a vector register, otherwise we have to go
21823247 via a GPR or via vpinsr which involves similar cost.
21923248 Likewise with a BIT_FIELD_REF extracting from a vector
22023249 register we can hope to avoid using a GPR. */
22123250 if (!is_gimple_assign (def)
22223251 || (!gimple_assign_load_p (def)
22323252 && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
22423253 || !VECTOR_TYPE_P (TREE_TYPE
22523254 (TREE_OPERAND (gimple_assign_rhs1
(def), 0))))))
22623255 stmt_cost += ix86_cost->sse_to_integer;
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (7 preceding siblings ...)
2022-05-17 3:29 ` crazylht at gmail dot com
@ 2022-05-17 3:37 ` crazylht at gmail dot com
2022-05-17 6:48 ` rguenth at gcc dot gnu.org
` (11 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: crazylht at gmail dot com @ 2022-05-17 3:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #8)
> (In reply to Hongtao.liu from comment #7)
> > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> > construct, but it seems not to work here, guess it's because &r0,and thought
> > it was load not scalar?
> Yes, true for as gimple_assign_load_p
>
>
> (gdb) p debug_gimple_stmt (def)
> 72# VUSE <.MEM_46>
> 73r0.0_20 = r0;
It's a load from stack, and finally eliminated in rtl dse1, but here the
vectorizer doesn't know.
And slp will not vectorize it when there's extra scalar->vector cost.
typedef long long uint64_t;
void add4i(uint64_t r0, uint64_t r1, uint64_t r2, uint64_t r3, uint64_t *dst)
{
dst[0] = r0;
dst[1] = r1;
dst[2] = r2;
dst[3] = r3;
}
add4i:
mov QWORD PTR [r8], rdi
mov QWORD PTR [r8+8], rsi
mov QWORD PTR [r8+16], rdx
mov QWORD PTR [r8+24], rcx
ret
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (8 preceding siblings ...)
2022-05-17 3:37 ` crazylht at gmail dot com
@ 2022-05-17 6:48 ` rguenth at gcc dot gnu.org
2022-05-17 9:17 ` already5chosen at yahoo dot com
` (10 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-17 6:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2022-05-17
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #9)
> (In reply to Hongtao.liu from comment #8)
> > (In reply to Hongtao.liu from comment #7)
> > > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> > > construct, but it seems not to work here, guess it's because &r0,and thought
> > > it was load not scalar?
> > Yes, true for as gimple_assign_load_p
> >
> >
> > (gdb) p debug_gimple_stmt (def)
> > 72# VUSE <.MEM_46>
> > 73r0.0_20 = r0;
> It's a load from stack, and finally eliminated in rtl dse1, but here the
> vectorizer doesn't know.
Yes, it's difficult for the SLP vectorizer to guess whether rN will come
from memory or not. Some friendlier middle-end representation for
add-with-carry might be nice - the x86 backend could for example fold
__builtin_ia32_addcarryx_u64 to use a _Complex unsinged long long for the
return, ferrying the carry in __imag. Alternatively we could devise
some special GIMPLE_ASM kind ferrying RTL and not assembly so the
backend could fold it directly to RTL on GIMPLE with asm constraints
doing the plumbing ... (we'd need some match-scratch and RTL expansion
would still need to allocate the actual pseudos).
<bb 2> [local count: 1073741824]:
_1 = *srcB_17(D);
_2 = *srcA_18(D);
_30 = __builtin_ia32_addcarryx_u64 (0, _2, _1, &r0);
_3 = MEM[(const uint64_t *)srcB_17(D) + 8B];
_4 = MEM[(const uint64_t *)srcA_18(D) + 8B];
_5 = (int) _30;
_29 = __builtin_ia32_addcarryx_u64 (_5, _4, _3, &r1);
_6 = MEM[(const uint64_t *)srcB_17(D) + 16B];
_7 = MEM[(const uint64_t *)srcA_18(D) + 16B];
_8 = (int) _29;
_28 = __builtin_ia32_addcarryx_u64 (_8, _7, _6, &r2);
_9 = MEM[(const uint64_t *)srcB_17(D) + 24B];
_10 = MEM[(const uint64_t *)srcA_18(D) + 24B];
_11 = (int) _28;
__builtin_ia32_addcarryx_u64 (_11, _10, _9, &r3);
r0.0_12 = r0;
r1.1_13 = r1;
_36 = {r0.0_12, r1.1_13};
r2.2_14 = r2;
r3.3_15 = r3;
_37 = {r2.2_14, r3.3_15};
vectp.9_35 = dst_19(D);
MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_35] = _36;
vectp.9_39 = vectp.9_35 + 16;
MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_39] = _37;
so for the situation at hand I don't see any reasonable way out that
doesn't have the chance of regressing things in other places (like
treat loads from non-indexed auto variables specially or so). The
only real solution is to find a GIMPLE representation for
__builtin_ia32_addcarryx_u64 that doesn't force the alternate output
to memory.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (9 preceding siblings ...)
2022-05-17 6:48 ` rguenth at gcc dot gnu.org
@ 2022-05-17 9:17 ` already5chosen at yahoo dot com
2022-05-17 9:25 ` already5chosen at yahoo dot com
` (9 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-17 9:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #11 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Richard Biener from comment #10)
> (In reply to Hongtao.liu from comment #9)
> > (In reply to Hongtao.liu from comment #8)
> > > (In reply to Hongtao.liu from comment #7)
> > > > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> > > > construct, but it seems not to work here, guess it's because &r0,and thought
> > > > it was load not scalar?
> > > Yes, true for as gimple_assign_load_p
> > >
> > >
> > > (gdb) p debug_gimple_stmt (def)
> > > 72# VUSE <.MEM_46>
> > > 73r0.0_20 = r0;
> > It's a load from stack, and finally eliminated in rtl dse1, but here the
> > vectorizer doesn't know.
>
> Yes, it's difficult for the SLP vectorizer to guess whether rN will come
> from memory or not. Some friendlier middle-end representation for
> add-with-carry might be nice - the x86 backend could for example fold
> __builtin_ia32_addcarryx_u64 to use a _Complex unsinged long long for the
> return, ferrying the carry in __imag. Alternatively we could devise
> some special GIMPLE_ASM kind ferrying RTL and not assembly so the
> backend could fold it directly to RTL on GIMPLE with asm constraints
> doing the plumbing ... (we'd need some match-scratch and RTL expansion
> would still need to allocate the actual pseudos).
>
> <bb 2> [local count: 1073741824]:
> _1 = *srcB_17(D);
> _2 = *srcA_18(D);
> _30 = __builtin_ia32_addcarryx_u64 (0, _2, _1, &r0);
> _3 = MEM[(const uint64_t *)srcB_17(D) + 8B];
> _4 = MEM[(const uint64_t *)srcA_18(D) + 8B];
> _5 = (int) _30;
> _29 = __builtin_ia32_addcarryx_u64 (_5, _4, _3, &r1);
> _6 = MEM[(const uint64_t *)srcB_17(D) + 16B];
> _7 = MEM[(const uint64_t *)srcA_18(D) + 16B];
> _8 = (int) _29;
> _28 = __builtin_ia32_addcarryx_u64 (_8, _7, _6, &r2);
> _9 = MEM[(const uint64_t *)srcB_17(D) + 24B];
> _10 = MEM[(const uint64_t *)srcA_18(D) + 24B];
> _11 = (int) _28;
> __builtin_ia32_addcarryx_u64 (_11, _10, _9, &r3);
> r0.0_12 = r0;
> r1.1_13 = r1;
> _36 = {r0.0_12, r1.1_13};
> r2.2_14 = r2;
> r3.3_15 = r3;
> _37 = {r2.2_14, r3.3_15};
> vectp.9_35 = dst_19(D);
> MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_35] = _36;
> vectp.9_39 = vectp.9_35 + 16;
> MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_39] = _37;
>
> so for the situation at hand I don't see any reasonable way out that
> doesn't have the chance of regressing things in other places (like
> treat loads from non-indexed auto variables specially or so). The
> only real solution is to find a GIMPLE representation for
> __builtin_ia32_addcarryx_u64 that doesn't force the alternate output
> to memory.
How about simple heuristic: "Never auto-vectorize integer code unless in inner
loop"?
It would be optimal >80% of time and even in cases where it's sub-optimal the
impact is likely single-digit per cents. On the other hand, the impact of
mistake in the opposit direction (i.e. over-vectorization) is often quite
large.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (10 preceding siblings ...)
2022-05-17 9:17 ` already5chosen at yahoo dot com
@ 2022-05-17 9:25 ` already5chosen at yahoo dot com
2022-05-19 8:05 ` crazylht at gmail dot com
` (8 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-05-17 9:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #12 from Michael_S <already5chosen at yahoo dot com> ---
On related note...
One of the historical good features of gcc relatively to other popular
compilers was absence of auto-vectorization at -O2.
When did you decide to change it and why?
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (11 preceding siblings ...)
2022-05-17 9:25 ` already5chosen at yahoo dot com
@ 2022-05-19 8:05 ` crazylht at gmail dot com
2022-07-26 11:22 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: crazylht at gmail dot com @ 2022-05-19 8:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #13 from Hongtao.liu <crazylht at gmail dot com> ---
> so for the situation at hand I don't see any reasonable way out that
> doesn't have the chance of regressing things in other places (like
> treat loads from non-indexed auto variables specially or so). The
> only real solution is to find a GIMPLE representation for
> __builtin_ia32_addcarryx_u64 that doesn't force the alternate output
> to memory.
Do we support parameter with "out" attribute(a parameter will be set by
builtin)? __builtin_ia32_addcarryx_u64 produces 2 outputs, one can be return
another must use by-reference parameter which force memory usage here.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (12 preceding siblings ...)
2022-05-19 8:05 ` crazylht at gmail dot com
@ 2022-07-26 11:22 ` rguenth at gcc dot gnu.org
2022-07-26 11:22 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-26 11:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Michael_S from comment #12)
> On related note...
> One of the historical good features of gcc relatively to other popular
> compilers was absence of auto-vectorization at -O2.
> When did you decide to change it and why?
Well, people requested it ... so GCC 12 is now having it (with an extra
eye on code size but this all doesn't work for scalar code vectorization
which OTOH is thought to almost always benefit code size ...).
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (13 preceding siblings ...)
2022-07-26 11:22 ` rguenth at gcc dot gnu.org
@ 2022-07-26 11:22 ` rguenth at gcc dot gnu.org
2022-07-29 13:21 ` already5chosen at yahoo dot com
` (5 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-26 11:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Priority|P3 |P2
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (14 preceding siblings ...)
2022-07-26 11:22 ` rguenth at gcc dot gnu.org
@ 2022-07-29 13:21 ` already5chosen at yahoo dot com
2023-05-08 12:24 ` [Bug target/105617] [12/13/14 " rguenth at gcc dot gnu.org
` (4 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2022-07-29 13:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #15 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Richard Biener from comment #14)
> (In reply to Michael_S from comment #12)
> > On related note...
> > One of the historical good features of gcc relatively to other popular
> > compilers was absence of auto-vectorization at -O2.
> > When did you decide to change it and why?
>
> Well, people requested it ... so GCC 12 is now having it (with an extra
> eye on code size but this all doesn't work for scalar code vectorization
> which OTOH is thought to almost always benefit code size ...).
"People" is a vague term.
You didn't organize voting among all gcc users, I suppose.
So, count me as "people" who asks to remove it.
Somehow , I suspect that the other one who will vote like me has a middle name
Benedict.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (15 preceding siblings ...)
2022-07-29 13:21 ` already5chosen at yahoo dot com
@ 2023-05-08 12:24 ` rguenth at gcc dot gnu.org
2023-06-01 7:58 ` slash.tmp at free dot fr
` (3 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-08 12:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|12.3 |12.4
--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 12.3 is being released, retargeting bugs to GCC 12.4.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (16 preceding siblings ...)
2023-05-08 12:24 ` [Bug target/105617] [12/13/14 " rguenth at gcc dot gnu.org
@ 2023-06-01 7:58 ` slash.tmp at free dot fr
2023-06-07 23:16 ` already5chosen at yahoo dot com
` (2 subsequent siblings)
20 siblings, 0 replies; 22+ messages in thread
From: slash.tmp at free dot fr @ 2023-06-01 7:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #18 from Mason <slash.tmp at free dot fr> ---
Hello Michael_S,
As far as I can see, massaging the source helps GCC generate optimal code
(in terms of instruction count, not convinced about scheduling).
#include <x86intrin.h>
typedef unsigned long long u64;
void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
{
unsigned char c = 0;
c = _addcarry_u64(c, A[0], B[0], dst+0);
c = _addcarry_u64(c, A[1], B[1], dst+1);
c = _addcarry_u64(c, A[2], B[2], dst+2);
c = _addcarry_u64(c, A[3], B[3], dst+3);
}
On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate
the expected:
add4i:
movq (%rdx), %rax
addq (%rsi), %rax
movq %rax, (%rdi)
movq 8(%rsi), %rax
adcq 8(%rdx), %rax
movq %rax, 8(%rdi)
movq 16(%rsi), %rax
adcq 16(%rdx), %rax
movq %rax, 16(%rdi)
movq 24(%rdx), %rax
adcq 24(%rsi), %rax
movq %rax, 24(%rdi)
ret
I'll run a few benchmarks to test optimal scheduling.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (17 preceding siblings ...)
2023-06-01 7:58 ` slash.tmp at free dot fr
@ 2023-06-07 23:16 ` already5chosen at yahoo dot com
2023-06-13 14:54 ` slash.tmp at free dot fr
2023-06-16 14:56 ` already5chosen at yahoo dot com
20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2023-06-07 23:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #19 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Mason from comment #18)
> Hello Michael_S,
>
> As far as I can see, massaging the source helps GCC generate optimal code
> (in terms of instruction count, not convinced about scheduling).
>
> #include <x86intrin.h>
> typedef unsigned long long u64;
> void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
> {
> unsigned char c = 0;
> c = _addcarry_u64(c, A[0], B[0], dst+0);
> c = _addcarry_u64(c, A[1], B[1], dst+1);
> c = _addcarry_u64(c, A[2], B[2], dst+2);
> c = _addcarry_u64(c, A[3], B[3], dst+3);
> }
>
>
> On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate
> the expected:
>
> add4i:
> movq (%rdx), %rax
> addq (%rsi), %rax
> movq %rax, (%rdi)
> movq 8(%rsi), %rax
> adcq 8(%rdx), %rax
> movq %rax, 8(%rdi)
> movq 16(%rsi), %rax
> adcq 16(%rdx), %rax
> movq %rax, 16(%rdi)
> movq 24(%rdx), %rax
> adcq 24(%rsi), %rax
> movq %rax, 24(%rdi)
> ret
>
> I'll run a few benchmarks to test optimal scheduling.
That's not merely "massaging the source". That's changing semantics.
Think about what happens when dst points to the middle of A or of B.
The change of semantics effectively prevented vectorizer from doing harm.
And yes, for common non-aliasing case the scheduling is problematic, too.
It would probably not cause slowdown on the latest and greatest cores, but
could be slow on less great cores, including your default Zen1.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (18 preceding siblings ...)
2023-06-07 23:16 ` already5chosen at yahoo dot com
@ 2023-06-13 14:54 ` slash.tmp at free dot fr
2023-06-16 14:56 ` already5chosen at yahoo dot com
20 siblings, 0 replies; 22+ messages in thread
From: slash.tmp at free dot fr @ 2023-06-13 14:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #20 from Mason <slash.tmp at free dot fr> ---
Doh! You're right.
I come from a background where overlapping/aliasing inputs are heresy,
thus got blindsided :(
This would be the optimal code, right?
add4i:
# rdi = dst, rsi = a, rdx = b
movq 0(%rdx), %r8
movq 8(%rdx), %rax
movq 16(%rdx), %rcx
movq 24(%rdx), %rdx
addq 0(%rsi), %r8
adcq 8(%rsi), %rax
adcq 16(%rsi), %rcx
adcq 24(%rsi), %rdx
movq %r8, 0(%rdi)
movq %rax, 8(%rdi)
movq %rcx, 16(%rdi)
movq %rdx, 24(%rdi)
ret
FWIW, trunk generates even nastier code for znver2:
add4i:
movq (%rdx), %rax
addq (%rsi), %rax
movq 8(%rsi), %rcx
adcq 8(%rdx), %rcx
movq 16(%rsi), %r8
adcq 16(%rdx), %r8
movq 24(%rdx), %rdx
adcq 24(%rsi), %rdx
vmovq %rax, %xmm2
vpinsrq $1, %rcx, %xmm2, %xmm0
vmovq %r8, %xmm1
vpinsrq $1, %rdx, %xmm1, %xmm1
vinserti128 $0x1, %xmm1, %ymm0, %ymm0
vmovdqu %ymm0, (%rdi)
vzeroupper
ret
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
` (19 preceding siblings ...)
2023-06-13 14:54 ` slash.tmp at free dot fr
@ 2023-06-16 14:56 ` already5chosen at yahoo dot com
20 siblings, 0 replies; 22+ messages in thread
From: already5chosen at yahoo dot com @ 2023-06-16 14:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #21 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Mason from comment #20)
> Doh! You're right.
> I come from a background where overlapping/aliasing inputs are heresy,
> thus got blindsided :(
>
> This would be the optimal code, right?
>
> add4i:
> # rdi = dst, rsi = a, rdx = b
> movq 0(%rdx), %r8
> movq 8(%rdx), %rax
> movq 16(%rdx), %rcx
> movq 24(%rdx), %rdx
> addq 0(%rsi), %r8
> adcq 8(%rsi), %rax
> adcq 16(%rsi), %rcx
> adcq 24(%rsi), %rdx
> movq %r8, 0(%rdi)
> movq %rax, 8(%rdi)
> movq %rcx, 16(%rdi)
> movq %rdx, 24(%rdi)
> ret
>
If one does not care deeply about latency (which is likely for function that
stores result into memory) then that looks good enough.
But if one does care deeply then I'd expect interleaved loads, as in first 8
lines of code generated by trunk, to produce slightly lower latency on majority
of modern CPUs.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2023-06-16 14:56 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-16 12:08 [Bug target/105617] New: Regression in code generation for _addcarry_u64() already5chosen at yahoo dot com
2022-05-16 12:13 ` [Bug target/105617] [12/13 regressi] Slp is maybe too aggressive in some/many cases pinskia at gcc dot gnu.org
2022-05-16 12:14 ` pinskia at gcc dot gnu.org
2022-05-16 12:15 ` rguenth at gcc dot gnu.org
2022-05-16 12:38 ` [Bug target/105617] [12/13 Regression] " already5chosen at yahoo dot com
2022-05-16 13:02 ` already5chosen at yahoo dot com
2022-05-16 14:08 ` already5chosen at yahoo dot com
2022-05-17 2:47 ` crazylht at gmail dot com
2022-05-17 3:29 ` crazylht at gmail dot com
2022-05-17 3:37 ` crazylht at gmail dot com
2022-05-17 6:48 ` rguenth at gcc dot gnu.org
2022-05-17 9:17 ` already5chosen at yahoo dot com
2022-05-17 9:25 ` already5chosen at yahoo dot com
2022-05-19 8:05 ` crazylht at gmail dot com
2022-07-26 11:22 ` rguenth at gcc dot gnu.org
2022-07-26 11:22 ` rguenth at gcc dot gnu.org
2022-07-29 13:21 ` already5chosen at yahoo dot com
2023-05-08 12:24 ` [Bug target/105617] [12/13/14 " rguenth at gcc dot gnu.org
2023-06-01 7:58 ` slash.tmp at free dot fr
2023-06-07 23:16 ` already5chosen at yahoo dot com
2023-06-13 14:54 ` slash.tmp at free dot fr
2023-06-16 14:56 ` already5chosen at yahoo dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).