public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc
@ 2021-03-04 23:01 hubicka at gcc dot gnu.org
2021-03-04 23:24 ` [Bug middle-end/99395] " hubicka at gcc dot gnu.org
` (19 more replies)
0 siblings, 20 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-04 23:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
Bug ID: 99395
Summary: s116 benchmark of TSVC is vectorized by clang and not
by gcc
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
s116 loop is:
real_t s116(struct args_t * func_args)
{
// linear dependence testing
initialise_arrays(__func__);
gettimeofday(&func_args->t1, NULL);
for (int nl = 0; nl < iterations*10; nl++) {
for (int i = 0; i < LEN_1D - 5; i += 5) {
a[i] = a[i + 1] * a[i];
a[i + 1] = a[i + 2] * a[i + 1];
a[i + 2] = a[i + 3] * a[i + 2];
a[i + 3] = a[i + 4] * a[i + 3];
a[i + 4] = a[i + 5] * a[i + 4];
}
dummy(a, b, c, d, e, aa, bb, cc, 0.);
}
gettimeofday(&func_args->t2, NULL);
return calc_checksum(__func__);
}
and vectorized code produced by clang11 is about 2 times faster on zen3 machine
0000000000401d00 <s116>:
401d00: 41 56 push %r14
401d02: 53 push %rbx
401d03: 50 push %rax
401d04: 49 89 fe mov %rdi,%r14
401d07: bf 66 e1 42 00 mov $0x42e166,%edi
401d0c: e8 ff 58 01 00 call 417610 <initialise_arrays>
401d11: 31 db xor %ebx,%ebx
401d13: 4c 89 f7 mov %r14,%rdi
401d16: 31 f6 xor %esi,%esi
401d18: e8 43 f3 ff ff call 401060 <gettimeofday@plt>
401d1d: eb 47 jmp 401d66 <s116+0x66>
401d1f: 90 nop
401d20: bf 00 25 45 00 mov $0x452500,%edi
401d25: be 00 31 43 00 mov $0x433100,%esi
401d2a: ba 00 19 47 00 mov $0x471900,%edx
401d2f: b9 00 0d 49 00 mov $0x490d00,%ecx
401d34: 41 b8 00 01 4b 00 mov $0x4b0100,%r8d
401d3a: 41 b9 00 f5 4c 00 mov $0x4cf500,%r9d
401d40: c5 f8 57 c0 vxorps %xmm0,%xmm0,%xmm0
401d44: 68 00 f5 54 00 push $0x54f500
401d49: 68 00 f5 50 00 push $0x50f500
401d4e: e8 6d 3c 01 00 call 4159c0 <dummy>
401d53: 48 83 c4 10 add $0x10,%rsp
401d57: 83 c3 01 add $0x1,%ebx
401d5a: 81 fb 40 42 0f 00 cmp $0xf4240,%ebx
401d60: 0f 84 9a 00 00 00 je 401e00 <s116+0x100>
401d66: c5 fa 10 05 92 07 05 vmovss 0x50792(%rip),%xmm0 #
452500 <a>
401d6d: 00
401d6e: 31 c0 xor %eax,%eax
401d70: c5 fa 10 0c 85 04 25 vmovss 0x452504(,%rax,4),%xmm1
401d77: 45 00
401d79: c5 fa 59 c1 vmulss %xmm1,%xmm0,%xmm0
401d7d: c5 fa 11 04 85 00 25 vmovss %xmm0,0x452500(,%rax,4)
401d84: 45 00
401d86: c5 f8 10 04 85 08 25 vmovups 0x452508(,%rax,4),%xmm0
401d8d: 45 00
401d8f: c5 f0 c6 c8 00 vshufps $0x0,%xmm0,%xmm1,%xmm1
401d94: c5 f0 c6 c8 98 vshufps $0x98,%xmm0,%xmm1,%xmm1
401d99: c5 f8 59 c9 vmulps %xmm1,%xmm0,%xmm1
401d9d: c5 f8 11 0c 85 04 25 vmovups %xmm1,0x452504(,%rax,4)
401da4: 45 00
401da6: 48 3d f5 7c 00 00 cmp $0x7cf5,%rax
401dac: 0f 87 6e ff ff ff ja 401d20 <s116+0x20>
401db2: c4 e3 79 04 c0 e7 vpermilps $0xe7,%xmm0,%xmm0
401db8: c5 fa 10 0c 85 18 25 vmovss 0x452518(,%rax,4),%xmm1
401dbf: 45 00
401dc1: c5 fa 59 c1 vmulss %xmm1,%xmm0,%xmm0
401dc5: c5 fa 11 04 85 14 25 vmovss %xmm0,0x452514(,%rax,4)
401dcc: 45 00
401dce: c5 f8 10 04 85 1c 25 vmovups 0x45251c(,%rax,4),%xmm0
401dd5: 45 00
401dd7: c5 f0 c6 c8 00 vshufps $0x0,%xmm0,%xmm1,%xmm1
401ddc: c5 f0 c6 c8 98 vshufps $0x98,%xmm0,%xmm1,%xmm1
401de1: c5 f8 59 c9 vmulps %xmm1,%xmm0,%xmm1
401de5: c5 fa 10 04 85 28 25 vmovss 0x452528(,%rax,4),%xmm0
401dec: 45 00
401dee: c5 f8 11 0c 85 18 25 vmovups %xmm1,0x452518(,%rax,4)
401df5: 45 00
401df7: 48 83 c0 0a add $0xa,%rax
401dfb: e9 70 ff ff ff jmp 401d70 <s116+0x70>
401e00: 49 83 c6 10 add $0x10,%r14
401e04: 4c 89 f7 mov %r14,%rdi
401e07: 31 f6 xor %esi,%esi
401e09: e8 52 f2 ff ff call 401060 <gettimeofday@plt>
401e0e: bf 66 e1 42 00 mov $0x42e166,%edi
401e13: 48 83 c4 08 add $0x8,%rsp
401e17: 5b pop %rbx
401e18: 41 5e pop %r14
401e1a: e9 e1 51 02 00 jmp 427000 <calc_checksum>
401e1f: 90 nop
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug middle-end/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
@ 2021-03-04 23:24 ` hubicka at gcc dot gnu.org
2021-03-05 8:27 ` [Bug tree-optimization/99395] " rguenth at gcc dot gnu.org
` (18 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-04 23:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Loop is:
real_t s116 (struct args_t * func_args)
{
int i;
int nl;
static const char __func__[5] = "s116";
struct timeval * _1;
int _2;
float _3;
float _4;
float _5;
int _6;
float _7;
float _8;
float _9;
int _10;
float _11;
float _12;
float _13;
int _14;
float _15;
float _16;
float _17;
int _18;
float _19;
float _20;
float _21;
struct timeval * _22;
real_t _33;
unsigned int ivtmp_43;
unsigned int ivtmp_44;
unsigned int ivtmp_45;
unsigned int ivtmp_46;
<bb 2> [local count: 108459]:
initialise_arrays (&__func__);
_1 = &func_args_29(D)->t1;
gettimeofday (_1, 0B);
goto <bb 5>; [100.00%]
<bb 8> [local count: 1052266996]:
<bb 3> [local count: 1063004409]:
# i_48 = PHI <_18(8), 0(5)>
# ivtmp_46 = PHI <ivtmp_45(8), 6399(5)>
_2 = i_48 + 1;
_3 = a[_2];
_4 = a[i_48];
_5 = _3 * _4;
a[i_48] = _5;
_6 = i_48 + 2;
_7 = a[_6];
_8 = a[_2];
_9 = _7 * _8;
a[_2] = _9;
_10 = i_48 + 3;
_11 = a[_10];
_12 = a[_6];
_13 = _11 * _12;
a[_6] = _13;
_14 = i_48 + 4;
_15 = a[_14];
_16 = a[_10];
_17 = _15 * _16;
a[_10] = _17;
_18 = i_48 + 5;
_19 = a[_18];
_20 = a[_14];
_21 = _19 * _20;
a[_14] = _21;
ivtmp_45 = ivtmp_46 - 1;
if (ivtmp_45 != 0)
goto <bb 8>; [98.99%]
else
goto <bb 4>; [1.01%]
tsvc.c:275:18: missed: not vectorized, possible dependence between data-refs
a[i_48] and a[_18]
tsvc.c:274:27: missed: bad data dependence.
_18 = i_48 + 5 and stride is 5...
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
2021-03-04 23:24 ` [Bug middle-end/99395] " hubicka at gcc dot gnu.org
@ 2021-03-05 8:27 ` rguenth at gcc dot gnu.org
2021-03-05 15:42 ` hubicka at gcc dot gnu.org
` (17 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05 8:27 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2021-03-05
CC| |rguenth at gcc dot gnu.org,
| |rsandifo at gcc dot gnu.org
Keywords| |missed-optimization
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
Component|middle-end |tree-optimization
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
please provide compilable testcases ...
Reduced testcase:
double a[1024];
void foo ()
{
for (int i = 0; i < 1022; i += 2)
{
a[i] = a[i+1] * a[i];
a[i+1] = a[i+2] * a[i+1];
}
}
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
2021-03-04 23:24 ` [Bug middle-end/99395] " hubicka at gcc dot gnu.org
2021-03-05 8:27 ` [Bug tree-optimization/99395] " rguenth at gcc dot gnu.org
@ 2021-03-05 15:42 ` hubicka at gcc dot gnu.org
2022-10-18 10:28 ` rguenth at gcc dot gnu.org
` (16 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-05 15:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
ICC version seems to run faster
000000000040a050 <s116>:
40a050: 55 push %rbp
40a051: 48 89 e5 mov %rsp,%rbp
40a054: 48 83 e4 e0 and $0xffffffffffffffe0,%rsp
40a058: 41 57 push %r15
40a05a: 53 push %rbx
40a05b: 48 83 ec 10 sub $0x10,%rsp
40a05f: 48 89 fb mov %rdi,%rbx
40a062: bf 74 f5 42 00 mov $0x42f574,%edi
40a067: e8 14 cc 00 00 call 416c80 <initialise_arrays>
40a06c: 48 89 df mov %rbx,%rdi
40a06f: 33 f6 xor %esi,%esi
40a071: e8 4a 70 ff ff call 4010c0 <gettimeofday@plt>
40a076: 33 c0 xor %eax,%eax
40a078: 41 89 c7 mov %eax,%r15d
40a07b: 33 d2 xor %edx,%edx
40a07d: 0f 1f 00 nopl (%rax)
40a080: c5 fc 10 04 95 04 9d vmovups 0x579d04(,%rdx,4),%ymm0
40a087: 57 00
40a089: c5 fc 10 14 95 24 9d vmovups 0x579d24(,%rdx,4),%ymm2
40a090: 57 00
40a092: c5 fc 10 24 95 44 9d vmovups 0x579d44(,%rdx,4),%ymm4
40a099: 57 00
40a09b: c5 fc 10 34 95 64 9d vmovups 0x579d64(,%rdx,4),%ymm6
40a0a2: 57 00
40a0a4: c5 fc 59 0c 95 00 9d vmulps 0x579d00(,%rdx,4),%ymm0,%ymm1
40a0ab: 57 00
40a0ad: c5 ec 59 1c 95 20 9d vmulps 0x579d20(,%rdx,4),%ymm2,%ymm3
40a0b4: 57 00
40a0b6: c5 dc 59 2c 95 40 9d vmulps 0x579d40(,%rdx,4),%ymm4,%ymm5
40a0bd: 57 00
40a0bf: c5 cc 59 3c 95 60 9d vmulps 0x579d60(,%rdx,4),%ymm6,%ymm7
40a0c6: 57 00
40a0c8: c5 fc 11 0c 95 00 9d vmovups %ymm1,0x579d00(,%rdx,4)
40a0cf: 57 00
40a0d1: c5 fc 11 1c 95 20 9d vmovups %ymm3,0x579d20(,%rdx,4)
40a0d8: 57 00
40a0da: c5 fc 11 2c 95 40 9d vmovups %ymm5,0x579d40(,%rdx,4)
40a0e1: 57 00
40a0e3: c5 fc 11 3c 95 60 9d vmovups %ymm7,0x579d60(,%rdx,4)
40a0ea: 57 00
40a0ec: 48 83 c2 20 add $0x20,%rdx
40a0f0: 48 81 fa e0 7c 00 00 cmp $0x7ce0,%rdx
40a0f7: 72 87 jb 40a080 <s116+0x30>
40a0f9: 33 c9 xor %ecx,%ecx
40a0fb: ba e1 7c 00 00 mov $0x7ce1,%edx
40a100: c5 fc 10 04 95 00 9d vmovups 0x579d00(,%rdx,4),%ymm0
40a107: 57 00
40a109: 48 83 c2 08 add $0x8,%rdx
40a10d: c5 fc 59 0c 8d 80 90 vmulps 0x599080(,%rcx,4),%ymm0,%ymm1
40a114: 59 00
40a116: c5 fc 11 0c 8d 80 90 vmovups %ymm1,0x599080(,%rcx,4)
40a11d: 59 00
40a11f: 48 83 c1 08 add $0x8,%rcx
40a123: 48 83 f9 18 cmp $0x18,%rcx
40a127: 72 d7 jb 40a100 <s116+0xb0>
40a129: c5 fa 10 0d b3 ef 18 vmovss 0x18efb3(%rip),%xmm1 #
5990e4 <a+0x1f3e4>
40a130: 00
40a131: bf 00 9d 57 00 mov $0x579d00,%edi
40a136: c5 fa 10 1d aa ef 18 vmovss 0x18efaa(%rip),%xmm3 #
5990e8 <a+0x1f3e8>
40a13d: 00
40a13e: be 80 d8 45 00 mov $0x45d880,%esi
40a143: c5 f2 59 05 95 ef 18 vmulss 0x18ef95(%rip),%xmm1,%xmm0
# 5990e0 <a+0x1f3e0>
40a14a: 00
40a14b: ba 00 a9 55 00 mov $0x55a900,%edx
40a150: c5 e2 59 25 94 ef 18 vmulss 0x18ef94(%rip),%xmm3,%xmm4
# 5990ec <a+0x1f3ec>
40a157: 00
40a158: c5 f2 59 d3 vmulss %xmm3,%xmm1,%xmm2
40a15c: c5 fa 11 05 7c ef 18 vmovss %xmm0,0x18ef7c(%rip) #
5990e0 <a+0x1f3e0>
40a163: 00
40a164: b9 80 e4 43 00 mov $0x43e480,%ecx
40a169: c5 fa 11 15 73 ef 18 vmovss %xmm2,0x18ef73(%rip) #
5990e4 <a+0x1f3e4>
40a170: 00
40a171: 41 b8 00 b5 53 00 mov $0x53b500,%r8d
40a177: c5 fa 11 25 69 ef 18 vmovss %xmm4,0x18ef69(%rip) #
5990e8 <a+0x1f3e8>
40a17e: 00
40a17f: 41 b9 c0 b4 4b 00 mov $0x4bb4c0,%r9d
40a185: 68 00 91 59 00 push $0x599100
40a18a: 68 00 b5 4f 00 push $0x4fb500
40a18f: c5 f8 77 vzeroupper
40a192: c5 f8 57 c0 vxorps %xmm0,%xmm0,%xmm0
40a196: e8 d5 92 00 00 call 413470 <dummy>
40a19b: 48 83 c4 10 add $0x10,%rsp
40a19f: 41 ff c7 inc %r15d
40a1a2: 41 81 ff 40 42 0f 00 cmp $0xf4240,%r15d
40a1a9: 0f 82 cc fe ff ff jb 40a07b <s116+0x2b>
40a1af: 48 83 c3 10 add $0x10,%rbx
40a1b3: 33 f6 xor %esi,%esi
40a1b5: 48 89 df mov %rbx,%rdi
40a1b8: e8 03 6f ff ff call 4010c0 <gettimeofday@plt>
40a1bd: bf 74 f5 42 00 mov $0x42f574,%edi
40a1c2: 48 83 c4 10 add $0x10,%rsp
40a1c6: 5b pop %rbx
40a1c7: 41 5f pop %r15
40a1c9: 48 89 ec mov %rbp,%rsp
40a1cc: 5d pop %rbp
40a1cd: e9 de 97 00 00 jmp 4139b0 <calc_checksum>
40a1d2: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
40a1d9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (2 preceding siblings ...)
2021-03-05 15:42 ` hubicka at gcc dot gnu.org
@ 2022-10-18 10:28 ` rguenth at gcc dot gnu.org
2022-10-18 10:37 ` rguenth at gcc dot gnu.org
` (15 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-18 10:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So in the BB SLP attempt from loop vectorization (or in the BB SLP pass with
-fno-predictive-commoning) we get confused during DR group building because
of a duplicate access and fixup splitting the candidates at odd points.
For the reduced testcase we see
<bb 3> [local count: 1063004409]:
# i_16 = PHI <_5(5), 0(2)>
# ivtmp_18 = PHI <ivtmp_15(5), 511(2)>
_1 = i_16 + 1;
_2 = a[_1];
_3 = a[i_16];
_4 = _2 * _3;
a[i_16] = _4;
_5 = i_16 + 2;
_6 = a[_5];
_7 = a[_1];
_8 = _6 * _7;
a[_1] = _8;
ivtmp_15 = ivtmp_18 - 1;
if (ivtmp_15 != 0)
goto <bb 5>; [99.00%]
else
goto <bb 4>; [1.00%]
so a[_1] is loaded twice because CSE doesn't figure that a[i_16] cannot alias
it. That causes us to split the load group.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (3 preceding siblings ...)
2022-10-18 10:28 ` rguenth at gcc dot gnu.org
@ 2022-10-18 10:37 ` rguenth at gcc dot gnu.org
2023-10-07 9:10 ` juzhe.zhong at rivai dot ai
` (14 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-18 10:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixing the CSE in the testcase by doing
double a[1024];
void foo ()
{
for (int i = 0; i < 1022; i += 2)
{
double tem = a[i+1];
a[i] = tem * a[i];
a[i+1] = a[i+2] * tem;
}
}
gets us
t.c:4:21: note: Detected interleaving load a[i_15] and a[_1]
t.c:4:21: note: Detected interleaving store a[i_15] and a[_1]
t.c:4:21: note: Detected interleaving load of size 2
t.c:4:21: note: _2 = a[i_15];
t.c:4:21: note: tem_10 = a[_1];
t.c:4:21: note: Detected single element interleaving a[_4] step 16
t.c:4:21: note: Detected interleaving store of size 2
t.c:4:21: note: a[i_15] = _3;
t.c:4:21: note: a[_1] = _6;
in the loop pass and failed dependence analysis and
with the SLP pass (no predcom):
t.c:10:1: note: Detected interleaving load a[i_15] and a[_1]
t.c:10:1: note: Detected interleaving load a[i_15] and a[_4]
t.c:10:1: note: Detected interleaving store a[i_15] and a[_1]
t.c:10:1: note: Detected interleaving load of size 3
t.c:10:1: note: _2 = a[i_15];
t.c:10:1: note: tem_10 = a[_1];
t.c:10:1: note: _5 = a[_4];
t.c:10:1: note: Detected interleaving store of size 2
t.c:10:1: note: a[i_15] = _3;
t.c:10:1: note: a[_1] = _6;
which then runs into gap vect issues for how we'd vectorize the three
element load.
The dependence analysis is done by analyzing the validity of the
vectorized load/store placement and the implied motion of the
scalar load/store statements. The missed optimization here would
be the missed alternate placement that would be correct. But I
think the way we form groups would need to be revisited first here.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (4 preceding siblings ...)
2022-10-18 10:37 ` rguenth at gcc dot gnu.org
@ 2023-10-07 9:10 ` juzhe.zhong at rivai dot ai
2023-10-09 6:30 ` rguenth at gcc dot gnu.org
` (13 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-10-07 9:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
JuzheZhong <juzhe.zhong at rivai dot ai> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |juzhe.zhong at rivai dot ai
--- Comment #6 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Hi, Richi.
Recently, I am evaluating TSVC performance of GCC:
I found both RISC-V and aarch64 can SLP vectorize it:
https://godbolt.org/z/ssvTxxjeT
Both GCC-13 and trunk GCC can SLP it like LLVM (GCC-12 failed) but with
-fno-vect-cost-model.
I suspect we should adjust Vector COST model (I don't think we should ajust
cost
model in target backend since LLVM by default vectorize such case).
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (5 preceding siblings ...)
2023-10-07 9:10 ` juzhe.zhong at rivai dot ai
@ 2023-10-09 6:30 ` rguenth at gcc dot gnu.org
2024-01-30 11:37 ` juzhe.zhong at rivai dot ai
` (12 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-10-09 6:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2021-03-05 00:00:00 |2023-10-9
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to JuzheZhong from comment #6)
> Hi, Richi.
>
> Recently, I am evaluating TSVC performance of GCC:
>
> I found both RISC-V and aarch64 can SLP vectorize it:
>
> https://godbolt.org/z/ssvTxxjeT
>
> Both GCC-13 and trunk GCC can SLP it like LLVM (GCC-12 failed) but with
> -fno-vect-cost-model.
>
> I suspect we should adjust Vector COST model (I don't think we should ajust
> cost
> model in target backend since LLVM by default vectorize such case).
We are only vectorizing part of the scalar code. The CSE issue still exists,
so is the resulting loop analysis issues.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (6 preceding siblings ...)
2023-10-09 6:30 ` rguenth at gcc dot gnu.org
@ 2024-01-30 11:37 ` juzhe.zhong at rivai dot ai
2024-01-30 12:26 ` rguenth at gcc dot gnu.org
` (11 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-30 11:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #8 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Hi, Richard.
Now, I find the time to GCC vectorization optimization.
I find this case:
_2 = a[_1];
...
a[i_16] = _4;
,,,
_7 = a[_1]; ---> This load should be eliminated and re-use _2.
Am I right ?
Could you guide me which pass should do this CSE optimization ?
Thanks.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (7 preceding siblings ...)
2024-01-30 11:37 ` juzhe.zhong at rivai dot ai
@ 2024-01-30 12:26 ` rguenth at gcc dot gnu.org
2024-01-31 1:32 ` juzhe.zhong at rivai dot ai
` (10 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-30 12:26 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to JuzheZhong from comment #8)
> Hi, Richard.
>
> Now, I find the time to GCC vectorization optimization.
>
> I find this case:
>
> _2 = a[_1];
> ...
> a[i_16] = _4;
> ,,,
> _7 = a[_1]; ---> This load should be eliminated and re-use _2.
>
> Am I right ?
>
> Could you guide me which pass should do this CSE optimization ?
>
> Thanks.
In principle it's value-numbering. The reason it doesn't do this is
compile-time cost of doing full data-ref analysis. In principle it's
as "easy" as hooking that up into vn_reference_lookup_3 as part of the
early work therein to disambiguate more defs.
Iff we chose to refrain from valueizing any of the SSA uses we could
cache both the data references and the dependence resolution.
One could also think of doing very simple recognition of these
single index expressions and / or integrating this with other cases.
IIRC there's some warranting SCEV processing / niter analysis as well
for example to figure that
for (int i = 0; i < 128; ++i)
a[i] = 1;
return a[5];
returns 1.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (8 preceding siblings ...)
2024-01-30 12:26 ` rguenth at gcc dot gnu.org
@ 2024-01-31 1:32 ` juzhe.zhong at rivai dot ai
2024-01-31 3:50 ` juzhe.zhong at rivai dot ai
` (9 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31 1:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #10 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
I think the root cause is we think i_16 and _1 are alias due to scalar
evolution:
(get_scalar_evolution
(scalar = i_16)
(scalar_evolution = {0, +, 2}<nw>_1))
(get_scalar_evolution
(scalar = _1)
(scalar_evolution = {1, +, 2}<nw>_1))
Even though I didn't understand what it is.
diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc
index 25e3130e2f1..2df6de67043 100644
--- a/gcc/tree-scalar-evolution.cc
+++ b/gcc/tree-scalar-evolution.cc
@@ -553,7 +553,7 @@ get_scalar_evolution (basic_block instantiated_below, tree
scalar)
if (SSA_NAME_IS_DEFAULT_DEF (scalar))
res = scalar;
else
- res = *find_var_scev_info (instantiated_below, scalar);
+ res = scalar;
break;
case REAL_CST:
Ah... I tried an ugly hack which is definitely wrong (just for experiment) in
scalar evolution.
Then, we can vectorize it:
foo:
lui a1,%hi(a)
addi a1,a1,%lo(a)
li a2,511
li a3,0
vsetivli zero,2,e64,m1,ta,ma
.L2:
addiw a5,a3,1
slli a5,a5,3
add a5,a1,a5
fld fa5,0(a5)
slli a4,a3,3
add a4,a1,a4
vlse64.v v2,0(a4),zero
vle64.v v1,0(a5)
vfslide1down.vf v2,v2,fa5
addiw a2,a2,-1
vfmul.vv v1,v1,v2
vse64.v v1,0(a4)
addiw a3,a3,2
bne a2,zero,.L2
ret
I think we can add some simple memory access index recognition, but I don't
known where to add this recognition.
Would you mind giving me some more hints ?
Thanks.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (9 preceding siblings ...)
2024-01-31 1:32 ` juzhe.zhong at rivai dot ai
@ 2024-01-31 3:50 ` juzhe.zhong at rivai dot ai
2024-01-31 7:00 ` juzhe.zhong at rivai dot ai
` (8 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31 3:50 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #11 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
It seems that we should fix this case (Richard gave) first which I think it's
not the SCEV or value-numbering issue:
double a[1024];
void foo ()
{
for (int i = 0; i < 1022; i += 2)
{
double tem = a[i+1];
a[i] = tem * a[i];
a[i+1] = a[i+2] * tem;
}
}
auto.c:13:21: missed: couldn't vectorize loop
auto.c:15:14: missed: not vectorized: no vectype for stmt: tem_10 = a[_1];
scalar_type: double
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (10 preceding siblings ...)
2024-01-31 3:50 ` juzhe.zhong at rivai dot ai
@ 2024-01-31 7:00 ` juzhe.zhong at rivai dot ai
2024-01-31 7:59 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31 7:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #12 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
OK. It seems it has data dependency issue:
missed: not vectorized, possible dependence between data-refs a[i_15] and
a[_4]
a[i_15] = _3; STMT 1
_4 = i_15 + 2;
_5 = a[_4]; STMT 2
STMT2 should not depend on STMT1.
It's recognized as dependency in vect_analyze_data_ref_dependence.
Is is reasonable to fix it in vect_analyze_data_ref_dependence ?
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (11 preceding siblings ...)
2024-01-31 7:00 ` juzhe.zhong at rivai dot ai
@ 2024-01-31 7:59 ` rguenth at gcc dot gnu.org
2024-01-31 8:09 ` juzhe.zhong at rivai dot ai
` (6 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-31 7:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to JuzheZhong from comment #12)
> OK. It seems it has data dependency issue:
>
> missed: not vectorized, possible dependence between data-refs a[i_15] and
> a[_4]
>
> a[i_15] = _3; STMT 1
> _4 = i_15 + 2;
> _5 = a[_4]; STMT 2
>
> STMT2 should not depend on STMT1.
>
> It's recognized as dependency in vect_analyze_data_ref_dependence.
>
> Is is reasonable to fix it in vect_analyze_data_ref_dependence ?
t2.c:4:21: note: dependence distance = 1.
t2.c:7:12: missed: not vectorized, possible dependence between data-refs
a[i_15] and a[_4]
t2.c:4:21: missed: bad data dependence.
so there's a cross iteration dependence with distance 1 - that's
(compute_affine_dependence
ref_a: a[i_15], stmt_a: a[i_15] = _3;
ref_b: a[_4], stmt_b: _5 = a[_4];
(analyze_overlapping_iterations
(chrec_a = {0, +, 2}<nw>_1)
(chrec_b = {2, +, 2}<nw>_1)
(analyze_siv_subscript
(analyze_subscript_affine_affine
(overlaps_a = [1 + 1 * x_1])
(overlaps_b = [0 + 1 * x_1]))
)
(overlap_iterations_a = [1 + 1 * x_1])
(overlap_iterations_b = [0 + 1 * x_1]))
(build_classic_dist_vector
dist_vector = (1
)
)
)
a read-after-write of a[i+2] after storing to a[i+1] in program order.
This would be fine with a VF of 1 only, but we are not really considering
that (a pure SLP vectorization w/o unrolling). Instead we start with the
assumption of classical vectorization using interleaving which has a
minimal VF of the number of lanes of the vector type with the largest
number of lanes as determined by vect_analyze_data_refs.
We can delay this all a bit but then the SLP build will fail anyway:
t2.c:4:21: missed: Build SLP failed: different interleaving chains in one
node _5 = a[_4];
which is because we do
t2.c:4:21: note: === vect_analyze_data_ref_accesses ===
t2.c:4:21: note: Detected interleaving load a[i_15] and a[_1]
t2.c:4:21: note: Detected interleaving store a[i_15] and a[_1]
t2.c:4:21: note: Detected interleaving load of size 2
t2.c:4:21: note: _2 = a[i_15];
t2.c:4:21: note: tem_10 = a[_1];
t2.c:4:21: note: Detected single element interleaving a[_4] step 16
that is, we are splitting the chain because of the intermediate store
(that's kind-of OK-ish, heuristically it works for more cases).
We'd usually handle the VF == 1 cases also duriing BB vectorization on
the loop body, but we're only doing that when there was if-conversion
and the later stand-alone BB vectorization is after predictive commoning
which wrecks the loop. We should move predcom after BB vect for that.
That said, this PR is quite elaborate and it will touch some key design
issues in the vectorizer. I'd rather finally finish getting us to
work on the SLP representation only before touching all these delicate
things. The following allows the analysis to proceed a bit longer
with VF == 1. Not adjusting min_vf early might have issues, but the
change might work as-is and possibly allow some cases to be loop vectorized
with SLP and a low VF that we now fail to.
diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index f592aeb8028..b16b4664e7b 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -589,7 +589,7 @@ vect_analyze_data_ref_dependence (struct
data_dependence_relation *ddr,
}
unsigned int abs_dist = abs (dist);
- if (abs_dist >= 2 && abs_dist < *max_vf)
+ if (abs_dist >= 1 && abs_dist < *max_vf)
{
/* The dependence distance requires reduction of the maximal
vectorization factor. */
@@ -4946,7 +4955,7 @@ vect_analyze_data_refs (vec_info *vinfo, poly_uint64
*min_vf, bool *fatal)
/* Adjust the minimal vectorization factor according to the
vector type. */
vf = TYPE_VECTOR_SUBPARTS (vectype);
- *min_vf = upper_bound (*min_vf, vf);
+ //*min_vf = upper_bound (*min_vf, vf);
/* Leave the BB vectorizer to pick the vector type later, based on
the final dataref group size and SLP node size. */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 30b90d99925..7eab3d4bebc 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2719,7 +2719,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool
&fatal,
opt_result ok = opt_result::success ();
int res;
unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
- poly_uint64 min_vf = 2;
+ poly_uint64 min_vf = 1;
loop_vec_info orig_loop_vinfo = NULL;
/* If we are dealing with an epilogue then orig_loop_vinfo points to the
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (12 preceding siblings ...)
2024-01-31 7:59 ` rguenth at gcc dot gnu.org
@ 2024-01-31 8:09 ` juzhe.zhong at rivai dot ai
2024-01-31 8:18 ` rguenther at suse dot de
` (5 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31 8:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Thanks Richard.
It seems that we can't fix this issue for now. Is that right ?
If I understand correctly, do you mean we should wait after SLP representations
are finished and then revisit this PR?
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (13 preceding siblings ...)
2024-01-31 8:09 ` juzhe.zhong at rivai dot ai
@ 2024-01-31 8:18 ` rguenther at suse dot de
2024-01-31 8:27 ` juzhe.zhong at rivai dot ai
` (4 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: rguenther at suse dot de @ 2024-01-31 8:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
>
> --- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> Thanks Richard.
>
> It seems that we can't fix this issue for now. Is that right ?
>
> If I understand correctly, do you mean we should wait after SLP representations
> are finished and then revisit this PR?
Yes.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (14 preceding siblings ...)
2024-01-31 8:18 ` rguenther at suse dot de
@ 2024-01-31 8:27 ` juzhe.zhong at rivai dot ai
2024-01-31 9:08 ` rguenther at suse dot de
` (3 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31 8:27 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #16 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to rguenther@suse.de from comment #15)
> On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> >
> > --- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > Thanks Richard.
> >
> > It seems that we can't fix this issue for now. Is that right ?
> >
> > If I understand correctly, do you mean we should wait after SLP representations
> > are finished and then revisit this PR?
>
> Yes.
It seems to be a big refactor work.
I wonder I can do anything to help with SLP representations ?
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (15 preceding siblings ...)
2024-01-31 8:27 ` juzhe.zhong at rivai dot ai
@ 2024-01-31 9:08 ` rguenther at suse dot de
2024-01-31 9:19 ` juzhe.zhong at rivai dot ai
` (2 subsequent siblings)
19 siblings, 0 replies; 21+ messages in thread
From: rguenther at suse dot de @ 2024-01-31 9:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #17 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
>
> --- Comment #16 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> (In reply to rguenther@suse.de from comment #15)
> > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > >
> > > --- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > > Thanks Richard.
> > >
> > > It seems that we can't fix this issue for now. Is that right ?
> > >
> > > If I understand correctly, do you mean we should wait after SLP representations
> > > are finished and then revisit this PR?
> >
> > Yes.
>
> It seems to be a big refactor work.
It's not too bad if people wouldn't continue to add features not
implementing SLP ...
> I wonder I can do anything to help with SLP representations ?
I hope to get back to this before stage1 re-opens and will post
another request for testing. It's really mostly going to be making
sure all paths have coverage which means testing all the various
architectures - I can only easily test x86. There's a branch
I worked on last year, refs/users/rguenth/heads/vect-force-slp,
which I use to hunt down cases not supporting SLP (it's a bit
overeager to trigger, and it has known holes so it's not really
a good starting point yet for folks to try other archs).
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (16 preceding siblings ...)
2024-01-31 9:08 ` rguenther at suse dot de
@ 2024-01-31 9:19 ` juzhe.zhong at rivai dot ai
2024-01-31 9:56 ` rguenther at suse dot de
2024-04-07 21:33 ` pinskia at gcc dot gnu.org
19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31 9:19 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #18 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to rguenther@suse.de from comment #17)
> On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> >
> > --- Comment #16 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > (In reply to rguenther@suse.de from comment #15)
> > > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > >
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > >
> > > > --- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > > > Thanks Richard.
> > > >
> > > > It seems that we can't fix this issue for now. Is that right ?
> > > >
> > > > If I understand correctly, do you mean we should wait after SLP representations
> > > > are finished and then revisit this PR?
> > >
> > > Yes.
> >
> > It seems to be a big refactor work.
>
> It's not too bad if people wouldn't continue to add features not
> implementing SLP ...
>
> > I wonder I can do anything to help with SLP representations ?
>
> I hope to get back to this before stage1 re-opens and will post
> another request for testing. It's really mostly going to be making
> sure all paths have coverage which means testing all the various
> architectures - I can only easily test x86. There's a branch
> I worked on last year, refs/users/rguenth/heads/vect-force-slp,
> which I use to hunt down cases not supporting SLP (it's a bit
> overeager to trigger, and it has known holes so it's not really
> a good starting point yet for folks to try other archs).
Ok. It seems that you almost done with that but needs more testing in
various targets.
So, if I want to work on optimizing vectorization (start with TSVC),
I should avoid touching the failed vectorized due to data reference/dependence
analysis (e.g. this PR case, s116).
and avoid adding new features into loop vectorizer, e.g. min/max reduction with
index (s315).
To not to make your SLP refactoring work heavier.
Am I right ?
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (17 preceding siblings ...)
2024-01-31 9:19 ` juzhe.zhong at rivai dot ai
@ 2024-01-31 9:56 ` rguenther at suse dot de
2024-04-07 21:33 ` pinskia at gcc dot gnu.org
19 siblings, 0 replies; 21+ messages in thread
From: rguenther at suse dot de @ 2024-01-31 9:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
--- Comment #19 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
>
> --- Comment #18 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> (In reply to rguenther@suse.de from comment #17)
> > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > >
> > > --- Comment #16 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > > (In reply to rguenther@suse.de from comment #15)
> > > > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > > >
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > > >
> > > > > --- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > > > > Thanks Richard.
> > > > >
> > > > > It seems that we can't fix this issue for now. Is that right ?
> > > > >
> > > > > If I understand correctly, do you mean we should wait after SLP representations
> > > > > are finished and then revisit this PR?
> > > >
> > > > Yes.
> > >
> > > It seems to be a big refactor work.
> >
> > It's not too bad if people wouldn't continue to add features not
> > implementing SLP ...
> >
> > > I wonder I can do anything to help with SLP representations ?
> >
> > I hope to get back to this before stage1 re-opens and will post
> > another request for testing. It's really mostly going to be making
> > sure all paths have coverage which means testing all the various
> > architectures - I can only easily test x86. There's a branch
> > I worked on last year, refs/users/rguenth/heads/vect-force-slp,
> > which I use to hunt down cases not supporting SLP (it's a bit
> > overeager to trigger, and it has known holes so it's not really
> > a good starting point yet for folks to try other archs).
>
> Ok. It seems that you almost done with that but needs more testing in
> various targets.
>
> So, if I want to work on optimizing vectorization (start with TSVC),
> I should avoid touching the failed vectorized due to data reference/dependence
> analysis (e.g. this PR case, s116).
It depends on the actual case - the one in this bug at least looks like
half of it might be dealt with with the refactoring.
> and avoid adding new features into loop vectorizer, e.g. min/max reduction with
> index (s315).
It's fine to add features if they works with SLP as well ;) Note that
in the future SLP will also do the "single lane" case but it doesn't
do that on trunk. Some features are difficult with multi-lane SLP
and probably not important in practice for that case, still handling
single-lane SLP will be important as otherwise the feature is lost.
> To not to make your SLP refactoring work heavier.
>
> Am I right ?
Yes. I've got early break vectorization to chase now, I was "finished"
with the parts I could exercise on x86_64 in autumn ...
^ permalink raw reply [flat|nested] 21+ messages in thread
* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (18 preceding siblings ...)
2024-01-31 9:56 ` rguenther at suse dot de
@ 2024-04-07 21:33 ` pinskia at gcc dot gnu.org
19 siblings, 0 replies; 21+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-07 21:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-04-07 21:33 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
2021-03-04 23:24 ` [Bug middle-end/99395] " hubicka at gcc dot gnu.org
2021-03-05 8:27 ` [Bug tree-optimization/99395] " rguenth at gcc dot gnu.org
2021-03-05 15:42 ` hubicka at gcc dot gnu.org
2022-10-18 10:28 ` rguenth at gcc dot gnu.org
2022-10-18 10:37 ` rguenth at gcc dot gnu.org
2023-10-07 9:10 ` juzhe.zhong at rivai dot ai
2023-10-09 6:30 ` rguenth at gcc dot gnu.org
2024-01-30 11:37 ` juzhe.zhong at rivai dot ai
2024-01-30 12:26 ` rguenth at gcc dot gnu.org
2024-01-31 1:32 ` juzhe.zhong at rivai dot ai
2024-01-31 3:50 ` juzhe.zhong at rivai dot ai
2024-01-31 7:00 ` juzhe.zhong at rivai dot ai
2024-01-31 7:59 ` rguenth at gcc dot gnu.org
2024-01-31 8:09 ` juzhe.zhong at rivai dot ai
2024-01-31 8:18 ` rguenther at suse dot de
2024-01-31 8:27 ` juzhe.zhong at rivai dot ai
2024-01-31 9:08 ` rguenther at suse dot de
2024-01-31 9:19 ` juzhe.zhong at rivai dot ai
2024-01-31 9:56 ` rguenther at suse dot de
2024-04-07 21:33 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).