public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/109690] New: bad SLP vectorization on zen
@ 2023-05-01 21:31 hubicka at gcc dot gnu.org
2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-01 21:31 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690
Bug ID: 109690
Summary: bad SLP vectorization on zen
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
model name : AMD Ryzen 7 5800X 8-Core Processor
reproduces on my znver1 laptop too.
h@ryzen3:~/gcc-kub/build/gcc> cat tt.c
int a[100];
[[gnu::noipa]]
void loop()
{
for (int i = 0; i < 3; i++)
a[i]+=a[i];
}
int
main()
{
for (int j = 0; j < 1000000000; j++)
loop ();
return 0;
}
jh@ryzen3:~/gcc-kub/build/gcc> ./xgcc -B ./ -O2 -march=native tt.c ; perf stat
./a.out
Performance counter stats for './a.out':
2683.95 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
52 page-faults:u # 19.374 /sec
13001141361 cycles:u # 4.844 GHz
(83.31%)
691180 stalled-cycles-frontend:u # 0.01% frontend
cycles idle (83.31%)
101980 stalled-cycles-backend:u # 0.00% backend
cycles idle (83.31%)
12999928665 instructions:u # 1.00 insn per
cycle
# 0.00 stalled cycles per
insn (83.31%)
3000013809 branches:u # 1.118 G/sec
(83.41%)
1525 branch-misses:u # 0.00% of all
branches (83.36%)
2.684376360 seconds time elapsed
2.684369000 seconds user
0.000000000 seconds sys
jh@ryzen3:~/gcc-kub/build/gcc> ./xgcc -B ./ -O2 -march=native tt.c
-fno-tree-vectorize ; perf stat ./a.out
Performance counter stats for './a.out':
1238.92 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
52 page-faults:u # 41.972 /sec
6000338140 cycles:u # 4.843 GHz
(83.21%)
314660 stalled-cycles-frontend:u # 0.01% frontend
cycles idle (83.21%)
0 stalled-cycles-backend:u # 0.00% backend
cycles idle (83.23%)
7999796562 instructions:u # 1.33 insn per
cycle
# 0.00 stalled cycles per
insn (83.53%)
2999887795 branches:u # 2.421 G/sec
(83.53%)
698 branch-misses:u # 0.00% of all
branches (83.28%)
1.239116606 seconds time elapsed
1.239121000 seconds user
0.000000000 seconds sys
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen
2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
@ 2023-05-01 21:59 ` pinskia at gcc dot gnu.org
2023-05-01 22:01 ` pinskia at gcc dot gnu.org
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-01 21:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Without -march=znver1, we get:
vect__10.6_9 = MEM <vector(2) int> [(int *)&a];
vect_patt_13.7_8 = VIEW_CONVERT_EXPR<vector(2) unsigned int>(vect__10.6_9);
vect_patt_19.8_1 = vect_patt_13.7_8 << 1;
vect_patt_25.9_2 = VIEW_CONVERT_EXPR<vector(2) int>(vect_patt_19.8_1);
MEM <vector(2) int> [(int *)&a] = vect_patt_25.9_2;
Which looks reasonable. But with -march=znver1 we get:
_10 = a[0];
_11 = _10 * 2;
_16 = a[1];
_17 = _16 * 2;
_13 = {_11, _17};
MEM <vector(2) int> [(int *)&a] = _13;
So this is definitely a cost model issue.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen
2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org
@ 2023-05-01 22:01 ` pinskia at gcc dot gnu.org
2023-05-01 22:11 ` pinskia at gcc dot gnu.org
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-01 22:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Even more interesting is:
for (int i = 0; i < 3; i++)
a[i] = ((unsigned)a[i]) << 1;
Produces different code .
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen
2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org
2023-05-01 22:01 ` pinskia at gcc dot gnu.org
@ 2023-05-01 22:11 ` pinskia at gcc dot gnu.org
2023-05-02 6:46 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-01 22:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690
--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
So in the case of without -march, we get:
first:
/app/example.cpp:14:24: note: Cost model analysis for part in loop 0:
Vector cost: 28
Scalar cost: 24
so we reject that and then we try it again and this time for V8QI and then it
works.
With -march we get:
/app/example.cpp:14:24: note: Cost model analysis for part in loop 0:
Vector cost: 32
Scalar cost: 32
Which then we accept and does not retry it ...
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen
2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
` (2 preceding siblings ...)
2023-05-01 22:11 ` pinskia at gcc dot gnu.org
@ 2023-05-02 6:46 ` rguenth at gcc dot gnu.org
2023-05-04 20:46 ` ubizjak at gmail dot com
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-02 6:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Ever confirmed|0 |1
Last reconfirmed| |2023-05-02
Status|UNCONFIRMED |NEW
CC| |uros at gcc dot gnu.org
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
The x86 target chooses not to go the "compare costs" route but choose the first
(usually biggest size) vectorization that is profitable.
So the interesting thing is that with -march=znver3 we have the
integer multiplication in V2SImode unsupported. Note that SLP chooses
V2SImode for the base V4SImode.
With V8QImode (aka V2SImode) base mode pattern recog works to produce
the desired shift.
I think the disconnect is that with V4SImode we have an integer multiplication
pattern (so no pattern is created) but with V2SImode we have not (looks like
the target chose not to implement that).
A solution would be to perform pattern recog in the vectorizable_* routines
or at least in the cases where straight-forward, simply code-gen a supported
variant.
Thus, mulv2si3 is missing.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen
2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
` (3 preceding siblings ...)
2023-05-02 6:46 ` rguenth at gcc dot gnu.org
@ 2023-05-04 20:46 ` ubizjak at gmail dot com
2023-05-05 12:16 ` ubizjak at gmail dot com
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: ubizjak at gmail dot com @ 2023-05-04 20:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690
--- Comment #5 from Uroš Bizjak <ubizjak at gmail dot com> ---
Created attachment 55002
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55002&action=edit
Patch that introduces mulv2si3
The compiled code with -march=znver1 is now the same as without the flag:
loop:
vmovq a(%rip), %xmm0
sall a+8(%rip)
vpslld $1, %xmm0, %xmm0
vmovq %xmm0, a(%rip)
ret
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen
2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
` (4 preceding siblings ...)
2023-05-04 20:46 ` ubizjak at gmail dot com
@ 2023-05-05 12:16 ` ubizjak at gmail dot com
2023-05-05 22:40 ` hubicka at gcc dot gnu.org
2023-05-06 8:45 ` amonakov at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: ubizjak at gmail dot com @ 2023-05-05 12:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690
--- Comment #6 from Uroš Bizjak <ubizjak at gmail dot com> ---
The missing pattern was committed as part of:
commit r14-493-g919642fa4b2bc4c32910336dd200d53766801c80
Author: Uros Bizjak <ubizjak@gmail.com>
Date: Fri May 5 14:10:18 2023 +0200
i386: Introduce mulv2si3 instruction
For SSE2 targets the expander unpacks input elements into the correct
position in the V4SI vector and emits PMULUDQ instruction. The output
elements are then shuffled back to their positions in the V2SI vector.
For SSE4 targets PMULLD instruction is emitted directly.
gcc/ChangeLog:
* config/i386/mmx.md (mulv2si3): New expander.
(*mulv2si3): New insn pattern.
gcc/testsuite/ChangeLog:
* gcc.target/i386/sse2-mmx-mult-vec.c: New test.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen
2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
` (5 preceding siblings ...)
2023-05-05 12:16 ` ubizjak at gmail dot com
@ 2023-05-05 22:40 ` hubicka at gcc dot gnu.org
2023-05-06 8:45 ` amonakov at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-05 22:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690
--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Thanks a lot! There however still seems to be problem with vectorization
On zen4 i now get:
jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native slp.c ; perf stat
./a.out
Performance counter stats for './a.out':
1,835.21 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
53 page-faults:u # 28.880 /sec
10,000,113,961 cycles:u # 5.449 GHz
(83.22%)
31,284 stalled-cycles-frontend:u # 0.00% frontend
cycles idle (83.23%)
64,771 stalled-cycles-backend:u # 0.00% backend
cycles idle (83.43%)
9,000,118,863 instructions:u # 0.90 insn per
cycle
# 0.00 stalled cycles per
insn (83.44%)
2,999,980,507 branches:u # 1.635 G/sec
(83.44%)
1,445 branch-misses:u # 0.00% of all
branches (83.25%)
1.835610338 seconds time elapsed
1.835628000 seconds user
0.000000000 seconds sys
jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native -fno-tree-vectorize
slp.c ; perf stat ./a.out
Performance counter stats for './a.out':
1,107.63 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
53 page-faults:u # 47.850 /sec
6,000,774,547 cycles:u # 5.418 GHz
(83.35%)
32,208 stalled-cycles-frontend:u # 0.00% frontend
cycles idle (83.39%)
57,126 stalled-cycles-backend:u # 0.00% backend
cycles idle (83.39%)
7,999,763,446 instructions:u # 1.33 insn per
cycle
# 0.00 stalled cycles per
insn (83.39%)
2,999,982,314 branches:u # 2.708 G/sec
(83.39%)
911 branch-misses:u # 0.00% of all
branches (83.09%)
1.108032230 seconds time elapsed
1.104079000 seconds user
0.003985000 seconds sys
with -fno-tree-slp-vectorize I get:
loop:
.LFB0:
.cfi_startproc
sall a(%rip)
sall a+4(%rip)
sall a+8(%rip)
ret
Which seem to be still faster. It is same if I do a[i]++
jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native slp2.c ; perf stat
./a.out
Performance counter stats for './a.out':
1,832.63 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
54 page-faults:u # 29.466 /sec
10,000,535,003 cycles:u # 5.457 GHz
(83.19%)
36,576 stalled-cycles-frontend:u # 0.00% frontend
cycles idle (83.34%)
75,320 stalled-cycles-backend:u # 0.00% backend
cycles idle (83.41%)
9,999,890,371 instructions:u # 1.00 insn per
cycle
# 0.00 stalled cycles per
insn (83.41%)
2,999,935,610 branches:u # 1.637 G/sec
(83.41%)
1,447 branch-misses:u # 0.00% of all
branches (83.23%)
1.833046939 seconds time elapsed
1.833062000 seconds user
0.000000000 seconds sys
jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native slp2.c
-fno-tree-vectorize ; perf stat ./a.out
Performance counter stats for './a.out':
1,110.15 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
51 page-faults:u # 45.940 /sec
6,000,096,821 cycles:u # 5.405 GHz
(83.17%)
28,459 stalled-cycles-frontend:u # 0.00% frontend
cycles idle (83.43%)
48,165 stalled-cycles-backend:u # 0.00% backend
cycles idle (83.43%)
7,999,665,012 instructions:u # 1.33 insn per
cycle
# 0.00 stalled cycles per
insn (83.43%)
2,999,941,619 branches:u # 2.702 G/sec
(83.43%)
719 branch-misses:u # 0.00% of all
branches (83.12%)
1.110557635 seconds time elapsed
1.110575000 seconds user
0.000000000 seconds sys
jh@ryzen4:~/gcc/build/gcc> cat slp2.c
int a[100];
[[gnu::noipa]]
void loop()
{
for (int i = 0; i < 3; i++)
a[i]++;
}
int
main()
{
for (int j = 0; j < 1000000000; j++)
loop ();
return 0;
}
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen
2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
` (6 preceding siblings ...)
2023-05-05 22:40 ` hubicka at gcc dot gnu.org
@ 2023-05-06 8:45 ` amonakov at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-05-06 8:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #8 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Note that the vectorized variant is latency-bound: vector load in loop() waits
for the vector store into the same location done in the previous invocation of
'loop'. This makes the microbenchmark take 10 cycles per iteration (9 cycles as
the vector store forwarding latency, plus 1 cycle for the ALU op).
In contrast, the fully-scalar variant benefits from "memory renaming" in Zen 2
and Zen 4 (absent in Zen 3) where store-forwarding happens earlier in the
pipeline with zero-cycle latency. I think it bottlenecks on taken branches.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-05-06 8:45 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org
2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org
2023-05-01 22:01 ` pinskia at gcc dot gnu.org
2023-05-01 22:11 ` pinskia at gcc dot gnu.org
2023-05-02 6:46 ` rguenth at gcc dot gnu.org
2023-05-04 20:46 ` ubizjak at gmail dot com
2023-05-05 12:16 ` ubizjak at gmail dot com
2023-05-05 22:40 ` hubicka at gcc dot gnu.org
2023-05-06 8:45 ` amonakov at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).