public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug tree-optimization/109690] New: bad SLP vectorization on zen @ 2023-05-01 21:31 hubicka at gcc dot gnu.org 2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org ` (7 more replies) 0 siblings, 8 replies; 9+ messages in thread From: hubicka at gcc dot gnu.org @ 2023-05-01 21:31 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690 Bug ID: 109690 Summary: bad SLP vectorization on zen Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- model name : AMD Ryzen 7 5800X 8-Core Processor reproduces on my znver1 laptop too. h@ryzen3:~/gcc-kub/build/gcc> cat tt.c int a[100]; [[gnu::noipa]] void loop() { for (int i = 0; i < 3; i++) a[i]+=a[i]; } int main() { for (int j = 0; j < 1000000000; j++) loop (); return 0; } jh@ryzen3:~/gcc-kub/build/gcc> ./xgcc -B ./ -O2 -march=native tt.c ; perf stat ./a.out Performance counter stats for './a.out': 2683.95 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 52 page-faults:u # 19.374 /sec 13001141361 cycles:u # 4.844 GHz (83.31%) 691180 stalled-cycles-frontend:u # 0.01% frontend cycles idle (83.31%) 101980 stalled-cycles-backend:u # 0.00% backend cycles idle (83.31%) 12999928665 instructions:u # 1.00 insn per cycle # 0.00 stalled cycles per insn (83.31%) 3000013809 branches:u # 1.118 G/sec (83.41%) 1525 branch-misses:u # 0.00% of all branches (83.36%) 2.684376360 seconds time elapsed 2.684369000 seconds user 0.000000000 seconds sys jh@ryzen3:~/gcc-kub/build/gcc> ./xgcc -B ./ -O2 -march=native tt.c -fno-tree-vectorize ; perf stat ./a.out Performance counter stats for './a.out': 1238.92 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 52 page-faults:u # 41.972 /sec 6000338140 cycles:u # 4.843 GHz (83.21%) 314660 stalled-cycles-frontend:u # 0.01% frontend cycles idle (83.21%) 0 stalled-cycles-backend:u # 0.00% backend cycles idle (83.23%) 7999796562 instructions:u # 1.33 insn per cycle # 0.00 stalled cycles per insn (83.53%) 2999887795 branches:u # 2.421 G/sec (83.53%) 698 branch-misses:u # 0.00% of all branches (83.28%) 1.239116606 seconds time elapsed 1.239121000 seconds user 0.000000000 seconds sys ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen 2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org @ 2023-05-01 21:59 ` pinskia at gcc dot gnu.org 2023-05-01 22:01 ` pinskia at gcc dot gnu.org ` (6 subsequent siblings) 7 siblings, 0 replies; 9+ messages in thread From: pinskia at gcc dot gnu.org @ 2023-05-01 21:59 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690 --- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Without -march=znver1, we get: vect__10.6_9 = MEM <vector(2) int> [(int *)&a]; vect_patt_13.7_8 = VIEW_CONVERT_EXPR<vector(2) unsigned int>(vect__10.6_9); vect_patt_19.8_1 = vect_patt_13.7_8 << 1; vect_patt_25.9_2 = VIEW_CONVERT_EXPR<vector(2) int>(vect_patt_19.8_1); MEM <vector(2) int> [(int *)&a] = vect_patt_25.9_2; Which looks reasonable. But with -march=znver1 we get: _10 = a[0]; _11 = _10 * 2; _16 = a[1]; _17 = _16 * 2; _13 = {_11, _17}; MEM <vector(2) int> [(int *)&a] = _13; So this is definitely a cost model issue. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen 2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org 2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org @ 2023-05-01 22:01 ` pinskia at gcc dot gnu.org 2023-05-01 22:11 ` pinskia at gcc dot gnu.org ` (5 subsequent siblings) 7 siblings, 0 replies; 9+ messages in thread From: pinskia at gcc dot gnu.org @ 2023-05-01 22:01 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690 --- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Even more interesting is: for (int i = 0; i < 3; i++) a[i] = ((unsigned)a[i]) << 1; Produces different code . ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen 2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org 2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org 2023-05-01 22:01 ` pinskia at gcc dot gnu.org @ 2023-05-01 22:11 ` pinskia at gcc dot gnu.org 2023-05-02 6:46 ` rguenth at gcc dot gnu.org ` (4 subsequent siblings) 7 siblings, 0 replies; 9+ messages in thread From: pinskia at gcc dot gnu.org @ 2023-05-01 22:11 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690 --- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> --- So in the case of without -march, we get: first: /app/example.cpp:14:24: note: Cost model analysis for part in loop 0: Vector cost: 28 Scalar cost: 24 so we reject that and then we try it again and this time for V8QI and then it works. With -march we get: /app/example.cpp:14:24: note: Cost model analysis for part in loop 0: Vector cost: 32 Scalar cost: 32 Which then we accept and does not retry it ... ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen 2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org ` (2 preceding siblings ...) 2023-05-01 22:11 ` pinskia at gcc dot gnu.org @ 2023-05-02 6:46 ` rguenth at gcc dot gnu.org 2023-05-04 20:46 ` ubizjak at gmail dot com ` (3 subsequent siblings) 7 siblings, 0 replies; 9+ messages in thread From: rguenth at gcc dot gnu.org @ 2023-05-02 6:46 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Last reconfirmed| |2023-05-02 Status|UNCONFIRMED |NEW CC| |uros at gcc dot gnu.org --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- The x86 target chooses not to go the "compare costs" route but choose the first (usually biggest size) vectorization that is profitable. So the interesting thing is that with -march=znver3 we have the integer multiplication in V2SImode unsupported. Note that SLP chooses V2SImode for the base V4SImode. With V8QImode (aka V2SImode) base mode pattern recog works to produce the desired shift. I think the disconnect is that with V4SImode we have an integer multiplication pattern (so no pattern is created) but with V2SImode we have not (looks like the target chose not to implement that). A solution would be to perform pattern recog in the vectorizable_* routines or at least in the cases where straight-forward, simply code-gen a supported variant. Thus, mulv2si3 is missing. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen 2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org ` (3 preceding siblings ...) 2023-05-02 6:46 ` rguenth at gcc dot gnu.org @ 2023-05-04 20:46 ` ubizjak at gmail dot com 2023-05-05 12:16 ` ubizjak at gmail dot com ` (2 subsequent siblings) 7 siblings, 0 replies; 9+ messages in thread From: ubizjak at gmail dot com @ 2023-05-04 20:46 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690 --- Comment #5 from Uroš Bizjak <ubizjak at gmail dot com> --- Created attachment 55002 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55002&action=edit Patch that introduces mulv2si3 The compiled code with -march=znver1 is now the same as without the flag: loop: vmovq a(%rip), %xmm0 sall a+8(%rip) vpslld $1, %xmm0, %xmm0 vmovq %xmm0, a(%rip) ret ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen 2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org ` (4 preceding siblings ...) 2023-05-04 20:46 ` ubizjak at gmail dot com @ 2023-05-05 12:16 ` ubizjak at gmail dot com 2023-05-05 22:40 ` hubicka at gcc dot gnu.org 2023-05-06 8:45 ` amonakov at gcc dot gnu.org 7 siblings, 0 replies; 9+ messages in thread From: ubizjak at gmail dot com @ 2023-05-05 12:16 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690 --- Comment #6 from Uroš Bizjak <ubizjak at gmail dot com> --- The missing pattern was committed as part of: commit r14-493-g919642fa4b2bc4c32910336dd200d53766801c80 Author: Uros Bizjak <ubizjak@gmail.com> Date: Fri May 5 14:10:18 2023 +0200 i386: Introduce mulv2si3 instruction For SSE2 targets the expander unpacks input elements into the correct position in the V4SI vector and emits PMULUDQ instruction. The output elements are then shuffled back to their positions in the V2SI vector. For SSE4 targets PMULLD instruction is emitted directly. gcc/ChangeLog: * config/i386/mmx.md (mulv2si3): New expander. (*mulv2si3): New insn pattern. gcc/testsuite/ChangeLog: * gcc.target/i386/sse2-mmx-mult-vec.c: New test. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen 2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org ` (5 preceding siblings ...) 2023-05-05 12:16 ` ubizjak at gmail dot com @ 2023-05-05 22:40 ` hubicka at gcc dot gnu.org 2023-05-06 8:45 ` amonakov at gcc dot gnu.org 7 siblings, 0 replies; 9+ messages in thread From: hubicka at gcc dot gnu.org @ 2023-05-05 22:40 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690 --- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Thanks a lot! There however still seems to be problem with vectorization On zen4 i now get: jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native slp.c ; perf stat ./a.out Performance counter stats for './a.out': 1,835.21 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 53 page-faults:u # 28.880 /sec 10,000,113,961 cycles:u # 5.449 GHz (83.22%) 31,284 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.23%) 64,771 stalled-cycles-backend:u # 0.00% backend cycles idle (83.43%) 9,000,118,863 instructions:u # 0.90 insn per cycle # 0.00 stalled cycles per insn (83.44%) 2,999,980,507 branches:u # 1.635 G/sec (83.44%) 1,445 branch-misses:u # 0.00% of all branches (83.25%) 1.835610338 seconds time elapsed 1.835628000 seconds user 0.000000000 seconds sys jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native -fno-tree-vectorize slp.c ; perf stat ./a.out Performance counter stats for './a.out': 1,107.63 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 53 page-faults:u # 47.850 /sec 6,000,774,547 cycles:u # 5.418 GHz (83.35%) 32,208 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.39%) 57,126 stalled-cycles-backend:u # 0.00% backend cycles idle (83.39%) 7,999,763,446 instructions:u # 1.33 insn per cycle # 0.00 stalled cycles per insn (83.39%) 2,999,982,314 branches:u # 2.708 G/sec (83.39%) 911 branch-misses:u # 0.00% of all branches (83.09%) 1.108032230 seconds time elapsed 1.104079000 seconds user 0.003985000 seconds sys with -fno-tree-slp-vectorize I get: loop: .LFB0: .cfi_startproc sall a(%rip) sall a+4(%rip) sall a+8(%rip) ret Which seem to be still faster. It is same if I do a[i]++ jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native slp2.c ; perf stat ./a.out Performance counter stats for './a.out': 1,832.63 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 54 page-faults:u # 29.466 /sec 10,000,535,003 cycles:u # 5.457 GHz (83.19%) 36,576 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.34%) 75,320 stalled-cycles-backend:u # 0.00% backend cycles idle (83.41%) 9,999,890,371 instructions:u # 1.00 insn per cycle # 0.00 stalled cycles per insn (83.41%) 2,999,935,610 branches:u # 1.637 G/sec (83.41%) 1,447 branch-misses:u # 0.00% of all branches (83.23%) 1.833046939 seconds time elapsed 1.833062000 seconds user 0.000000000 seconds sys jh@ryzen4:~/gcc/build/gcc> ./xgcc -B ./ -O2 -march=native slp2.c -fno-tree-vectorize ; perf stat ./a.out Performance counter stats for './a.out': 1,110.15 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 51 page-faults:u # 45.940 /sec 6,000,096,821 cycles:u # 5.405 GHz (83.17%) 28,459 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.43%) 48,165 stalled-cycles-backend:u # 0.00% backend cycles idle (83.43%) 7,999,665,012 instructions:u # 1.33 insn per cycle # 0.00 stalled cycles per insn (83.43%) 2,999,941,619 branches:u # 2.702 G/sec (83.43%) 719 branch-misses:u # 0.00% of all branches (83.12%) 1.110557635 seconds time elapsed 1.110575000 seconds user 0.000000000 seconds sys jh@ryzen4:~/gcc/build/gcc> cat slp2.c int a[100]; [[gnu::noipa]] void loop() { for (int i = 0; i < 3; i++) a[i]++; } int main() { for (int j = 0; j < 1000000000; j++) loop (); return 0; } ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/109690] bad SLP vectorization on zen 2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org ` (6 preceding siblings ...) 2023-05-05 22:40 ` hubicka at gcc dot gnu.org @ 2023-05-06 8:45 ` amonakov at gcc dot gnu.org 7 siblings, 0 replies; 9+ messages in thread From: amonakov at gcc dot gnu.org @ 2023-05-06 8:45 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109690 Alexander Monakov <amonakov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amonakov at gcc dot gnu.org --- Comment #8 from Alexander Monakov <amonakov at gcc dot gnu.org> --- Note that the vectorized variant is latency-bound: vector load in loop() waits for the vector store into the same location done in the previous invocation of 'loop'. This makes the microbenchmark take 10 cycles per iteration (9 cycles as the vector store forwarding latency, plus 1 cycle for the ALU op). In contrast, the fully-scalar variant benefits from "memory renaming" in Zen 2 and Zen 4 (absent in Zen 3) where store-forwarding happens earlier in the pipeline with zero-cycle latency. I think it bottlenecks on taken branches. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-05-06 8:45 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-05-01 21:31 [Bug tree-optimization/109690] New: bad SLP vectorization on zen hubicka at gcc dot gnu.org 2023-05-01 21:59 ` [Bug target/109690] " pinskia at gcc dot gnu.org 2023-05-01 22:01 ` pinskia at gcc dot gnu.org 2023-05-01 22:11 ` pinskia at gcc dot gnu.org 2023-05-02 6:46 ` rguenth at gcc dot gnu.org 2023-05-04 20:46 ` ubizjak at gmail dot com 2023-05-05 12:16 ` ubizjak at gmail dot com 2023-05-05 22:40 ` hubicka at gcc dot gnu.org 2023-05-06 8:45 ` amonakov at gcc dot gnu.org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).