public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 @ 2020-03-31 17:33 jamborm at gcc dot gnu.org 2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org ` (4 more replies) 0 siblings, 5 replies; 6+ messages in thread From: jamborm at gcc dot gnu.org @ 2020-03-31 17:33 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427 Bug ID: 94427 Summary: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux SPECINT 2006 benchmark 456.hmmer runs 18% slower on AMD Zen2 CPUs, 15% on AMD Zen1 CPUs and 8% on Intel Cascade Lake server CPUs when built with trunk (revision 26b3e568a60) and just -Ofast (so with generic march/mtune) than when compiled wth GCC 9. Bisecting the regression leads to commit: commit 14ec49a7537004633b7fff859178cbebd288ca1d Author: Richard Biener <rguenther@suse.de> Date: Tue Jul 2 07:35:23 2019 +0000 re PR tree-optimization/58483 (missing optimization opportunity for const std::vector compared to std::array) 2019-07-02 Richard Biener <rguenther@suse.de> PR tree-optimization/58483 * tree-ssa-scopedtables.c (avail_expr_hash): Use OEP_ADDRESS_OF for MEM_REF base hashing. (equal_mem_array_ref_p): Likewise for base comparison. * gcc.dg/tree-ssa/ssa-dom-cse-8.c: New testcase. From-SVN: r272922 Collected profiles are weird, almost the other way round I would expect them to be, because the *slow* version spends less time in cold section - but both spend IMHO too much time there. The following data were collected on AMD Zen2 but those from Intel are similar in this regard. What is different is that on Intel perf stat reports doubling of branch misses - and because it has older perf it does not report front/back-end stalls. Before the aforementioned revision: Performance counter stats for 'numactl -C 0 -l specinvoke': 163360.87 msec task-clock:u # 0.992 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 7639 page-faults:u # 0.047 K/sec 525635661818 cycles:u # 809847511 stalled-cycles-frontend:u # 0.15% frontend cycles idle (83.35%) 299331255326 stalled-cycles-backend:u # 56.95% backend cycles idle (83.30%) 1757801907547 instructions:u # 3.34 insn per cycle # 0.17 stalled cycles per insn (83.34%) 133496985084 branches:u # 817.191 M/sec (83.35%) 682351923 branch-misses:u # 0.51% of all branches (83.31%) 164.659685804 seconds time elapsed 163.325420000 seconds user 0.022183000 seconds sys # Samples: 637K of event 'cycles:u' # Event count (approx.): 527143782584 # # Overhead Samples Shared Object Symbol # ........ ............ ....................... .................... # 58.43% 372284 hmmer_peak.mine-std-gen [.] P7Viterbi 35.12% 223887 hmmer_peak.mine-std-gen [.] P7Viterbi.cold 2.59% 16418 hmmer_peak.mine-std-gen [.] FChoose 2.51% 15906 hmmer_peak.mine-std-gen [.] sre_random At the aforementioned revision: Performance counter stats for 'numactl -C 0 -l specinvoke': 191483.84 msec task-clock:u # 0.994 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 7639 page-faults:u # 0.040 K/sec 622159384711 cycles:u # 817604010 stalled-cycles-frontend:u # 0.13% frontend cycles idle (83.31%) 439972264588 stalled-cycles-backend:u # 70.72% backend cycles idle (83.34%) 1707838992202 instructions:u # 2.75 insn per cycle # 0.26 stalled cycles per insn (83.35%) 91309384910 branches:u # 476.852 M/sec (83.32%) 655463713 branch-misses:u # 0.72% of all branches (83.33%) 192.564513355 seconds time elapsed 191.443774000 seconds user 0.023978000 seconds sys # Samples: 752K of event 'cycles:u' # Event count (approx.): 622947549968 # # Overhead Samples Shared Object Symbol # ........ ............ ........................ .................... # 83.68% 629645 hmmer_peak.small-std-gen [.] P7Viterbi 10.84% 81591 hmmer_peak.small-std-gen [.] P7Viterbi.cold 2.21% 16546 hmmer_peak.small-std-gen [.] FChoose 2.11% 15793 hmmer_peak.small-std-gen [.] sre_random Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95) ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org @ 2020-03-31 23:12 ` jamborm at gcc dot gnu.org 2020-04-01 6:48 ` rguenth at gcc dot gnu.org ` (3 subsequent siblings) 4 siblings, 0 replies; 6+ messages in thread From: jamborm at gcc dot gnu.org @ 2020-03-31 23:12 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427 --- Comment #1 from Martin Jambor <jamborm at gcc dot gnu.org> --- OK, so it turns out the identified commit only allows us to shoot ourselves in the foot - and there one too few branches, not too many. The hottest loop, consuming most of the time is: Percent Instructions ------------------------------------------------ 0.03 │ fb0:┌─+add -0x8(%r9,%rcx,4),%eax 5.03 │ │ mov %eax,-0x4(%r13,%rcx,4) 2.48 │ │ mov -0x8(%r8,%rcx,4),%esi 0.02 │ │ add -0x8(%rdx,%rcx,4),%esi 0.06 │ │ cmp %eax,%esi 4.49 │ │ cmovge %esi,%eax 17.17 │ │ mov %ecx,%esi 0.03 │ │ cmp $0xc521974f,%eax 3.50 │ │ cmovl %ebx,%eax <----------- this used to be a branch 21.84 │ │ mov %eax,-0x4(%r13,%rcx,4) 3.88 │ │ add $0x1,%rcx 0.00 │ │ cmp %rdi,%rcx 0.04 │ └──jne fb0 where the marked conditional move was a branch one revision before, because, after fwprop3 the IL looked like: <bb 16> [local count: 955630217]: # cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14), [fast_algorithms.c:142:53] cstore_249(15)> [fast_algorithms.c:142:49] MEM <int> [(void *)_72] = cstore_281; [fast_algorithms.c:143:13] _78 = [fast_algorithms.c:143:13] *_72; [fast_algorithms.c:143:10] if (_78 < -987654321) goto <bb 18>; [50.00%] else goto <bb 17>; [50.00%] <bb 17> [local count: 477815109]: <bb 18> [local count: 955630217]: # cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16), [fast_algorithms.c:143:33] cstore_281(17)> [fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_250; The aforementioned revision turned this into more optimized code: <bb 16> [local count: 955630217]: # cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14), [fast_algorithms.c:142:53] _73(15)> [fast_algorithms.c:143:10] if (cstore_281 < -987654321) goto <bb 18>; [50.00%] else goto <bb 17>; [50.00%] <bb 17> [local count: 477815109]: <bb 18> [local count: 955630217]: # cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16), [fast_algorithms.c:143:33] cstore_281(17)> [fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_250; Which then phiopt3 changed to: cstore_248 = MAX_EXPR <cstore_249, -987654321>; [fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_248; and expander apparently always expands MAX_EXPR into a conditional move if it can(?). When I hacked phiopt not to do the transformation for - ehm - any GIMPLE_COND statement originating from source line 143, I recovered the original run-time of the benchmark. On both AMD and Intel. ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org 2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org @ 2020-04-01 6:48 ` rguenth at gcc dot gnu.org 2023-08-07 9:19 ` hubicka at gcc dot gnu.org ` (2 subsequent siblings) 4 siblings, 0 replies; 6+ messages in thread From: rguenth at gcc dot gnu.org @ 2020-04-01 6:48 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427 --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- The hot/cold thing smells like PR90911 btw. The cmov thing you identified is PR91154 which was fixed for archs that can do the required SSE min/max operation which is SSE 4.1 IIRC. It also is then a "duplicate" of the (various?) bugs about to-cmov-or-not-to-cmov where the microarchitectural details are unclear. So I fear with just SSE2 we can't do anything here but apply yet another heuristic special to hmmer that will likely hurt another case (there are corresponding bugs asking for _more_ cmov...). One of my first patches was to notice the special constant for the max/min operation and not do conditional move expansion (plus prevent later RTL ifcvt to apply with the same heuristics). ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org 2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org 2020-04-01 6:48 ` rguenth at gcc dot gnu.org @ 2023-08-07 9:19 ` hubicka at gcc dot gnu.org 2023-08-07 9:38 ` hubicka at gcc dot gnu.org 2023-08-07 9:44 ` rguenth at gcc dot gnu.org 4 siblings, 0 replies; 6+ messages in thread From: hubicka at gcc dot gnu.org @ 2023-08-07 9:19 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427 Jan Hubicka <hubicka at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hubicka at gcc dot gnu.org --- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> --- With profile feedback on zen4 we now get hottest loops as: │ dc[k] = dc[k-1] + tpdd[k-1]; ▒ │16b0:┌─ vmovd (%r14,%rdx,1),%xmm2 ▒ 0.15 │ │ vpaddd %xmm2,%xmm0,%xmm0 ▒ 5.79 │ │ vmovd %xmm0,0x4(%rax,%rdx,1) ▒ │ │if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; ▒ 8.04 │ │ vmovd (%r15,%rdx,1),%xmm7 ▒ 0.16 │ │ vmovd (%rcx,%rdx,1),%xmm2 ▒ 0.41 │ │ vpaddd %xmm7,%xmm2,%xmm2 ▒ │ │if (dc[k] < -INFTY) dc[k] = -INFTY; ▒ 0.71 │ │ vmovdqa _IO_stdin_used+0x560,%xmm7 ◆ 1.07 │ │ vpmaxsd %xmm7,%xmm2,%xmm2 ▒ 0.73 │ │ vpmaxsd %xmm0,%xmm2,%xmm0 ▒ 5.83 │ │ vmovd %xmm0,0x4(%rax,%rdx,1) ▒ │ │for (k = 1; k <= M; k++) { ▒ 5.86 │ │ add $0x4,%rdx ▒ 1.40 │ ├──cmp %rdx,%r13 ▒ 0.00 │ └──jne 16b0 ▒ no time is spent in cold section. Without profile I get: 88.80% hmmer_peak.chn- [.] P7Viterbi ◆ 5.10% hmmer_peak.chn- [.] sre_random ▒ 2.31% hmmer_peak.chn- [.] FChoose ▒ 1.35% hmmer_peak.chn- [.] RandomSequence ▒ so no time in cold section either. internal loop almost identical: │17e0:┌─ vmovd (%r11,%rdi,4),%xmm3 ▒ 0.07 │ │ mov %rdi,%r8 ▒ 0.09 │ │ vpaddd %xmm3,%xmm0,%xmm0 ▒ 6.20 │ │ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒ │ │if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; ▒ 7.00 │ │ vmovd (%rax,%rdi,4),%xmm6 ▒ 0.19 │ │ vmovd (%r10,%rdi,4),%xmm3 ▒ 0.16 │ │ vpaddd %xmm3,%xmm6,%xmm3 ◆ │ │if (dc[k] < -INFTY) dc[k] = -INFTY; ▒ 1.25 │ │ vmovdqa _IO_stdin_used+0x600,%xmm6 ▒ 0.89 │ │ vpmaxsd %xmm6,%xmm3,%xmm3 ▒ 0.46 │ │ vpmaxsd %xmm0,%xmm3,%xmm0 ▒ 5.85 │ │ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒ │ │for (k = 1; k <= M; k++) { ▒ 6.02 │ │ inc %rdi ▒ 2.48 │ ├──cmp %r8,%r9 ▒ 0.00 │ └──jne 17e0 ▒ However the hottest loop seems to be completely elsewhere then shown by you since it is FP loop and yours seems integer? ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org ` (2 preceding siblings ...) 2023-08-07 9:19 ` hubicka at gcc dot gnu.org @ 2023-08-07 9:38 ` hubicka at gcc dot gnu.org 2023-08-07 9:44 ` rguenth at gcc dot gnu.org 4 siblings, 0 replies; 6+ messages in thread From: hubicka at gcc dot gnu.org @ 2023-08-07 9:38 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427 --- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> --- It is the same loop - it was float only in my mind (since the function return float value :) With loop splitting we no longer have the last iteration check, but we still have the underflow checks that are indeed likely predictable well and in unvectorized version may make sense to be not if converted. So I guess in unvectorized loop the 100% predictable conditonal should be still a win but vectorization should likely outweight that? ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org ` (3 preceding siblings ...) 2023-08-07 9:38 ` hubicka at gcc dot gnu.org @ 2023-08-07 9:44 ` rguenth at gcc dot gnu.org 4 siblings, 0 replies; 6+ messages in thread From: rguenth at gcc dot gnu.org @ 2023-08-07 9:44 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427 --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Jan Hubicka from comment #4) > It is the same loop - it was float only in my mind (since the function > return float value :) > > With loop splitting we no longer have the last iteration check, but we still > have the underflow checks that are indeed likely predictable well and in > unvectorized version may make sense to be not if converted. > > So I guess in unvectorized loop the 100% predictable conditonal should be > still a win but vectorization should likely outweight that? I think so. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-08-07 9:44 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org 2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org 2020-04-01 6:48 ` rguenth at gcc dot gnu.org 2023-08-07 9:19 ` hubicka at gcc dot gnu.org 2023-08-07 9:38 ` hubicka at gcc dot gnu.org 2023-08-07 9:44 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).