public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
@ 2020-03-31 17:33 jamborm at gcc dot gnu.org
2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-31 17:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427
Bug ID: 94427
Summary: 456.hmmer is 8-17% slower when compiled at -Ofast than
with GCC 9
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: jamborm at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
Blocks: 26163
Target Milestone: ---
Host: x86_64-linux
Target: x86_64-linux
SPECINT 2006 benchmark 456.hmmer runs 18% slower on AMD Zen2 CPUs, 15%
on AMD Zen1 CPUs and 8% on Intel Cascade Lake server CPUs when built
with trunk (revision 26b3e568a60) and just -Ofast (so with generic
march/mtune) than when compiled wth GCC 9.
Bisecting the regression leads to commit:
commit 14ec49a7537004633b7fff859178cbebd288ca1d
Author: Richard Biener <rguenther@suse.de>
Date: Tue Jul 2 07:35:23 2019 +0000
re PR tree-optimization/58483 (missing optimization opportunity for const
std::vector compared to std::array)
2019-07-02 Richard Biener <rguenther@suse.de>
PR tree-optimization/58483
* tree-ssa-scopedtables.c (avail_expr_hash): Use OEP_ADDRESS_OF
for MEM_REF base hashing.
(equal_mem_array_ref_p): Likewise for base comparison.
* gcc.dg/tree-ssa/ssa-dom-cse-8.c: New testcase.
From-SVN: r272922
Collected profiles are weird, almost the other way round I would
expect them to be, because the *slow* version spends less time in cold
section - but both spend IMHO too much time there. The following data
were collected on AMD Zen2 but those from Intel are similar in this
regard. What is different is that on Intel perf stat reports doubling
of branch misses - and because it has older perf it does not report
front/back-end stalls.
Before the aforementioned revision:
Performance counter stats for 'numactl -C 0 -l specinvoke':
163360.87 msec task-clock:u # 0.992 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7639 page-faults:u # 0.047 K/sec
525635661818 cycles:u #
809847511 stalled-cycles-frontend:u # 0.15% frontend cycles
idle (83.35%)
299331255326 stalled-cycles-backend:u # 56.95% backend cycles
idle (83.30%)
1757801907547 instructions:u # 3.34 insn per cycle
# 0.17 stalled cycles per
insn (83.34%)
133496985084 branches:u # 817.191 M/sec
(83.35%)
682351923 branch-misses:u # 0.51% of all branches
(83.31%)
164.659685804 seconds time elapsed
163.325420000 seconds user
0.022183000 seconds sys
# Samples: 637K of event 'cycles:u'
# Event count (approx.): 527143782584
#
# Overhead Samples Shared Object Symbol
# ........ ............ ....................... ....................
#
58.43% 372284 hmmer_peak.mine-std-gen [.] P7Viterbi
35.12% 223887 hmmer_peak.mine-std-gen [.] P7Viterbi.cold
2.59% 16418 hmmer_peak.mine-std-gen [.] FChoose
2.51% 15906 hmmer_peak.mine-std-gen [.] sre_random
At the aforementioned revision:
Performance counter stats for 'numactl -C 0 -l specinvoke':
191483.84 msec task-clock:u # 0.994 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7639 page-faults:u # 0.040 K/sec
622159384711 cycles:u #
817604010 stalled-cycles-frontend:u # 0.13% frontend cycles
idle (83.31%)
439972264588 stalled-cycles-backend:u # 70.72% backend cycles
idle (83.34%)
1707838992202 instructions:u # 2.75 insn per cycle
# 0.26 stalled cycles per
insn (83.35%)
91309384910 branches:u # 476.852 M/sec
(83.32%)
655463713 branch-misses:u # 0.72% of all branches
(83.33%)
192.564513355 seconds time elapsed
191.443774000 seconds user
0.023978000 seconds sys
# Samples: 752K of event 'cycles:u'
# Event count (approx.): 622947549968
#
# Overhead Samples Shared Object Symbol
# ........ ............ ........................ ....................
#
83.68% 629645 hmmer_peak.small-std-gen [.] P7Viterbi
10.84% 81591 hmmer_peak.small-std-gen [.] P7Viterbi.cold
2.21% 16546 hmmer_peak.small-std-gen [.] FChoose
2.11% 15793 hmmer_peak.small-std-gen [.] sre_random
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
@ 2020-03-31 23:12 ` jamborm at gcc dot gnu.org
2020-04-01 6:48 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-31 23:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427
--- Comment #1 from Martin Jambor <jamborm at gcc dot gnu.org> ---
OK, so it turns out the identified commit only allows us to shoot
ourselves in the foot - and there one too few branches, not too many.
The hottest loop, consuming most of the time is:
Percent Instructions
------------------------------------------------
0.03 │ fb0:┌─+add -0x8(%r9,%rcx,4),%eax
5.03 │ │ mov %eax,-0x4(%r13,%rcx,4)
2.48 │ │ mov -0x8(%r8,%rcx,4),%esi
0.02 │ │ add -0x8(%rdx,%rcx,4),%esi
0.06 │ │ cmp %eax,%esi
4.49 │ │ cmovge %esi,%eax
17.17 │ │ mov %ecx,%esi
0.03 │ │ cmp $0xc521974f,%eax
3.50 │ │ cmovl %ebx,%eax <----------- this used to be a branch
21.84 │ │ mov %eax,-0x4(%r13,%rcx,4)
3.88 │ │ add $0x1,%rcx
0.00 │ │ cmp %rdi,%rcx
0.04 │ └──jne fb0
where the marked conditional move was a branch one revision before,
because, after fwprop3 the IL looked like:
<bb 16> [local count: 955630217]:
# cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14),
[fast_algorithms.c:142:53] cstore_249(15)>
[fast_algorithms.c:142:49] MEM <int> [(void *)_72] = cstore_281;
[fast_algorithms.c:143:13] _78 = [fast_algorithms.c:143:13] *_72;
[fast_algorithms.c:143:10] if (_78 < -987654321)
goto <bb 18>; [50.00%]
else
goto <bb 17>; [50.00%]
<bb 17> [local count: 477815109]:
<bb 18> [local count: 955630217]:
# cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16),
[fast_algorithms.c:143:33] cstore_281(17)>
[fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_250;
The aforementioned revision turned this into more optimized code:
<bb 16> [local count: 955630217]:
# cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14),
[fast_algorithms.c:142:53] _73(15)>
[fast_algorithms.c:143:10] if (cstore_281 < -987654321)
goto <bb 18>; [50.00%]
else
goto <bb 17>; [50.00%]
<bb 17> [local count: 477815109]:
<bb 18> [local count: 955630217]:
# cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16),
[fast_algorithms.c:143:33] cstore_281(17)>
[fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_250;
Which then phiopt3 changed to:
cstore_248 = MAX_EXPR <cstore_249, -987654321>;
[fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_248;
and expander apparently always expands MAX_EXPR into a conditional
move if it can(?).
When I hacked phiopt not to do the transformation for - ehm - any
GIMPLE_COND statement originating from source line 143, I recovered
the original run-time of the benchmark. On both AMD and Intel.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org
@ 2020-04-01 6:48 ` rguenth at gcc dot gnu.org
2023-08-07 9:19 ` hubicka at gcc dot gnu.org
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-01 6:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The hot/cold thing smells like PR90911 btw.
The cmov thing you identified is PR91154 which was fixed for archs that can
do the required SSE min/max operation which is SSE 4.1 IIRC.
It also is then a "duplicate" of the (various?) bugs about
to-cmov-or-not-to-cmov
where the microarchitectural details are unclear.
So I fear with just SSE2 we can't do anything here but apply yet another
heuristic special to hmmer that will likely hurt another case (there are
corresponding bugs asking for _more_ cmov...). One of my first patches was
to notice the special constant for the max/min operation and not do
conditional move expansion (plus prevent later RTL ifcvt to apply with the
same heuristics).
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org
2020-04-01 6:48 ` rguenth at gcc dot gnu.org
@ 2023-08-07 9:19 ` hubicka at gcc dot gnu.org
2023-08-07 9:38 ` hubicka at gcc dot gnu.org
2023-08-07 9:44 ` rguenth at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-08-07 9:19 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427
Jan Hubicka <hubicka at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hubicka at gcc dot gnu.org
--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
With profile feedback on zen4 we now get hottest loops as:
│ dc[k] = dc[k-1] + tpdd[k-1]; ▒
│16b0:┌─ vmovd (%r14,%rdx,1),%xmm2 ▒
0.15 │ │ vpaddd %xmm2,%xmm0,%xmm0 ▒
5.79 │ │ vmovd %xmm0,0x4(%rax,%rdx,1) ▒
│ │if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; ▒
8.04 │ │ vmovd (%r15,%rdx,1),%xmm7 ▒
0.16 │ │ vmovd (%rcx,%rdx,1),%xmm2 ▒
0.41 │ │ vpaddd %xmm7,%xmm2,%xmm2 ▒
│ │if (dc[k] < -INFTY) dc[k] = -INFTY; ▒
0.71 │ │ vmovdqa _IO_stdin_used+0x560,%xmm7 ◆
1.07 │ │ vpmaxsd %xmm7,%xmm2,%xmm2 ▒
0.73 │ │ vpmaxsd %xmm0,%xmm2,%xmm0 ▒
5.83 │ │ vmovd %xmm0,0x4(%rax,%rdx,1) ▒
│ │for (k = 1; k <= M; k++) { ▒
5.86 │ │ add $0x4,%rdx ▒
1.40 │ ├──cmp %rdx,%r13 ▒
0.00 │ └──jne 16b0 ▒
no time is spent in cold section.
Without profile I get:
88.80% hmmer_peak.chn- [.] P7Viterbi ◆
5.10% hmmer_peak.chn- [.] sre_random ▒
2.31% hmmer_peak.chn- [.] FChoose ▒
1.35% hmmer_peak.chn- [.] RandomSequence ▒
so no time in cold section either.
internal loop almost identical:
│17e0:┌─ vmovd (%r11,%rdi,4),%xmm3 ▒
0.07 │ │ mov %rdi,%r8 ▒
0.09 │ │ vpaddd %xmm3,%xmm0,%xmm0 ▒
6.20 │ │ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒
│ │if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; ▒
7.00 │ │ vmovd (%rax,%rdi,4),%xmm6 ▒
0.19 │ │ vmovd (%r10,%rdi,4),%xmm3 ▒
0.16 │ │ vpaddd %xmm3,%xmm6,%xmm3 ◆
│ │if (dc[k] < -INFTY) dc[k] = -INFTY; ▒
1.25 │ │ vmovdqa _IO_stdin_used+0x600,%xmm6 ▒
0.89 │ │ vpmaxsd %xmm6,%xmm3,%xmm3 ▒
0.46 │ │ vpmaxsd %xmm0,%xmm3,%xmm0 ▒
5.85 │ │ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒
│ │for (k = 1; k <= M; k++) { ▒
6.02 │ │ inc %rdi ▒
2.48 │ ├──cmp %r8,%r9 ▒
0.00 │ └──jne 17e0 ▒
However the hottest loop seems to be completely elsewhere then shown by you
since it is FP loop and yours seems integer?
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
` (2 preceding siblings ...)
2023-08-07 9:19 ` hubicka at gcc dot gnu.org
@ 2023-08-07 9:38 ` hubicka at gcc dot gnu.org
2023-08-07 9:44 ` rguenth at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-08-07 9:38 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427
--- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
It is the same loop - it was float only in my mind (since the function return
float value :)
With loop splitting we no longer have the last iteration check, but we still
have the underflow checks that are indeed likely predictable well and in
unvectorized version may make sense to be not if converted.
So I guess in unvectorized loop the 100% predictable conditonal should be still
a win but vectorization should likely outweight that?
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
` (3 preceding siblings ...)
2023-08-07 9:38 ` hubicka at gcc dot gnu.org
@ 2023-08-07 9:44 ` rguenth at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-07 9:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #4)
> It is the same loop - it was float only in my mind (since the function
> return float value :)
>
> With loop splitting we no longer have the last iteration check, but we still
> have the underflow checks that are indeed likely predictable well and in
> unvectorized version may make sense to be not if converted.
>
> So I guess in unvectorized loop the 100% predictable conditonal should be
> still a win but vectorization should likely outweight that?
I think so.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-08-07 9:44 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org
2020-04-01 6:48 ` rguenth at gcc dot gnu.org
2023-08-07 9:19 ` hubicka at gcc dot gnu.org
2023-08-07 9:38 ` hubicka at gcc dot gnu.org
2023-08-07 9:44 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).