public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
@ 2020-03-30 15:57 jamborm at gcc dot gnu.org
2020-03-30 16:01 ` [Bug target/94406] " jamborm at gcc dot gnu.org
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-30 15:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406
Bug ID: 94406
Summary: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9
with -Ofast -march=native
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: jamborm at gcc dot gnu.org
CC: andre.simoesdiasvieira at arm dot com
Blocks: 26163
Target Milestone: ---
Host: x86_64-linux
Target: x86_64-linux
SPEC 2017 FPrate benchmark 503.bwaves_r compiled with -Ofast
-march=native -mtune=native runs 11% slower on AMD Zen2 CPUs when
built with trunk (revision abe13e1847f) than when compiled with GCC
9.2.
Bisecting led to commit:
commit 1297712fb4af6c6bfd827e0f0a9695b14669f87d
Author: Andre Vieira <andre.simoesdiasvieira@arm.com>
Date: Thu Oct 31 09:49:47 2019 +0000
[vect]Make vect-epilogues-nomask=1 default
This patch turns epilogue vectorization on by default for all targets.
From-SVN: r277659
If we use current trunk but build also with option
--param vect-epilogues-nomask=0 we get run-time on par with GCC 9.
This is also the reason why generic march/tuning or building with
-mprefer-vector-width=128 currently results in faster code than simple
-march=native.
Interestingly, I do not see this issue on an Intel Cascade Lake Server
CPU, even though the epilogue is created there too - judging by CFG of
the hottest function which looks the same.
And I am not sure to what extent it tells anything at all, but I
accidentally also perf'ed load-to-store-stall events and in the slow
version, the reported "samples" was 10% higher and the reported "event
count" shot up 2.8 times(!).
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
@ 2020-03-30 16:01 ` jamborm at gcc dot gnu.org
2020-03-30 16:02 ` jamborm at gcc dot gnu.org
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-30 16:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406
--- Comment #1 from Martin Jambor <jamborm at gcc dot gnu.org> ---
For the record, the collected profiles both for the traditional
"cycles:u" event and (originally unintended) "ls_stlf:u" event are
below:
-Ofast -march=native -mtune=native
# Samples: 894K of event 'cycles:u'
# Event count (approx.): 735979402525
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ............... ............................
.................................
#
67.18% 599542 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
mat_times_vec_
11.40% 102686 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
shell_
11.37% 101388 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
bi_cgstab_block_
6.95% 62694 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
jacobian_
1.88% 16957 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
flux_
1.01% 9023 bwaves_r_peak.e libc-2.31.so [.]
__memset_avx2_unaligned
# Samples: 769K of event 'ls_stlf:u'
# Event count (approx.): 154704730574
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ............... ............................
....................................
#
94.59% 612921 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
mat_times_vec_
1.83% 88259 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
shell_
1.12% 13615 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
flux_
1.11% 43093 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
jacobian_
1.05% 8746 bwaves_r_peak.e libc-2.31.so [.]
__memset_avx2_unaligned
-Ofast -march=native -mtune=native --param vect-epilogues-nomask=0
# Samples: 816K of event 'cycles:u'
# Event count (approx.): 671104061807
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ............... ............................
.................................
#
64.07% 521532 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
mat_times_vec_
12.50% 102670 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
shell_
12.39% 100777 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
bi_cgstab_block_
7.60% 62641 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
jacobian_
2.06% 16925 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
flux_
1.17% 9531 bwaves_r_peak.e libc-2.31.so [.]
__memset_avx2_unaligned
# Samples: 705K of event 'ls_stlf:u'
# Event count (approx.): 55009340780
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ............... ............................
..............................
#
86.26% 532930 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
mat_times_vec_
5.15% 88270 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
shell_
3.17% 13696 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
flux_
3.06% 57149 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.]
jacobian_
1.59% 9226 bwaves_r_peak.e libc-2.31.so [.]
__memset_avx2_unaligned
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
2020-03-30 16:01 ` [Bug target/94406] " jamborm at gcc dot gnu.org
@ 2020-03-30 16:02 ` jamborm at gcc dot gnu.org
2020-03-30 16:39 ` jamborm at gcc dot gnu.org
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-30 16:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406
--- Comment #2 from Martin Jambor <jamborm at gcc dot gnu.org> ---
And for completeness, LNT sees this too and has just managed to catch the
regression:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=276.427.0&plot.1=295.427.0&
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
2020-03-30 16:01 ` [Bug target/94406] " jamborm at gcc dot gnu.org
2020-03-30 16:02 ` jamborm at gcc dot gnu.org
@ 2020-03-30 16:39 ` jamborm at gcc dot gnu.org
2020-03-30 21:10 ` jamborm at gcc dot gnu.org
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-30 16:39 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406
--- Comment #3 from Martin Jambor <jamborm at gcc dot gnu.org> ---
One more data point, binary compiled for cascadelake does not run on
Zen2, but one for znver2 runs on Cascade Lake and it makes no
difference in run-time.
If disapling epilogues helps on Intel, the difference is less than 2%.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
` (2 preceding siblings ...)
2020-03-30 16:39 ` jamborm at gcc dot gnu.org
@ 2020-03-30 21:10 ` jamborm at gcc dot gnu.org
2020-03-31 7:05 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-30 21:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406
--- Comment #4 from Martin Jambor <jamborm at gcc dot gnu.org> ---
For the record, on AMD Zen2 at least, SPEC 2006 410.bwaves also runs
about 12% faster with --param vect-epilogues-nomask=0 (and otherwise
with -Ofast -march=native -mtune=native).
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
` (3 preceding siblings ...)
2020-03-30 21:10 ` jamborm at gcc dot gnu.org
@ 2020-03-31 7:05 ` rguenth at gcc dot gnu.org
2020-11-13 14:35 ` [Bug tree-optimization/94406] " cvs-commit at gcc dot gnu.org
2020-12-01 10:47 ` jamborm at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-03-31 7:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Hmm, we're invoking memset from libc which might use a different path on
CXL compared to Zen2?
Note that a vectorized epilogue should in no way cause additional
store-to-load forwarding penalties _but_ it might cause additional
(positive) store-to-load forwardings.
Code-generation wise the loop leaves a lot to be desired and given
we know the number of iterations is 5 the vectorized epilogue will
never be entered thus its overhead will only hurt. Maybe CXL
branch prediction behaves better here.
Note there's room for improvement in the way we dispatch to the vectorized
epilogue. Exiting the main vectorized loop we do
if (do_we_need_an_epilouge)
{
then for the vectorized epilogue we do
if (remaining-niters == 1)
do scalar epilogue
else
do vector epilogue
where the complication is due to the fact that we share the scalar
epilogue loops with the loop used when the runtime cost model check
fails.
Thus the CFG with vectorized epilogue could be more optimally structured
reducing the overhead to a single jump-around.
For bwaves the other improvement opportunity is to move the memset out
of the full loop nest rather than just covering the innermost two loops.
That probably improves register allocation.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
` (4 preceding siblings ...)
2020-03-31 7:05 ` rguenth at gcc dot gnu.org
@ 2020-11-13 14:35 ` cvs-commit at gcc dot gnu.org
2020-12-01 10:47 ` jamborm at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-11-13 14:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406
--- Comment #6 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Martin Jambor <jamborm@gcc.gnu.org>:
https://gcc.gnu.org/g:ac91af71c93462cbc701bbd104fa21894bb15e86
commit r11-4983-gac91af71c93462cbc701bbd104fa21894bb15e86
Author: Martin Jambor <mjambor@suse.cz>
Date: Fri Nov 13 15:35:18 2020 +0100
loops: Invoke lim after successful loop interchange
This patch makes the entry point to loop invariant motion public, so
that it can be called after loop interchange when that pass has
swapped loops. This avoids the non-LTO -Ofast run-time regressions of
410.bwaves and 503.bwaves_r (which are 19% and 15% faster than current
master on an AMD zen2 machine) while not introducing a full LIM pass
into the pass pipeline.
The patch also adds a parameter which allows not to perform any store
motion so that it is not done after an interchange.
gcc/ChangeLog:
2020-11-12 Martin Jambor <mjambor@suse.cz>
PR tree-optimization/94406
* tree-ssa-loop-im.c (tree_ssa_lim): Renamed to
loop_invariant_motion_in_fun, added a parameter to control store
motion.
(pass_lim::execute): Adjust call to tree_ssa_lim, now
loop_invariant_motion_in_fun.
* tree-ssa-loop-manip.h (loop_invariant_motion_in_fun): Declare.
* gimple-loop-interchange.cc (pass_linterchange::execute): Call
loop_invariant_motion_in_fun if any interchange has been done.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
` (5 preceding siblings ...)
2020-11-13 14:35 ` [Bug tree-optimization/94406] " cvs-commit at gcc dot gnu.org
@ 2020-12-01 10:47 ` jamborm at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-12-01 10:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406
Martin Jambor <jamborm at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Status|UNCONFIRMED |RESOLVED
--- Comment #7 from Martin Jambor <jamborm at gcc dot gnu.org> ---
Fixed, as can be seen at
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=276.427.0&plot.1=398.427.0&plot.2=295.427.0&
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-12-01 10:47 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
2020-03-30 16:01 ` [Bug target/94406] " jamborm at gcc dot gnu.org
2020-03-30 16:02 ` jamborm at gcc dot gnu.org
2020-03-30 16:39 ` jamborm at gcc dot gnu.org
2020-03-30 21:10 ` jamborm at gcc dot gnu.org
2020-03-31 7:05 ` rguenth at gcc dot gnu.org
2020-11-13 14:35 ` [Bug tree-optimization/94406] " cvs-commit at gcc dot gnu.org
2020-12-01 10:47 ` jamborm at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).