[Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
@ 2020-03-30 15:57 jamborm at gcc dot gnu.org
  2020-03-30 16:01 ` [Bug target/94406] " jamborm at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-30 15:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

            Bug ID: 94406
           Summary: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9
                    with -Ofast -march=native
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
                CC: andre.simoesdiasvieira at arm dot com
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

SPEC 2017 FPrate benchmark 503.bwaves_r compiled with -Ofast
-march=native -mtune=native runs 11% slower on AMD Zen2 CPUs when
built with trunk (revision abe13e1847f) than when compiled with GCC
9.2.

Bisecting led to commit:

  commit 1297712fb4af6c6bfd827e0f0a9695b14669f87d
  Author: Andre Vieira <andre.simoesdiasvieira@arm.com>
  Date:   Thu Oct 31 09:49:47 2019 +0000

    [vect]Make vect-epilogues-nomask=1 default

    This patch turns epilogue vectorization on by default for all targets.


  From-SVN: r277659

If we use current trunk but build also with option
--param vect-epilogues-nomask=0 we get run-time on par with GCC 9.

This is also the reason why generic march/tuning or building with
-mprefer-vector-width=128 currently results in faster code than simple
-march=native.

Interestingly, I do not see this issue on an Intel Cascade Lake Server
CPU, even though the epilogue is created there too - judging by CFG of
the hottest function which looks the same.

And I am not sure to what extent it tells anything at all, but I
accidentally also perf'ed load-to-store-stall events and in the slow
version, the reported "samples" was 10% higher and the reported "event
count" shot up 2.8 times(!).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
  2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
@ 2020-03-30 16:01 ` jamborm at gcc dot gnu.org
  2020-03-30 16:02 ` jamborm at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-30 16:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #1 from Martin Jambor <jamborm at gcc dot gnu.org> ---
For the record, the collected profiles both for the traditional
"cycles:u" event and (originally unintended) "ls_stlf:u" event are
below:

-Ofast -march=native -mtune=native

# Samples: 894K of event 'cycles:u'
# Event count (approx.): 735979402525
#
# Overhead       Samples  Command          Shared Object                 Symbol 
# ........  ............  ...............  ............................ 
.................................
#
    67.18%        599542  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
    11.40%        102686  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
    11.37%        101388  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
bi_cgstab_block_
     6.95%         62694  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
     1.88%         16957  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
     1.01%          9023  bwaves_r_peak.e  libc-2.31.so                  [.]
__memset_avx2_unaligned


# Samples: 769K of event 'ls_stlf:u'
# Event count (approx.): 154704730574
#
# Overhead       Samples  Command          Shared Object                 Symbol 
# ........  ............  ...............  ............................ 
....................................
#
    94.59%        612921  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
     1.83%         88259  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
     1.12%         13615  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
     1.11%         43093  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
     1.05%          8746  bwaves_r_peak.e  libc-2.31.so                  [.]
__memset_avx2_unaligned



-Ofast -march=native -mtune=native --param vect-epilogues-nomask=0

# Samples: 816K of event 'cycles:u'
# Event count (approx.): 671104061807
#
# Overhead       Samples  Command          Shared Object                 Symbol 
# ........  ............  ...............  ............................ 
.................................
#
    64.07%        521532  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
    12.50%        102670  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
    12.39%        100777  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
bi_cgstab_block_
     7.60%         62641  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
     2.06%         16925  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
     1.17%          9531  bwaves_r_peak.e  libc-2.31.so                  [.]
__memset_avx2_unaligned

# Samples: 705K of event 'ls_stlf:u'
# Event count (approx.): 55009340780
#
# Overhead       Samples  Command          Shared Object                 Symbol 
# ........  ............  ...............  ............................ 
..............................
#
    86.26%        532930  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
     5.15%         88270  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
     3.17%         13696  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
     3.06%         57149  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
     1.59%          9226  bwaves_r_peak.e  libc-2.31.so                  [.]
__memset_avx2_unaligned

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
  2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
  2020-03-30 16:01 ` [Bug target/94406] " jamborm at gcc dot gnu.org
@ 2020-03-30 16:02 ` jamborm at gcc dot gnu.org
  2020-03-30 16:39 ` jamborm at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-30 16:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #2 from Martin Jambor <jamborm at gcc dot gnu.org> ---
And for completeness, LNT sees this too and has just managed to catch the
regression:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=276.427.0&plot.1=295.427.0&

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
  2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
  2020-03-30 16:01 ` [Bug target/94406] " jamborm at gcc dot gnu.org
  2020-03-30 16:02 ` jamborm at gcc dot gnu.org
@ 2020-03-30 16:39 ` jamborm at gcc dot gnu.org
  2020-03-30 21:10 ` jamborm at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-30 16:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #3 from Martin Jambor <jamborm at gcc dot gnu.org> ---
One more data point, binary compiled for cascadelake does not run on
Zen2, but one for znver2 runs on Cascade Lake and it makes no
difference in run-time.

If disapling epilogues helps on Intel, the difference is less than 2%.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
  2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-03-30 16:39 ` jamborm at gcc dot gnu.org
@ 2020-03-30 21:10 ` jamborm at gcc dot gnu.org
  2020-03-31  7:05 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-30 21:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #4 from Martin Jambor <jamborm at gcc dot gnu.org> ---
For the record, on AMD Zen2 at least, SPEC 2006 410.bwaves also runs
about 12% faster with --param vect-epilogues-nomask=0 (and otherwise
with -Ofast -march=native -mtune=native).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
  2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2020-03-30 21:10 ` jamborm at gcc dot gnu.org
@ 2020-03-31  7:05 ` rguenth at gcc dot gnu.org
  2020-11-13 14:35 ` [Bug tree-optimization/94406] " cvs-commit at gcc dot gnu.org
  2020-12-01 10:47 ` jamborm at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-03-31  7:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Hmm, we're invoking memset from libc which might use a different path on
CXL compared to Zen2?

Note that a vectorized epilogue should in no way cause additional
store-to-load forwarding penalties _but_ it might cause additional
(positive) store-to-load forwardings.

Code-generation wise the loop leaves a lot to be desired and given
we know the number of iterations is 5 the vectorized epilogue will
never be entered thus its overhead will only hurt.  Maybe CXL
branch prediction behaves better here.

Note there's room for improvement in the way we dispatch to the vectorized
epilogue.  Exiting the main vectorized loop we do

  if (do_we_need_an_epilouge)
    {

then for the vectorized epilogue we do

       if (remaining-niters == 1)
         do scalar epilogue
       else
         do vector epilogue

where the complication is due to the fact that we share the scalar
epilogue loops with the loop used when the runtime cost model check
fails.

Thus the CFG with vectorized epilogue could be more optimally structured
reducing the overhead to a single jump-around.

For bwaves the other improvement opportunity is to move the memset out
of the full loop nest rather than just covering the innermost two loops.
That probably improves register allocation.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
  2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2020-03-31  7:05 ` rguenth at gcc dot gnu.org
@ 2020-11-13 14:35 ` cvs-commit at gcc dot gnu.org
  2020-12-01 10:47 ` jamborm at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-11-13 14:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #6 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Martin Jambor <jamborm@gcc.gnu.org>:

https://gcc.gnu.org/g:ac91af71c93462cbc701bbd104fa21894bb15e86

commit r11-4983-gac91af71c93462cbc701bbd104fa21894bb15e86
Author: Martin Jambor <mjambor@suse.cz>
Date:   Fri Nov 13 15:35:18 2020 +0100

    loops: Invoke lim after successful loop interchange

    This patch makes the entry point to loop invariant motion public, so
    that it can be called after loop interchange when that pass has
    swapped loops.  This avoids the non-LTO -Ofast run-time regressions of
    410.bwaves and 503.bwaves_r (which are 19% and 15% faster than current
    master on an AMD zen2 machine) while not introducing a full LIM pass
    into the pass pipeline.

    The patch also adds a parameter which allows not to perform any store
    motion so that it is not done after an interchange.

    gcc/ChangeLog:

    2020-11-12  Martin Jambor  <mjambor@suse.cz>

            PR tree-optimization/94406
            * tree-ssa-loop-im.c (tree_ssa_lim): Renamed to
            loop_invariant_motion_in_fun, added a parameter to control store
            motion.
            (pass_lim::execute): Adjust call to tree_ssa_lim, now
            loop_invariant_motion_in_fun.
            * tree-ssa-loop-manip.h (loop_invariant_motion_in_fun): Declare.
            * gimple-loop-interchange.cc (pass_linterchange::execute): Call
            loop_invariant_motion_in_fun if any interchange has been done.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
  2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2020-11-13 14:35 ` [Bug tree-optimization/94406] " cvs-commit at gcc dot gnu.org
@ 2020-12-01 10:47 ` jamborm at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-12-01 10:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

Martin Jambor <jamborm at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #7 from Martin Jambor <jamborm at gcc dot gnu.org> ---
Fixed, as can be seen at
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=276.427.0&plot.1=398.427.0&plot.2=295.427.0&

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-12-01 10:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-30 15:57 [Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native jamborm at gcc dot gnu.org
2020-03-30 16:01 ` [Bug target/94406] " jamborm at gcc dot gnu.org
2020-03-30 16:02 ` jamborm at gcc dot gnu.org
2020-03-30 16:39 ` jamborm at gcc dot gnu.org
2020-03-30 21:10 ` jamborm at gcc dot gnu.org
2020-03-31  7:05 ` rguenth at gcc dot gnu.org
2020-11-13 14:35 ` [Bug tree-optimization/94406] " cvs-commit at gcc dot gnu.org
2020-12-01 10:47 ` jamborm at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).