[Bug tree-optimization/103592] New: fatigue2 benchmarks on zen runs 43% faster with -fno-tree-vectorize -fno-tree-slp-vectorize

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/103592] New: fatigue2 benchmarks on zen runs 43% faster with -fno-tree-vectorize -fno-tree-slp-vectorize
@ 2021-12-06 21:02 hubicka at gcc dot gnu.org
  2021-12-06 21:48 ` [Bug tree-optimization/103592] " hubicka at kam dot mff.cuni.cz
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-06 21:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103592

            Bug ID: 103592
           Summary: fatigue2 benchmarks on zen runs 43% faster with
                    -fno-tree-vectorize -fno-tree-slp-vectorize
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

While looking into -fno-inline-functions-called-once difference I noticed that
on zen hardware I get:
 - 0m33s runtime for fatigue2 benchmark (from phoronix) when built with -Ofast
-march=native -fno-slp-vectorize -fno-tree-vectorize
 - 0m57s for -Ofast -march=native binary

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/103592] fatigue2 benchmarks on zen runs 43% faster with -fno-tree-vectorize -fno-tree-slp-vectorize
  2021-12-06 21:02 [Bug tree-optimization/103592] New: fatigue2 benchmarks on zen runs 43% faster with -fno-tree-vectorize -fno-tree-slp-vectorize hubicka at gcc dot gnu.org
@ 2021-12-06 21:48 ` hubicka at kam dot mff.cuni.cz
  2021-12-07  8:23 ` marxin at gcc dot gnu.org
  2021-12-07  9:38 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-06 21:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103592

--- Comment #1 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
> [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
note that fatigue2 is polyhedron, not spec...

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/103592] fatigue2 benchmarks on zen runs 43% faster with -fno-tree-vectorize -fno-tree-slp-vectorize
  2021-12-06 21:02 [Bug tree-optimization/103592] New: fatigue2 benchmarks on zen runs 43% faster with -fno-tree-vectorize -fno-tree-slp-vectorize hubicka at gcc dot gnu.org
  2021-12-06 21:48 ` [Bug tree-optimization/103592] " hubicka at kam dot mff.cuni.cz
@ 2021-12-07  8:23 ` marxin at gcc dot gnu.org
  2021-12-07  9:38 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-12-07  8:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103592

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |marxin at gcc dot gnu.org
   Last reconfirmed|                            |2021-12-07
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/103592] fatigue2 benchmarks on zen runs 43% faster with -fno-tree-vectorize -fno-tree-slp-vectorize
  2021-12-06 21:02 [Bug tree-optimization/103592] New: fatigue2 benchmarks on zen runs 43% faster with -fno-tree-vectorize -fno-tree-slp-vectorize hubicka at gcc dot gnu.org
  2021-12-06 21:48 ` [Bug tree-optimization/103592] " hubicka at kam dot mff.cuni.cz
  2021-12-07  8:23 ` marxin at gcc dot gnu.org
@ 2021-12-07  9:38 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-12-07  9:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103592

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
  23.13%         44783  a.out.vect    a.out.vect            [.]
__perdida_m_MOD_generalized_hookes_law.constprop.0.isra.0#
   2.40%          4641  a.out.vect    a.out.vect            [.]
__perdida_m_MOD_generalized_hookes_law.constprop.1.isra.0#
   2.37%          4613  a.out.novect  a.out.novect          [.]
__perdida_m_MOD_generalized_hookes_law.constprop.0.isra.0#
   1.23%          2383  a.out.vect    libc-2.31.so          [.]
__memset_avx2_unaligned_erms                             #
   0.35%           676  a.out.vect    libc-2.31.so          [.]
__memset_avx2_unaligned                                  #
   0.20%           394  a.out.novect  a.out.novect          [.]
__perdida_m_MOD_generalized_hookes_law.constprop.1.isra.0

we end up doing loop vectorization with a lot of invariants built up from
scalars but only a known single vector iteration.  We also have a local
array that's only elided after vectorization causing final stores to
require vector extracts.

I think this is the usual case of vectorization constraining OOO execution
in the face of the code being limited by load & store.

We also fail to elide generalized_constitutive_tensor - FRE can do this
in priciple - there's a duplicate PR for this and the situation is like

  generalized_constitutive_tensor = {};
...
  generalized_constitutive_tensor[0] = _19;
  generalized_constitutive_tensor[1] = ISRA.833_76(D);
  generalized_constitutive_tensor[2] = ISRA.833_76(D);
...
  vect__14.843_125 = MEM <vector(4) real(kind=8)> [(real(kind=8)
*)&generalized_constitutive_tensor];

where FRE could create a { _19, ISRA.833_76(D), ISRA.833_76(D), 0. }
vector CTOR but that's only profitable if the stores go away.  I have
a patch to do that (w/o the costing).

Note in the not vectorized case we are able to elide
generalized_constitutive_tensor and also CSE a lot of the computations
because the tensor only has 4 distinct values (and some are even zero).

So it's really a very special case ...

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-12-07  9:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-06 21:02 [Bug tree-optimization/103592] New: fatigue2 benchmarks on zen runs 43% faster with -fno-tree-vectorize -fno-tree-slp-vectorize hubicka at gcc dot gnu.org
2021-12-06 21:48 ` [Bug tree-optimization/103592] " hubicka at kam dot mff.cuni.cz
2021-12-07  8:23 ` marxin at gcc dot gnu.org
2021-12-07  9:38 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).