public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
@ 2014-10-28 19:28 trippels at gcc dot gnu.org
  2014-10-28 19:39 ` [Bug ipa/63671] " trippels at gcc dot gnu.org
                   ` (21 more replies)
  0 siblings, 22 replies; 23+ messages in thread
From: trippels at gcc dot gnu.org @ 2014-10-28 19:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

            Bug ID: 63671
           Summary: [5 Regression] 21% tramp3d-v4 performance hit due to
                    -fdevirtualize
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: ipa
          Assignee: unassigned at gcc dot gnu.org
          Reporter: trippels at gcc dot gnu.org
                CC: hubicka at gcc dot gnu.org

On my AMD machine I get:

markus@x4 ~ % time g++ -Ofast tramp3d-v4.cpp
23.540 total
markus@x4 ~ % ./a.out --cartvis 1.0 0.0 --rhomin 1e-8 -n 20
...
Time spent in iteration: 3.79717

markus@x4 ~ % time g++ -Ofast -fno-devirtualize tramp3d-v4.cpp
22.163 total
markus@x4 ~ % ./a.out --cartvis 1.0 0.0 --rhomin 1e-8 -n 20
...
Time spent in iteration: 2.97514

For gcc-4.9 -fno-devirtualize makes no difference and I get on both cases:
Time spent in iteration: 3.02253


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
@ 2014-10-28 19:39 ` trippels at gcc dot gnu.org
  2014-10-29  9:42 ` rguenth at gcc dot gnu.org
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: trippels at gcc dot gnu.org @ 2014-10-28 19:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #1 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
It gets worse with -flto:

markus@x4 ~ % g++ -w -Ofast -flto=4 tramp3d-v4.cpp
markus@x4 ~ % ./a.out --cartvis 1.0 0.0 --rhomin 1e-8 -n 20
...
Time spent in iteration: 4.6181

For "-fno-devirtualize -flto=4" the result is the same as without -flto.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
  2014-10-28 19:39 ` [Bug ipa/63671] " trippels at gcc dot gnu.org
@ 2014-10-29  9:42 ` rguenth at gcc dot gnu.org
  2014-10-29 10:26 ` trippels at gcc dot gnu.org
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-10-29  9:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
   Target Milestone|---                         |5.0

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I suppose -flto is the same as -fwhole-program then?  Maybe too much
speculative inlining happens?  Does using FDO mitigate the -fdevirtualize hit?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
  2014-10-28 19:39 ` [Bug ipa/63671] " trippels at gcc dot gnu.org
  2014-10-29  9:42 ` rguenth at gcc dot gnu.org
@ 2014-10-29 10:26 ` trippels at gcc dot gnu.org
  2014-11-01 18:37 ` hubicka at gcc dot gnu.org
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: trippels at gcc dot gnu.org @ 2014-10-29 10:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #3 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> I suppose -flto is the same as -fwhole-program then?

Yes.

> Maybe too much  speculative inlining happens?

Hopefully Honza finds time to tune inlining before release.

> Does using FDO mitigate the -fdevirtualize hit?

Yes.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2014-10-29 10:26 ` trippels at gcc dot gnu.org
@ 2014-11-01 18:37 ` hubicka at gcc dot gnu.org
  2014-11-11  5:33 ` hubicka at gcc dot gnu.org
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2014-11-01 18:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2014-11-01
           Assignee|unassigned at gcc dot gnu.org      |hubicka at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
There are not many virtual functions in tramp3d hot paths, right?
I suppose it may be just inlining into virtual functions that are removed
afterwards.  I will try to debug this.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2014-11-01 18:37 ` hubicka at gcc dot gnu.org
@ 2014-11-11  5:33 ` hubicka at gcc dot gnu.org
  2014-11-12  0:04 ` hubicka at ucw dot cz
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2014-11-11  5:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #5 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
There is also code size differnce. I tried to track it down and with
-fno-devirtualize I need to disable all places devirtualization happens:
Index: gimple-fold.c
===================================================================
--- gimple-fold.c       (revision 217304)
+++ gimple-fold.c       (working copy)
@@ -331,7 +331,7 @@ fold_gimple_assign (gimple_stmt_iterator
            tree val = OBJ_TYPE_REF_EXPR (rhs);
            if (is_gimple_min_invariant (val))
              return val;
-           else if (flag_devirtualize && virtual_method_call_p (rhs))
+           else if (flag_devirtualize && virtual_method_call_p (rhs) && 0)
              {
                bool final;
                vec <cgraph_node *>targets
@@ -2633,7 +2633,7 @@ gimple_fold_call (gimple_stmt_iterator *
          gimple_call_set_fn (stmt, OBJ_TYPE_REF_EXPR (callee));
          changed = true;
        }
-      else if (flag_devirtualize && !inplace && virtual_method_call_p
(callee))
+      else if (flag_devirtualize && !inplace && virtual_method_call_p (callee)
&& 0)
        {
          bool final;
          vec <cgraph_node *>targets
Index: ipa.c
===================================================================
--- ipa.c       (revision 217304)
+++ ipa.c       (working copy)
@@ -198,7 +198,7 @@ walk_polymorphic_call_targets (hash_set<
      final or anonymous (so we know all its derivation)
      and there is only one possible virtual call target,
      make the edge direct.  */
-  if (final)
+  if (final && 0)
     {
       if (targets.length () <= 1 && dbg_cnt (devirt))
        {
Index: tree-ssa-pre.c
===================================================================
--- tree-ssa-pre.c      (revision 217304)
+++ tree-ssa-pre.c      (working copy)
@@ -4320,7 +4320,7 @@ eliminate_dom_walker::before_dom_childre
        {
          tree fn = gimple_call_fn (stmt);
          if (fn
-             && flag_devirtualize
+             && flag_devirtualize && 0
              && virtual_method_call_p (fn))
            {
              tree otr_type = obj_type_ref_class (fn);

So apparently it is not the unit growth nor reachability code. It is simply
presence of direct calls instead of polymorphic calls that seems to make
inliner to do bad decisions...


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2014-11-11  5:33 ` hubicka at gcc dot gnu.org
@ 2014-11-12  0:04 ` hubicka at ucw dot cz
  2014-11-12  0:07 ` hubicka at gcc dot gnu.org
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at ucw dot cz @ 2014-11-12  0:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #6 from Jan Hubicka <hubicka at ucw dot cz> ---
I am attaching changes that are caused by enabling tree-ssa-pre
devirtualization only.  This devirtualize couple calls and does not affect
early inliner (becuase it is run afterwards in early opt queue and does not
seem to propagate down) and for some reason causes the slowdown.

These devirtualization are locally all good idea, so it seems that the
global inliner heuristics just gets lost.

Comparing inline decisions is going to be fun.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2014-11-12  0:04 ` hubicka at ucw dot cz
@ 2014-11-12  0:07 ` hubicka at gcc dot gnu.org
  2014-11-12  1:21 ` hubicka at gcc dot gnu.org
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2014-11-12  0:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Created attachment 33944
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33944&action=edit
Changes in release_ssa dump

These are changes made by tree-ssa-pre devirtualization (that triggers the
regression alone).  There are no changes in early inlining, just some indirect
calls are replaced by direct.  This should be always good idea - it seems that
simply global inliner degenerates somehow.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2014-11-12  0:07 ` hubicka at gcc dot gnu.org
@ 2014-11-12  1:21 ` hubicka at gcc dot gnu.org
  2014-11-13  9:03 ` rguenther at suse dot de
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2014-11-12  1:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #8 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
According to perf, the culprint is inlining of
_ZNK22UniformRectilinearMeshI10MeshTraitsILi3Ed21UniformRectilinearTag12CartesianTagLi3EEE12cellPositionERK3LocILi3EE
that does get only partially inlined with devirtualization on, becuase we hit
the inline unit growth limit.

Inline summary for UniformRectilinearMesh<MeshTraits>::PointType_t
UniformRectilinearMesh<MeshTraits>::cellPosition(const Loc_t&) const [with
MeshTraits = MeshTraits<3, double, UniformRec
  self time:       66
  global time:     0
  self size:       23
  global size:     0
  min size:       0
  self stack:      0
  global stack:    0
    size:17.500000, time:56.275000, predicate:(true)
    size:4.500000, time:6.275000, predicate:(not inlined)
    size:0.500000, time:1.500000, predicate:(op0[ref offset: 0] changed) &&
(not inlined)
    size:0.500000, time:1.500000, predicate:(op0[ref offset: 0] changed)
  calls:

so this is leaf function with quite small body.  It fails to inline as:

 Estimated badness is -96240, frequency 79.20.
    Badness calculation for void Adv5::Z::MomentumfluxZ<Dim>::operator()(const
F1&, const F2&, const F3&, const Loc<Dim>&) const [with F1 =
Field<UniformRectilinearMesh<MeshTraits<3, doub
      size growth 12, time 58 inline hints: declared_inline
      -96240: guessed profile. frequency 79.200000, benefit 8.713335%, time w/o
inlining 20899, time w inlining 19078 overall growth 138 (current) 162
(original)

this is quite small badness but still enought to make the inline appear late. 
The function body is as follows:

{
  int i;
  int i;
  int _7;
  double _8;
  double _9;
  struct UniformRectilinearMeshData * _12;
  const struct Domain * _13;
  int _15;
  double _16;
  double _17;
  double _18;
  double _19;
  double * _22;
  const Element_t _23;

  <bb 2>:
  goto <bb 6>;

  <bb 3>:
  _22 = &MEM[(struct VectorEngine *)point_5(D)].x_m[i_3];
  if (_22 != 0B)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:
  MEM[(This_t *)point_5(D)].x_m[i_3] = 0.0;

  <bb 5>:
  i_14 = i_3 + 1;

  <bb 6>:
  # i_3 = PHI <0(2), i_14(5)>
  if (i_3 <= 2)
    goto <bb 3>;
  else
    goto <bb 7>;

  <bb 7>:
  # i_11 = PHI <0(6)>
  goto <bb 9>;

  <bb 8>:
  _12 = MEM[(const struct RefCountedPtr *)this_6(D)].ptr_m;
  _8 = MEM[(const double &)_12 + 104].x_m[i_1];
  _9 = MEM[(const double &)_12 + 128].x_m[i_1];
  _7 = MEM[(const struct Domain
*)loc_10(D)].D.78652.domain_m[i_1].D.77613.D.46542.D.46423.domain_m;
  _13 = &MEM[(const struct OneDomain_t *)_12 +
32B].D.120542.domain_m[i_1].D.115812.D.45237;
  _23 = MEM[(int *)_13];
  _15 = _7 - _23;
  _16 = (double) _15;
  _17 = _16 + 5.0e-1;
  _18 = _9 * _17;
  _19 = _8 + _18;
  MEM[(double &)point_5(D)].x_m[i_1] = _19;
  i_21 = i_1 + 1;

  <bb 9>:
  # i_1 = PHI <i_11(7), i_21(8)>
  if (i_1 <= 2)
    goto <bb 8>;
  else
    goto <bb 10>;

  <bb 10>:
  return point_5(D);

}

One obvious nonsense is:

 <bb 3>:
  _22 = &MEM[(struct VectorEngine *)point_5(D)].x_m[i_3];
  if (_22 != 0B)
    goto <bb 4>;
  else
    goto <bb 5>;

Redundant non-zero checks seems very common in C++ code - I ran across many of
those autogenerated by multiple inheritance or brought in by inlining.
Scheduling VRP early makes the function to be fully inlined again, but does not
fully solve the regression:


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2014-11-12  1:21 ` hubicka at gcc dot gnu.org
@ 2014-11-13  9:03 ` rguenther at suse dot de
  2014-11-14  0:47 ` hubicka at ucw dot cz
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: rguenther at suse dot de @ 2014-11-13  9:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 12 Nov 2014, hubicka at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671
> 
> --- Comment #9 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
> With early VRP (but also without) the inliner seems to now suffer from extreme
> roundoff errors at badness.  With VRP the first uninlined function still has
> badness 0:
> 
> Considering std::_Bit_reference& std::_Bit_reference::operator=(bool)/797 with
> 15 size
>  to be inlined into <built-in>/47767 in /aux/hubicka/tramp3d-v4.cpp:38764
>  Estimated badness is 0, frequency 0.71.
>     Badness calculation for <built-in>/47767 -> std::_Bit_reference&
> std::_Bit_reference::operator=(bool)/797
>       size growth 8, time 5 inline hints: declared_inline
>       0: guessed profile. frequency 0.709000, benefit 0.002000%, time w/o
> inlining 500006, time w inlining 499996 overall growth 612 (current) 398
> (original)
>   not inlinable: <built-in>/47767 -> std::_Bit_reference&
> std::_Bit_reference::operator=(bool)/797, --param inline-unit-growth limit
> reached
>    Estimating body: std::basic_ostream<char, _Traits>&
> std::operator<<(std::basic_ostream<char, _Traits>&, const char*) [with _Traits
> = std::char_traits<char>]/2076
>    Known to be false: not inlined, op1 == 0B, op1 changed
>    size:7 time:22
> 
> so inline decisions are basically random.  I tuned this few times, but it is
> hard to balance the fixpoint arithmetic to not get into 0.  The function in
> question is very small, but there are too many of them.
> 
> I wonder if we can't switch inliner to std::priority_queue and use sreal to
> drive the priority queue? Or one can hold fractions and compare in wide int
> calculations.

I think using sreal is fine - I expect that to be faster than using
wide_int (and smaller).

Random inlining decisions are bad :/  (and hard to debug)

Richard.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2014-11-13  9:03 ` rguenther at suse dot de
@ 2014-11-14  0:47 ` hubicka at ucw dot cz
  2014-11-20 12:42 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at ucw dot cz @ 2014-11-14  0:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #11 from Jan Hubicka <hubicka at ucw dot cz> ---
> 
> I think using sreal is fine - I expect that to be faster than using
> wide_int (and smaller).
> 
> Random inlining decisions are bad :/  (and hard to debug)

Yep, this was hitting us from time to time and I usually was able to just tweak
badness calculation, but sreal would reduce this maintenance burden.
I will give this a try with Martin's fibheap/sreal patchset.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2014-11-14  0:47 ` hubicka at ucw dot cz
@ 2014-11-20 12:42 ` rguenth at gcc dot gnu.org
  2014-11-23 15:20 ` hubicka at gcc dot gnu.org
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-11-20 12:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
           Priority|P3                          |P1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2014-11-20 12:42 ` rguenth at gcc dot gnu.org
@ 2014-11-23 15:20 ` hubicka at gcc dot gnu.org
  2014-11-24 16:16 ` hubicka at gcc dot gnu.org
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2014-11-23 15:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #12 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Created attachment 34076
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34076&action=edit
Patch to fix aliases and dead code removal.

One of problem was the abstract origin tracking already solved. Other problem
is some implementation lazyness WRT aliases that are now used more pervasively
(by ICF and also by visibility code).
The attached patch makes -fdevirtualize -O3 run fast again, but now
-fno-devirtualize -O3 regresses (and even with Martin's heap on sreals). What a
fun! :)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2014-11-23 15:20 ` hubicka at gcc dot gnu.org
@ 2014-11-24 16:16 ` hubicka at gcc dot gnu.org
  2014-11-24 16:36 ` trippels at gcc dot gnu.org
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2014-11-24 16:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #13 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Author: hubicka
Date: Mon Nov 24 16:15:46 2014
New Revision: 218024

URL: https://gcc.gnu.org/viewcvs?rev=218024&root=gcc&view=rev
Log:
    PR ipa/63671
    * ipa-inline-transform.c (can_remove_node_now_p_1): Handle alises
    and -fno-devirtualize more carefully.
    (can_remove_node_now_p): Update.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-inline-transform.c


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2014-11-24 16:16 ` hubicka at gcc dot gnu.org
@ 2014-11-24 16:36 ` trippels at gcc dot gnu.org
  2014-11-24 16:39 ` hubicka at gcc dot gnu.org
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: trippels at gcc dot gnu.org @ 2014-11-24 16:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #14 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #13)
> Author: hubicka
> Date: Mon Nov 24 16:15:46 2014
> New Revision: 218024
> 
> URL: https://gcc.gnu.org/viewcvs?rev=218024&root=gcc&view=rev
> Log:
> 	PR ipa/63671
> 	* ipa-inline-transform.c (can_remove_node_now_p_1): Handle alises
> 	and -fno-devirtualize more carefully.
> 	(can_remove_node_now_p): Update.
> 
> Modified:
>     trunk/gcc/ChangeLog
>     trunk/gcc/ipa-inline-transform.c

Thanks for the patch. The issue from Comment 0 is now fine.
But with -flto it is still slow:


markus@x4 ~ % g++ -Ofast -flto=4 -w tramp3d-v4.cpp
markus@x4 ~ % ./a.out --cartvis 1.0 0.0 --rhomin 1e-8 -n 20
...
Time spent in iteration: 4.35963


And for -flto -fno-devirtualize I get an ICE:

markus@x4 ~ % g++ -Ofast -flto=4 -w -fno-devirtualize tramp3d-v4.cpp
tramp3d-v4.cpp: In member function ‘RelationListItem::notifyPreRead() [clone
.part.111]’:
tramp3d-v4.cpp:64206:1: internal compiler error: Segmentation fault
 }
 ^
0xc7542f crash_signal
        ../../gcc/gcc/toplev.c:359
0xac92b8 tree_check
        ../../gcc/gcc/tree.h:2763
0xac92b8 ipa_polymorphic_call_context::get_dynamic_type(tree_node*, tree_node*,
tree_node*, gimple_statement_base*)
        ../../gcc/gcc/ipa-polymorphic-call.c:1593
0xae4c04 ipa_analyze_call_uses
        ../../gcc/gcc/ipa-prop.c:2173
0xae4c04 ipa_analyze_stmt_uses
        ../../gcc/gcc/ipa-prop.c:2192
0xae4c04 ipa_analyze_params_uses_in_bb
        ../../gcc/gcc/ipa-prop.c:2232
0xae4c04 analysis_dom_walker::before_dom_children(basic_block_def*)
        ../../gcc/gcc/ipa-prop.c:2316
0x12202d7 dom_walker::walk(basic_block_def*)
        ../../gcc/gcc/domwalk.c:188
0xaeb839 ipa_analyze_node(cgraph_node*)
        ../../gcc/gcc/ipa-prop.c:2373
0x125c77f ipcp_generate_summary
        ../../gcc/gcc/ipa-cp.c:4254
0xbba939 execute_ipa_summary_passes(ipa_opt_pass_d*)
        ../../gcc/gcc/passes.c:2137
0x8d39fe ipa_passes
        ../../gcc/gcc/cgraphunit.c:2074
0x8d39fe symbol_table::compile()
        ../../gcc/gcc/cgraphunit.c:2187
0x8d5177 symbol_table::finalize_compilation_unit()
        ../../gcc/gcc/cgraphunit.c:2340
0x6ac91b cp_write_global_declarations()
        ../../gcc/gcc/cp/decl2.c:4688
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
>From gcc-bugs-return-468376-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org Mon Nov 24 16:37:39 2014
Return-Path: <gcc-bugs-return-468376-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Delivered-To: listarch-gcc-bugs@gcc.gnu.org
Received: (qmail 7769 invoked by alias); 24 Nov 2014 16:37:39 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Delivered-To: mailing list gcc-bugs@gcc.gnu.org
Received: (qmail 7723 invoked by uid 48); 24 Nov 2014 16:37:36 -0000
From: "hjl.tools at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug ipa/64049] Wrong code at -O3
Date: Mon, 24 Nov 2014 16:37:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: ipa
X-Bugzilla-Version: 5.0
X-Bugzilla-Keywords: wrong-code
X-Bugzilla-Severity: normal
X-Bugzilla-Who: hjl.tools at gmail dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 5.0
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-64049-4-Ale5c4R8m5@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-64049-4@http.gcc.gnu.org/bugzilla/>
References: <bug-64049-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-11/txt/msg02848.txt.bz2
Content-length: 141

https://gcc.gnu.org/bugzilla/show_bug.cgi?idd049

--- Comment #11 from H.J. Lu <hjl.tools at gmail dot com> ---
It was caused by r215898.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2014-11-24 16:36 ` trippels at gcc dot gnu.org
@ 2014-11-24 16:39 ` hubicka at gcc dot gnu.org
  2014-11-24 16:43 ` hubicka at gcc dot gnu.org
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2014-11-24 16:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #15 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
The performance regression seems solved on my setup (Trevor, can you double
check?).

-Ofast -fdevirtualize:
Time spent in iteration: 4.11598

-Ofast -fno-devirtualize
Time spent in iteration: 4.17063


-Ofast -fno-devirtualize -fno-icf
Time spent in iteration: 3.34953

GCC 4.7:
Time spent in iteration: 5.53831

GCC 4.8:
Time spent in iteration: 3.4625

GCC 4.9:
Time spent in iteration: 2.79834

$ size 4.7 4.9 devirt nodevirt 
   text    data     bss     dec     hex filename
 785615    1204    2904  789723   c0cdb 4.7
 620822    1268    3320  625410   98b02 4.8
 615847    1180    3160  620187   9769b 4.9
 680881    1228    3064  685173   a7475 devirt
 650161    1228    3064  654453   9fc75 nodevirt
 681709    1228    3064  686001   a77b1 tramp-devirt-noicf


So 4.9 still outperforms mainline; ICF confuse inliner and moreover
devirtualization accounts 5% of code for no visible benefits.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2014-11-24 16:39 ` hubicka at gcc dot gnu.org
@ 2014-11-24 16:43 ` hubicka at gcc dot gnu.org
  2014-11-24 17:30 ` trippels at gcc dot gnu.org
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2014-11-24 16:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #16 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
The ICE will probably go away with
Index: ipa-prop.c
===================================================================
--- ipa-prop.c  (revision 217980)
+++ ipa-prop.c  (working copy)
@@ -2155,7 +2155,7 @@ ipa_analyze_call_uses (struct func_body_
   if (cs && !cs->indirect_unknown_callee)
     return;

-  if (cs->indirect_info->polymorphic)
+  if (cs->indirect_info->polymorphic && flag_devirtualize)
     {
       tree instance;
       tree target = gimple_call_fn (call);

I am at a workshop; the patch is preapproved if it passes testing.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2014-11-24 16:43 ` hubicka at gcc dot gnu.org
@ 2014-11-24 17:30 ` trippels at gcc dot gnu.org
  2014-11-26 16:05 ` hubicka at gcc dot gnu.org
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: trippels at gcc dot gnu.org @ 2014-11-24 17:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #17 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #16)
> The ICE will probably go away with
> Index: ipa-prop.c
> ===================================================================
> --- ipa-prop.c  (revision 217980)
> +++ ipa-prop.c  (working copy)
> @@ -2155,7 +2155,7 @@ ipa_analyze_call_uses (struct func_body_
>    if (cs && !cs->indirect_unknown_callee)
>      return;
>  
> -  if (cs->indirect_info->polymorphic)
> +  if (cs->indirect_info->polymorphic && flag_devirtualize)
>      {
>        tree instance;
>        tree target = gimple_call_fn (call);
> 
> I am at a workshop; the patch is preapproved if it passes testing.

The patch doesn't help unfortunately. I've opened PR64059 for this issue.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2014-11-24 17:30 ` trippels at gcc dot gnu.org
@ 2014-11-26 16:05 ` hubicka at gcc dot gnu.org
  2014-11-26 18:39 ` hubicka at gcc dot gnu.org
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2014-11-26 16:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #18 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
With Martin's fibheap long->sreal change some of anomalies went away. -ficf is
no longer an issue, but mainline still produces slower binary than 4.9 and
-fwhole-program hurts code quality.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2014-11-26 16:05 ` hubicka at gcc dot gnu.org
@ 2014-11-26 18:39 ` hubicka at gcc dot gnu.org
  2014-11-26 19:09 ` trippels at gcc dot gnu.org
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2014-11-26 18:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #19 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Richard, I still get 2.7 per iteration for GCC 4.9.
For mainline and inline limits bumped ad extermum I can not get better than 3.3
(and it is what I get with -O3 -Ofast too). I am not sure if inliner is to
blame or it is effect of some other change. Perhaps since you know the code,
you may take a look what is causing the difference?

I plan to continue by looking into -fwhole-program issue that is still slower
(5.2) than normal compilation.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2014-11-26 18:39 ` hubicka at gcc dot gnu.org
@ 2014-11-26 19:09 ` trippels at gcc dot gnu.org
  2014-11-27  3:03 ` hubicka at gcc dot gnu.org
  2014-11-27 10:47 ` rguenth at gcc dot gnu.org
  21 siblings, 0 replies; 23+ messages in thread
From: trippels at gcc dot gnu.org @ 2014-11-26 19:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #20 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
At some point you start measuring processor characteristics.

On my old AMD box (Phenom II) gcc-5 now is the fastest:

-Ofast    : 2.9
-O3       : 3.6
PGO/-Ofast: 2.7

gcc-4.9:
-Ofast    : 3.0
-O3       : 3.7
PGO/-Ofast: 2.8

gcc-4.8:
-Ofast    : 3.4
-O3       : 4.0
PGO/-Ofast: 3.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2014-11-26 19:09 ` trippels at gcc dot gnu.org
@ 2014-11-27  3:03 ` hubicka at gcc dot gnu.org
  2014-11-27 10:47 ` rguenth at gcc dot gnu.org
  21 siblings, 0 replies; 23+ messages in thread
From: hubicka at gcc dot gnu.org @ 2014-11-27  3:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

--- Comment #21 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
My CPU is AMD Opteron(TM) Processor 6272 and 5.0 doesn't beat 4.9 generic even
with -fmarch=native (the time is 2.9 wrt 2.7).  May be some tuning issue?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug ipa/63671] [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize
  2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2014-11-27  3:03 ` hubicka at gcc dot gnu.org
@ 2014-11-27 10:47 ` rguenth at gcc dot gnu.org
  21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-11-27 10:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63671

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #22 from Richard Biener <rguenth at gcc dot gnu.org> ---
You also get to measure memory / cache behavior.  The default grid size was
(in the past ...) set to make sure we hit main memory, nowadays it's more like
on the order of the size of L2 (or L3) caches.  Once you make the problem
smaller via cmdline parameters you can see the optimization effects more
(ideally tramp3d ran on a cluster with MPI parallelization and on the
node OpenMP enabled with each thread running fully inside the L3 cache - back
in times Itanic was really great here with its gigantic L3/L4 caches).

On an old iCore5 I see trunk outperforming 4.9 with just using -Ofast now
(with generic tuning).

The most important thing to make sure when optimizing is that no calls should
survive in all the hot triple-nested loops and the innermost loop should
"look" fast ;)

The loops are in functions with symbols with the pattern
*EvaluateLocLoop*runEv, for example
_ZN14MultiArgKernelI9MultiArg2I5FieldI22UniformRectilinearMeshI10MeshTraitsILi3Ed21UniformRectilinearTag12CartesianTagLi3EEEd10BrickViewUES9_E15EvaluateLocLoopIN4Adv51Z7DensupdILi3EEELi3EEE3runEv
(unfortunately the pattern matches on very many unrelated functions as well...)

Note that we seem to vectorize the innermost loops now (yay!) but peel
them for alignment (ugh - the prologue won't make things better - the
innermost loops run only 64 iterations and thus 32 vector iterations
by default).  And then we of course have the epilogue for the remaining
iteration.  Luckily we peel both epilogue and prologue (both have at
most 1 iteration with V2DF vectors).

Code generated for trunk and 4.9 is almost the same for a few cases I looked
at.

And I think the performance to compare to is that with compiling
with -Dleafify=flatten (which makes sure to do all the desired inlining
very early).  On my machine with flatten enabled its even a little slower.

The graphs on gcc.opensuse.org show the regression is fixed as well (though
compile-time had quite a surge).

Thus I think we can close this as fixed.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-11-27 10:47 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-28 19:28 [Bug ipa/63671] New: [5 Regression] 21% tramp3d-v4 performance hit due to -fdevirtualize trippels at gcc dot gnu.org
2014-10-28 19:39 ` [Bug ipa/63671] " trippels at gcc dot gnu.org
2014-10-29  9:42 ` rguenth at gcc dot gnu.org
2014-10-29 10:26 ` trippels at gcc dot gnu.org
2014-11-01 18:37 ` hubicka at gcc dot gnu.org
2014-11-11  5:33 ` hubicka at gcc dot gnu.org
2014-11-12  0:04 ` hubicka at ucw dot cz
2014-11-12  0:07 ` hubicka at gcc dot gnu.org
2014-11-12  1:21 ` hubicka at gcc dot gnu.org
2014-11-13  9:03 ` rguenther at suse dot de
2014-11-14  0:47 ` hubicka at ucw dot cz
2014-11-20 12:42 ` rguenth at gcc dot gnu.org
2014-11-23 15:20 ` hubicka at gcc dot gnu.org
2014-11-24 16:16 ` hubicka at gcc dot gnu.org
2014-11-24 16:36 ` trippels at gcc dot gnu.org
2014-11-24 16:39 ` hubicka at gcc dot gnu.org
2014-11-24 16:43 ` hubicka at gcc dot gnu.org
2014-11-24 17:30 ` trippels at gcc dot gnu.org
2014-11-26 16:05 ` hubicka at gcc dot gnu.org
2014-11-26 18:39 ` hubicka at gcc dot gnu.org
2014-11-26 19:09 ` trippels at gcc dot gnu.org
2014-11-27  3:03 ` hubicka at gcc dot gnu.org
2014-11-27 10:47 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).