[Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3
@ 2021-04-15  6:35 crazylht at gmail dot com
  2021-04-15  7:17 ` [Bug tree-optimization/100089] [11 Regression] 30% performance regression " rguenth at gcc dot gnu.org
                   ` (17 more replies)
  0 siblings, 18 replies; 19+ messages in thread
From: crazylht at gmail dot com @ 2021-04-15  6:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

            Bug ID: 100089
           Summary: [11 Performance regression ] 30% for
                    denbench/mp2decoddata2 with -O3
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: crazylht at gmail dot com
                CC: hjl.tools at gmail dot com
  Target Milestone: ---
              Host: x86_64-pc-linux-gnu
            Target: x86_64-*-* i?86-*-*

Created attachment 50597
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50597&action=edit
denbench_mp2decoddata2.cpp

https://godbolt.org/z/EGoz1zx61

cat test.cpp

static inline void idctrow(e_s16 *blk)
{
   e_s32 x0, x1, x2, x3, x4, x5, x6, x7, x8;


   if (!((x1 = blk[4]<<11) | (x2 = blk[6]) | (x3 = blk[2]) |
      (x4 = blk[1]) | (x5 = blk[7]) | (x6 = blk[5]) | (x7 = blk[3])))
   {
      blk[0]=blk[1]=blk[2]=blk[3]=blk[4]=blk[5]=blk[6]=blk[7]=(e_s16)blk[0]<<3;
      return;
   }

   x0 = (blk[0]<<11) + 128;


   x8 = 565*(x4+x5);
   x4 = x8 + (2841 -565)*x4;
   x5 = x8 - (2841 +565)*x5;
   x8 = 2408*(x6+x7);
   x6 = x8 - (2408 -1609)*x6;
   x7 = x8 - (2408 +1609)*x7;


   x8 = x0 + x1;
   x0 -= x1;
   x1 = 1108*(x3+x2);
   x2 = x1 - (2676 +1108)*x2;
   x3 = x1 + (2676 -1108)*x3;
   x1 = x4 + x6;
   x4 -= x6;
   x6 = x5 + x7;
   x5 -= x7;
   x7 = x8 + x3;
   x8 -= x3;
   x3 = x0 + x2;
   x0 -= x2;
   x2 = (181*(x4+x5)+128)>>8;
   x4 = (181*(x4-x5)+128)>>8;

   blk[0] = (e_s16)((x7+x1)>>8);
   blk[1] = (e_s16)((x3+x2)>>8);
   blk[2] = (e_s16)((x0+x4)>>8);
   blk[3] = (e_s16)((x8+x6)>>8);
   blk[4] = (e_s16)((x8-x6)>>8);
   blk[5] = (e_s16)((x0-x4)>>8);
   blk[6] = (e_s16)((x3-x2)>>8);
   blk[7] = (e_s16)((x7-x1)>>8);

}

int
__attribute__ ((noipa))
Fast_IDCT(e_s16 *block)
{
   e_s32 i;

   for (i=0; i<8; i++)
      idctrow(block+8*i);

   return 1;
}

 pass_ifcvt transforms the if branch in idctrow into an conditional move, and
then pass_vect finds that although there's no loop vectorization opportunity
but there are opportunities for SLP, but the cost model of SLP does not
consider the cost of these conditional movs, which eventually generates a large
number of redundant test and cmov in codegen.

test.cpp:76:11: note:   stmt 1 MEM[(e_s16 *)_3 + 2B] = _ifc__264;
test.cpp:76:11: note:   stmt 2 MEM[(e_s16 *)_3 + 4B] = _ifc__267;
test.cpp:76:11: note:   stmt 3 MEM[(e_s16 *)_3 + 6B] = _ifc__270;
test.cpp:76:11: note:   stmt 4 MEM[(e_s16 *)_3 + 8B] = _ifc__273;
test.cpp:76:11: note:   stmt 5 MEM[(e_s16 *)_3 + 10B] = _ifc__276;
test.cpp:76:11: note:   stmt 6 MEM[(e_s16 *)_3 + 12B] = _ifc__279;
test.cpp:76:11: note:   stmt 7 MEM[(e_s16 *)_3 + 14B] = _ifc__282;
test.cpp:76:11: note:   children 0x3ec9580
test.cpp:76:11: note: node (external) 0x3ec9580 (max_nunits=1, refcnt=1)
test.cpp:76:11: note:   { _ifc__261, _ifc__264, _ifc__267, _ifc__270,
_ifc__273, _ifc__276, _ifc__279, _ifc__282 }
test.cpp:76:11: note: Cost model analysis: 
0x3c1aee0 _ifc__261 1 times scalar_store costs 12 in body
0x3c1aee0 _ifc__264 1 times scalar_store costs 12 in body
0x3c1aee0 _ifc__267 1 times scalar_store costs 12 in body
0x3c1aee0 _ifc__270 1 times scalar_store costs 12 in body
0x3c1aee0 _ifc__273 1 times scalar_store costs 12 in body
0x3c1aee0 _ifc__276 1 times scalar_store costs 12 in body
0x3c1aee0 _ifc__279 1 times scalar_store costs 12 in body
0x3c1aee0 _ifc__282 1 times scalar_store costs 12 in body
0x3c1aee0 _ifc__261 1 times unaligned_store (misalign -1) costs 12 in body
0x3c1aee0 <unknown> 1 times vec_construct costs 32 in prologue
test.cpp:76:11: note: Cost model analysis for part in loop 1:
  Vector cost: 44
  Scalar cost: 96


int Fast_IDCT (e_s16 * block)
{
  vector(8) short int * vectp.78;
  vector(8) short int * vectp.77;
  e_s32 x0;
  e_s32 x1;
  e_s32 x2;
  e_s32 x3;
  e_s32 x4;
  e_s32 x5;
  e_s32 x6;
  e_s32 x7;
  e_s32 x8;
  e_s32 i;
  long unsigned int i.0_1;
  long unsigned int _2;
  e_s16 * _3;
  unsigned long ivtmp_4;
  unsigned long ivtmp_5;
  short int _10;
  int _11;
  int _12;
  short int _14;
  short int _16;
  short int _17;
  short int _19;
  short int _20;
  short int _22;
  short int _23;
  short int _25;
  short int _26;
  short int _28;
  short int _29;
  long int _31;
  int _34;
  short int _35;
  int _38;
  int _39;
  long int _41;
  long int _43;
  long int _45;
  long int _47;
  long int _49;
  long int _51;
  long int _55;
  long int _57;
  long int _59;
  long int _69;
  long int _70;
  long int _71;
  long int _73;
  long int _74;
  long int _75;
  long int _77;
  long int _78;
  short int _79;
  long int _80;
  long int _81;
  short int _82;
  long int _83;
  long int _84;
  short int _85;
  long int _86;
  long int _87;
  short int _88;
  long int _89;
  long int _90;
  short int _91;
  long int _92;
  long int _93;
  short int _94;
  long int _95;
  long int _96;
  short int _97;
  long int _98;
  long int _99;
  short int _100;
  long int _118;
  bool _121;
  short int pretmp_159;
  int _160;
  short int _ifc__237;
  short int _ifc__240;
  short int _ifc__243;
  short int _ifc__246;
  short int _ifc__249;
  short int _ifc__252;
  short int _ifc__255;
  short int _ifc__258;
  short int _ifc__261;
  short int _ifc__264;
  short int _ifc__267;
  short int _ifc__270;
  short int _ifc__273;
  short int _ifc__276;
  short int _ifc__279;
  short int _ifc__282;
  vector(8) short int _1154;

  <bb 2> [local count: 119292720]:
  _121 = 1;

  <bb 3> [local count: 954449105]:
  # i_122 = PHI <i_9(8), 0(2)>
  # ivtmp_5 = PHI <ivtmp_4(8), 8(2)>
  i.0_1 = (long unsigned int) i_122;
  _2 = i.0_1 * 16;
  _3 = block_7(D) + _2;
  _10 = MEM[(e_s16 *)_3 + 8B];
  _11 = (int) _10;
  _12 = _11 << 11;
  x1_13 = (e_s32) _12;
  _14 = MEM[(e_s16 *)_3 + 12B];
  _17 = MEM[(e_s16 *)_3 + 4B];
  _16 = _14 | _17;
  _20 = MEM[(e_s16 *)_3 + 2B];
  _19 = _16 | _20;
  _23 = MEM[(e_s16 *)_3 + 14B];
  _22 = _19 | _23;
  _26 = MEM[(e_s16 *)_3 + 10B];
  _25 = _22 | _26;
  _29 = MEM[(e_s16 *)_3 + 6B];
  _28 = _25 | _29;
  _118 = (long int) _28;
  _31 = x1_13 | _118;
  pretmp_159 = *_3;
  _160 = (int) pretmp_159;
  _34 = _160 << 3;
  _35 = (short int) _34;
  _ifc__237 = _31 != 0 ? _23 : _35;
  _ifc__240 = _31 != 0 ? _14 : _35;
  _ifc__243 = _31 != 0 ? _26 : _35;
  _ifc__246 = _31 != 0 ? _10 : _35;
  _ifc__249 = _31 != 0 ? _29 : _35;
  _ifc__252 = _31 != 0 ? _17 : _35;
  _ifc__255 = _31 != 0 ? _20 : _35;
  _ifc__258 = _31 == 0 ? _35 : pretmp_159;
  x2_15 = (e_s32) _14;
  x3_18 = (e_s32) _17;
  x4_21 = (e_s32) _20;
  x5_24 = (e_s32) _23;
  x6_27 = (e_s32) _26;
  x7_30 = (e_s32) _29;
  _38 = _160 << 11;
  _39 = _38 + 128;
  x0_40 = (e_s32) _39;
  _41 = x4_21 + x5_24;
  x8_42 = _41 * 565;
  _43 = x4_21 * 2276;
  x4_44 = x8_42 + _43;
  _45 = x5_24 * -3406;
  x5_46 = x8_42 + _45;
  _47 = x6_27 + x7_30;
  x8_48 = _47 * 2408;
  _49 = x6_27 * -799;
  x6_50 = x8_48 + _49;
  _51 = x7_30 * -4017;
  x7_52 = x8_48 + _51;
  x8_53 = x1_13 + x0_40;
  x0_54 = x0_40 - x1_13;
  _55 = x2_15 + x3_18;
  x1_56 = _55 * 1108;
  _57 = x2_15 * -3784;
  x2_58 = x1_56 + _57;
  _59 = x3_18 * 1568;
  x3_60 = x1_56 + _59;
  x1_61 = x4_44 + x6_50;
  x4_62 = x4_44 - x6_50;
  x6_63 = x5_46 + x7_52;
  x5_64 = x5_46 - x7_52;
  x7_65 = x8_53 + x3_60;
  x8_66 = x8_53 - x3_60;
  x3_67 = x0_54 + x2_58;
  x0_68 = x0_54 - x2_58;
  _69 = x4_62 + x5_64;
  _70 = _69 * 181;
  _71 = _70 + 128;
  x2_72 = _71 >> 8;
  _73 = x4_62 - x5_64;
  _74 = _73 * 181;
  _75 = _74 + 128;
  x4_76 = _75 >> 8;
  _77 = x1_61 + x7_65;
  _78 = _77 >> 8;
  _79 = (short int) _78;
  _ifc__261 = _31 != 0 ? _79 : _ifc__258;
  _80 = x3_67 + x2_72;
  _81 = _80 >> 8;
  _82 = (short int) _81;
  _ifc__264 = _31 != 0 ? _82 : _ifc__255;
  _83 = x0_68 + x4_76;
  _84 = _83 >> 8;
  _85 = (short int) _84;
  _ifc__267 = _31 != 0 ? _85 : _ifc__252;
  _86 = x6_63 + x8_66;
  _87 = _86 >> 8;
  _88 = (short int) _87;
  _ifc__270 = _31 != 0 ? _88 : _ifc__249;
  _89 = x8_66 - x6_63;
  _90 = _89 >> 8;
  _91 = (short int) _90;
  _ifc__273 = _31 != 0 ? _91 : _ifc__246;
  _92 = x0_68 - x4_76;
  _93 = _92 >> 8;
  _94 = (short int) _93;
  _ifc__276 = _31 != 0 ? _94 : _ifc__243;
  _95 = x3_67 - x2_72;
  _96 = _95 >> 8;
  _97 = (short int) _96;
  _ifc__279 = _31 != 0 ? _97 : _ifc__240;
  _98 = x7_65 - x1_61;
  _99 = _98 >> 8;
  _100 = (short int) _99;
  _ifc__282 = _31 != 0 ? _100 : _ifc__237;
  _1154 = {_ifc__261, _ifc__264, _ifc__267, _ifc__270, _ifc__273, _ifc__276,
_ifc__279, _ifc__282};
  vectp.78_1155 = _3;
  MEM <vector(8) short int> [(e_s16 *)vectp.78_1155] = _1154;
  i_9 = i_122 + 1;
  ivtmp_4 = ivtmp_5 - 1;
  if (ivtmp_4 != 0)
    goto <bb 8>; [87.50%]
  else
    goto <bb 7>; [12.50%]

  <bb 8> [local count: 835156386]:
  goto <bb 3>; [100.00%]

  <bb 7> [local count: 119292720]:
  return 1;

}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
@ 2021-04-15  7:17 ` rguenth at gcc dot gnu.org
  2021-04-27 11:40 ` [Bug tree-optimization/100089] [11/12 " jakub at gcc dot gnu.org
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-15  7:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-04-15
             Status|UNCONFIRMED                 |NEW
            Summary|[11 Performance regression  |[11 Regression] 30%
                   |] 30% for                   |performance regression for
                   |denbench/mp2decoddata2 with |denbench/mp2decoddata2 with
                   |-O3                         |-O3
   Target Milestone|---                         |11.0

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Indeed loop vectorization throws if-converted bodies at the BB vectorizer as a
last resort (because BB vectorization doesn't do if-conversion itself).  But
the BB vectorizer then uses the if-converted scalar code as the thing to
cost against (costing against the not if-converted loop body isn't really
possible).  To quote

      /* If we applied if-conversion then try to vectorize the
         BB of innermost loops.
         ???  Ideally BB vectorization would learn to vectorize
         control flow by applying if-conversion on-the-fly, the
         following retains the if-converted loop body even when
         only non-if-converted parts took part in BB vectorization.  */
      if (flag_tree_slp_vectorize != 0
          && loop_vectorized_call
          && ! loop->inner)
        {

as a "hack" we could see to scalar cost the always executed part of
the not if-converted loop body and apply the full bias of this cost
vs. the scalar cost of the if-converted body to the scalar cost of the
BB vectorization.  But that's really apples-to-oranges in the end
(as it is now).

Maybe we can cost the whole partly vectorized loop body in this mode
and compare it against the scalar cost of the original loop.  But even
the loop vectorizer costs the if-converted scalar loop, so it is off as well.

Long-term if-conversion needs to be integrated with vectorization so we
can at least keep track of what stmts were originally executed conditional
and what not.

Short-term I'm not sure we can do much.  Doing SLP on the if-converted
body does help in quite some cases.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
  2021-04-15  7:17 ` [Bug tree-optimization/100089] [11 Regression] 30% performance regression " rguenth at gcc dot gnu.org
@ 2021-04-27 11:40 ` jakub at gcc dot gnu.org
  2021-05-12  8:19 ` rsandifo at gcc dot gnu.org
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-04-27 11:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.0                        |11.2

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 11.1 has been released, retargeting bugs to GCC 11.2.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
  2021-04-15  7:17 ` [Bug tree-optimization/100089] [11 Regression] 30% performance regression " rguenth at gcc dot gnu.org
  2021-04-27 11:40 ` [Bug tree-optimization/100089] [11/12 " jakub at gcc dot gnu.org
@ 2021-05-12  8:19 ` rsandifo at gcc dot gnu.org
  2021-05-12  8:27 ` rguenth at gcc dot gnu.org
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-05-12  8:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

--- Comment #3 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
Is this really a costing issue, or should we instead reject the
BB fallback if it leaves any scalar COND_EXPRs around?  This would
be similar to the way that we reject IFN_MASK_LOAD/STORE calls,
except that the COND_EXPR tests would only apply to unvectorised
statements and so would need to be tested after SLP discovery
rather than before it.  (Ideally IFN_MASK_LOAD/STORE would work
like that too.)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (2 preceding siblings ...)
  2021-05-12  8:19 ` rsandifo at gcc dot gnu.org
@ 2021-05-12  8:27 ` rguenth at gcc dot gnu.org
  2021-07-28  7:06 ` rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-05-12  8:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rsandifo@gcc.gnu.org from comment #3)
> Is this really a costing issue, or should we instead reject the
> BB fallback if it leaves any scalar COND_EXPRs around?  This would
> be similar to the way that we reject IFN_MASK_LOAD/STORE calls,
> except that the COND_EXPR tests would only apply to unvectorised
> statements and so would need to be tested after SLP discovery
> rather than before it.  (Ideally IFN_MASK_LOAD/STORE would work
> like that too.)

I suppose we could do that, but then I'm not sure how exactly we'd do it ;)

Good idea anyway.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (3 preceding siblings ...)
  2021-05-12  8:27 ` rguenth at gcc dot gnu.org
@ 2021-07-28  7:06 ` rguenth at gcc dot gnu.org
  2021-08-23 12:47 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-28  7:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.2                        |11.3

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 11.2 is being released, retargeting bugs to GCC 11.3

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (4 preceding siblings ...)
  2021-07-28  7:06 ` rguenth at gcc dot gnu.org
@ 2021-08-23 12:47 ` rguenth at gcc dot gnu.org
  2021-08-24  1:28 ` cvs-commit at gcc dot gnu.org
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-23 12:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
             Status|NEW                         |ASSIGNED

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Working on this now.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (5 preceding siblings ...)
  2021-08-23 12:47 ` rguenth at gcc dot gnu.org
@ 2021-08-24  1:28 ` cvs-commit at gcc dot gnu.org
  2021-08-24  9:36 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-08-24  1:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:819b7c3a339e3bdaf85cd55954c5536bd98aae09

commit r12-3103-g819b7c3a339e3bdaf85cd55954c5536bd98aae09
Author: liuhongt <hongtao.liu@intel.com>
Date:   Wed Aug 4 16:39:31 2021 +0800

    Disable slp in loop vectorizer when cost model is very-cheap.

    Performance impact for the commit with option:
    -march=x86-64 -O2 -ftree-vectorize -fvect-cost-model=very-cheap

    SPEC2017 fprate
    503.bwaves_r        BuildSame
    507.cactuBSSN_r         -0.04
    508.namd_r               0.14
    510.parest_r            -0.54
    511.povray_r             0.10
    519.lbm_r           BuildSame
    521.wrf_r                0.64
    526.blender_r           -0.32
    527.cam4_r               0.17
    538.imagick_r            0.09
    544.nab_r           BuildSame
    549.fotonik3d_r     BuildSame
    554.roms_r          BuildSame
    997.specrand_fr         -0.09
    Geometric mean:  0.02

    SPEC2017 intrate
    500.perlbench_r          0.26
    502.gcc_r                0.21
    505.mcf_r               -0.09
    520.omnetpp_r       BuildSame
    523.xalancbmk_r     BuildSame
    525.x264_r              -0.41
    531.deepsjeng_r     BuildSame
    541.leela_r              0.13
    548.exchange2_r     BuildSame
    557.xz_r            BuildSame
    999.specrand_ir     BuildSame
    Geometric mean:  0.02

    EEMBC: no regression, only improvement or build the same, the below is
    improved benchmarks.

    mp2decoddata1       7.59
    mp2decoddata2       31.80
    mp2decoddata3       12.15
    mp2decoddata4       11.16
    mp2decoddata5       11.19
    mp2decoddata1       7.06
    mp2decoddata2       24.12
    mp2decoddata3       10.83
    mp2decoddata4       10.04
    mp2decoddata5       10.07

    gcc/ChangeLog:

            PR tree-optimization/100089
            * tree-vectorizer.c (try_vectorize_loop_1): Disable slp in
            loop vectorizer when cost model is very-cheap.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (6 preceding siblings ...)
  2021-08-24  1:28 ` cvs-commit at gcc dot gnu.org
@ 2021-08-24  9:36 ` rguenth at gcc dot gnu.org
  2021-08-24 10:34 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-24  9:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
So if we agree to a sane way to cost branchy code on the scalar side then it
should be possible to compare the scalar cost of the not if-converted inner
loop body against the full partially vectorized and if-converted inner loop
body.

vect_bb_vectorization_profitable_p would have to add the cost of the scalar
stmts not covered by vectorization - this set is conveniently available as
the set of stmts not having the visited flag set before we clear it here:

vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
                                    vec<slp_instance> slp_instances)
{
...
  /* Unset visited flag.  */
  stmt_info_for_cost *cost;
  FOR_EACH_VEC_ELT (scalar_costs, i, cost)
    gimple_set_visited  (cost->stmt_info->stmt, false);

so we'd need to walk over all stmts in the BB and add the cost of the
not marked stmts to the vector cost.  We'd want to force a single
SLP "subgraph" in this mode to avoid going over the whole block
multiple times.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (7 preceding siblings ...)
  2021-08-24  9:36 ` rguenth at gcc dot gnu.org
@ 2021-08-24 10:34 ` rguenth at gcc dot gnu.org
  2021-08-24 12:23 ` cvs-commit at gcc dot gnu.org
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-24 10:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 51350
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51350&action=edit
patch

This implements scanning for not vectorized COND_EXPRs with
-fvect-cost-model=very-cheap when vectorizing if-converted loop bodies.  It
also gives
vect_bb_vectorization_profitable_p enough info to eventually do something
more fancy (namely the original not if-converted loop body).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (8 preceding siblings ...)
  2021-08-24 10:34 ` rguenth at gcc dot gnu.org
@ 2021-08-24 12:23 ` cvs-commit at gcc dot gnu.org
  2021-08-24 12:27 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-08-24 12:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:9216ee6d1195d48388f825cf1b072e570129cbbe

commit r12-3116-g9216ee6d1195d48388f825cf1b072e570129cbbe
Author: Richard Biener <rguenther@suse.de>
Date:   Tue Aug 24 12:25:25 2021 +0200

    tree-optimization/100089 - avoid leaving scalar if-converted code around

    This avoids leaving scalar if-converted code around for the case
    of BB vectorizing an if-converted loop body when using the very-cheap
    cost model.  In this case we scan not vectorized scalar stmts in
    the basic-block vectorized for COND_EXPRs and force the vectorization
    to be marked as not profitable.

    The patch also makes sure to always consider all BB vectorization
    subgraphs together for costing purposes when vectorizing an
    if-converted loop body.

    2021-08-24  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/100089
            * tree-vectorizer.h (vect_slp_bb): Rename to ...
            (vect_slp_if_converted_bb): ... this and get the original
            loop as new argument.
            * tree-vectorizer.c (try_vectorize_loop_1): Revert previous fix,
            pass original loop to vect_slp_if_converted_bb.
            * tree-vect-slp.c (vect_bb_vectorization_profitable_p):
            If orig_loop was passed scan the not vectorized stmts
            for COND_EXPRs and force not profitable if found.
            (vect_slp_region): Pass down all SLP instances to costing
            if orig_loop was specified.
            (vect_slp_bbs): Pass through orig_loop.
            (vect_slp_bb): Rename to ...
            (vect_slp_if_converted_bb): ... this and get the original
            loop as new argument.
            (vect_slp_function): Adjust.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (9 preceding siblings ...)
  2021-08-24 12:23 ` cvs-commit at gcc dot gnu.org
@ 2021-08-24 12:27 ` rguenth at gcc dot gnu.org
  2021-08-30 12:05 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-24 12:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
So this fixes it for -fvect-cost-model=very-cheap.  One could argue that we
should enable the code for all cost models, fixing the -O3 regression (and
backportable to the branch).

I'll see to experiment with "fancy" costing of the stray vectorizations.  I'll
also note that the scanning for load/store (and other ifns not supported)
could be handled in the costing as well (but the costing would need to run
also for -fno-vect-cost-model then, just the result ignored if not forced).
I'm talking about

          bool require_loop_vectorize = false;
          for (gimple_stmt_iterator gsi = gsi_start_bb (bb);
               !gsi_end_p (gsi); gsi_next (&gsi))
            {
              gimple *stmt = gsi_stmt (gsi);
              gcall *call = dyn_cast <gcall *> (stmt);
              if (call && gimple_call_internal_p (call))
                {
                  internal_fn ifn = gimple_call_internal_fn (call);
                  if (ifn == IFN_MASK_LOAD || ifn == IFN_MASK_STORE
                      /* Don't keep the if-converted parts when the ifn with
                         specifc type is not supported by the backend.  */
                      || (direct_internal_fn_p (ifn)
                          && !direct_internal_fn_supported_p
                          (call, OPTIMIZE_FOR_SPEED)))
                    {
                      require_loop_vectorize = true;
                      break;
                    }
                }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (10 preceding siblings ...)
  2021-08-24 12:27 ` rguenth at gcc dot gnu.org
@ 2021-08-30 12:05 ` rguenth at gcc dot gnu.org
  2021-08-31 10:27 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-30 12:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
Bug 100089 depends on bug 102128, which changed state.

Bug 102128 Summary: [12 Regression] Huge performance drop for 519.lbm_r since r12-3116-g9216ee6d1195d48388f825cf1b072e570129cbbe
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102128

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (11 preceding siblings ...)
  2021-08-30 12:05 ` rguenth at gcc dot gnu.org
@ 2021-08-31 10:27 ` rguenth at gcc dot gnu.org
  2022-01-21 12:34 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-31 10:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089
Bug 100089 depends on bug 102142, which changed state.

Bug 102142 Summary: [12 Regression] ICE Segmentation fault since r12-3222-g89f33f44addbf985
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102142

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (12 preceding siblings ...)
  2021-08-31 10:27 ` rguenth at gcc dot gnu.org
@ 2022-01-21 12:34 ` rguenth at gcc dot gnu.org
  2022-01-21 13:23 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-21 12:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11/12 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (13 preceding siblings ...)
  2022-01-21 12:34 ` rguenth at gcc dot gnu.org
@ 2022-01-21 13:23 ` cvs-commit at gcc dot gnu.org
  2022-01-21 13:24 ` [Bug tree-optimization/100089] [11 " rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-21 13:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

--- Comment #12 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:199cd0e0f8744ca1e61a95987b2d020a592a46d9

commit r12-6795-g199cd0e0f8744ca1e61a95987b2d020a592a46d9
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Jan 21 13:29:06 2022 +0100

    tree-optimization/100089 - BB vectorization of if-converted loop bodies

    The PR complains that when we only partially BB vectorize an
    if-converted loop body that this can leave unvectorized code
    unconditionally executed and thus effectively slow down code.
    For -O2 we already mitigated the issue by not doing BB vectorization
    when not all if-converted stmts were covered but the issue is
    present with -O3 as well.  Thus the following simply extends the
    fix to cover all but the unlimited cost models.  It is after all
    very likely that we vectorize some stmts, if only a single
    paired store.

    2022-01-21  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/100089
            * tree-vect-slp.cc (vect_slp_region): Reject BB vectorization
            of if-converted loops with unvectorized COND_EXPRs for
            all but the unlimited cost models.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (14 preceding siblings ...)
  2022-01-21 13:23 ` cvs-commit at gcc dot gnu.org
@ 2022-01-21 13:24 ` rguenth at gcc dot gnu.org
  2022-04-21  7:49 ` rguenth at gcc dot gnu.org
  2023-05-29 10:04 ` jakub at gcc dot gnu.org
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-21 13:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[11/12 Regression] 30%      |[11 Regression] 30%
                   |performance regression for  |performance regression for
                   |denbench/mp2decoddata2 with |denbench/mp2decoddata2 with
                   |-O3                         |-O3
      Known to fail|                            |11.2.1
      Known to work|                            |12.0

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
Now fixed for GCC 12 even with -O3.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (15 preceding siblings ...)
  2022-01-21 13:24 ` [Bug tree-optimization/100089] [11 " rguenth at gcc dot gnu.org
@ 2022-04-21  7:49 ` rguenth at gcc dot gnu.org
  2023-05-29 10:04 ` jakub at gcc dot gnu.org
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-21  7:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.3                        |11.4

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 11.3 is being released, retargeting bugs to GCC 11.4.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug tree-optimization/100089] [11 Regression] 30% performance regression for denbench/mp2decoddata2 with -O3
  2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
                   ` (16 preceding siblings ...)
  2022-04-21  7:49 ` rguenth at gcc dot gnu.org
@ 2023-05-29 10:04 ` jakub at gcc dot gnu.org
  17 siblings, 0 replies; 19+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-05-29 10:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100089

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.4                        |11.5

--- Comment #15 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 11.4 is being released, retargeting bugs to GCC 11.5.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-05-29 10:04 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-15  6:35 [Bug tree-optimization/100089] New: [11 Performance regression ] 30% for denbench/mp2decoddata2 with -O3 crazylht at gmail dot com
2021-04-15  7:17 ` [Bug tree-optimization/100089] [11 Regression] 30% performance regression " rguenth at gcc dot gnu.org
2021-04-27 11:40 ` [Bug tree-optimization/100089] [11/12 " jakub at gcc dot gnu.org
2021-05-12  8:19 ` rsandifo at gcc dot gnu.org
2021-05-12  8:27 ` rguenth at gcc dot gnu.org
2021-07-28  7:06 ` rguenth at gcc dot gnu.org
2021-08-23 12:47 ` rguenth at gcc dot gnu.org
2021-08-24  1:28 ` cvs-commit at gcc dot gnu.org
2021-08-24  9:36 ` rguenth at gcc dot gnu.org
2021-08-24 10:34 ` rguenth at gcc dot gnu.org
2021-08-24 12:23 ` cvs-commit at gcc dot gnu.org
2021-08-24 12:27 ` rguenth at gcc dot gnu.org
2021-08-30 12:05 ` rguenth at gcc dot gnu.org
2021-08-31 10:27 ` rguenth at gcc dot gnu.org
2022-01-21 12:34 ` rguenth at gcc dot gnu.org
2022-01-21 13:23 ` cvs-commit at gcc dot gnu.org
2022-01-21 13:24 ` [Bug tree-optimization/100089] [11 " rguenth at gcc dot gnu.org
2022-04-21  7:49 ` rguenth at gcc dot gnu.org
2023-05-29 10:04 ` jakub at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).