public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails
@ 2021-05-27  7:48 linkw at gcc dot gnu.org
  2021-05-28  6:39 ` [Bug tree-optimization/100794] " rguenth at gcc dot gnu.org
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-27  7:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

            Bug ID: 100794
           Summary: suboptimal code due to missing pre2 when vectorization
                    fails
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

I was investigating one degradation from SPEC2017 554.roms_r on Power9, the
baseline is -O2 -mcpu=power9 -ffast-math while the test line is -O2
-mcpu=power9 -ffast-math -ftree-vectorize -fvect-cost-model=very-cheap.

One reduced C test case is as below:

#include <math.h>

#define MIN fmin
#define MAX fmax

#define N1 400
#define N2 600
#define N3 800

extern int j_0, j_n, i_0, i_n;
extern double diff2[N1][N2];
extern double dZdx[N1][N2][N3];
extern double dTdz[N1][N2][N3];
extern double dTdx[N1][N2][N3];
extern double FS[N1][N2][N3];

void
test (int k1, int k2)
{
  for (int j = j_0; j < j_n; j++)
    for (int i = i_0; i < i_n; i++)
      {
        double cff = 0.5 * diff2[j][i];
        double cff1 = MIN (dZdx[k1][j][i], 0.0);
        double cff2 = MIN (dZdx[k2][j][i + 1], 0.0);
        double cff3 = MAX (dZdx[k2][j][i], 0.0);
        double cff4 = MAX (dZdx[k1][j][i + 1], 0.0);

        FS[k2][j][i]
          = cff
            * (cff1 * (cff1 * dTdz[k2][j][i] - dTdx[k1][j][i])
               + cff2 * (cff2 * dTdz[k2][j][i] - dTdx[k2][j][i + 1])
               + cff3 * (cff3 * dTdz[k2][j][i] - dTdx[k2][j][i])
               + cff4 * (cff4 * dTdz[k2][j][i] - dTdx[k1][j][i + 1]));
      }
}

O2 fast:

  <bb 8> [local count: 955630225]:
  # prephitmp_107 = PHI <_6(8), pretmp_106(7)>
  # prephitmp_109 = PHI <_4(8), pretmp_108(7)>
  # prephitmp_111 = PHI <_23(8), pretmp_110(7)>
  # prephitmp_113 = PHI <_13(8), pretmp_112(7)>
  # doloop.9_55 = PHI <doloop.9_57(8), doloop.9_105(7)>
  # ivtmp.33_102 = PHI <ivtmp.33_101(8), ivtmp.44_70(7)>
  _87 = (double[400][600] *) ivtmp.45_60;
  _1 = MEM[(double *)_87 + ivtmp.33_102 * 1];
  cff_38 = _1 * 5.0e-1;
  cff1_40 = MIN_EXPR <prephitmp_107, 0.0>;
  _4 = MEM[(double *)&dZdx + 8B + ivtmp.33_102 * 1];
  cff2_42 = MIN_EXPR <_4, 0.0>;
  cff3_43 = MAX_EXPR <prephitmp_109, 0.0>;
  _6 = MEM[(double *)_79 + ivtmp.33_102 * 1];
  cff4_44 = MAX_EXPR <_6, 0.0>;


O2 fast vect (very-cheap)
  <bb 6> [local count: 955630225]:
  # doloop.9_55 = PHI <doloop.9_57(6), doloop.9_105(5)>
  # ivtmp.37_102 = PHI <ivtmp.37_101(6), ivtmp.46_72(5)>
  # ivtmp.38_92 = PHI <ivtmp.38_91(6), ivtmp.38_90(5)>
  _77 = (double[400][600] *) ivtmp.48_62;
  _1 = MEM[(double *)_77 + ivtmp.37_102 * 1];
  cff_38 = _1 * 5.0e-1;
  _2 = MEM[(double *)&dZdx + ivtmp.38_92 * 1];   // redundant load
  cff1_40 = MIN_EXPR <_2, 0.0>;
  _4 = MEM[(double *)&dZdx + 8B + ivtmp.37_102 * 1];
  cff2_42 = MIN_EXPR <_4, 0.0>;
  _5 = MEM[(double *)&dZdx + ivtmp.37_102 * 1];  // redundant load 
  cff3_43 = MAX_EXPR <_5, 0.0>;
  _6 = MEM[(double *)&dZdx + 8B + ivtmp.38_92 * 1];
  cff4_44 = MAX_EXPR <_6, 0.0>;


I found the root cause is that: in the baseline version, PRE makes it to reuse
some load result from previous iterations, it saves some loads. while in the
test line version, with the check below:

      /* Inhibit the use of an inserted PHI on a loop header when
         the address of the memory reference is a simple induction
         variable.  In other cases the vectorizer won't do anything
         anyway (either it's loop invariant or a complicated
         expression).  */
      if (sprime
          && TREE_CODE (sprime) == SSA_NAME
          && do_pre
          && (flag_tree_loop_vectorize || flag_tree_parallelize_loops > 1)

PRE doesn't optimize it to avoid introducing loop carried dependence. It makes
sense. But unfortunately the expected downstream loop vectorization isn't
performed on the given loop since with "very-cheap" cost model, it doesn't
allow vectorizer to peel for niters. Later there seems no downstream pass which
is trying to optimize it, it eventually results in sub-optimal code.

To rerun pre once after loop vectorization did fix the degradation, but not
sure it's practical, since iterating pre seems much time-consuming. Or tagging
this kind of loop and later just run pre on the tagged one? It seems also not
practical to predict one loop whether can be loop-vectorized later. Also not
sure whether there are some passes which can be taught for this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
@ 2021-05-28  6:39 ` rguenth at gcc dot gnu.org
  2021-05-28  7:29 ` linkw at gcc dot gnu.org
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-05-28  6:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
There's predictive commoning which can do similar transforms and runs after
vectorization.  It might be it doesn't handle these "simple" cases or that
loop dependence info is not up to the task there.

Another option is to avoid the PRE guard with the (very) cheap cost model
at the expense of not vectorizing affected loops.

So it's just one of those classical phase ordering issues that you can
try to squash with running passes over and over again ...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
  2021-05-28  6:39 ` [Bug tree-optimization/100794] " rguenth at gcc dot gnu.org
@ 2021-05-28  7:29 ` linkw at gcc dot gnu.org
  2021-05-28  7:36 ` rguenther at suse dot de
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-28  7:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)

Thanks for the comments!

> There's predictive commoning which can do similar transforms and runs after
> vectorization.  It might be it doesn't handle these "simple" cases or that
> loop dependence info is not up to the task there.
> 

pcom does fix this problem, but it's enabled by default at -O3. Could it be
considered to be run at O2? Or enabled at O2 at some conditions such as: only
for one loop which skips loop carried optimization and isn't vectorized
further?

> Another option is to avoid the PRE guard with the (very) cheap cost model
> at the expense of not vectorizing affected loops.
> 

OK, I will benchmark this to see its impact. For the particular loops in
554.roms_r, they can be vectorized at cheap cost model, this bmk got improved
at cheap cost model on both Power8 and Power9 (a bit though). So I will just
test the impact on very cheap cost model.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
  2021-05-28  6:39 ` [Bug tree-optimization/100794] " rguenth at gcc dot gnu.org
  2021-05-28  7:29 ` linkw at gcc dot gnu.org
@ 2021-05-28  7:36 ` rguenther at suse dot de
  2021-05-28 11:30 ` linkw at gcc dot gnu.org
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenther at suse dot de @ 2021-05-28  7:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> 
> --- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #1)
> 
> Thanks for the comments!
> 
> > There's predictive commoning which can do similar transforms and runs after
> > vectorization.  It might be it doesn't handle these "simple" cases or that
> > loop dependence info is not up to the task there.
> > 
> 
> pcom does fix this problem, but it's enabled by default at -O3. Could it be
> considered to be run at O2? Or enabled at O2 at some conditions such as: only
> for one loop which skips loop carried optimization and isn't vectorized
> further?

I think pcom should be enabled when vectorization is due to the 
interaction with PRE.  It could be tamed down (it can do peeling/unrolling 
which is why it is -O3) based on the vectorizer cost model active
if only implicitely enabled ...   Things will get a bit messy I guess.

> > Another option is to avoid the PRE guard with the (very) cheap cost model
> > at the expense of not vectorizing affected loops.
> > 
> 
> OK, I will benchmark this to see its impact. For the particular loops in
> 554.roms_r, they can be vectorized at cheap cost model, this bmk got improved
> at cheap cost model on both Power8 and Power9 (a bit though). So I will just
> test the impact on very cheap cost model.

So another thing to benchmark would be to enable pcom but make sure

  /* Determine the unroll factor, and if the loop should be unrolled, 
ensure
     that its number of iterations is divisible by the factor.  */
  unroll_factor = determine_unroll_factor (chains);
  scev_reset ();
  unroll = (unroll_factor > 1
            && can_unroll_loop_p (loop, unroll_factor, &desc));

is false for the cheap and very-cheap cost models unless
flag_predictive_commoning is active.

It's probably also a good idea to investigate whether the
update_ssa calls in pcom can be delayed to until after all transforms
have been done.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-05-28  7:36 ` rguenther at suse dot de
@ 2021-05-28 11:30 ` linkw at gcc dot gnu.org
  2021-05-28 12:23 ` rguenther at suse dot de
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-28 11:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #4 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #3)
> On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> > 
> > --- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #1)
> > 
> > Thanks for the comments!
> > 
> > > There's predictive commoning which can do similar transforms and runs after
> > > vectorization.  It might be it doesn't handle these "simple" cases or that
> > > loop dependence info is not up to the task there.
> > > 
> > 
> > pcom does fix this problem, but it's enabled by default at -O3. Could it be
> > considered to be run at O2? Or enabled at O2 at some conditions such as: only
> > for one loop which skips loop carried optimization and isn't vectorized
> > further?
> 
> I think pcom should be enabled when vectorization is due to the 
> interaction with PRE.  It could be tamed down (it can do peeling/unrolling 
> which is why it is -O3) based on the vectorizer cost model active
> if only implicitely enabled ...   Things will get a bit messy I guess.
> 

Good point! I prefer this idea to the one guarding cost model in sccvn code. 

> > > Another option is to avoid the PRE guard with the (very) cheap cost model
> > > at the expense of not vectorizing affected loops.
> > > 
> > 
> > OK, I will benchmark this to see its impact. For the particular loops in
> > 554.roms_r, they can be vectorized at cheap cost model, this bmk got improved
> > at cheap cost model on both Power8 and Power9 (a bit though). So I will just
> > test the impact on very cheap cost model.
> 
> So another thing to benchmark would be to enable pcom but make sure
> 
>   /* Determine the unroll factor, and if the loop should be unrolled, 
> ensure
>      that its number of iterations is divisible by the factor.  */
>   unroll_factor = determine_unroll_factor (chains);
>   scev_reset ();
>   unroll = (unroll_factor > 1
>             && can_unroll_loop_p (loop, unroll_factor, &desc));
> 
> is false for the cheap and very-cheap cost models unless
> flag_predictive_commoning is active.
> 

Thanks for the hints! One question: could we just enable non-unroll version of
pcom if it's enabled by flag_tree_loop_vectorize implicitly without considering
vect cost model? Although the very-cheap and cheap cost model are very likely
associated to O2, users still can try dynamic or unlimited cost model at O2, or
very-cheap/cheap cost model at O3, it seems not good to map cost model onto
unroll decision here directly. Or maybe we check the optimization level? such
as:

  virtual bool
  gate (function *)
  {
    if (flag_predictive_commoning != 0)
      return true;
    if (flag_tree_loop_vectorize)
      {
        allow_unroll_p = optimize > 2;
        return true;
      }
    return false;
  }


> It's probably also a good idea to investigate whether the
> update_ssa calls in pcom can be delayed to until after all transforms
> have been done.

OK, will check this later.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-05-28 11:30 ` linkw at gcc dot gnu.org
@ 2021-05-28 12:23 ` rguenther at suse dot de
  2021-05-31  6:05 ` linkw at gcc dot gnu.org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenther at suse dot de @ 2021-05-28 12:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> 
> --- Comment #4 from Kewen Lin <linkw at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #3)
> > On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> > > 
> > > --- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > (In reply to Richard Biener from comment #1)
> > > 
> > > Thanks for the comments!
> > > 
> > > > There's predictive commoning which can do similar transforms and runs after
> > > > vectorization.  It might be it doesn't handle these "simple" cases or that
> > > > loop dependence info is not up to the task there.
> > > > 
> > > 
> > > pcom does fix this problem, but it's enabled by default at -O3. Could it be
> > > considered to be run at O2? Or enabled at O2 at some conditions such as: only
> > > for one loop which skips loop carried optimization and isn't vectorized
> > > further?
> > 
> > I think pcom should be enabled when vectorization is due to the 
> > interaction with PRE.  It could be tamed down (it can do peeling/unrolling 
> > which is why it is -O3) based on the vectorizer cost model active
> > if only implicitely enabled ...   Things will get a bit messy I guess.
> > 
> 
> Good point! I prefer this idea to the one guarding cost model in sccvn code. 
> 
> > > > Another option is to avoid the PRE guard with the (very) cheap cost model
> > > > at the expense of not vectorizing affected loops.
> > > > 
> > > 
> > > OK, I will benchmark this to see its impact. For the particular loops in
> > > 554.roms_r, they can be vectorized at cheap cost model, this bmk got improved
> > > at cheap cost model on both Power8 and Power9 (a bit though). So I will just
> > > test the impact on very cheap cost model.
> > 
> > So another thing to benchmark would be to enable pcom but make sure
> > 
> >   /* Determine the unroll factor, and if the loop should be unrolled, 
> > ensure
> >      that its number of iterations is divisible by the factor.  */
> >   unroll_factor = determine_unroll_factor (chains);
> >   scev_reset ();
> >   unroll = (unroll_factor > 1
> >             && can_unroll_loop_p (loop, unroll_factor, &desc));
> > 
> > is false for the cheap and very-cheap cost models unless
> > flag_predictive_commoning is active.
> > 
> 
> Thanks for the hints! One question: could we just enable non-unroll 
> version of pcom if it's enabled by flag_tree_loop_vectorize implicitly 
> without considering vect cost model? Although the very-cheap and cheap 
> cost model are very likely associated to O2, users still can try dynamic 
> or unlimited cost model at O2, or very-cheap/cheap cost model at O3, it 
> seems not good to map cost model onto unroll decision here directly. Or 
> maybe we check the optimization level? such as:
> 
>   virtual bool
>   gate (function *)
>   {
>     if (flag_predictive_commoning != 0)
>       return true;
>     if (flag_tree_loop_vectorize)
>       {
>         allow_unroll_p = optimize > 2;

But what about -O2 -ftree-vectorize -fno-predictive-commoning?  IMHO
we want to check global_options_set and for "implicit" pcom do not
allow unrolling and for explicitely disabled pcom do not do any
pcom at all.

Richard.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-05-28 12:23 ` rguenther at suse dot de
@ 2021-05-31  6:05 ` linkw at gcc dot gnu.org
  2021-05-31  6:06 ` linkw at gcc dot gnu.org
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-31  6:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #6 from Kewen Lin <linkw at gcc dot gnu.org> ---
Created attachment 50894
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50894&action=edit
Method 1, implicitly enable pcom without unrolling once loop vectorization is
enabled but pcom isn't set explicitly.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-05-31  6:05 ` linkw at gcc dot gnu.org
@ 2021-05-31  6:06 ` linkw at gcc dot gnu.org
  2021-05-31  6:08 ` linkw at gcc dot gnu.org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-31  6:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #7 from Kewen Lin <linkw at gcc dot gnu.org> ---
Created attachment 50895
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50895&action=edit
Method 2, let pre generate loop carried dependence for very cheap and cheap
cost model.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-05-31  6:06 ` linkw at gcc dot gnu.org
@ 2021-05-31  6:08 ` linkw at gcc dot gnu.org
  2021-05-31  6:18 ` linkw at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-31  6:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #8 from Kewen Lin <linkw at gcc dot gnu.org> ---
Created attachment 50896
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50896&action=edit
M1 M2 SPEC2017 P9 eval result

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2021-05-31  6:08 ` linkw at gcc dot gnu.org
@ 2021-05-31  6:18 ` linkw at gcc dot gnu.org
  2021-05-31  6:19 ` linkw at gcc dot gnu.org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-31  6:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #9 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #5)
> On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> > 
> > --- Comment #4 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > (In reply to rguenther@suse.de from comment #3)
> > > On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> > > > 
> > > > --- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > > (In reply to Richard Biener from comment #1)
> > > > 
> > > > Thanks for the comments!
> > > > 
> > > > > There's predictive commoning which can do similar transforms and runs after
> > > > > vectorization.  It might be it doesn't handle these "simple" cases or that
> > > > > loop dependence info is not up to the task there.
> > > > > 
> > > > 
> > > > pcom does fix this problem, but it's enabled by default at -O3. Could it be
> > > > considered to be run at O2? Or enabled at O2 at some conditions such as: only
> > > > for one loop which skips loop carried optimization and isn't vectorized
> > > > further?
> > > 
> > > I think pcom should be enabled when vectorization is due to the 
> > > interaction with PRE.  It could be tamed down (it can do peeling/unrolling 
> > > which is why it is -O3) based on the vectorizer cost model active
> > > if only implicitely enabled ...   Things will get a bit messy I guess.
> > > 
> > 
> > Good point! I prefer this idea to the one guarding cost model in sccvn code. 
> > 
> > > > > Another option is to avoid the PRE guard with the (very) cheap cost model
> > > > > at the expense of not vectorizing affected loops.
> > > > > 
> > > > 
> > > > OK, I will benchmark this to see its impact. For the particular loops in
> > > > 554.roms_r, they can be vectorized at cheap cost model, this bmk got improved
> > > > at cheap cost model on both Power8 and Power9 (a bit though). So I will just
> > > > test the impact on very cheap cost model.
> > > 
> > > So another thing to benchmark would be to enable pcom but make sure
> > > 
> > >   /* Determine the unroll factor, and if the loop should be unrolled, 
> > > ensure
> > >      that its number of iterations is divisible by the factor.  */
> > >   unroll_factor = determine_unroll_factor (chains);
> > >   scev_reset ();
> > >   unroll = (unroll_factor > 1
> > >             && can_unroll_loop_p (loop, unroll_factor, &desc));
> > > 
> > > is false for the cheap and very-cheap cost models unless
> > > flag_predictive_commoning is active.
> > > 
> > 
> > Thanks for the hints! One question: could we just enable non-unroll 
> > version of pcom if it's enabled by flag_tree_loop_vectorize implicitly 
> > without considering vect cost model? Although the very-cheap and cheap 
> > cost model are very likely associated to O2, users still can try dynamic 
> > or unlimited cost model at O2, or very-cheap/cheap cost model at O3, it 
> > seems not good to map cost model onto unroll decision here directly. Or 
> > maybe we check the optimization level? such as:
> > 
> >   virtual bool
> >   gate (function *)
> >   {
> >     if (flag_predictive_commoning != 0)
> >       return true;
> >     if (flag_tree_loop_vectorize)
> >       {
> >         allow_unroll_p = optimize > 2;
> 
> But what about -O2 -ftree-vectorize -fno-predictive-commoning?  IMHO
> we want to check global_options_set and for "implicit" pcom do not
> allow unrolling and for explicitely disabled pcom do not do any
> pcom at all.
> 
> Richard.

OK, the M1 patch followed the suggestion here, not allow unrolling if the pcom
is enabled implicitly by loop vect.

As the evaluation result shows, the way with implied pcom (M1) looks better.
For  very cheap cost model, it can make wrf a bit better. For cheap cost model,
the loop carried dependence (M2) did affect some bmks (especially wrf -9%). The
build time increment with M1 looks fine too.

Btw, commenting out two places with "update_ssa
(TODO_update_ssa_only_virtuals)", bootstrapped with the default way and
--with-build-config=bootstrap-O3 way. Will check with regression testing and
SPEC2017 build, but I guess it's another independent patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2021-05-31  6:18 ` linkw at gcc dot gnu.org
@ 2021-05-31  6:19 ` linkw at gcc dot gnu.org
  2021-06-08  5:30 ` cvs-commit at gcc dot gnu.org
  2021-06-09  2:20 ` linkw at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-31  6:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |linkw at gcc dot gnu.org
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2021-05-31

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2021-05-31  6:19 ` linkw at gcc dot gnu.org
@ 2021-06-08  5:30 ` cvs-commit at gcc dot gnu.org
  2021-06-09  2:20 ` linkw at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-06-08  5:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Kewen Lin <linkw@gcc.gnu.org>:

https://gcc.gnu.org/g:4db34072d5336d13b66f7185ec6454aa7d36f3c7

commit r12-1275-g4db34072d5336d13b66f7185ec6454aa7d36f3c7
Author: Kewen Lin <linkw@linux.ibm.com>
Date:   Mon Jun 7 22:10:33 2021 -0500

    predcom: Enabled by loop vect at O2 [PR100794]

    As PR100794 shows, in the current implementation PRE bypasses
    some optimization to avoid introducing loop carried dependence
    which stops loop vectorizer to vectorize the loop.  At -O2,
    there is no downstream pass to re-catch this kind of opportunity
    if loop vectorizer fails to vectorize that loop.

    This patch follows Richi's suggestion in the PR, if predcom flag
    isn't set and loop vectorization will enable predcom without any
    unrolling implicitly.  The Power9 SPEC2017 evaluation showed it
    can speed up 521.wrf_r 3.30% and 554.roms_r 1.08% at very-cheap
    cost model, no remarkable impact at cheap cost model, the build
    time and size impact is fine (see the PR for the details).

    By the way, I tested another proposal to guard PRE not skip the
    optimization for cheap and very-cheap vect cost models, the
    evaluation results showed it's fine with very cheap cost model,
    but it can degrade some bmks like 521.wrf_r -9.17% and
    549.fotonik3d_r -2.07% etc.

    Bootstrapped/regtested on powerpc64le-linux-gnu P9,
    x86_64-redhat-linux and aarch64-linux-gnu.

    gcc/ChangeLog:

            PR tree-optimization/100794
            * tree-predcom.c (tree_predictive_commoning_loop): Add parameter
            allow_unroll_p and only allow unrolling when it's true.
            (tree_predictive_commoning): Add parameter allow_unroll_p and
            adjust for it.
            (run_tree_predictive_commoning): Likewise.
            (pass_predcom::gate): Check flag_tree_loop_vectorize and
            global_options_set.x_flag_predictive_commoning.
            (pass_predcom::execute): Adjust for allow_unroll_p.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/100794
            * gcc.dg/tree-ssa/pr100794.c: New test.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
  2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2021-06-08  5:30 ` cvs-commit at gcc dot gnu.org
@ 2021-06-09  2:20 ` linkw at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-06-09  2:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #11 from Kewen Lin <linkw at gcc dot gnu.org> ---
Fixed on trunk, will continue to refactor the tree_predictive_commoning_loop
and its callees into class and member functions as suggested.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-06-09  2:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27  7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
2021-05-28  6:39 ` [Bug tree-optimization/100794] " rguenth at gcc dot gnu.org
2021-05-28  7:29 ` linkw at gcc dot gnu.org
2021-05-28  7:36 ` rguenther at suse dot de
2021-05-28 11:30 ` linkw at gcc dot gnu.org
2021-05-28 12:23 ` rguenther at suse dot de
2021-05-31  6:05 ` linkw at gcc dot gnu.org
2021-05-31  6:06 ` linkw at gcc dot gnu.org
2021-05-31  6:08 ` linkw at gcc dot gnu.org
2021-05-31  6:18 ` linkw at gcc dot gnu.org
2021-05-31  6:19 ` linkw at gcc dot gnu.org
2021-06-08  5:30 ` cvs-commit at gcc dot gnu.org
2021-06-09  2:20 ` linkw at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).