public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails
@ 2021-05-27 7:48 linkw at gcc dot gnu.org
2021-05-28 6:39 ` [Bug tree-optimization/100794] " rguenth at gcc dot gnu.org
` (11 more replies)
0 siblings, 12 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-27 7:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
Bug ID: 100794
Summary: suboptimal code due to missing pre2 when vectorization
fails
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: linkw at gcc dot gnu.org
Target Milestone: ---
I was investigating one degradation from SPEC2017 554.roms_r on Power9, the
baseline is -O2 -mcpu=power9 -ffast-math while the test line is -O2
-mcpu=power9 -ffast-math -ftree-vectorize -fvect-cost-model=very-cheap.
One reduced C test case is as below:
#include <math.h>
#define MIN fmin
#define MAX fmax
#define N1 400
#define N2 600
#define N3 800
extern int j_0, j_n, i_0, i_n;
extern double diff2[N1][N2];
extern double dZdx[N1][N2][N3];
extern double dTdz[N1][N2][N3];
extern double dTdx[N1][N2][N3];
extern double FS[N1][N2][N3];
void
test (int k1, int k2)
{
for (int j = j_0; j < j_n; j++)
for (int i = i_0; i < i_n; i++)
{
double cff = 0.5 * diff2[j][i];
double cff1 = MIN (dZdx[k1][j][i], 0.0);
double cff2 = MIN (dZdx[k2][j][i + 1], 0.0);
double cff3 = MAX (dZdx[k2][j][i], 0.0);
double cff4 = MAX (dZdx[k1][j][i + 1], 0.0);
FS[k2][j][i]
= cff
* (cff1 * (cff1 * dTdz[k2][j][i] - dTdx[k1][j][i])
+ cff2 * (cff2 * dTdz[k2][j][i] - dTdx[k2][j][i + 1])
+ cff3 * (cff3 * dTdz[k2][j][i] - dTdx[k2][j][i])
+ cff4 * (cff4 * dTdz[k2][j][i] - dTdx[k1][j][i + 1]));
}
}
O2 fast:
<bb 8> [local count: 955630225]:
# prephitmp_107 = PHI <_6(8), pretmp_106(7)>
# prephitmp_109 = PHI <_4(8), pretmp_108(7)>
# prephitmp_111 = PHI <_23(8), pretmp_110(7)>
# prephitmp_113 = PHI <_13(8), pretmp_112(7)>
# doloop.9_55 = PHI <doloop.9_57(8), doloop.9_105(7)>
# ivtmp.33_102 = PHI <ivtmp.33_101(8), ivtmp.44_70(7)>
_87 = (double[400][600] *) ivtmp.45_60;
_1 = MEM[(double *)_87 + ivtmp.33_102 * 1];
cff_38 = _1 * 5.0e-1;
cff1_40 = MIN_EXPR <prephitmp_107, 0.0>;
_4 = MEM[(double *)&dZdx + 8B + ivtmp.33_102 * 1];
cff2_42 = MIN_EXPR <_4, 0.0>;
cff3_43 = MAX_EXPR <prephitmp_109, 0.0>;
_6 = MEM[(double *)_79 + ivtmp.33_102 * 1];
cff4_44 = MAX_EXPR <_6, 0.0>;
O2 fast vect (very-cheap)
<bb 6> [local count: 955630225]:
# doloop.9_55 = PHI <doloop.9_57(6), doloop.9_105(5)>
# ivtmp.37_102 = PHI <ivtmp.37_101(6), ivtmp.46_72(5)>
# ivtmp.38_92 = PHI <ivtmp.38_91(6), ivtmp.38_90(5)>
_77 = (double[400][600] *) ivtmp.48_62;
_1 = MEM[(double *)_77 + ivtmp.37_102 * 1];
cff_38 = _1 * 5.0e-1;
_2 = MEM[(double *)&dZdx + ivtmp.38_92 * 1]; // redundant load
cff1_40 = MIN_EXPR <_2, 0.0>;
_4 = MEM[(double *)&dZdx + 8B + ivtmp.37_102 * 1];
cff2_42 = MIN_EXPR <_4, 0.0>;
_5 = MEM[(double *)&dZdx + ivtmp.37_102 * 1]; // redundant load
cff3_43 = MAX_EXPR <_5, 0.0>;
_6 = MEM[(double *)&dZdx + 8B + ivtmp.38_92 * 1];
cff4_44 = MAX_EXPR <_6, 0.0>;
I found the root cause is that: in the baseline version, PRE makes it to reuse
some load result from previous iterations, it saves some loads. while in the
test line version, with the check below:
/* Inhibit the use of an inserted PHI on a loop header when
the address of the memory reference is a simple induction
variable. In other cases the vectorizer won't do anything
anyway (either it's loop invariant or a complicated
expression). */
if (sprime
&& TREE_CODE (sprime) == SSA_NAME
&& do_pre
&& (flag_tree_loop_vectorize || flag_tree_parallelize_loops > 1)
PRE doesn't optimize it to avoid introducing loop carried dependence. It makes
sense. But unfortunately the expected downstream loop vectorization isn't
performed on the given loop since with "very-cheap" cost model, it doesn't
allow vectorizer to peel for niters. Later there seems no downstream pass which
is trying to optimize it, it eventually results in sub-optimal code.
To rerun pre once after loop vectorization did fix the degradation, but not
sure it's practical, since iterating pre seems much time-consuming. Or tagging
this kind of loop and later just run pre on the tagged one? It seems also not
practical to predict one loop whether can be loop-vectorized later. Also not
sure whether there are some passes which can be taught for this.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
@ 2021-05-28 6:39 ` rguenth at gcc dot gnu.org
2021-05-28 7:29 ` linkw at gcc dot gnu.org
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-05-28 6:39 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
There's predictive commoning which can do similar transforms and runs after
vectorization. It might be it doesn't handle these "simple" cases or that
loop dependence info is not up to the task there.
Another option is to avoid the PRE guard with the (very) cheap cost model
at the expense of not vectorizing affected loops.
So it's just one of those classical phase ordering issues that you can
try to squash with running passes over and over again ...
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
2021-05-28 6:39 ` [Bug tree-optimization/100794] " rguenth at gcc dot gnu.org
@ 2021-05-28 7:29 ` linkw at gcc dot gnu.org
2021-05-28 7:36 ` rguenther at suse dot de
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-28 7:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
--- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
Thanks for the comments!
> There's predictive commoning which can do similar transforms and runs after
> vectorization. It might be it doesn't handle these "simple" cases or that
> loop dependence info is not up to the task there.
>
pcom does fix this problem, but it's enabled by default at -O3. Could it be
considered to be run at O2? Or enabled at O2 at some conditions such as: only
for one loop which skips loop carried optimization and isn't vectorized
further?
> Another option is to avoid the PRE guard with the (very) cheap cost model
> at the expense of not vectorizing affected loops.
>
OK, I will benchmark this to see its impact. For the particular loops in
554.roms_r, they can be vectorized at cheap cost model, this bmk got improved
at cheap cost model on both Power8 and Power9 (a bit though). So I will just
test the impact on very cheap cost model.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
2021-05-28 6:39 ` [Bug tree-optimization/100794] " rguenth at gcc dot gnu.org
2021-05-28 7:29 ` linkw at gcc dot gnu.org
@ 2021-05-28 7:36 ` rguenther at suse dot de
2021-05-28 11:30 ` linkw at gcc dot gnu.org
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenther at suse dot de @ 2021-05-28 7:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
>
> --- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #1)
>
> Thanks for the comments!
>
> > There's predictive commoning which can do similar transforms and runs after
> > vectorization. It might be it doesn't handle these "simple" cases or that
> > loop dependence info is not up to the task there.
> >
>
> pcom does fix this problem, but it's enabled by default at -O3. Could it be
> considered to be run at O2? Or enabled at O2 at some conditions such as: only
> for one loop which skips loop carried optimization and isn't vectorized
> further?
I think pcom should be enabled when vectorization is due to the
interaction with PRE. It could be tamed down (it can do peeling/unrolling
which is why it is -O3) based on the vectorizer cost model active
if only implicitely enabled ... Things will get a bit messy I guess.
> > Another option is to avoid the PRE guard with the (very) cheap cost model
> > at the expense of not vectorizing affected loops.
> >
>
> OK, I will benchmark this to see its impact. For the particular loops in
> 554.roms_r, they can be vectorized at cheap cost model, this bmk got improved
> at cheap cost model on both Power8 and Power9 (a bit though). So I will just
> test the impact on very cheap cost model.
So another thing to benchmark would be to enable pcom but make sure
/* Determine the unroll factor, and if the loop should be unrolled,
ensure
that its number of iterations is divisible by the factor. */
unroll_factor = determine_unroll_factor (chains);
scev_reset ();
unroll = (unroll_factor > 1
&& can_unroll_loop_p (loop, unroll_factor, &desc));
is false for the cheap and very-cheap cost models unless
flag_predictive_commoning is active.
It's probably also a good idea to investigate whether the
update_ssa calls in pcom can be delayed to until after all transforms
have been done.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
` (2 preceding siblings ...)
2021-05-28 7:36 ` rguenther at suse dot de
@ 2021-05-28 11:30 ` linkw at gcc dot gnu.org
2021-05-28 12:23 ` rguenther at suse dot de
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-28 11:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
--- Comment #4 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #3)
> On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> >
> > --- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #1)
> >
> > Thanks for the comments!
> >
> > > There's predictive commoning which can do similar transforms and runs after
> > > vectorization. It might be it doesn't handle these "simple" cases or that
> > > loop dependence info is not up to the task there.
> > >
> >
> > pcom does fix this problem, but it's enabled by default at -O3. Could it be
> > considered to be run at O2? Or enabled at O2 at some conditions such as: only
> > for one loop which skips loop carried optimization and isn't vectorized
> > further?
>
> I think pcom should be enabled when vectorization is due to the
> interaction with PRE. It could be tamed down (it can do peeling/unrolling
> which is why it is -O3) based on the vectorizer cost model active
> if only implicitely enabled ... Things will get a bit messy I guess.
>
Good point! I prefer this idea to the one guarding cost model in sccvn code.
> > > Another option is to avoid the PRE guard with the (very) cheap cost model
> > > at the expense of not vectorizing affected loops.
> > >
> >
> > OK, I will benchmark this to see its impact. For the particular loops in
> > 554.roms_r, they can be vectorized at cheap cost model, this bmk got improved
> > at cheap cost model on both Power8 and Power9 (a bit though). So I will just
> > test the impact on very cheap cost model.
>
> So another thing to benchmark would be to enable pcom but make sure
>
> /* Determine the unroll factor, and if the loop should be unrolled,
> ensure
> that its number of iterations is divisible by the factor. */
> unroll_factor = determine_unroll_factor (chains);
> scev_reset ();
> unroll = (unroll_factor > 1
> && can_unroll_loop_p (loop, unroll_factor, &desc));
>
> is false for the cheap and very-cheap cost models unless
> flag_predictive_commoning is active.
>
Thanks for the hints! One question: could we just enable non-unroll version of
pcom if it's enabled by flag_tree_loop_vectorize implicitly without considering
vect cost model? Although the very-cheap and cheap cost model are very likely
associated to O2, users still can try dynamic or unlimited cost model at O2, or
very-cheap/cheap cost model at O3, it seems not good to map cost model onto
unroll decision here directly. Or maybe we check the optimization level? such
as:
virtual bool
gate (function *)
{
if (flag_predictive_commoning != 0)
return true;
if (flag_tree_loop_vectorize)
{
allow_unroll_p = optimize > 2;
return true;
}
return false;
}
> It's probably also a good idea to investigate whether the
> update_ssa calls in pcom can be delayed to until after all transforms
> have been done.
OK, will check this later.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
` (3 preceding siblings ...)
2021-05-28 11:30 ` linkw at gcc dot gnu.org
@ 2021-05-28 12:23 ` rguenther at suse dot de
2021-05-31 6:05 ` linkw at gcc dot gnu.org
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenther at suse dot de @ 2021-05-28 12:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
>
> --- Comment #4 from Kewen Lin <linkw at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #3)
> > On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> > >
> > > --- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > (In reply to Richard Biener from comment #1)
> > >
> > > Thanks for the comments!
> > >
> > > > There's predictive commoning which can do similar transforms and runs after
> > > > vectorization. It might be it doesn't handle these "simple" cases or that
> > > > loop dependence info is not up to the task there.
> > > >
> > >
> > > pcom does fix this problem, but it's enabled by default at -O3. Could it be
> > > considered to be run at O2? Or enabled at O2 at some conditions such as: only
> > > for one loop which skips loop carried optimization and isn't vectorized
> > > further?
> >
> > I think pcom should be enabled when vectorization is due to the
> > interaction with PRE. It could be tamed down (it can do peeling/unrolling
> > which is why it is -O3) based on the vectorizer cost model active
> > if only implicitely enabled ... Things will get a bit messy I guess.
> >
>
> Good point! I prefer this idea to the one guarding cost model in sccvn code.
>
> > > > Another option is to avoid the PRE guard with the (very) cheap cost model
> > > > at the expense of not vectorizing affected loops.
> > > >
> > >
> > > OK, I will benchmark this to see its impact. For the particular loops in
> > > 554.roms_r, they can be vectorized at cheap cost model, this bmk got improved
> > > at cheap cost model on both Power8 and Power9 (a bit though). So I will just
> > > test the impact on very cheap cost model.
> >
> > So another thing to benchmark would be to enable pcom but make sure
> >
> > /* Determine the unroll factor, and if the loop should be unrolled,
> > ensure
> > that its number of iterations is divisible by the factor. */
> > unroll_factor = determine_unroll_factor (chains);
> > scev_reset ();
> > unroll = (unroll_factor > 1
> > && can_unroll_loop_p (loop, unroll_factor, &desc));
> >
> > is false for the cheap and very-cheap cost models unless
> > flag_predictive_commoning is active.
> >
>
> Thanks for the hints! One question: could we just enable non-unroll
> version of pcom if it's enabled by flag_tree_loop_vectorize implicitly
> without considering vect cost model? Although the very-cheap and cheap
> cost model are very likely associated to O2, users still can try dynamic
> or unlimited cost model at O2, or very-cheap/cheap cost model at O3, it
> seems not good to map cost model onto unroll decision here directly. Or
> maybe we check the optimization level? such as:
>
> virtual bool
> gate (function *)
> {
> if (flag_predictive_commoning != 0)
> return true;
> if (flag_tree_loop_vectorize)
> {
> allow_unroll_p = optimize > 2;
But what about -O2 -ftree-vectorize -fno-predictive-commoning? IMHO
we want to check global_options_set and for "implicit" pcom do not
allow unrolling and for explicitely disabled pcom do not do any
pcom at all.
Richard.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
` (4 preceding siblings ...)
2021-05-28 12:23 ` rguenther at suse dot de
@ 2021-05-31 6:05 ` linkw at gcc dot gnu.org
2021-05-31 6:06 ` linkw at gcc dot gnu.org
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-31 6:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
--- Comment #6 from Kewen Lin <linkw at gcc dot gnu.org> ---
Created attachment 50894
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50894&action=edit
Method 1, implicitly enable pcom without unrolling once loop vectorization is
enabled but pcom isn't set explicitly.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
` (5 preceding siblings ...)
2021-05-31 6:05 ` linkw at gcc dot gnu.org
@ 2021-05-31 6:06 ` linkw at gcc dot gnu.org
2021-05-31 6:08 ` linkw at gcc dot gnu.org
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-31 6:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
--- Comment #7 from Kewen Lin <linkw at gcc dot gnu.org> ---
Created attachment 50895
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50895&action=edit
Method 2, let pre generate loop carried dependence for very cheap and cheap
cost model.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
` (6 preceding siblings ...)
2021-05-31 6:06 ` linkw at gcc dot gnu.org
@ 2021-05-31 6:08 ` linkw at gcc dot gnu.org
2021-05-31 6:18 ` linkw at gcc dot gnu.org
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-31 6:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
--- Comment #8 from Kewen Lin <linkw at gcc dot gnu.org> ---
Created attachment 50896
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50896&action=edit
M1 M2 SPEC2017 P9 eval result
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
` (7 preceding siblings ...)
2021-05-31 6:08 ` linkw at gcc dot gnu.org
@ 2021-05-31 6:18 ` linkw at gcc dot gnu.org
2021-05-31 6:19 ` linkw at gcc dot gnu.org
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-31 6:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
--- Comment #9 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #5)
> On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> >
> > --- Comment #4 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > (In reply to rguenther@suse.de from comment #3)
> > > On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> > >
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> > > >
> > > > --- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > > (In reply to Richard Biener from comment #1)
> > > >
> > > > Thanks for the comments!
> > > >
> > > > > There's predictive commoning which can do similar transforms and runs after
> > > > > vectorization. It might be it doesn't handle these "simple" cases or that
> > > > > loop dependence info is not up to the task there.
> > > > >
> > > >
> > > > pcom does fix this problem, but it's enabled by default at -O3. Could it be
> > > > considered to be run at O2? Or enabled at O2 at some conditions such as: only
> > > > for one loop which skips loop carried optimization and isn't vectorized
> > > > further?
> > >
> > > I think pcom should be enabled when vectorization is due to the
> > > interaction with PRE. It could be tamed down (it can do peeling/unrolling
> > > which is why it is -O3) based on the vectorizer cost model active
> > > if only implicitely enabled ... Things will get a bit messy I guess.
> > >
> >
> > Good point! I prefer this idea to the one guarding cost model in sccvn code.
> >
> > > > > Another option is to avoid the PRE guard with the (very) cheap cost model
> > > > > at the expense of not vectorizing affected loops.
> > > > >
> > > >
> > > > OK, I will benchmark this to see its impact. For the particular loops in
> > > > 554.roms_r, they can be vectorized at cheap cost model, this bmk got improved
> > > > at cheap cost model on both Power8 and Power9 (a bit though). So I will just
> > > > test the impact on very cheap cost model.
> > >
> > > So another thing to benchmark would be to enable pcom but make sure
> > >
> > > /* Determine the unroll factor, and if the loop should be unrolled,
> > > ensure
> > > that its number of iterations is divisible by the factor. */
> > > unroll_factor = determine_unroll_factor (chains);
> > > scev_reset ();
> > > unroll = (unroll_factor > 1
> > > && can_unroll_loop_p (loop, unroll_factor, &desc));
> > >
> > > is false for the cheap and very-cheap cost models unless
> > > flag_predictive_commoning is active.
> > >
> >
> > Thanks for the hints! One question: could we just enable non-unroll
> > version of pcom if it's enabled by flag_tree_loop_vectorize implicitly
> > without considering vect cost model? Although the very-cheap and cheap
> > cost model are very likely associated to O2, users still can try dynamic
> > or unlimited cost model at O2, or very-cheap/cheap cost model at O3, it
> > seems not good to map cost model onto unroll decision here directly. Or
> > maybe we check the optimization level? such as:
> >
> > virtual bool
> > gate (function *)
> > {
> > if (flag_predictive_commoning != 0)
> > return true;
> > if (flag_tree_loop_vectorize)
> > {
> > allow_unroll_p = optimize > 2;
>
> But what about -O2 -ftree-vectorize -fno-predictive-commoning? IMHO
> we want to check global_options_set and for "implicit" pcom do not
> allow unrolling and for explicitely disabled pcom do not do any
> pcom at all.
>
> Richard.
OK, the M1 patch followed the suggestion here, not allow unrolling if the pcom
is enabled implicitly by loop vect.
As the evaluation result shows, the way with implied pcom (M1) looks better.
For very cheap cost model, it can make wrf a bit better. For cheap cost model,
the loop carried dependence (M2) did affect some bmks (especially wrf -9%). The
build time increment with M1 looks fine too.
Btw, commenting out two places with "update_ssa
(TODO_update_ssa_only_virtuals)", bootstrapped with the default way and
--with-build-config=bootstrap-O3 way. Will check with regression testing and
SPEC2017 build, but I guess it's another independent patch.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
` (8 preceding siblings ...)
2021-05-31 6:18 ` linkw at gcc dot gnu.org
@ 2021-05-31 6:19 ` linkw at gcc dot gnu.org
2021-06-08 5:30 ` cvs-commit at gcc dot gnu.org
2021-06-09 2:20 ` linkw at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-05-31 6:19 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
Kewen Lin <linkw at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org
Ever confirmed|0 |1
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2021-05-31
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
` (9 preceding siblings ...)
2021-05-31 6:19 ` linkw at gcc dot gnu.org
@ 2021-06-08 5:30 ` cvs-commit at gcc dot gnu.org
2021-06-09 2:20 ` linkw at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-06-08 5:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Kewen Lin <linkw@gcc.gnu.org>:
https://gcc.gnu.org/g:4db34072d5336d13b66f7185ec6454aa7d36f3c7
commit r12-1275-g4db34072d5336d13b66f7185ec6454aa7d36f3c7
Author: Kewen Lin <linkw@linux.ibm.com>
Date: Mon Jun 7 22:10:33 2021 -0500
predcom: Enabled by loop vect at O2 [PR100794]
As PR100794 shows, in the current implementation PRE bypasses
some optimization to avoid introducing loop carried dependence
which stops loop vectorizer to vectorize the loop. At -O2,
there is no downstream pass to re-catch this kind of opportunity
if loop vectorizer fails to vectorize that loop.
This patch follows Richi's suggestion in the PR, if predcom flag
isn't set and loop vectorization will enable predcom without any
unrolling implicitly. The Power9 SPEC2017 evaluation showed it
can speed up 521.wrf_r 3.30% and 554.roms_r 1.08% at very-cheap
cost model, no remarkable impact at cheap cost model, the build
time and size impact is fine (see the PR for the details).
By the way, I tested another proposal to guard PRE not skip the
optimization for cheap and very-cheap vect cost models, the
evaluation results showed it's fine with very cheap cost model,
but it can degrade some bmks like 521.wrf_r -9.17% and
549.fotonik3d_r -2.07% etc.
Bootstrapped/regtested on powerpc64le-linux-gnu P9,
x86_64-redhat-linux and aarch64-linux-gnu.
gcc/ChangeLog:
PR tree-optimization/100794
* tree-predcom.c (tree_predictive_commoning_loop): Add parameter
allow_unroll_p and only allow unrolling when it's true.
(tree_predictive_commoning): Add parameter allow_unroll_p and
adjust for it.
(run_tree_predictive_commoning): Likewise.
(pass_predcom::gate): Check flag_tree_loop_vectorize and
global_options_set.x_flag_predictive_commoning.
(pass_predcom::execute): Adjust for allow_unroll_p.
gcc/testsuite/ChangeLog:
PR tree-optimization/100794
* gcc.dg/tree-ssa/pr100794.c: New test.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
` (10 preceding siblings ...)
2021-06-08 5:30 ` cvs-commit at gcc dot gnu.org
@ 2021-06-09 2:20 ` linkw at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-06-09 2:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
Kewen Lin <linkw at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
--- Comment #11 from Kewen Lin <linkw at gcc dot gnu.org> ---
Fixed on trunk, will continue to refactor the tree_predictive_commoning_loop
and its callees into class and member functions as suggested.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-06-09 2:20 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27 7:48 [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails linkw at gcc dot gnu.org
2021-05-28 6:39 ` [Bug tree-optimization/100794] " rguenth at gcc dot gnu.org
2021-05-28 7:29 ` linkw at gcc dot gnu.org
2021-05-28 7:36 ` rguenther at suse dot de
2021-05-28 11:30 ` linkw at gcc dot gnu.org
2021-05-28 12:23 ` rguenther at suse dot de
2021-05-31 6:05 ` linkw at gcc dot gnu.org
2021-05-31 6:06 ` linkw at gcc dot gnu.org
2021-05-31 6:08 ` linkw at gcc dot gnu.org
2021-05-31 6:18 ` linkw at gcc dot gnu.org
2021-05-31 6:19 ` linkw at gcc dot gnu.org
2021-06-08 5:30 ` cvs-commit at gcc dot gnu.org
2021-06-09 2:20 ` linkw at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).