From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id E12F83854801; Thu, 27 May 2021 07:48:11 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E12F83854801
From: "linkw at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/100794] New: suboptimal code due to missing
 pre2 when vectorization fails
Date: Thu, 27 May 2021 07:48:11 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: linkw at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-100794-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 27 May 2021 07:48:12 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D100794

            Bug ID: 100794
           Summary: suboptimal code due to missing pre2 when vectorization
                    fails
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

I was investigating one degradation from SPEC2017 554.roms_r on Power9, the
baseline is -O2 -mcpu=3Dpower9 -ffast-math while the test line is -O2
-mcpu=3Dpower9 -ffast-math -ftree-vectorize -fvect-cost-model=3Dvery-cheap.

One reduced C test case is as below:

#include <math.h>

#define MIN fmin
#define MAX fmax

#define N1 400
#define N2 600
#define N3 800

extern int j_0, j_n, i_0, i_n;
extern double diff2[N1][N2];
extern double dZdx[N1][N2][N3];
extern double dTdz[N1][N2][N3];
extern double dTdx[N1][N2][N3];
extern double FS[N1][N2][N3];

void
test (int k1, int k2)
{
  for (int j =3D j_0; j < j_n; j++)
    for (int i =3D i_0; i < i_n; i++)
      {
        double cff =3D 0.5 * diff2[j][i];
        double cff1 =3D MIN (dZdx[k1][j][i], 0.0);
        double cff2 =3D MIN (dZdx[k2][j][i + 1], 0.0);
        double cff3 =3D MAX (dZdx[k2][j][i], 0.0);
        double cff4 =3D MAX (dZdx[k1][j][i + 1], 0.0);

        FS[k2][j][i]
          =3D cff
            * (cff1 * (cff1 * dTdz[k2][j][i] - dTdx[k1][j][i])
               + cff2 * (cff2 * dTdz[k2][j][i] - dTdx[k2][j][i + 1])
               + cff3 * (cff3 * dTdz[k2][j][i] - dTdx[k2][j][i])
               + cff4 * (cff4 * dTdz[k2][j][i] - dTdx[k1][j][i + 1]));
      }
}

O2 fast:

  <bb 8> [local count: 955630225]:
  # prephitmp_107 =3D PHI <_6(8), pretmp_106(7)>
  # prephitmp_109 =3D PHI <_4(8), pretmp_108(7)>
  # prephitmp_111 =3D PHI <_23(8), pretmp_110(7)>
  # prephitmp_113 =3D PHI <_13(8), pretmp_112(7)>
  # doloop.9_55 =3D PHI <doloop.9_57(8), doloop.9_105(7)>
  # ivtmp.33_102 =3D PHI <ivtmp.33_101(8), ivtmp.44_70(7)>
  _87 =3D (double[400][600] *) ivtmp.45_60;
  _1 =3D MEM[(double *)_87 + ivtmp.33_102 * 1];
  cff_38 =3D _1 * 5.0e-1;
  cff1_40 =3D MIN_EXPR <prephitmp_107, 0.0>;
  _4 =3D MEM[(double *)&dZdx + 8B + ivtmp.33_102 * 1];
  cff2_42 =3D MIN_EXPR <_4, 0.0>;
  cff3_43 =3D MAX_EXPR <prephitmp_109, 0.0>;
  _6 =3D MEM[(double *)_79 + ivtmp.33_102 * 1];
  cff4_44 =3D MAX_EXPR <_6, 0.0>;


O2 fast vect (very-cheap)
  <bb 6> [local count: 955630225]:
  # doloop.9_55 =3D PHI <doloop.9_57(6), doloop.9_105(5)>
  # ivtmp.37_102 =3D PHI <ivtmp.37_101(6), ivtmp.46_72(5)>
  # ivtmp.38_92 =3D PHI <ivtmp.38_91(6), ivtmp.38_90(5)>
  _77 =3D (double[400][600] *) ivtmp.48_62;
  _1 =3D MEM[(double *)_77 + ivtmp.37_102 * 1];
  cff_38 =3D _1 * 5.0e-1;
  _2 =3D MEM[(double *)&dZdx + ivtmp.38_92 * 1];   // redundant load
  cff1_40 =3D MIN_EXPR <_2, 0.0>;
  _4 =3D MEM[(double *)&dZdx + 8B + ivtmp.37_102 * 1];
  cff2_42 =3D MIN_EXPR <_4, 0.0>;
  _5 =3D MEM[(double *)&dZdx + ivtmp.37_102 * 1];  // redundant load=20
  cff3_43 =3D MAX_EXPR <_5, 0.0>;
  _6 =3D MEM[(double *)&dZdx + 8B + ivtmp.38_92 * 1];
  cff4_44 =3D MAX_EXPR <_6, 0.0>;


I found the root cause is that: in the baseline version, PRE makes it to re=
use
some load result from previous iterations, it saves some loads. while in the
test line version, with the check below:

      /* Inhibit the use of an inserted PHI on a loop header when
         the address of the memory reference is a simple induction
         variable.  In other cases the vectorizer won't do anything
         anyway (either it's loop invariant or a complicated
         expression).  */
      if (sprime
          && TREE_CODE (sprime) =3D=3D SSA_NAME
          && do_pre
          && (flag_tree_loop_vectorize || flag_tree_parallelize_loops > 1)

PRE doesn't optimize it to avoid introducing loop carried dependence. It ma=
kes
sense. But unfortunately the expected downstream loop vectorization isn't
performed on the given loop since with "very-cheap" cost model, it doesn't
allow vectorizer to peel for niters. Later there seems no downstream pass w=
hich
is trying to optimize it, it eventually results in sub-optimal code.

To rerun pre once after loop vectorization did fix the degradation, but not
sure it's practical, since iterating pre seems much time-consuming. Or tagg=
ing
this kind of loop and later just run pre on the tagged one? It seems also n=
ot
practical to predict one loop whether can be loop-vectorized later. Also not
sure whether there are some passes which can be taught for this.=