From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id E12F83854801; Thu, 27 May 2021 07:48:11 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E12F83854801 From: "linkw at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails Date: Thu, 27 May 2021 07:48:11 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: linkw at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 May 2021 07:48:12 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D100794 Bug ID: 100794 Summary: suboptimal code due to missing pre2 when vectorization fails Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linkw at gcc dot gnu.org Target Milestone: --- I was investigating one degradation from SPEC2017 554.roms_r on Power9, the baseline is -O2 -mcpu=3Dpower9 -ffast-math while the test line is -O2 -mcpu=3Dpower9 -ffast-math -ftree-vectorize -fvect-cost-model=3Dvery-cheap. One reduced C test case is as below: #include #define MIN fmin #define MAX fmax #define N1 400 #define N2 600 #define N3 800 extern int j_0, j_n, i_0, i_n; extern double diff2[N1][N2]; extern double dZdx[N1][N2][N3]; extern double dTdz[N1][N2][N3]; extern double dTdx[N1][N2][N3]; extern double FS[N1][N2][N3]; void test (int k1, int k2) { for (int j =3D j_0; j < j_n; j++) for (int i =3D i_0; i < i_n; i++) { double cff =3D 0.5 * diff2[j][i]; double cff1 =3D MIN (dZdx[k1][j][i], 0.0); double cff2 =3D MIN (dZdx[k2][j][i + 1], 0.0); double cff3 =3D MAX (dZdx[k2][j][i], 0.0); double cff4 =3D MAX (dZdx[k1][j][i + 1], 0.0); FS[k2][j][i] =3D cff * (cff1 * (cff1 * dTdz[k2][j][i] - dTdx[k1][j][i]) + cff2 * (cff2 * dTdz[k2][j][i] - dTdx[k2][j][i + 1]) + cff3 * (cff3 * dTdz[k2][j][i] - dTdx[k2][j][i]) + cff4 * (cff4 * dTdz[k2][j][i] - dTdx[k1][j][i + 1])); } } O2 fast: [local count: 955630225]: # prephitmp_107 =3D PHI <_6(8), pretmp_106(7)> # prephitmp_109 =3D PHI <_4(8), pretmp_108(7)> # prephitmp_111 =3D PHI <_23(8), pretmp_110(7)> # prephitmp_113 =3D PHI <_13(8), pretmp_112(7)> # doloop.9_55 =3D PHI # ivtmp.33_102 =3D PHI _87 =3D (double[400][600] *) ivtmp.45_60; _1 =3D MEM[(double *)_87 + ivtmp.33_102 * 1]; cff_38 =3D _1 * 5.0e-1; cff1_40 =3D MIN_EXPR ; _4 =3D MEM[(double *)&dZdx + 8B + ivtmp.33_102 * 1]; cff2_42 =3D MIN_EXPR <_4, 0.0>; cff3_43 =3D MAX_EXPR ; _6 =3D MEM[(double *)_79 + ivtmp.33_102 * 1]; cff4_44 =3D MAX_EXPR <_6, 0.0>; O2 fast vect (very-cheap) [local count: 955630225]: # doloop.9_55 =3D PHI # ivtmp.37_102 =3D PHI # ivtmp.38_92 =3D PHI _77 =3D (double[400][600] *) ivtmp.48_62; _1 =3D MEM[(double *)_77 + ivtmp.37_102 * 1]; cff_38 =3D _1 * 5.0e-1; _2 =3D MEM[(double *)&dZdx + ivtmp.38_92 * 1]; // redundant load cff1_40 =3D MIN_EXPR <_2, 0.0>; _4 =3D MEM[(double *)&dZdx + 8B + ivtmp.37_102 * 1]; cff2_42 =3D MIN_EXPR <_4, 0.0>; _5 =3D MEM[(double *)&dZdx + ivtmp.37_102 * 1]; // redundant load=20 cff3_43 =3D MAX_EXPR <_5, 0.0>; _6 =3D MEM[(double *)&dZdx + 8B + ivtmp.38_92 * 1]; cff4_44 =3D MAX_EXPR <_6, 0.0>; I found the root cause is that: in the baseline version, PRE makes it to re= use some load result from previous iterations, it saves some loads. while in the test line version, with the check below: /* Inhibit the use of an inserted PHI on a loop header when the address of the memory reference is a simple induction variable. In other cases the vectorizer won't do anything anyway (either it's loop invariant or a complicated expression). */ if (sprime && TREE_CODE (sprime) =3D=3D SSA_NAME && do_pre && (flag_tree_loop_vectorize || flag_tree_parallelize_loops > 1) PRE doesn't optimize it to avoid introducing loop carried dependence. It ma= kes sense. But unfortunately the expected downstream loop vectorization isn't performed on the given loop since with "very-cheap" cost model, it doesn't allow vectorizer to peel for niters. Later there seems no downstream pass w= hich is trying to optimize it, it eventually results in sub-optimal code. To rerun pre once after loop vectorization did fix the degradation, but not sure it's practical, since iterating pre seems much time-consuming. Or tagg= ing this kind of loop and later just run pre on the tagged one? It seems also n= ot practical to predict one loop whether can be loop-vectorized later. Also not sure whether there are some passes which can be taught for this.=