[Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "linkw at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails
Date: Thu, 27 May 2021 07:48:11 +0000	[thread overview]
Message-ID: <bug-100794-4@http.gcc.gnu.org/bugzilla/> (raw)

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

            Bug ID: 100794
           Summary: suboptimal code due to missing pre2 when vectorization
                    fails
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

I was investigating one degradation from SPEC2017 554.roms_r on Power9, the
baseline is -O2 -mcpu=power9 -ffast-math while the test line is -O2
-mcpu=power9 -ffast-math -ftree-vectorize -fvect-cost-model=very-cheap.

One reduced C test case is as below:

#include <math.h>

#define MIN fmin
#define MAX fmax

#define N1 400
#define N2 600
#define N3 800

extern int j_0, j_n, i_0, i_n;
extern double diff2[N1][N2];
extern double dZdx[N1][N2][N3];
extern double dTdz[N1][N2][N3];
extern double dTdx[N1][N2][N3];
extern double FS[N1][N2][N3];

void
test (int k1, int k2)
{
  for (int j = j_0; j < j_n; j++)
    for (int i = i_0; i < i_n; i++)
      {
        double cff = 0.5 * diff2[j][i];
        double cff1 = MIN (dZdx[k1][j][i], 0.0);
        double cff2 = MIN (dZdx[k2][j][i + 1], 0.0);
        double cff3 = MAX (dZdx[k2][j][i], 0.0);
        double cff4 = MAX (dZdx[k1][j][i + 1], 0.0);

        FS[k2][j][i]
          = cff
            * (cff1 * (cff1 * dTdz[k2][j][i] - dTdx[k1][j][i])
               + cff2 * (cff2 * dTdz[k2][j][i] - dTdx[k2][j][i + 1])
               + cff3 * (cff3 * dTdz[k2][j][i] - dTdx[k2][j][i])
               + cff4 * (cff4 * dTdz[k2][j][i] - dTdx[k1][j][i + 1]));
      }
}

O2 fast:

  <bb 8> [local count: 955630225]:
  # prephitmp_107 = PHI <_6(8), pretmp_106(7)>
  # prephitmp_109 = PHI <_4(8), pretmp_108(7)>
  # prephitmp_111 = PHI <_23(8), pretmp_110(7)>
  # prephitmp_113 = PHI <_13(8), pretmp_112(7)>
  # doloop.9_55 = PHI <doloop.9_57(8), doloop.9_105(7)>
  # ivtmp.33_102 = PHI <ivtmp.33_101(8), ivtmp.44_70(7)>
  _87 = (double[400][600] *) ivtmp.45_60;
  _1 = MEM[(double *)_87 + ivtmp.33_102 * 1];
  cff_38 = _1 * 5.0e-1;
  cff1_40 = MIN_EXPR <prephitmp_107, 0.0>;
  _4 = MEM[(double *)&dZdx + 8B + ivtmp.33_102 * 1];
  cff2_42 = MIN_EXPR <_4, 0.0>;
  cff3_43 = MAX_EXPR <prephitmp_109, 0.0>;
  _6 = MEM[(double *)_79 + ivtmp.33_102 * 1];
  cff4_44 = MAX_EXPR <_6, 0.0>;


O2 fast vect (very-cheap)
  <bb 6> [local count: 955630225]:
  # doloop.9_55 = PHI <doloop.9_57(6), doloop.9_105(5)>
  # ivtmp.37_102 = PHI <ivtmp.37_101(6), ivtmp.46_72(5)>
  # ivtmp.38_92 = PHI <ivtmp.38_91(6), ivtmp.38_90(5)>
  _77 = (double[400][600] *) ivtmp.48_62;
  _1 = MEM[(double *)_77 + ivtmp.37_102 * 1];
  cff_38 = _1 * 5.0e-1;
  _2 = MEM[(double *)&dZdx + ivtmp.38_92 * 1];   // redundant load
  cff1_40 = MIN_EXPR <_2, 0.0>;
  _4 = MEM[(double *)&dZdx + 8B + ivtmp.37_102 * 1];
  cff2_42 = MIN_EXPR <_4, 0.0>;
  _5 = MEM[(double *)&dZdx + ivtmp.37_102 * 1];  // redundant load 
  cff3_43 = MAX_EXPR <_5, 0.0>;
  _6 = MEM[(double *)&dZdx + 8B + ivtmp.38_92 * 1];
  cff4_44 = MAX_EXPR <_6, 0.0>;


I found the root cause is that: in the baseline version, PRE makes it to reuse
some load result from previous iterations, it saves some loads. while in the
test line version, with the check below:

      /* Inhibit the use of an inserted PHI on a loop header when
         the address of the memory reference is a simple induction
         variable.  In other cases the vectorizer won't do anything
         anyway (either it's loop invariant or a complicated
         expression).  */
      if (sprime
          && TREE_CODE (sprime) == SSA_NAME
          && do_pre
          && (flag_tree_loop_vectorize || flag_tree_parallelize_loops > 1)

PRE doesn't optimize it to avoid introducing loop carried dependence. It makes
sense. But unfortunately the expected downstream loop vectorization isn't
performed on the given loop since with "very-cheap" cost model, it doesn't
allow vectorizer to peel for niters. Later there seems no downstream pass which
is trying to optimize it, it eventually results in sub-optimal code.

To rerun pre once after loop vectorization did fix the degradation, but not
sure it's practical, since iterating pre seems much time-consuming. Or tagging
this kind of loop and later just run pre on the tagged one? It seems also not
practical to predict one loop whether can be loop-vectorized later. Also not
sure whether there are some passes which can be taught for this.

next             reply	other threads:[~2021-05-27  7:48 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-27  7:48 linkw at gcc dot gnu.org [this message]
2021-05-28  6:39 ` [Bug tree-optimization/100794] " rguenth at gcc dot gnu.org
2021-05-28  7:29 ` linkw at gcc dot gnu.org
2021-05-28  7:36 ` rguenther at suse dot de
2021-05-28 11:30 ` linkw at gcc dot gnu.org
2021-05-28 12:23 ` rguenther at suse dot de
2021-05-31  6:05 ` linkw at gcc dot gnu.org
2021-05-31  6:06 ` linkw at gcc dot gnu.org
2021-05-31  6:08 ` linkw at gcc dot gnu.org
2021-05-31  6:18 ` linkw at gcc dot gnu.org
2021-05-31  6:19 ` linkw at gcc dot gnu.org
2021-06-08  5:30 ` cvs-commit at gcc dot gnu.org
2021-06-09  2:20 ` linkw at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-100794-4@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).