[Bug tree-optimization/102054] New: slightly worse code as PRE on some code got disabled for loop vectorization

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/102054] New: slightly worse code as PRE on some code got disabled for loop vectorization
@ 2021-08-25  7:14 linkw at gcc dot gnu.org
  2021-08-25  7:18 ` [Bug tree-optimization/102054] " linkw at gcc dot gnu.org
  2021-09-13  6:20 ` linkw at gcc dot gnu.org
  0 siblings, 2 replies; 3+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-08-25  7:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102054

            Bug ID: 102054
           Summary: slightly worse code as PRE on some code got disabled
                    for loop vectorization
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

This is a test case reduced from SPEC2017 bmk 541.leela_r source FastBoard.cpp,
when I was investigating the O2 vectorization degradation on SPEC2017 run. It's
an issue similar to PR100794, but which is only applied at O2 and fixed by
re-running pcom at O2. This one is applied for O3 vectorization as well.

TEST CASE:

class FastBoard {
public:
    static const int NBR_SHIFT = 4;
    static const int MAXBOARDSIZE = 19;
    static const int MAXSQ = ((MAXBOARDSIZE + 2) * (MAXBOARDSIZE + 2));
    enum square_t {
        BLACK = 0, WHITE = 1, EMPTY = 2, INVAL = 3
    };

    bool self_atari(int color, int vertex);

protected:
    int m_dirs[4];
    square_t m_square[MAXSQ];
    int nbr_libs[20];
};

bool FastBoard::self_atari(int color, int vertex) {
  int nbr_libs_cnt = 0;
  nbr_libs[nbr_libs_cnt++] = vertex;

  for (int k = 0; k < 20; k++) {
    int ai = vertex + m_dirs[k];

    if (m_square[ai] == FastBoard::EMPTY) {
      bool found = false;

      for (int i = 0; i < nbr_libs_cnt; i++) {
        if (nbr_libs[i] == ai) {
          found = true;
          break;
        }
      }

      if (!found) {
        if (nbr_libs_cnt > 1)
          return false;
        nbr_libs[nbr_libs_cnt++] = ai;
      }
    }
  }

  return true;
}

Options: -mcpu=power9 -Ofast (or -O2 -ftree-vectorize) etc.

With -fno-tree-loop-vectorize, it passes down the vertex_11 for nbr_libs[0].

  <bb 3> [local count: 1014686026]:
  # prephitmp_26 = PHI <pretmp_28(5), vertex_11(D)(10)>
  # ivtmp.17_27 = PHI <ivtmp.17_3(5), ivtmp.17_8(10)>
  if (ai_15 == prephitmp_26)
    goto <bb 8>; [5.50%]
  else
    goto <bb 4>; [94.50%]

  <bb 4> [local count: 958878295]:
  if (ivtmp.17_27 != _31)
    goto <bb 5>; [93.84%]
  else
    goto <bb 11>; [6.16%]

  <bb 5> [local count: 899822494]:
  ivtmp.17_3 = ivtmp.17_27 + 4;
  _21 = (void *) ivtmp.17_3;
  pretmp_28 = MEM[(int *)_21];
  goto <bb 3>; [100.00%]


Without -fno-tree-loop-vectorize, it has the below IRs instead, always do the
load before ai comparison.

  <bb 4> [local count: 1014686026]:
  # ivtmp.12_27 = PHI <ivtmp.12_28(5), ivtmp.12_26(3)>
  ivtmp.12_28 = ivtmp.12_27 + 4;
  _22 = (void *) ivtmp.12_28;
  _3 = MEM[(int *)_22];
  if (_3 == ai_15)
    goto <bb 8>; [5.50%]
  else
    goto <bb 5>; [94.50%]


  <bb 5> [local count: 958878295]:
  if (ivtmp.12_28 != _30)
    goto <bb 4>; [93.84%]
  else
    goto <bb 10>; [6.16%]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug tree-optimization/102054] slightly worse code as PRE on some code got disabled for loop vectorization
  2021-08-25  7:14 [Bug tree-optimization/102054] New: slightly worse code as PRE on some code got disabled for loop vectorization linkw at gcc dot gnu.org
@ 2021-08-25  7:18 ` linkw at gcc dot gnu.org
  2021-09-13  6:20 ` linkw at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-08-25  7:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102054

Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com,
                   |                            |rguenth at gcc dot gnu.org,
                   |                            |rsandifo at gcc dot gnu.org,
                   |                            |segher at gcc dot gnu.org,
                   |                            |wschmidt at gcc dot gnu.org
           Keywords|                            |missed-optimization

--- Comment #1 from Kewen Lin <linkw at gcc dot gnu.org> ---
Forgot to mention that it only affects 0.3% for 541.leela_r, so I guess it's in
low priority.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug tree-optimization/102054] slightly worse code as PRE on some code got disabled for loop vectorization
  2021-08-25  7:14 [Bug tree-optimization/102054] New: slightly worse code as PRE on some code got disabled for loop vectorization linkw at gcc dot gnu.org
  2021-08-25  7:18 ` [Bug tree-optimization/102054] " linkw at gcc dot gnu.org
@ 2021-09-13  6:20 ` linkw at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: linkw at gcc dot gnu.org @ 2021-09-13  6:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102054

--- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
Yet another reduced test case from 526.blender_r.

#include <math.h>

typedef struct QMCSampler {
  struct QMCSampler *next, *prev;
  int type;
  int tot;
  int used;
  double *samp2d;
  double offs[1][2];
} QMCSampler;

float BLI_thread_frand(int thread);

static void halton_sample(double *ht_invprimes, double *ht_nums, double *v) {
  unsigned int i;

  for (i = 0; i < 2; i++) {
    double r = fabs((1.0 - ht_nums[i]) - 1e-10);

    if (ht_invprimes[i] >= r) {
      double lasth;
      double h = ht_invprimes[i];

      do {
        lasth = h;
        h *= ht_invprimes[i];
      } while (h >= r);

      ht_nums[i] += ((lasth + h) - 1.0);
    } else
      ht_nums[i] += ht_invprimes[i];

    v[i] = (float)ht_nums[i];
  }
}

void QMC_initPixel(QMCSampler *qsa, int thread) {
  if (qsa->type == 2) {
    qsa->offs[thread][0] = 0.5f * BLI_thread_frand(thread);
    qsa->offs[thread][1] = 0.5f * BLI_thread_frand(thread);
  } else {
    double ht_invprimes[2], ht_nums[2];
    double r[2];
    int i;

    ht_nums[0] = BLI_thread_frand(thread);
    ht_nums[1] = BLI_thread_frand(thread);
    ht_invprimes[0] = 0.5;
    ht_invprimes[1] = 1.0 / 3.0;

    for (i = 0; i < qsa->tot; i++) {
      halton_sample(ht_invprimes, ht_nums, r);
      qsa->samp2d[2 * i + 0] = r[0];
      qsa->samp2d[2 * i + 1] = r[1];
    }
  }
}

Without loop vectorization, unrestricted pre makes the loop happy for cunroll
and the loop was completely unrolled. The affected pct. is also small, about
0.7%.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-09-13  6:20 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-25  7:14 [Bug tree-optimization/102054] New: slightly worse code as PRE on some code got disabled for loop vectorization linkw at gcc dot gnu.org
2021-08-25  7:18 ` [Bug tree-optimization/102054] " linkw at gcc dot gnu.org
2021-09-13  6:20 ` linkw at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).