public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized.
@ 2024-01-24 14:21 rdapp at gcc dot gnu.org
  2024-01-24 14:42 ` [Bug tree-optimization/113583] " juzhe.zhong at rivai dot ai
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-01-24 14:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

            Bug ID: 113583
           Summary: Main loop in 519.lbm not vectorized.
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-* riscv*-*-*

This might be a known issue but a bugzilla search regarding lbm didn't show
anything related.

The main loop in SPEC2017 519.lbm GCC riscv does not vectorize while clang
does.  For x86 neither clang nor GCC seem to vectorize it.

A (not entirely minimal but let's start somewhere) example is the following. 
This one is, however, vectorized by clang-17 x86 and not by GCC trunk x86 or
other targets I checked.

#define CST1 (1.0 / 3.0)

typedef enum
{
  C = 0,
  N, S, E, W, T, B, NW,
  NE, A, BB, CC, D, EE, FF, GG,
  HH, II, JJ, FLAGS, NN
} CELL_ENTRIES;

#define SX 100
#define SY 100
#define SZ 130

#define CALC_INDEX(x, y, z, e) ((e) + NN * ((x) + (y) * SX + (z) * SX * SY))

#define GRID_ENTRY_SWEEP(g, dx, dy, dz, e) ((g)[CALC_INDEX (dx, dy, dz, e) +
(i)])

#define LOCAL(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e))
#define NEIGHBOR_C(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e))
#define NEIGHBOR_S(g, e) (GRID_ENTRY_SWEEP (g, 0, -1, 0, e))
#define NEIGHBOR_N(g, e) (GRID_ENTRY_SWEEP (g, 0, +1, 0, e))
#define NEIGHBOR_E(g, e) (GRID_ENTRY_SWEEP (g, +1, 0, 0, e))

#define SRC_C(g) (LOCAL (g, C))
#define SRC_N(g) (LOCAL (g, N))
#define SRC_S(g) (LOCAL (g, S))
#define SRC_E(g) (LOCAL (g, E))
#define SRC_W(g) (LOCAL (g, W))

#define DST_C(g) (NEIGHBOR_C (g, C))
#define DST_N(g) (NEIGHBOR_N (g, N))
#define DST_S(g) (NEIGHBOR_S (g, S))
#define DST_E(g) (NEIGHBOR_E (g, E))

typedef double arr[SX * SY * SZ * NN];

#define OMEGA 0.123

void
foo (arr src, arr dst)
{
  double ux, uy, u2;
  const double lambda0 = 1.0 / (0.5 + 3.0 / (16.0 * (1.0 / OMEGA - 0.5)));
  double fs[NN], fa[NN], feqs[NN], feqa[NN];

  for (int i = 0; i < SX * SY * SZ * NN; i += NN)
    {
      ux = 1.0;
      uy = 1.0;

      feqs[C] = CST1 * (1.0);
      feqs[N] = feqs[S] = CST1 * (1.0 + 4.5 * (+uy) * (+uy));

      feqa[C] = 0.0;
      feqa[N] = 0.2;

      fs[C] = SRC_C (src);
      fs[N] = fs[S] = 0.5 * (SRC_N (src) + SRC_S (src));

      fa[C] = 0.0;
      fa[N] = 0.1;

      DST_C (dst) = SRC_C (src) - OMEGA * (fs[C] - feqs[C]);
      DST_N (dst)
        = SRC_N (src) - OMEGA * (fs[N] - feqs[N]) - lambda0 * (fa[N] -
feqa[N]);
    }
}



missed.c:19:2: note:   ==> examining statement: _4 = *_3;
missed.c:19:2: missed:   no array mode for V8DF[20]
missed.c:19:2: missed:   no array mode for V8DF[20]
missed.c:19:2: missed:   the size of the group of accesses is not a power of 2
or not equal to 3
missed.c:19:2: missed:   not falling back to elementwise accesses
missed.c:43:11: missed:   not vectorized: relevant stmt not supported: _4 =
*_3;


Also refer to https://godbolt.org/z/P517qc3Yf for riscv and
https://godbolt.org/z/M134KvEEo for aarch64.  For aarch64 it seems clang would
vectorize the snippet but does not consider it profitable to do so.

For riscv and the full lbm workload I roughly see one third the number of
dynamically executed qemu instructions with the clang build vs GCC build, 340
billion vs 1200 billion.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-05-16 12:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
2024-01-24 14:42 ` [Bug tree-optimization/113583] " juzhe.zhong at rivai dot ai
2024-01-24 14:44 ` rdapp at gcc dot gnu.org
2024-01-24 15:00 ` juzhe.zhong at rivai dot ai
2024-01-25  3:06 ` juzhe.zhong at rivai dot ai
2024-01-25  3:13 ` juzhe.zhong at rivai dot ai
2024-01-25  5:41 ` pinskia at gcc dot gnu.org
2024-01-25  9:05 ` rguenther at suse dot de
2024-01-25  9:16 ` juzhe.zhong at rivai dot ai
2024-01-25  9:34 ` rguenth at gcc dot gnu.org
2024-01-26  9:50 ` rdapp at gcc dot gnu.org
2024-01-26 10:21 ` rguenther at suse dot de
2024-02-05  6:59 ` juzhe.zhong at rivai dot ai
2024-02-07  3:39 ` juzhe.zhong at rivai dot ai
2024-02-07  7:48 ` juzhe.zhong at rivai dot ai
2024-02-07  8:04 ` rguenther at suse dot de
2024-02-07  8:08 ` juzhe.zhong at rivai dot ai
2024-02-07  8:13 ` juzhe.zhong at rivai dot ai
2024-02-07 10:24 ` rguenther at suse dot de
2024-05-13 14:17 ` rdapp at gcc dot gnu.org
2024-05-16 12:41 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).