[Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized.

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized.
@ 2024-01-24 14:21 rdapp at gcc dot gnu.org
  2024-01-24 14:42 ` [Bug tree-optimization/113583] " juzhe.zhong at rivai dot ai
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-01-24 14:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

            Bug ID: 113583
           Summary: Main loop in 519.lbm not vectorized.
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-* riscv*-*-*

This might be a known issue but a bugzilla search regarding lbm didn't show
anything related.

The main loop in SPEC2017 519.lbm GCC riscv does not vectorize while clang
does.  For x86 neither clang nor GCC seem to vectorize it.

A (not entirely minimal but let's start somewhere) example is the following. 
This one is, however, vectorized by clang-17 x86 and not by GCC trunk x86 or
other targets I checked.

#define CST1 (1.0 / 3.0)

typedef enum
{
  C = 0,
  N, S, E, W, T, B, NW,
  NE, A, BB, CC, D, EE, FF, GG,
  HH, II, JJ, FLAGS, NN
} CELL_ENTRIES;

#define SX 100
#define SY 100
#define SZ 130

#define CALC_INDEX(x, y, z, e) ((e) + NN * ((x) + (y) * SX + (z) * SX * SY))

#define GRID_ENTRY_SWEEP(g, dx, dy, dz, e) ((g)[CALC_INDEX (dx, dy, dz, e) +
(i)])

#define LOCAL(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e))
#define NEIGHBOR_C(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e))
#define NEIGHBOR_S(g, e) (GRID_ENTRY_SWEEP (g, 0, -1, 0, e))
#define NEIGHBOR_N(g, e) (GRID_ENTRY_SWEEP (g, 0, +1, 0, e))
#define NEIGHBOR_E(g, e) (GRID_ENTRY_SWEEP (g, +1, 0, 0, e))

#define SRC_C(g) (LOCAL (g, C))
#define SRC_N(g) (LOCAL (g, N))
#define SRC_S(g) (LOCAL (g, S))
#define SRC_E(g) (LOCAL (g, E))
#define SRC_W(g) (LOCAL (g, W))

#define DST_C(g) (NEIGHBOR_C (g, C))
#define DST_N(g) (NEIGHBOR_N (g, N))
#define DST_S(g) (NEIGHBOR_S (g, S))
#define DST_E(g) (NEIGHBOR_E (g, E))

typedef double arr[SX * SY * SZ * NN];

#define OMEGA 0.123

void
foo (arr src, arr dst)
{
  double ux, uy, u2;
  const double lambda0 = 1.0 / (0.5 + 3.0 / (16.0 * (1.0 / OMEGA - 0.5)));
  double fs[NN], fa[NN], feqs[NN], feqa[NN];

  for (int i = 0; i < SX * SY * SZ * NN; i += NN)
    {
      ux = 1.0;
      uy = 1.0;

      feqs[C] = CST1 * (1.0);
      feqs[N] = feqs[S] = CST1 * (1.0 + 4.5 * (+uy) * (+uy));

      feqa[C] = 0.0;
      feqa[N] = 0.2;

      fs[C] = SRC_C (src);
      fs[N] = fs[S] = 0.5 * (SRC_N (src) + SRC_S (src));

      fa[C] = 0.0;
      fa[N] = 0.1;

      DST_C (dst) = SRC_C (src) - OMEGA * (fs[C] - feqs[C]);
      DST_N (dst)
        = SRC_N (src) - OMEGA * (fs[N] - feqs[N]) - lambda0 * (fa[N] -
feqa[N]);
    }
}



missed.c:19:2: note:   ==> examining statement: _4 = *_3;
missed.c:19:2: missed:   no array mode for V8DF[20]
missed.c:19:2: missed:   no array mode for V8DF[20]
missed.c:19:2: missed:   the size of the group of accesses is not a power of 2
or not equal to 3
missed.c:19:2: missed:   not falling back to elementwise accesses
missed.c:43:11: missed:   not vectorized: relevant stmt not supported: _4 =
*_3;


Also refer to https://godbolt.org/z/P517qc3Yf for riscv and
https://godbolt.org/z/M134KvEEo for aarch64.  For aarch64 it seems clang would
vectorize the snippet but does not consider it profitable to do so.

For riscv and the full lbm workload I roughly see one third the number of
dynamically executed qemu instructions with the clang build vs GCC build, 340
billion vs 1200 billion.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
@ 2024-01-24 14:42 ` juzhe.zhong at rivai dot ai
  2024-01-24 14:44 ` rdapp at gcc dot gnu.org
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-24 14:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #1 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
It's interesting, for Clang only RISC-V can vectorize it.

I think there are 2 topics:

1. Support vectorization of this codes of in loop vectorizer.
2. Transform gather/scatter into strided load/store for RISC-V.

For 2nd topic: LLVM does it by RISC-V target specific lowering pass:

RISC-V gather/scatter lowering (riscv-gather-scatter-lowering)

This is the RISC-V LLVM backend codes:

  if (II->getIntrinsicID() == Intrinsic::masked_gather)
    Call = Builder.CreateIntrinsic(
        Intrinsic::riscv_masked_strided_load,
        {DataType, BasePtr->getType(), Stride->getType()},
        {II->getArgOperand(3), BasePtr, Stride, II->getArgOperand(2)});
  else
    Call = Builder.CreateIntrinsic(
        Intrinsic::riscv_masked_strided_store,
        {DataType, BasePtr->getType(), Stride->getType()},
        {II->getArgOperand(0), BasePtr, Stride, II->getArgOperand(3)});

I have ever tried to support strided load/store in GCC loop vectorizer,
but it seems to be unacceptable.  Maybe we can support strided load/stores
by leveraging LLVM approach ???

Btw, LLVM risc-v gather/scatter didn't do a perfect job here:

        vid.v   v8
        vmul.vx v8, v8, a3
....

        vsoxei64.v      v10, (s2), v14

This is in-order indexed store which is very costly in hardware.
It should be unorder indexed store or strided store.

Anyway, I think we should investigate first how to support vectorization of lbm
in loop vectorizer.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
  2024-01-24 14:42 ` [Bug tree-optimization/113583] " juzhe.zhong at rivai dot ai
@ 2024-01-24 14:44 ` rdapp at gcc dot gnu.org
  2024-01-24 15:00 ` juzhe.zhong at rivai dot ai
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-01-24 14:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #2 from Robin Dapp <rdapp at gcc dot gnu.org> ---
> It's interesting, for Clang only RISC-V can vectorize it.

The full loop can be vectorized on clang x86 as well when I remove the first
conditional (which is not in the snippet I posted above).  So that's likely a
different issue than the loop itself.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
  2024-01-24 14:42 ` [Bug tree-optimization/113583] " juzhe.zhong at rivai dot ai
  2024-01-24 14:44 ` rdapp at gcc dot gnu.org
@ 2024-01-24 15:00 ` juzhe.zhong at rivai dot ai
  2024-01-25  3:06 ` juzhe.zhong at rivai dot ai
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-24 15:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #3 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Ok I see.

If we change NN into 8, then we can vectorize it with load_lanes/store_lanes
with group size = 8:

https://godbolt.org/z/doe9c3hfo

We will use vlseg8e64 which is RVVM1DF[8] == RVVM1x8DFmode.

Here there is report:

/app/example.c:47:21: missed:   no array mode for RVVM1DF[20]
/app/example.c:47:21: missed:   no array mode for RVVM1DF[20]

I believe we enable vec_load_lanes/vec_store_lanes for RVVM1DF[20] which
RVVM1x20DF mode, then we can vectorize it.

But it's not reasonable and not general way to do that. This code require array
size = 20. How about other codes, we may have codes require array size  = 21,
22,..
23, ....etc... The array size can be any number. We can't leverage this
approach
for infinite array size.

So, the idea is that we try to find vec_load_lanes/vec_store_lanes first check
whether it support lanes vectorization for specific array size.

If not, we should be able to lower them into multiple gather/scatter or strided
load/stores.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-01-24 15:00 ` juzhe.zhong at rivai dot ai
@ 2024-01-25  3:06 ` juzhe.zhong at rivai dot ai
  2024-01-25  3:13 ` juzhe.zhong at rivai dot ai
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-25  3:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #4 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
OK. Confirm on X86 GCC failed to vectorize it, wheras Clang X86 can vectorize
it.

https://godbolt.org/z/EaTjGbPGW

X86 Clang and RISC-V Clang IR are same:

  %12 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %11,
i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true,
i1 true>, <8 x double> poison), !dbg !62
  %13 = or disjoint <8 x i64> %10, <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1,
i64 1, i64 1>, !dbg !72
  %14 = getelementptr inbounds double, ptr %0, <8 x i64> %13, !dbg !72
  %15 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %14,
i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true,
i1 true>, <8 x double> poison), !dbg !72
  %16 = or disjoint <8 x i64> %10, <i64 2, i64 2, i64 2, i64 2, i64 2, i64 2,
i64 2, i64 2>, !dbg !73
  %17 = getelementptr inbounds double, ptr %0, <8 x i64> %16, !dbg !73
  %18 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %17,
i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true,
i1 true>, <8 x double> poison), !dbg !73
  %19 = fadd <8 x double> %15, %18, !dbg !74
  %20 = fmul <8 x double> %19, <double 5.000000e-01, double 5.000000e-01,
double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double
5.000000e-01, double 5.000000e-01, double 5.000000e-01>, !dbg !75
  %21 = fadd <8 x double> %12, <double 0xBFD5555555555555, double
0xBFD5555555555555, double 0xBFD5555555555555, double 0xBFD5555555555555,
double 0xBFD5555555555555, double 0xBFD5555555555555, double
0xBFD5555555555555, double 0xBFD5555555555555>, !dbg !76
  %22 = tail call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %21, <8 x
double> <double -1.230000e-01, double -1.230000e-01, double -1.230000e-01,
double -1.230000e-01, double -1.230000e-01, double -1.230000e-01, double
-1.230000e-01, double -1.230000e-01>, <8 x double> %12), !dbg !77
  %23 = getelementptr inbounds double, ptr %1, <8 x i64> %10, !dbg !77
  tail call void @llvm.masked.scatter.v8f64.v8p0(<8 x double> %22, <8 x ptr>
%23, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
true, i1 true>), !dbg !78
  %24 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %14,
i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true,
i1 true>, <8 x double> poison), !dbg !81
  %25 = fadd <8 x double> %20, <double 0xBFFD555555555555, double
0xBFFD555555555555, double 0xBFFD555555555555, double 0xBFFD555555555555,
double 0xBFFD555555555555, double 0xBFFD555555555555, double
0xBFFD555555555555, double 0xBFFD555555555555>, !dbg !82
  %26 = tail call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %25, <8 x
double> <double -1.230000e-01, double -1.230000e-01, double -1.230000e-01,
double -1.230000e-01, double -1.230000e-01, double -1.230000e-01, double
-1.230000e-01, double -1.230000e-01>, <8 x double> %24), !dbg !83
  %27 = fadd <8 x double> %26, <double 0x3FC8669851CB9250, double
0x3FC8669851CB9250, double 0x3FC8669851CB9250, double 0x3FC8669851CB9250,
double 0x3FC8669851CB9250, double 0x3FC8669851CB9250, double
0x3FC8669851CB9250, double 0x3FC8669851CB9250>, !dbg !84
  %28 = getelementptr double, <8 x ptr> %23, i64 2001, !dbg !84
  tail call void @llvm.masked.scatter.v8f64.v8p0(<8 x double> %27, <8 x ptr>
%28, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
true, i1 true>), !dbg !85

Hi, Richard. Do you have suggestions about this issue ?
Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-01-25  3:06 ` juzhe.zhong at rivai dot ai
@ 2024-01-25  3:13 ` juzhe.zhong at rivai dot ai
  2024-01-25  5:41 ` pinskia at gcc dot gnu.org
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-25  3:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #5 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Both ICC and Clang X86 can vectorize SPEC 2017 lbm:

https://godbolt.org/z/MjbTbYf1G

But I am not sure X86 ICC is better or X86 Clang is better.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-01-25  3:13 ` juzhe.zhong at rivai dot ai
@ 2024-01-25  5:41 ` pinskia at gcc dot gnu.org
  2024-01-25  9:05 ` rguenther at suse dot de
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-25  5:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-01-25  5:41 ` pinskia at gcc dot gnu.org
@ 2024-01-25  9:05 ` rguenther at suse dot de
  2024-01-25  9:16 ` juzhe.zhong at rivai dot ai
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenther at suse dot de @ 2024-01-25  9:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 25 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #5 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> Both ICC and Clang X86 can vectorize SPEC 2017 lbm:
> 
> https://godbolt.org/z/MjbTbYf1G
> 
> But I am not sure X86 ICC is better or X86 Clang is better.

gather/scatter are possibly slow (and gather now has that Intel
security issue).  The reason is a "cost" one:

t.c:47:21: note:   ==> examining statement: _4 = *_3;
t.c:47:21: missed:   no array mode for V8DF[20]
t.c:47:21: missed:   no array mode for V8DF[20]
t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
or not equal to 3
t.c:47:21: missed:   not falling back to elementwise accesses
t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
*_3;
t.c:47:21: missed:  bad operation or unsupported loop bound.

where we don't consider using gather because we have a known constant
stride (20).  Since the stores are really scatters we don't attempt
to SLP either.

Disabling the above heuristic we get this vectorized as well, avoiding
gather/scatter by manually implementing them and using a quite high
VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
faster code in the end).  But yes, I doubt that any of ICC or clang
vectorized codes are faster anywhere (but without specifying an
uarch you get some generic cost modelling applied).  Maybe SPR doesn't
have the gather bug and it does have reasonable gather and scatter
(zen4 scatter sucks).

.L3:
        vmovsd  952(%rax), %xmm0
        vmovsd  -8(%rax), %xmm2
        addq    $1280, %rsi
        addq    $1280, %rax
        vmovhpd -168(%rax), %xmm0, %xmm1
        vmovhpd -1128(%rax), %xmm2, %xmm2
        vmovsd  -648(%rax), %xmm0
        vmovhpd -488(%rax), %xmm0, %xmm0
        vinsertf32x4    $0x1, %xmm1, %ymm0, %ymm0
        vmovsd  -968(%rax), %xmm1
        vmovhpd -808(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm1, %ymm2, %ymm2
        vinsertf64x4    $0x1, %ymm0, %zmm2, %zmm2
        vmovsd  -320(%rax), %xmm0
        vmovhpd -160(%rax), %xmm0, %xmm1
        vmovsd  -640(%rax), %xmm0
        vmovhpd -480(%rax), %xmm0, %xmm0
        vinsertf32x4    $0x1, %xmm1, %ymm0, %ymm1
        vmovsd  -960(%rax), %xmm0
        vmovhpd -800(%rax), %xmm0, %xmm8
        vmovsd  -1280(%rax), %xmm0
        vmovhpd -1120(%rax), %xmm0, %xmm0
        vinsertf32x4    $0x1, %xmm8, %ymm0, %ymm0
        vinsertf64x4    $0x1, %ymm1, %zmm0, %zmm0
        vmovsd  -312(%rax), %xmm1
        vmovhpd -152(%rax), %xmm1, %xmm8
        vmovsd  -632(%rax), %xmm1
        vmovhpd -472(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm8, %ymm1, %ymm8
        vmovsd  -952(%rax), %xmm1
        vmovhpd -792(%rax), %xmm1, %xmm9
        vmovsd  -1272(%rax), %xmm1
        vmovhpd -1112(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm9, %ymm1, %ymm1
        vinsertf64x4    $0x1, %ymm8, %zmm1, %zmm1
        vaddpd  %zmm1, %zmm0, %zmm0
        vaddpd  %zmm7, %zmm2, %zmm1
        vfnmadd132pd    %zmm3, %zmm2, %zmm1
        vfmadd132pd     %zmm6, %zmm5, %zmm0
        valignq $3, %ymm1, %ymm1, %ymm2
        vmovlpd %xmm1, -1280(%rsi)
        vextractf64x2   $1, %ymm1, %xmm8
        vmovhpd %xmm1, -1120(%rsi)
        vextractf64x4   $0x1, %zmm1, %ymm1
        vmovlpd %xmm1, -640(%rsi)
        vmovhpd %xmm1, -480(%rsi)
        vmovsd  %xmm2, -800(%rsi)
        vextractf64x2   $1, %ymm1, %xmm2
        vmovsd  %xmm8, -960(%rsi)
        valignq $3, %ymm1, %ymm1, %ymm1
        vmovsd  %xmm2, -320(%rsi)
        vmovsd  %xmm1, -160(%rsi)
        vmovsd  -320(%rax), %xmm1
        vmovhpd -160(%rax), %xmm1, %xmm2
        vmovsd  -640(%rax), %xmm1
        vmovhpd -480(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm2, %ymm1, %ymm2
        vmovsd  -960(%rax), %xmm1
        vmovhpd -800(%rax), %xmm1, %xmm8
        vmovsd  -1280(%rax), %xmm1
        vmovhpd -1120(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm8, %ymm1, %ymm1
        vinsertf64x4    $0x1, %ymm2, %zmm1, %zmm1
        vfnmadd132pd    %zmm3, %zmm1, %zmm0
        vaddpd  %zmm4, %zmm0, %zmm0
        valignq $3, %ymm0, %ymm0, %ymm1
        vmovlpd %xmm0, 14728(%rsi)
        vextractf64x2   $1, %ymm0, %xmm2
        vmovhpd %xmm0, 14888(%rsi)
        vextractf64x4   $0x1, %zmm0, %ymm0
        vmovlpd %xmm0, 15368(%rsi)
        vmovhpd %xmm0, 15528(%rsi)
        vmovsd  %xmm1, 15208(%rsi)
        vextractf64x2   $1, %ymm0, %xmm1
        vmovsd  %xmm2, 15048(%rsi)
        valignq $3, %ymm0, %ymm0, %ymm0
        vmovsd  %xmm1, 15688(%rsi)
        vmovsd  %xmm0, 15848(%rsi)
        cmpq    %rdx, %rsi
        jne     .L3

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2024-01-25  9:05 ` rguenther at suse dot de
@ 2024-01-25  9:16 ` juzhe.zhong at rivai dot ai
  2024-01-25  9:34 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-25  9:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #7 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to rguenther@suse.de from comment #6)
> On Thu, 25 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> > 
> > --- Comment #5 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > Both ICC and Clang X86 can vectorize SPEC 2017 lbm:
> > 
> > https://godbolt.org/z/MjbTbYf1G
> > 
> > But I am not sure X86 ICC is better or X86 Clang is better.
> 
> gather/scatter are possibly slow (and gather now has that Intel
> security issue).  The reason is a "cost" one:
> 
> t.c:47:21: note:   ==> examining statement: _4 = *_3;
> t.c:47:21: missed:   no array mode for V8DF[20]
> t.c:47:21: missed:   no array mode for V8DF[20]
> t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
> or not equal to 3
> t.c:47:21: missed:   not falling back to elementwise accesses
> t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
> *_3;
> t.c:47:21: missed:  bad operation or unsupported loop bound.
> 
> where we don't consider using gather because we have a known constant
> stride (20).  Since the stores are really scatters we don't attempt
> to SLP either.
> 
> Disabling the above heuristic we get this vectorized as well, avoiding
> gather/scatter by manually implementing them and using a quite high
> VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
> faster code in the end).  But yes, I doubt that any of ICC or clang
> vectorized codes are faster anywhere (but without specifying an
> uarch you get some generic cost modelling applied).  Maybe SPR doesn't
> have the gather bug and it does have reasonable gather and scatter
> (zen4 scatter sucks).
> 
> .L3:
>         vmovsd  952(%rax), %xmm0
>         vmovsd  -8(%rax), %xmm2
>         addq    $1280, %rsi
>         addq    $1280, %rax
>         vmovhpd -168(%rax), %xmm0, %xmm1
>         vmovhpd -1128(%rax), %xmm2, %xmm2
>         vmovsd  -648(%rax), %xmm0
>         vmovhpd -488(%rax), %xmm0, %xmm0
>         vinsertf32x4    $0x1, %xmm1, %ymm0, %ymm0
>         vmovsd  -968(%rax), %xmm1
>         vmovhpd -808(%rax), %xmm1, %xmm1
>         vinsertf32x4    $0x1, %xmm1, %ymm2, %ymm2
>         vinsertf64x4    $0x1, %ymm0, %zmm2, %zmm2
>         vmovsd  -320(%rax), %xmm0
>         vmovhpd -160(%rax), %xmm0, %xmm1
>         vmovsd  -640(%rax), %xmm0
>         vmovhpd -480(%rax), %xmm0, %xmm0
>         vinsertf32x4    $0x1, %xmm1, %ymm0, %ymm1
>         vmovsd  -960(%rax), %xmm0
>         vmovhpd -800(%rax), %xmm0, %xmm8
>         vmovsd  -1280(%rax), %xmm0
>         vmovhpd -1120(%rax), %xmm0, %xmm0
>         vinsertf32x4    $0x1, %xmm8, %ymm0, %ymm0
>         vinsertf64x4    $0x1, %ymm1, %zmm0, %zmm0
>         vmovsd  -312(%rax), %xmm1
>         vmovhpd -152(%rax), %xmm1, %xmm8
>         vmovsd  -632(%rax), %xmm1
>         vmovhpd -472(%rax), %xmm1, %xmm1
>         vinsertf32x4    $0x1, %xmm8, %ymm1, %ymm8
>         vmovsd  -952(%rax), %xmm1
>         vmovhpd -792(%rax), %xmm1, %xmm9
>         vmovsd  -1272(%rax), %xmm1
>         vmovhpd -1112(%rax), %xmm1, %xmm1
>         vinsertf32x4    $0x1, %xmm9, %ymm1, %ymm1
>         vinsertf64x4    $0x1, %ymm8, %zmm1, %zmm1
>         vaddpd  %zmm1, %zmm0, %zmm0
>         vaddpd  %zmm7, %zmm2, %zmm1
>         vfnmadd132pd    %zmm3, %zmm2, %zmm1
>         vfmadd132pd     %zmm6, %zmm5, %zmm0
>         valignq $3, %ymm1, %ymm1, %ymm2
>         vmovlpd %xmm1, -1280(%rsi)
>         vextractf64x2   $1, %ymm1, %xmm8
>         vmovhpd %xmm1, -1120(%rsi)
>         vextractf64x4   $0x1, %zmm1, %ymm1
>         vmovlpd %xmm1, -640(%rsi)
>         vmovhpd %xmm1, -480(%rsi)
>         vmovsd  %xmm2, -800(%rsi)
>         vextractf64x2   $1, %ymm1, %xmm2
>         vmovsd  %xmm8, -960(%rsi)
>         valignq $3, %ymm1, %ymm1, %ymm1
>         vmovsd  %xmm2, -320(%rsi)
>         vmovsd  %xmm1, -160(%rsi)
>         vmovsd  -320(%rax), %xmm1
>         vmovhpd -160(%rax), %xmm1, %xmm2
>         vmovsd  -640(%rax), %xmm1
>         vmovhpd -480(%rax), %xmm1, %xmm1
>         vinsertf32x4    $0x1, %xmm2, %ymm1, %ymm2
>         vmovsd  -960(%rax), %xmm1
>         vmovhpd -800(%rax), %xmm1, %xmm8
>         vmovsd  -1280(%rax), %xmm1
>         vmovhpd -1120(%rax), %xmm1, %xmm1
>         vinsertf32x4    $0x1, %xmm8, %ymm1, %ymm1
>         vinsertf64x4    $0x1, %ymm2, %zmm1, %zmm1
>         vfnmadd132pd    %zmm3, %zmm1, %zmm0
>         vaddpd  %zmm4, %zmm0, %zmm0
>         valignq $3, %ymm0, %ymm0, %ymm1
>         vmovlpd %xmm0, 14728(%rsi)
>         vextractf64x2   $1, %ymm0, %xmm2
>         vmovhpd %xmm0, 14888(%rsi)
>         vextractf64x4   $0x1, %zmm0, %ymm0
>         vmovlpd %xmm0, 15368(%rsi)
>         vmovhpd %xmm0, 15528(%rsi)
>         vmovsd  %xmm1, 15208(%rsi)
>         vextractf64x2   $1, %ymm0, %xmm1
>         vmovsd  %xmm2, 15048(%rsi)
>         valignq $3, %ymm0, %ymm0, %ymm0
>         vmovsd  %xmm1, 15688(%rsi)
>         vmovsd  %xmm0, 15848(%rsi)
>         cmpq    %rdx, %rsi
>         jne     .L3

Thanks Richard.

>> Disabling the above heuristic we get this vectorized as well, avoiding
>> gather/scatter by manually implementing them and using a quite high
>> VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
>> faster code in the end). 

Ok. It seems that enabling the vectorization for this case doesn't mean
lbm will become faster. It depends on hardware.
I think we can test spec 2017 lbm on RISC-V board to see whether vectorization
is beneficial.

But I wonder if we see it is beneficial on some boards, could you teach us how
we can enable vectorization for such case according to uarchs ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-01-25  9:16 ` juzhe.zhong at rivai dot ai
@ 2024-01-25  9:34 ` rguenth at gcc dot gnu.org
  2024-01-26  9:50 ` rdapp at gcc dot gnu.org
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-25  9:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to JuzheZhong from comment #7)
>
> But I wonder if we see it is beneficial on some boards, could you teach us
> how we can enable vectorization for such case according to uarchs ?

If you figure how to optimally vectorize this for a given uarch I can
definitely guide you.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-01-25  9:34 ` rguenth at gcc dot gnu.org
@ 2024-01-26  9:50 ` rdapp at gcc dot gnu.org
  2024-01-26 10:21 ` rguenther at suse dot de
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-01-26  9:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #9 from Robin Dapp <rdapp at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #6)

> t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
> or not equal to 3
> t.c:47:21: missed:   not falling back to elementwise accesses
> t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
> *_3;
> t.c:47:21: missed:  bad operation or unsupported loop bound.
> 
> where we don't consider using gather because we have a known constant
> stride (20).  Since the stores are really scatters we don't attempt
> to SLP either.
> 
> Disabling the above heuristic we get this vectorized as well, avoiding
> gather/scatter by manually implementing them and using a quite high
> VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
> faster code in the end).

I suppose you're referring to this?

  /* FIXME: At the moment the cost model seems to underestimate the
     cost of using elementwise accesses.  This check preserves the
     traditional behavior until that can be fixed.  */
  stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
  if (!first_stmt_info)
    first_stmt_info = stmt_info;
  if (*memory_access_type == VMAT_ELEMENTWISE
      && !STMT_VINFO_STRIDED_P (first_stmt_info)
      && !(stmt_info == DR_GROUP_FIRST_ELEMENT (stmt_info)
           && !DR_GROUP_NEXT_ELEMENT (stmt_info)
           && !pow2p_hwi (DR_GROUP_SIZE (stmt_info))))
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "not falling back to elementwise accesses\n");
      return false;
    }


I did some more tests on my laptop.  As said above the whole loop in lbm is
larger and contains two ifs.  The first one prevents clang and GCC from
vectorizing the loop, the second one

                if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
                        ux = 0.005;
                        uy = 0.002;
                        uz = 0.000;
                }

seems to be if-converted? by clang or at least doesn't inhibit vectorization.

Now if I comment out the first, larger if clang does vectorize the loop.  With
the return false commented out in the above GCC snippet GCC also vectorizes,
but only when both ifs are commented out.

Results (with both ifs commented out), -march=native (resulting in avx2), best
of 3 as lbm is notoriously fickle:

gcc trunk vanilla: 156.04s
gcc trunk with elementwise: 132.10s
clang 17: 143.06s

Of course even the comment already said that costing is difficult and the
change will surely cause regressions elsewhere.  However the 15% improvement
with vectorization (or the 9% improvement of clang) IMHO show that it's surely
useful to look into this further.  On top, the riscv clang seems to not care
about the first if either and still vectorize.  I haven't looked closer what
happens there, though.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2024-01-26  9:50 ` rdapp at gcc dot gnu.org
@ 2024-01-26 10:21 ` rguenther at suse dot de
  2024-02-05  6:59 ` juzhe.zhong at rivai dot ai
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenther at suse dot de @ 2024-01-26 10:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 26 Jan 2024, rdapp at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #9 from Robin Dapp <rdapp at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #6)
> 
> > t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
> > or not equal to 3
> > t.c:47:21: missed:   not falling back to elementwise accesses
> > t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
> > *_3;
> > t.c:47:21: missed:  bad operation or unsupported loop bound.
> > 
> > where we don't consider using gather because we have a known constant
> > stride (20).  Since the stores are really scatters we don't attempt
> > to SLP either.
> > 
> > Disabling the above heuristic we get this vectorized as well, avoiding
> > gather/scatter by manually implementing them and using a quite high
> > VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
> > faster code in the end).
> 
> I suppose you're referring to this?
> 
>   /* FIXME: At the moment the cost model seems to underestimate the
>      cost of using elementwise accesses.  This check preserves the
>      traditional behavior until that can be fixed.  */
>   stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
>   if (!first_stmt_info)
>     first_stmt_info = stmt_info;
>   if (*memory_access_type == VMAT_ELEMENTWISE
>       && !STMT_VINFO_STRIDED_P (first_stmt_info)
>       && !(stmt_info == DR_GROUP_FIRST_ELEMENT (stmt_info)
>            && !DR_GROUP_NEXT_ELEMENT (stmt_info)
>            && !pow2p_hwi (DR_GROUP_SIZE (stmt_info))))
>     {
>       if (dump_enabled_p ())
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                          "not falling back to elementwise accesses\n");
>       return false;
>     }
> 
> 
> I did some more tests on my laptop.  As said above the whole loop in lbm is
> larger and contains two ifs.  The first one prevents clang and GCC from
> vectorizing the loop, the second one
> 
>                 if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
>                         ux = 0.005;
>                         uy = 0.002;
>                         uz = 0.000;
>                 }
> 
> seems to be if-converted? by clang or at least doesn't inhibit vectorization.
> 
> Now if I comment out the first, larger if clang does vectorize the loop.  With
> the return false commented out in the above GCC snippet GCC also vectorizes,
> but only when both ifs are commented out.
> 
> Results (with both ifs commented out), -march=native (resulting in avx2), best
> of 3 as lbm is notoriously fickle:
> 
> gcc trunk vanilla: 156.04s
> gcc trunk with elementwise: 132.10s
> clang 17: 143.06s
> 
> Of course even the comment already said that costing is difficult and the
> change will surely cause regressions elsewhere.  However the 15% improvement
> with vectorization (or the 9% improvement of clang) IMHO show that it's surely
> useful to look into this further.  On top, the riscv clang seems to not care
> about the first if either and still vectorize.  I haven't looked closer what
> happens there, though.

Yes.  I think this shows we should remove the above hack and instead
try to fix the costing next stage1.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2024-01-26 10:21 ` rguenther at suse dot de
@ 2024-02-05  6:59 ` juzhe.zhong at rivai dot ai
  2024-02-07  3:39 ` juzhe.zhong at rivai dot ai
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-02-05  6:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #11 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Hi, I think this RVV compiler codegen is that optimal codegen we want for RVV:

https://repo.hca.bsc.es/epic/z/P6QXCc

.LBB0_5:                                # %vector.body
        sub     a4, t0, a3
        vsetvli t1, a4, e64, m1, ta, mu
        mul     a2, a3, t2
        add     a5, t3, a2
        vlse64.v        v8, (a5), t2
        add     a4, a6, a2
        vlse64.v        v9, (a4), t2
        add     a4, a0, a2
        vlse64.v        v10, (a4), t2
        vfadd.vv        v8, v8, v9
        vfmul.vf        v8, v8, fa5
        vfadd.vf        v9, v10, fa4
        vfmadd.vf       v9, fa3, v10
        vlse64.v        v10, (a5), t2
        add     a4, a1, a2
        vsse64.v        v9, (a4), t2
        vfadd.vf        v8, v8, fa2
        vfmadd.vf       v8, fa3, v10
        vfadd.vf        v8, v8, fa1
        add     a2, a2, a7
        add     a3, a3, t1
        vsse64.v        v8, (a2), t2
        bne     a3, t0, .LBB0_5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2024-02-05  6:59 ` juzhe.zhong at rivai dot ai
@ 2024-02-07  3:39 ` juzhe.zhong at rivai dot ai
  2024-02-07  7:48 ` juzhe.zhong at rivai dot ai
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-02-07  3:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #12 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Ok. I found it even without vectorization:

GCC is worse than Clang:

https://godbolt.org/z/addr54Gc6

GCC (14 instructions inside the loop):

        fld     fa3,0(a0)
        fld     fa5,8(a0)
        fld     fa1,16(a0)
        fsub.d  fa4,ft2,fa3
        addi    a0,a0,160
        fadd.d  fa5,fa5,fa1
        addi    a1,a1,160
        addi    a5,a5,160
        fmadd.d fa4,fa4,fa2,fa3
        fnmsub.d        fa5,fa5,ft1,ft0
        fsd     fa4,-160(a1)
        fld     fa4,-152(a0)
        fadd.d  fa4,fa4,fa0
        fmadd.d fa5,fa5,fa2,fa4
        fsd     fa5,-160(a5)

Clang (12 instructions inside the loop):

        fld     fa1, -8(a0)
        fld     fa0, 0(a0)
        fld     ft0, 8(a0)
        fmadd.d fa1, fa1, fa4, fa5
        fsd     fa1, 0(a1)
        fld     fa1, 0(a0)
        fadd.d  fa0, ft0, fa0
        fmadd.d fa0, fa0, fa2, fa3
        fadd.d  fa1, fa0, fa1
        add     a4, a1, a3
        fsd     fa1, -376(a4)
        addi    a1, a1, 160
        addi    a0, a0, 160

The critical things is that:

GCC has 

        fsub.d  fa4,ft2,fa3
        fadd.d  fa5,fa5,fa1
        fmadd.d fa4,fa4,fa2,fa3
        fnmsub.d        fa5,fa5,ft1,ft0
        fadd.d  fa4,fa4,fa0
        fmadd.d fa5,fa5,fa2,fa4

6 floating-point operations.

Clang has:

        fmadd.d fa1, fa1, fa4, fa5
        fadd.d  fa0, ft0, fa0
        fmadd.d fa0, fa0, fa2, fa3
        fadd.d  fa1, fa0, fa1

Clang has 4.

2 more floating-point operations are very critical to the performance I think
since double floating-point operations are usually costly in real hardware.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2024-02-07  3:39 ` juzhe.zhong at rivai dot ai
@ 2024-02-07  7:48 ` juzhe.zhong at rivai dot ai
  2024-02-07  8:04 ` rguenther at suse dot de
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-02-07  7:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #13 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Ok. I found the optimized tree:


  _5 = 3.33333333333333314829616256247390992939472198486328125e-1 - _4;
  _8 = .FMA (_5, 1.229999999999999982236431605997495353221893310546875e-1, _4);

Let CST0 = 3.33333333333333314829616256247390992939472198486328125e-1,
CST1 = 1.229999999999999982236431605997495353221893310546875e-1

The expression is equivalent to the following:

_5 = CST0 - _4;
_8 = _5 * CST1 + 4;

That is:

_8 = (CST0 - _4) * CST1 + 4;

So, We should be able to re-associate it like Clang:

_8 = CST0 * CST1 - _4 * CST1 + 4; ---> _8 = CST0 * CST1 + _4 * (1 - CST1);

Since both CST0 * CST1 and 1 - CST1 can be pre-computed during compilation
time.

Let say CST2 = CST0 * CST1, CST3 = 1 - CST1, then we can re-associate as Clang:

_8 = FMA (_4, CST3, CST2).

Any suggestions for this re-association ?  Is match.pd the right place to do it
?

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2024-02-07  7:48 ` juzhe.zhong at rivai dot ai
@ 2024-02-07  8:04 ` rguenther at suse dot de
  2024-02-07  8:08 ` juzhe.zhong at rivai dot ai
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenther at suse dot de @ 2024-02-07  8:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #14 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 7 Feb 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #13 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> Ok. I found the optimized tree:
> 
> 
>   _5 = 3.33333333333333314829616256247390992939472198486328125e-1 - _4;
>   _8 = .FMA (_5, 1.229999999999999982236431605997495353221893310546875e-1, _4);
> 
> Let CST0 = 3.33333333333333314829616256247390992939472198486328125e-1,
> CST1 = 1.229999999999999982236431605997495353221893310546875e-1
> 
> The expression is equivalent to the following:
> 
> _5 = CST0 - _4;
> _8 = _5 * CST1 + 4;
> 
> That is:
> 
> _8 = (CST0 - _4) * CST1 + 4;
> 
> So, We should be able to re-associate it like Clang:
> 
> _8 = CST0 * CST1 - _4 * CST1 + 4; ---> _8 = CST0 * CST1 + _4 * (1 - CST1);
> 
> Since both CST0 * CST1 and 1 - CST1 can be pre-computed during compilation
> time.
> 
> Let say CST2 = CST0 * CST1, CST3 = 1 - CST1, then we can re-associate as Clang:
> 
> _8 = FMA (_4, CST3, CST2).
> 
> Any suggestions for this re-association ?  Is match.pd the right place to do it
> ?

You need to look at the IL before we do .FMA forming, specifically 
before/after the late reassoc pass.  There pass applying match.pd
patterns everywhere is forwprop.

I also wonder which compilation flags you are using (note clang
has different defaults for example for -ftrapping-math)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2024-02-07  8:04 ` rguenther at suse dot de
@ 2024-02-07  8:08 ` juzhe.zhong at rivai dot ai
  2024-02-07  8:13 ` juzhe.zhong at rivai dot ai
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-02-07  8:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #15 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to rguenther@suse.de from comment #14)
> On Wed, 7 Feb 2024, juzhe.zhong at rivai dot ai wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> > 
> > --- Comment #13 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > Ok. I found the optimized tree:
> > 
> > 
> >   _5 = 3.33333333333333314829616256247390992939472198486328125e-1 - _4;
> >   _8 = .FMA (_5, 1.229999999999999982236431605997495353221893310546875e-1, _4);
> > 
> > Let CST0 = 3.33333333333333314829616256247390992939472198486328125e-1,
> > CST1 = 1.229999999999999982236431605997495353221893310546875e-1
> > 
> > The expression is equivalent to the following:
> > 
> > _5 = CST0 - _4;
> > _8 = _5 * CST1 + 4;
> > 
> > That is:
> > 
> > _8 = (CST0 - _4) * CST1 + 4;
> > 
> > So, We should be able to re-associate it like Clang:
> > 
> > _8 = CST0 * CST1 - _4 * CST1 + 4; ---> _8 = CST0 * CST1 + _4 * (1 - CST1);
> > 
> > Since both CST0 * CST1 and 1 - CST1 can be pre-computed during compilation
> > time.
> > 
> > Let say CST2 = CST0 * CST1, CST3 = 1 - CST1, then we can re-associate as Clang:
> > 
> > _8 = FMA (_4, CST3, CST2).
> > 
> > Any suggestions for this re-association ?  Is match.pd the right place to do it
> > ?
> 
> You need to look at the IL before we do .FMA forming, specifically 
> before/after the late reassoc pass.  There pass applying match.pd
> patterns everywhere is forwprop.
> 
> I also wonder which compilation flags you are using (note clang
> has different defaults for example for -ftrapping-math)

Both GCC and Clang are using   -Ofast -ffast-math.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2024-02-07  8:08 ` juzhe.zhong at rivai dot ai
@ 2024-02-07  8:13 ` juzhe.zhong at rivai dot ai
  2024-02-07 10:24 ` rguenther at suse dot de
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-02-07  8:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #16 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
The FMA is generated in widening_mul PASS:

Before widening_mul (fab1):

  _5 = 3.33333333333333314829616256247390992939472198486328125e-1 - _4;
  _6 = _5 * 1.229999999999999982236431605997495353221893310546875e-1;
  _8 = _4 + _6;

After widening_mul:

  _5 = 3.33333333333333314829616256247390992939472198486328125e-1 - _4;
  _8 = .FMA (_5, 1.229999999999999982236431605997495353221893310546875e-1, _4);

I think it's obvious, widening_mul choose to transform later 2 STMTs:

  _6 = _5 * 1.229999999999999982236431605997495353221893310546875e-1;
  _8 = _4 + _6;

into:

 _8 = .FMA (_5, 1.229999999999999982236431605997495353221893310546875e-1, _4);

without any re-association.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2024-02-07  8:13 ` juzhe.zhong at rivai dot ai
@ 2024-02-07 10:24 ` rguenther at suse dot de
  2024-05-13 14:17 ` rdapp at gcc dot gnu.org
  2024-05-16 12:41 ` rguenth at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: rguenther at suse dot de @ 2024-02-07 10:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #17 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 7 Feb 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #16 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> The FMA is generated in widening_mul PASS:
> 
> Before widening_mul (fab1):
> 
>   _5 = 3.33333333333333314829616256247390992939472198486328125e-1 - _4;
>   _6 = _5 * 1.229999999999999982236431605997495353221893310546875e-1;
>   _8 = _4 + _6;

So this is x + (CST1 - x) * CST2 which we might fold/associate to
x * (1. - CST2) + CST1 * CST2

this looks like something for reassociation (it knows some rules,
like what it does in undistribute_ops_list, I'm not sure if that
comes into play here already, this would be doing the reverse
before).  A match.pd pattern also works, but it wouldn't be
general enough to handle more complicated but similar cases.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2024-02-07 10:24 ` rguenther at suse dot de
@ 2024-05-13 14:17 ` rdapp at gcc dot gnu.org
  2024-05-16 12:41 ` rguenth at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-05-13 14:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #18 from Robin Dapp <rdapp at gcc dot gnu.org> ---
A bit of a follow-up:  I'm working on a patch for reassociation that can handle
the mentioned cases and some more but it will still require a bit of time to
get everything regression free and correct.  What it does is allow reassoc to
look through constant multiplications and negates to provide more freedom in
the optimization process.

Regarding the mentioned element-wise costing how should we proceed here?  I'm
going to remove the hunk in question, run SPEC2017 on x86 and post a patch in
order to get some data and basis for discussion.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
  2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2024-05-13 14:17 ` rdapp at gcc dot gnu.org
@ 2024-05-16 12:41 ` rguenth at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-16 12:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Robin Dapp from comment #18)
[...]
> Regarding the mentioned element-wise costing how should we proceed here? 
> I'm going to remove the hunk in question, run SPEC2017 on x86 and post a
> patch in order to get some data and basis for discussion.

Yeah, I think this hunk was put in as a stopgap solution.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-05-16 12:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-24 14:21 [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized rdapp at gcc dot gnu.org
2024-01-24 14:42 ` [Bug tree-optimization/113583] " juzhe.zhong at rivai dot ai
2024-01-24 14:44 ` rdapp at gcc dot gnu.org
2024-01-24 15:00 ` juzhe.zhong at rivai dot ai
2024-01-25  3:06 ` juzhe.zhong at rivai dot ai
2024-01-25  3:13 ` juzhe.zhong at rivai dot ai
2024-01-25  5:41 ` pinskia at gcc dot gnu.org
2024-01-25  9:05 ` rguenther at suse dot de
2024-01-25  9:16 ` juzhe.zhong at rivai dot ai
2024-01-25  9:34 ` rguenth at gcc dot gnu.org
2024-01-26  9:50 ` rdapp at gcc dot gnu.org
2024-01-26 10:21 ` rguenther at suse dot de
2024-02-05  6:59 ` juzhe.zhong at rivai dot ai
2024-02-07  3:39 ` juzhe.zhong at rivai dot ai
2024-02-07  7:48 ` juzhe.zhong at rivai dot ai
2024-02-07  8:04 ` rguenther at suse dot de
2024-02-07  8:08 ` juzhe.zhong at rivai dot ai
2024-02-07  8:13 ` juzhe.zhong at rivai dot ai
2024-02-07 10:24 ` rguenther at suse dot de
2024-05-13 14:17 ` rdapp at gcc dot gnu.org
2024-05-16 12:41 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).