From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 97E813858D37; Wed, 24 Jan 2024 14:21:15 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 97E813858D37 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1706106075; bh=6cpvoYeU7a2fVi6yLJUin0Mou07DVX1lEB+LbBqHyWI=; h=From:To:Subject:Date:From; b=Fa1+wgbLXHSxibz185LnSlyXDZ+Ks7hGQMbW2IQouClTG3/MIJSGD56zLp4A4AV1m 9v08kGub6SMxVEuvoVyF65s0TR3BshufOYTglfxNQU20Pltv0MbmFQiZ5rA0U28iW7 yvwVxzTQcXpzzuUD+WTSLAkwLjPg1vkzmoQS+UrI= From: "rdapp at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized. Date: Wed, 24 Jan 2024 14:21:14 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rdapp at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status keywords bug_severity priority component assigned_to reporter target_milestone cf_gcctarget Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583 Bug ID: 113583 Summary: Main loop in 519.lbm not vectorized. Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* riscv*-*-* This might be a known issue but a bugzilla search regarding lbm didn't show anything related. The main loop in SPEC2017 519.lbm GCC riscv does not vectorize while clang does. For x86 neither clang nor GCC seem to vectorize it. A (not entirely minimal but let's start somewhere) example is the following= .=20 This one is, however, vectorized by clang-17 x86 and not by GCC trunk x86 or other targets I checked. #define CST1 (1.0 / 3.0) typedef enum { C =3D 0, N, S, E, W, T, B, NW, NE, A, BB, CC, D, EE, FF, GG, HH, II, JJ, FLAGS, NN } CELL_ENTRIES; #define SX 100 #define SY 100 #define SZ 130 #define CALC_INDEX(x, y, z, e) ((e) + NN * ((x) + (y) * SX + (z) * SX * SY)) #define GRID_ENTRY_SWEEP(g, dx, dy, dz, e) ((g)[CALC_INDEX (dx, dy, dz, e) + (i)]) #define LOCAL(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e)) #define NEIGHBOR_C(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e)) #define NEIGHBOR_S(g, e) (GRID_ENTRY_SWEEP (g, 0, -1, 0, e)) #define NEIGHBOR_N(g, e) (GRID_ENTRY_SWEEP (g, 0, +1, 0, e)) #define NEIGHBOR_E(g, e) (GRID_ENTRY_SWEEP (g, +1, 0, 0, e)) #define SRC_C(g) (LOCAL (g, C)) #define SRC_N(g) (LOCAL (g, N)) #define SRC_S(g) (LOCAL (g, S)) #define SRC_E(g) (LOCAL (g, E)) #define SRC_W(g) (LOCAL (g, W)) #define DST_C(g) (NEIGHBOR_C (g, C)) #define DST_N(g) (NEIGHBOR_N (g, N)) #define DST_S(g) (NEIGHBOR_S (g, S)) #define DST_E(g) (NEIGHBOR_E (g, E)) typedef double arr[SX * SY * SZ * NN]; #define OMEGA 0.123 void foo (arr src, arr dst) { double ux, uy, u2; const double lambda0 =3D 1.0 / (0.5 + 3.0 / (16.0 * (1.0 / OMEGA - 0.5))); double fs[NN], fa[NN], feqs[NN], feqa[NN]; for (int i =3D 0; i < SX * SY * SZ * NN; i +=3D NN) { ux =3D 1.0; uy =3D 1.0; feqs[C] =3D CST1 * (1.0); feqs[N] =3D feqs[S] =3D CST1 * (1.0 + 4.5 * (+uy) * (+uy)); feqa[C] =3D 0.0; feqa[N] =3D 0.2; fs[C] =3D SRC_C (src); fs[N] =3D fs[S] =3D 0.5 * (SRC_N (src) + SRC_S (src)); fa[C] =3D 0.0; fa[N] =3D 0.1; DST_C (dst) =3D SRC_C (src) - OMEGA * (fs[C] - feqs[C]); DST_N (dst) =3D SRC_N (src) - OMEGA * (fs[N] - feqs[N]) - lambda0 * (fa[N] - feqa[N]); } } missed.c:19:2: note: =3D=3D> examining statement: _4 =3D *_3; missed.c:19:2: missed: no array mode for V8DF[20] missed.c:19:2: missed: no array mode for V8DF[20] missed.c:19:2: missed: the size of the group of accesses is not a power o= f 2 or not equal to 3 missed.c:19:2: missed: not falling back to elementwise accesses missed.c:43:11: missed: not vectorized: relevant stmt not supported: _4 = =3D *_3; Also refer to https://godbolt.org/z/P517qc3Yf for riscv and https://godbolt.org/z/M134KvEEo for aarch64. For aarch64 it seems clang wo= uld vectorize the snippet but does not consider it profitable to do so. For riscv and the full lbm workload I roughly see one third the number of dynamically executed qemu instructions with the clang build vs GCC build, 3= 40 billion vs 1200 billion.=