public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation
@ 2022-07-11 23:13 vineet.gupta at linux dot dev
  2022-07-11 23:23 ` [Bug target/106265] " vineet.gupta at linux dot dev
                   ` (11 more replies)
  0 siblings, 12 replies; 14+ messages in thread
From: vineet.gupta at linux dot dev @ 2022-07-11 23:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

            Bug ID: 106265
           Summary: RISC-V SPEC2017 507.cactu code bloat due to address
                    generation
           Product: gcc
           Version: 12.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vineet.gupta at linux dot dev
  Target Milestone: ---

SPEC2017 FP benchmark 507.cactu: ML_BSSN_RHS.cc:ML_BSSN_RHS_Body() has terrible
codegen. A recurring pattern which shows up in hotspot analysis is

  380a1a:       lui     a6,0x1     # LI 4096 before linker relaxation
  380a1c:       addi    t2,sp,32
  380a20:       addi    a6,a6,-1192 # b58
  380a24:       add     a6,a6,t2
  380a26:       sd      t4,0(a6)

The first 4 instructions help calculate the destination of the last SD
instruction. There were 27 such distinct instances in the hottest top block.

Part of this is the ISA not having a single instruction to do set a reg with
32-bit const and/or the limited addressing modes. However the compiler is not
helping either. All those 27 instances have the first instruction to set
register with 4096, sometimes with the register still being live with the exact
same value.

Using creduce I was able to create a small'ish (not ideal small) test case
which shows 14 instances of li <reg>,4096

Built as riscv64-unknown-linux-gnu-g++ -O3 -static -ffast-math -funroll-loops
-march=rv64gc_zba_zbb_zbc_zbs -c  -std=c++03 

This is with trunk as of June 14 (commit 6abe341558ab)

--->8-----
void *a();
int b, c, d, l, o, q;
double *e, *f, *p, *r, *s = 0;
long g, h, k;
double m, n, aa;
void y() {
  double ag, ai, aj, ap, ar, at, aw(01.0 / 0);
  double *am((double *)a);
  double *an;
  double *ao((double *)a());
  long av = sizeof(double) * g;
  for (; l; ++l)
    for (int j = d; j < q; ++j)
      for (int i = 1; i < o; i++) {
        long az = i * j + l;
        double ba, bb, bc, bd, be, bf, bg, bh, bi, bj, bk, bl, bm, bn, bo, bp,
            bq, br, bs, bv(-ar);
        switch (b)
        case 4: {
          *(double *)((char *)0)[2] = ((char *)0)[2];
          ba = ((char *)0)[k + av] + ((char *)0)[2] + ((char *)0)[k] +
               *(double *)((char *)0)[av] +
               ((char *)0)[av * 2] * ((char *)0)[k * av];
          bb = (&am[az])[h] + ((char *)0)[k * h] +
               ((char *)&am[az])[1] *
                   (((char *)&am[az])[h] + *(&am)[h] + (&am[az])[k]) +
               (&am[az])[2] - ((char *)0)[h] + (&am[az])[k * av];
          bc = 4 * 8 + *(double *)&((char *)0)[2];
          bd = 4 * ((char *)0)[k + 1] +
               (&an[az])[k * h] * (((char *)0)[-1] + (&an[az])[k + h]) +
               8 * ((&an[az])[k * 2] + (&an[az])[av + h * -2] + (&an[az])[av] +
                    (&an[az])[av * 2 + h]) -
               8 * ((&an[az])[av * 2] + (&an[az])[k] + (&an[az])[av * 21] +
                    (&an[az])[av + h]) +
               *(&an)[h] - (&an[az])[av * 22] * (&an[az])[h] +
               (&an[az])[av * 2 * 2];
          bf = (&az)[0] * (((char *)&ao[az])[0] + ao[av] + (&ao[az])[av * 2] +
                           *(double *)((char *)ao)[k]);
          bg = (&ao[az])[0] * *(&ao)[av] + *(&ao)[2];
          bh = *((char **)0)[av] * (&ao[az])[av - h] +
               ((char *)0)[1] * *(double *)((char *)0)[1] *
                   (*(double *)((char *)&ao)[k] + *(double *)((char *)0)[12] +
                    ((char *)&ao[az])[av]) +
               *(&ao)[av * 2] - *(&ao)[k] - (&ao[az])[2] +
               (&ao[az])[k * av * 2];
          bi = *(&ao)[av * h] *
                   (ao[h] + ((char *)0)[k * 2] + (&ao[az])[k * av] +
                    ((char *)0)[21]) *
                   (((char *)0)[h] + *(double *)((char *)0)[h * 2] +
                    *(double *)((char *)&ao[az])[k] + *(&ao)[h]) +
               *(&ao)[av * 22] - (&ao[az])[2 * h * 2] + ((char *)0)[22];
          bj = 4 * ((&ao[az])[av] + (&ao[az])[k * av * h] * (&ao[az])[av * -1]
+
                    (&ao[az])[av * h]) +
               ((&ao[az])[av * 12] + (&ao[az])[k] + ((char *)&ao)[h] +
                (&ao[az])[av * 2 + h * 0]) *
                   (((char *)0)[2] + *(double *)((char *)&ao[az])[av * 2] +
                    (&ao[az])[h] + (&ao[az])[av + h]) +
               (&ao[az])[av * 22] - ((char *)&ao[az])[h] - (&ao[az])[av * -2] +
               (&ao[az])[av + h * 2];
          bk = 8 * (&ap)[1];
          bm = ((char *)0)[1] * (&e[az])[2];
          bn = -(&az)[0];
          bo = (&e[az])[h] - (&e[az])[h * 2] * aw;
          bq = (&at)[h];
          br = 8 * 8 * ((&az)[1] + (&f[az])[-2] - (&f[az])[2]);
          bs = (&f[az])[h * -2] - (&f[az])[h * 2];
          n = ((char *)&s)[h * 2];
        }
          double bt(e[az] * f[az] - at * at);
        double bu(p[az] * at * az);
        double bw(f[az] - p[az]);
        double bx = 01.0 / 0;
        double by = 01.0 / 0 * 0;
        double ca;
        double cb = -bl;
        double cd;
        double cf = 0.5 * bn;
        double ch;
        double cj = bp - 0.5 * bo;
        double cm = ch * bv;
        double cn = bk * cd * ch;
        double co = bk + cd * by;
        double cq = bm + bx;
        double ct = ca + bx;
        double cu = cb * bt + cf * bu + cj * bv;
        double cz = bt + bq * bv;
        double db = br * bq * by;
        double dc = cm * cu * bw + cz * by;
        double dd = cn * bt + cq * ct + bs * by;
        double df(c == 1 ? s[az] : 0);
        double dg = df * n;
        double dj = bu + ai * bv;
        double dk = ag * ai * bu + aj * bv;
        double dl = bw + aj * bx;
        double dm;
        double dn = bu * bv;
        double t = bu + dj;
        double u = dk * by;
        double v = ag * bv + dl * by;
        double w = bx + dm;
        double x = bf + 0.333333333333333333333333333333 * bc + bi + ba + bj +
                   by * (bb + bd +
                         bg * bh *
                             (6 * w * dg + dn * co * t * bv + u * aa * cj * v +
                              w * db)) -
                   m * dc * dd + (be + 0.666666666666666666666666666667) * co;
        r[az] = x;
      }
}
--->8-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
@ 2022-07-11 23:23 ` vineet.gupta at linux dot dev
  2022-07-11 23:25 ` vineet.gupta at linux dot dev
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: vineet.gupta at linux dot dev @ 2022-07-11 23:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #1 from Vineet Gupta <vineet.gupta at linux dot dev> ---
Analyzed a section of -dP dump where reg a2 is setup with exact same value
while being live.

rhs-cred.cc:42: (*(double *)((char *)&ao)[k] + *(double *)((char *)0)[12] +
#(insn 2662 1711 76 (set (reg:DI 12 a2 [2816])
#        (const_int 4096 [0x1000])) "rhs-cred.cc":42:47 139 {*movdi_64bit}
#     (nil))
        li      a2,4096         # tmp2816,      # 2662  [c=4 l=4] 
*movdi_64bit/1


rhs-cred.cc:39: bg = (&ao[az])[0] * *(&ao)[av] + *(&ao)[2];
#(insn 76 2662 2663 (set (reg/f:DI 20 s4 [orig:193 _239 ] [193])
#        (plus:DI (reg:DI 13 a3 [1865])
#            (reg:DI 20 s4 [orig:181 _224 ] [181]))) "rhs-cred.cc":39:40 4
{adddi3}
#     (expr_list:REG_DEAD (reg:DI 13 a3 [1865])
#        (nil)))
        add     s4,a3,s4        # _224, _239, tmp1865   # 76    [c=4 l=4] 
adddi3/0


rhs-cred.cc:42: (*(double *)((char *)&ao)[k] + *(double *)((char *)0)[12] +
#(insn 2663 76 2660 (set (reg:DI 13 a3 [2815])
#        (plus:DI (reg:DI 12 a2 [2816])
#            (const_int -2024 [0xfffffffffffff818]))) "rhs-cred.cc":42:47 4
{adddi3}
#     (expr_list:REG_DEAD (reg:DI 12 a2 [2816])
#        (expr_list:REG_EQUAL (const_int 2072 [0x818])
#            (nil))))
        addi    a3,a2,-2024     #, tmp2815, tmp2816     # 2663  [c=4 l=4] 
adddi3/1

rhs-cred.cc:44: *(&ao)[av * 2] - *(&ao)[k] - (&ao[az])[2] +
#(insn 2660 2663 1710 (set (reg:DI 12 a2 [2814])
#        (const_int 4096 [0x1000])) "rhs-cred.cc":44:29 139 {*movdi_64bit}
#     (nil))
        li      a2,4096         # tmp2814,      # 2660  [c=4 l=4] 
*movdi_64bit/1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
  2022-07-11 23:23 ` [Bug target/106265] " vineet.gupta at linux dot dev
@ 2022-07-11 23:25 ` vineet.gupta at linux dot dev
  2022-07-11 23:27 ` vineet.gupta at linux dot dev
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: vineet.gupta at linux dot dev @ 2022-07-11 23:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #2 from Vineet Gupta <vineet.gupta at linux dot dev> ---
I've experimented with riscv_rtx_costs() setting cost of const to 1 as
discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98596. This does
reduce the number of li 4096 instances to 10 (from 14), but the fragment pasted
above generates the same code.

# rhs-cred.cc:42: (*(double *)((char *)&ao)[k] + *(double *)((char *)0)[12] +
#(insn 2660 1703 76 (set (reg:DI 28 t3 [2817])
#        (const_int 4096 [0x1000])) "rhs-cred.cc":42:47 139 {*movdi_64bit}
#     (nil))
        li      t3,4096         # tmp2817,      # 2660  [c=1 l=4] 
*movdi_64bit/1


# rhs-cred.cc:39: bg = (&ao[az])[0] * *(&ao)[av] + *(&ao)[2];
#(insn 76 2660 2661 (set (reg/f:DI 16 a6 [orig:193 _239 ] [193])
#        (plus:DI (reg:DI 12 a2 [1860])
#            (reg:DI 16 a6 [orig:181 _224 ] [181]))) "rhs-cred.cc":39:40 4
{adddi3}
#     (expr_list:REG_DEAD (reg:DI 12 a2 [1860])
#        (nil)))
        add     a6,a2,a6        # _224, _239, tmp1860   # 76    [c=4 l=4] 
adddi3/0


# rhs-cred.cc:42: (*(double *)((char *)&ao)[k] + *(double *)((char *)0)[12] +
#(insn 2661 76 2658 (set (reg:DI 12 a2 [2816])
#        (plus:DI (reg:DI 28 t3 [2817])
#            (const_int -2040 [0xfffffffffffff808]))) "rhs-cred.cc":42:47 4
{adddi3}
#     (expr_list:REG_DEAD (reg:DI 28 t3 [2817])
#        (expr_list:REG_EQUAL (const_int 2056 [0x808])
#            (nil))))
        addi    a2,t3,-2040     #, tmp2816, tmp2817     # 2661  [c=4 l=4] 
adddi3/1

# rhs-cred.cc:44: *(&ao)[av * 2] - *(&ao)[k] - (&ao[az])[2] +
#(insn 2658 2661 1702 (set (reg:DI 28 t3 [2815])
#        (const_int 4096 [0x1000])) "rhs-cred.cc":44:29 139 {*movdi_64bit}
#     (nil))

        li      t3,4096         # tmp2815,      # 2658  [c=1 l=4] 
*movdi_64bit/1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
  2022-07-11 23:23 ` [Bug target/106265] " vineet.gupta at linux dot dev
  2022-07-11 23:25 ` vineet.gupta at linux dot dev
@ 2022-07-11 23:27 ` vineet.gupta at linux dot dev
  2022-07-11 23:37 ` vineet.gupta at linux dot dev
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: vineet.gupta at linux dot dev @ 2022-07-11 23:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #3 from Vineet Gupta <vineet.gupta at linux dot dev> ---
Digging into RTL dumps, the li instructions are introduced by 300r reload.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
                   ` (2 preceding siblings ...)
  2022-07-11 23:27 ` vineet.gupta at linux dot dev
@ 2022-07-11 23:37 ` vineet.gupta at linux dot dev
  2022-07-12  6:44 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: vineet.gupta at linux dot dev @ 2022-07-11 23:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #4 from Vineet Gupta <vineet.gupta at linux dot dev> ---
Going back to first dump (upstream 6abe341558a w/o riscv_rtx_costs() adj): the
3rd instruction addi is marking a2 REG_DEAD at 315 cprop.hardreg

--->8---- 314r.rnreg

(insn 2663 2662 1714 3 (set (reg:DI 13 a3 [2815])
        (plus:DI (reg:DI 12 a2 [2816])
            (const_int -2024 [0xfffffffffffff818]))) "rhs-cred.cc":42:47 4
{adddi3}
     (expr_list:REG_EQUAL (const_int 2072 [0x818])
        (nil)))


--->8---- 315r.cprop_hardreg

(insn 2663 2662 1714 3 (set (reg:DI 13 a3 [2815])
        (plus:DI (reg:DI 12 a2 [2816])
            (const_int -2024 [0xfffffffffffff818]))) "rhs-cred.cc":42:47 4
{adddi3}
     (expr_list:REG_DEAD (reg:DI 12 a2 [2816])
        (expr_list:REG_EQUAL (const_int 2072 [0x818])
            (nil))))

Not sure if that is the issue or the compiler just deducing that a2 is dead
given that the next instruction sets it up again was introduced much earlier in
300r.reload

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
                   ` (3 preceding siblings ...)
  2022-07-11 23:37 ` vineet.gupta at linux dot dev
@ 2022-07-12  6:44 ` rguenth at gcc dot gnu.org
  2022-07-12  7:51   ` Andrew Waterman
  2022-07-12  7:52 ` andrew at sifive dot com
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 14+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-12  6:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
So why do we even emit unsupported 'li 4096' and leave it to the linker to
"optimize(?)"?  At least the cost of this should be reflected - IIRC powerpc
recently got improvements for similar cases by changing the targets rtx_cost
hook to properly const SET from CONST_INT so that CSE doesn't leave so many
sets from constants around.

OTOH LRA rematerialization also could be the culprit, thinking rematerializing
the constant is cheaper than spilling a register holding it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-12  6:44 ` rguenth at gcc dot gnu.org
@ 2022-07-12  7:51   ` Andrew Waterman
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Waterman @ 2022-07-12  7:51 UTC (permalink / raw)
  To: rguenth at gcc dot gnu.org; +Cc: gcc-bugs

To be clear, `li rx, 4096' isn't unsupported: it's a
very-much-supported idiom for `lui rx, 1`.


On Mon, Jul 11, 2022 at 11:45 PM rguenth at gcc dot gnu.org via
Gcc-bugs <gcc-bugs@gcc.gnu.org> wrote:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265
>
> --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
> So why do we even emit unsupported 'li 4096' and leave it to the linker to
> "optimize(?)"?  At least the cost of this should be reflected - IIRC powerpc
> recently got improvements for similar cases by changing the targets rtx_cost
> hook to properly const SET from CONST_INT so that CSE doesn't leave so many
> sets from constants around.
>
> OTOH LRA rematerialization also could be the culprit, thinking rematerializing
> the constant is cheaper than spilling a register holding it.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
                   ` (4 preceding siblings ...)
  2022-07-12  6:44 ` rguenth at gcc dot gnu.org
@ 2022-07-12  7:52 ` andrew at sifive dot com
  2022-07-12 20:18 ` vineet.gupta at linux dot dev
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: andrew at sifive dot com @ 2022-07-12  7:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #6 from Andrew Waterman <andrew at sifive dot com> ---
To be clear, `li rx, 4096' isn't unsupported: it's a
very-much-supported idiom for `lui rx, 1`.


On Mon, Jul 11, 2022 at 11:45 PM rguenth at gcc dot gnu.org via
Gcc-bugs <gcc-bugs@gcc.gnu.org> wrote:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265
>
> --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
> So why do we even emit unsupported 'li 4096' and leave it to the linker to
> "optimize(?)"?  At least the cost of this should be reflected - IIRC powerpc
> recently got improvements for similar cases by changing the targets rtx_cost
> hook to properly const SET from CONST_INT so that CSE doesn't leave so many
> sets from constants around.
>
> OTOH LRA rematerialization also could be the culprit, thinking rematerializing
> the constant is cheaper than spilling a register holding it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
                   ` (5 preceding siblings ...)
  2022-07-12  7:52 ` andrew at sifive dot com
@ 2022-07-12 20:18 ` vineet.gupta at linux dot dev
  2022-07-21 21:09 ` vineet.gupta at linux dot dev
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: vineet.gupta at linux dot dev @ 2022-07-12 20:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #7 from Vineet Gupta <vineet.gupta at linux dot dev> ---
(In reply to Richard Biener from comment #5)
> So why do we even emit unsupported 'li 4096' and leave it to the linker to
> "optimize(?)"? 

li 4096 is really a pseudo-op - LUI is used to build 32-bit constants. For this
problem whether LI or LUI is used is just a detail.

> OTOH LRA rematerialization also could be the culprit, thinking rematerializing
> the constant is cheaper than spilling a register holding it.

Thx for the pointer. I tried disabling it -fno-lra-remat and it generates
exactly same code.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
                   ` (6 preceding siblings ...)
  2022-07-12 20:18 ` vineet.gupta at linux dot dev
@ 2022-07-21 21:09 ` vineet.gupta at linux dot dev
  2022-07-21 21:21 ` vineet.gupta at linux dot dev
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: vineet.gupta at linux dot dev @ 2022-07-21 21:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #8 from Vineet Gupta <vineet.gupta at linux dot dev> ---
Created attachment 53332
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53332&action=edit
Full reload output

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
                   ` (7 preceding siblings ...)
  2022-07-21 21:09 ` vineet.gupta at linux dot dev
@ 2022-07-21 21:21 ` vineet.gupta at linux dot dev
  2022-07-22 21:39 ` vineet.gupta at linux dot dev
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: vineet.gupta at linux dot dev @ 2022-07-21 21:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #9 from Vineet Gupta <vineet.gupta at linux dot dev> ---
The redundant Insn 2660 is reload inserted for Insn 1717

 1717: r1871:DI=frame:DI+r2813:DI
    Inserting insn reload before:
 2660: r2814:DI=0x1000
 2661: r2813:DI=r2814:DI-0x7e8
      REG_EQUAL 0x818


Insn 1717 in turn is reload inserted for Insn 82

   82: r1870:DI=r1871:DI+r222:DI
      REG_DEAD r222:DI
    Inserting insn reload before:
 1717: r1871:DI=frame:DI+0x28
    Inserting insn reload after:
 1716: r223:DI=r1870:DI


Insn 82 comes from 253r.expand

(insn 80 79 81 5 (set (reg:DI 222 [ _282 ])
        (ashift:DI (reg:DI 74 [ g.1_3 ])
            (const_int 7 [0x7]))) "rhs-cred.cc":44:29 -1
     (nil))
(insn 81 80 82 5 (set (reg:DI 884)
        (plus:DI (reg/f:DI 67 virtual-stack-vars)
            (const_int -8 [0xfffffffffffffff8]))) "rhs-cred.cc":44:29 -1
     (nil))
(insn 82 81 83 5 (set (reg/f:DI 223 [ _283 ])
        (plus:DI (reg:DI 884)
            (reg:DI 222 [ _282 ]))) "rhs-cred.cc":44:29 -1
     (nil))


Stepping thru lra-constraints.cc, curr_insn_transform () is called twice for
Insn 1717 and in the 2nd call (lra_constraints_iter == 4), operand is required
to have a reload (goal_alt_win[2]==false) which leads to lra_emit_move () ->
riscv_move_integer ()

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
                   ` (8 preceding siblings ...)
  2022-07-21 21:21 ` vineet.gupta at linux dot dev
@ 2022-07-22 21:39 ` vineet.gupta at linux dot dev
  2023-08-04 22:30 ` vineetg at gcc dot gnu.org
  2024-05-28 22:40 ` vineetg at gcc dot gnu.org
  11 siblings, 0 replies; 14+ messages in thread
From: vineet.gupta at linux dot dev @ 2022-07-22 21:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #10 from Vineet Gupta <vineet.gupta at linux dot dev> ---
Created a small test case which emulates generation of 2 split consts. 

void foo(void)
{
        bar(2072, 2096);
}

253r.expand has 4 instructions: Pair of LI 4096 + ADDI for each const.
260r.fwprop1 prunes out the redundant insn 7. However it runs before
300r.reload. The original issue is reload inserting the insns so no way to
prune these out.

---253r.expand----
(insn 5 2 6 2 (set (reg:DI 72)
        (const_int 4096 [0x1000])) "const-load.c":5:2 -1
     (nil))
(insn 6 5 7 2 (set (reg:DI 11 a1)
        (plus:DI (reg:DI 72)
            (const_int -2000 [0xfffffffffffff830]))) "const-load.c":5:2 -1
     (expr_list:REG_EQUAL (const_int 2096 [0x830])
        (nil)))
(insn 7 6 8 2 (set (reg:DI 73)
        (const_int 4096 [0x1000])) "const-load.c":5:2 -1
     (nil))
(insn 8 7 9 2 (set (reg:DI 10 a0)
        (plus:DI (reg:DI 73)
            (const_int -2024 [0xfffffffffffff818]))) "const-load.c":5:2 -1
     (expr_list:REG_EQUAL (const_int 2072 [0x818])
        (nil)))

---300r.reload----
(insn 5 2 6 2 (set (reg:DI 10 a0 [72])
        (const_int 4096 [0x1000])) "const-load.c":5:2 139 {*movdi_64bit}
     (expr_list:REG_EQUIV (const_int 4096 [0x1000])
        (nil)))
(insn 6 5 8 2 (set (reg:DI 11 a1)
        (plus:DI (reg:DI 10 a0 [72])
            (const_int -2000 [0xfffffffffffff830]))) "const-load.c":5:2 4
{adddi3}
     (expr_list:REG_EQUAL (const_int 2096 [0x830])
        (nil)))
(insn 8 6 9 2 (set (reg:DI 10 a0)
        (plus:DI (reg:DI 10 a0 [72])
            (const_int -2024 [0xfffffffffffff818]))) "const-load.c":5:2 4
{adddi3}
     (expr_list:REG_EQUAL (const_int 2072 [0x818])
        (nil)))

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
                   ` (9 preceding siblings ...)
  2022-07-22 21:39 ` vineet.gupta at linux dot dev
@ 2023-08-04 22:30 ` vineetg at gcc dot gnu.org
  2024-05-28 22:40 ` vineetg at gcc dot gnu.org
  11 siblings, 0 replies; 14+ messages in thread
From: vineetg at gcc dot gnu.org @ 2023-08-04 22:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

Vineet Gupta <vineetg at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vineetg at gcc dot gnu.org

--- Comment #11 from Vineet Gupta <vineetg at gcc dot gnu.org> ---
Revisited this with gcc-13.

The reduced test case no longer shows the extraneous LI 4096 (although the full
test still does). 

The key here is -funroll-loops which is needed for original issue to show as
well.

The was with middle-end update:

commit 19295e8607da2f743368fe6f5708146616aafa91
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Oct 24 09:51:32 2022 +0200

    tree-optimization/100756 - niter analysis and folding

    niter analysis, specifically the part trying to simplify the computed
    maybe_zero condition against the loop header copying condition, is
    confused by us now simplifying

      _15 = n_8(D) * 4;
      if (_15 > 0)

    to

      _15 = n_8(D) * 4;
      if (n_8(D) > 0)

    which is perfectly sound at the point we do this transform.  One
    solution might be to involve ranger in this simplification, another
    is to be more aggressive when expanding expressions - the condition
    we try to simplify is _15 > 0, so all we need is expanding that
    to n_8(D) * 4 > 0.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation
  2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
                   ` (10 preceding siblings ...)
  2023-08-04 22:30 ` vineetg at gcc dot gnu.org
@ 2024-05-28 22:40 ` vineetg at gcc dot gnu.org
  11 siblings, 0 replies; 14+ messages in thread
From: vineetg at gcc dot gnu.org @ 2024-05-28 22:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

Vineet Gupta <vineetg at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #12 from Vineet Gupta <vineetg at gcc dot gnu.org> ---
Two years hence and we are a little wiser.

The root-cause of spills is sched1
[PR/114729](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729).

The recent sum of two s12 patch does make the spill codegen better by having 1
less insn to materialize the stack access. And that shaves off 10% of cactu
dynamic icounts which should be enough to close this PR.

commit 4bfc4585c9935fbde75ccf04e44a15d24f42cde9
Author: Vineet Gupta <vineetg@rivosinc.com>
Date:   Mon May 13 11:45:55 2024 -0700

    RISC-V: avoid LUI based const materialization ... [part of PR/106265]

    ... if the constant can be represented as sum of two S12 values.
    The two S12 values could instead be fused with subsequent ADD insn.
    The helps
     - avoid an additional LUI insn
     - side benefits of not clobbering a reg

    e.g.
                                w/o patch             w/ patch
    long                  |                     |
    plus(unsigned long i) | li      a5,4096     |
    {                     | addi    a5,a5,-2032 | addi a0, a0, 2047
       return i + 2064;   | add     a0,a0,a5    | addi a0, a0, 17
    }                     |         ret         | ret

    NOTE: In theory not having const in a standalone reg might seem less
          CSE friendly, but for workloads in consideration these mat are
          from very late LRA reloads and follow on GCSE is not doing much
          currently.

    The real benefit however is seen in base+offset computation for array
    accesses and especially for stack accesses which are finalized late in
    optim pipeline, during LRA register allocation. Often the finalized
    offsets trigger LRA reloads resulting in mind boggling repetition of
    exact same insn sequence including LUI based constant materialization.

    This shaves off 290 billion dynamic instrustions (QEMU icounts) in
    SPEC 2017 Cactu benchmark which is over 10% of workload. In the rest of
    suite, there additional 10 billion shaved, with both gains and losses
    in indiv workloads as is usual with compiler changes.

     500.perlbench_r-0 |  1,214,534,029,025 | 1,212,887,959,387 |
     500.perlbench_r-1 |    740,383,419,739 |   739,280,308,163 |
     500.perlbench_r-2 |    692,074,638,817 |   691,118,734,547 |
     502.gcc_r-0       |    190,820,141,435 |   190,857,065,988 |
     502.gcc_r-1       |    225,747,660,839 |   225,809,444,357 | <- -0.02%
     502.gcc_r-2       |    220,370,089,641 |   220,406,367,876 | <- -0.03%
     502.gcc_r-3       |    179,111,460,458 |   179,135,609,723 | <- -0.02%
     502.gcc_r-4       |    219,301,546,340 |   219,320,416,956 | <- -0.01%
     503.bwaves_r-0    |    278,733,324,691 |   278,733,323,575 | <- -0.01%
     503.bwaves_r-1    |    442,397,521,282 |   442,397,519,616 |
     503.bwaves_r-2    |    344,112,218,206 |   344,112,216,760 |
     503.bwaves_r-3    |    417,561,469,153 |   417,561,467,597 |
     505.mcf_r         |    669,319,257,525 |   669,318,763,084 |
     507.cactuBSSN_r   |  2,852,767,394,456 | 2,564,736,063,742 | <+ 10.10%
     508.namd_r        |  1,855,884,342,110 | 1,855,881,110,934 |
     510.parest_r      |  1,654,525,521,053 | 1,654,402,859,174 |
     511.povray_r      |  2,990,146,655,619 | 2,990,060,324,589 |
     519.lbm_r         |  1,158,337,294,525 | 1,158,337,294,529 |
     520.omnetpp_r     |  1,021,765,791,283 | 1,026,165,661,394 |
     521.wrf_r         |  1,715,955,652,503 | 1,714,352,737,385 |
     523.xalancbmk_r   |    849,846,008,075 |   849,836,851,752 |
     525.x264_r-0      |    277,801,762,763 |   277,488,776,427 |
     525.x264_r-1      |    927,281,789,540 |   926,751,516,742 |
     525.x264_r-2      |    915,352,631,375 |   914,667,785,953 |
     526.blender_r     |  1,652,839,180,887 | 1,653,260,825,512 |
     527.cam4_r        |  1,487,053,494,925 | 1,484,526,670,770 |
     531.deepsjeng_r   |  1,641,969,526,837 | 1,642,126,598,866 |
     538.imagick_r     |  2,098,016,546,691 | 2,097,997,929,125 |
     541.leela_r       |  1,983,557,323,877 | 1,983,531,314,526 |
     544.nab_r         |  1,516,061,611,233 | 1,516,061,407,715 |
     548.exchange2_r   |  2,072,594,330,215 | 2,072,591,648,318 |
     549.fotonik3d_r   |  1,001,499,307,366 | 1,001,478,944,189 |
     554.roms_r        |  1,028,799,739,111 | 1,028,780,904,061 |
     557.xz_r-0        |    363,827,039,684 |   363,057,014,260 |
     557.xz_r-1        |    906,649,112,601 |   905,928,888,732 |
     557.xz_r-2        |    509,023,898,187 |   508,140,356,932 |
     997.specrand_fr   |        402,535,577 |       403,052,561 |
     999.specrand_ir   |        402,535,577 |       403,052,561 |

    This should still be considered damage control as the real/deeper fix
    would be to reduce number of LRA reloads or CSE/anchor those during
    LRA constraint sub-pass (re)runs (thats a different PR/114729.

    Implementation Details (for posterity)
    --------------------------------------
     - basic idea is to have a splitter selected via a new predicate for
constant
       being possible sum of two S12 and provide the transform.
       This is however a 2 -> 2 transform which combine can't handle.
       So we specify it using a define_insn_and_split.

     - the initial loose "i" constraint caused LRA to accept invalid insns thus
       needing a tighter new constraint as well.

     - An additional fallback alternate with catch-all "r" register
       constraint also needed to allow any "reloads" that LRA might
       require for ADDI with const larger than S12.

    Testing
    --------
    This is testsuite clean (rv64 only).
    I'll rely on post-commit CI multlib run for any possible fallout for
    other setups such as rv32.

    |                                               |         gcc |         
g++ |     gfortran |
    | rv64imafdc_zba_zbb_zbs_zicond/  lp64d/ medlow |  41 /    17 |    8 /    
3 |    7 /     2 |
    | rv64imafdc_zba_zbb_zbs_zicond/  lp64d/ medlow |  41 /    17 |    8 /    
3 |    7 /     2 |

    I also threw this into a buildroot run, it obviously boots Linux to
    userspace. bloat-o-meter on glibc and kernel show overall decrease in
    staic instruction counts with some minor spot increases.
    These are generally in the category of

     - LUI + ADDI are 2 byte each vs. two ADD being 4 byte each.
     - Knock on effects due to inlining changes.
     - Sometimes the slightly shorter 2-insn seq in a mult-exit function
       can cause in-place epilogue duplication (vs. a jump back).
       This is slightly larger but more efficient in execution.
    In summary nothing to fret about.

    | linux/scripts/bloat-o-meter build-gcc-240131/target/lib/libc.so.6 \
             build-gcc-240131-new-splitter-1-variant/target/lib/libc.so.6
    |
    | add/remove: 0/0 grow/shrink: 21/49 up/down: 520/-3056 (-2536)
    | Function                                     old     new   delta
    | getnameinfo                                 2756    2892    +136
    ...
    | tempnam                                      136     144      +8
    | padzero                                      276     284      +8
    ...
    | __GI___register_printf_specifier             284     280      -4
    | __EI_xdr_array                               468     464      -4
    | try_file_lock                                268     260      -8
    | pthread_create@GLIBC_2                      3520    3508     -12
    | __pthread_create_2_1                        3520    3508     -12
    ...
    | _nss_files_setnetgrent                       932     904     -28
    | _nss_dns_gethostbyaddr2_r                   1524    1480     -44
    | build_trtable                               3312    3208    -104
    | printf_positional                          25000   22580   -2420
    | Total: Before=2107999, After=2105463, chg -0.12%

    Caveat:
    ------
    Jeff noted during v2 review that the operand0 constraint
!riscv_reg_frame_related
    could potentially cause issues with hard reg cprop in future. If that
    trips things up we will have to loosen the constraint while dialing down
    the const range to (-2048 to 2032) as opposed to fll S12 range of
    (-2048 to 2047) to keep stack regs aligned.

    gcc/ChangeLog:
            * config/riscv/riscv.h: New macros to check for sum of two S12
            range.
            * config/riscv/constraints.md: New constraint.
            * config/riscv/predicates.md: New Predicate.
            * config/riscv/riscv.md: New splitter.
            * config/riscv/riscv.cc (riscv_reg_frame_related): New helper.
            * config/riscv/riscv-protos.h: New helper prototype.

    gcc/testsuite/ChangeLog:
            * gcc.target/riscv/sum-of-two-s12-const-1.c: New test: checks
            for new patterns output.
            * gcc.target/riscv/sum-of-two-s12-const-2.c: Ditto.
            * gcc.target/riscv/sum-of-two-s12-const-3.c: New test: should not
            ICE.

    Tested-by: Edwin Lu <ewlu@rivosinc.com> # pre-commit-CI #1520
    Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-05-28 22:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-11 23:13 [Bug target/106265] New: RISC-V SPEC2017 507.cactu code bloat due to address generation vineet.gupta at linux dot dev
2022-07-11 23:23 ` [Bug target/106265] " vineet.gupta at linux dot dev
2022-07-11 23:25 ` vineet.gupta at linux dot dev
2022-07-11 23:27 ` vineet.gupta at linux dot dev
2022-07-11 23:37 ` vineet.gupta at linux dot dev
2022-07-12  6:44 ` rguenth at gcc dot gnu.org
2022-07-12  7:51   ` Andrew Waterman
2022-07-12  7:52 ` andrew at sifive dot com
2022-07-12 20:18 ` vineet.gupta at linux dot dev
2022-07-21 21:09 ` vineet.gupta at linux dot dev
2022-07-21 21:21 ` vineet.gupta at linux dot dev
2022-07-22 21:39 ` vineet.gupta at linux dot dev
2023-08-04 22:30 ` vineetg at gcc dot gnu.org
2024-05-28 22:40 ` vineetg at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).