public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
@ 2024-04-15 22:42 vineetg at gcc dot gnu.org
  2024-04-15 22:43 ` [Bug rtl-optimization/114729] " pinskia at gcc dot gnu.org
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: vineetg at gcc dot gnu.org @ 2024-04-15 22:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

            Bug ID: 114729
           Summary: RISC-V SPEC2017 507.cactu excessive spillls with
                    -fschedule-insns
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vineetg at gcc dot gnu.org
                CC: jeffreyalaw at gmail dot com, kito.cheng at gmail dot com,
                    rdapp at gcc dot gnu.org
  Target Milestone: ---

Created attachment 57953
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57953&action=edit
spec cactu reduced

In RISC-V SPEC runs, Cactu dynamic icounts are worst of all (compared to
aarch64 with similar build toggles: -Ofast). 

As of Upstream commit 3fed1609f610 of 2024-01-31:
   aarch64: 1,363,212,534,747  vs.
   risc-v : 2,852,277,890,338 

There's an existing issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265
which captures ongoing work to improve the stack/array accesses. However that
is more of damage control. The root cause happens to be excessive stack spills
on RISC-V. Robin noticed these were somehow triggered by first scheduling pass.
Disabling sched1 with -fno-schedule-insns brings down the total icount to half 
1,295,520,619,523 which is even slightly better than aarch64, all things
considered.

I ran a reducer (tracking token sfp in -verbose-asm output) and was able to get
a test which shows a single stack spill (store+load) with
default/-fschedule-insns and none with -fno-schedule-insns.

It seems sched1 is moving insn around, but the actual spills are generated by
IRA. So this is an interplay of sched1 and IRA.

```
ira

New iteration of spill/restore move
      Changing RTL for loop 2 (header bb6)
      Changing RTL for loop 1 (header bb4)
  26 vs parent 26:Creating newreg=246 from oldreg=137
  25 vs parent 25:Creating newreg=247 from oldreg=143
  11 vs parent 11:Creating newreg=248 from oldreg=223
  16 vs parent 16:Creating newreg=249 from oldreg=237

      Changing RTL for loop 3 (header bb3)
  26 vs parent 26:Creating newreg=250 from oldreg=246
  25 vs parent 25:Creating newreg=251 from oldreg=247
  -1 vs parent 11:Creating newreg=253 from oldreg=248
  16 vs parent 16:Creating newreg=254 from oldreg=249

...

scanning new insn with uid = 181.
scanning new insn with uid = 182.
scanning new insn with uid = 183.
scanning new insn with uid = 184.
changing bb of uid 194
  unscanned insn
scanning new insn with uid = 185.
scanning new insn with uid = 186.
scanning new insn with uid = 187.
scanning new insn with uid = 188.
changing bb of uid 195
  unscanned insn

...
+++Costs: overall 11650, reg 10680, mem 970, ld 485, st 485, move 1366
+++       move loops 0, new jumps 2
...

(insn 9 104 11 2 (set (reg/f:DI 137 [ r.4_4 ])
  (mem/f/c:DI (lo_sum:DI (reg/f:DI 155)
     (symbol_ref:DI ("r") [flags 0x86]  
       <var_decl 0x7a69fcdb1630 r>)) [4 r+0 S8 A64]))
              {*movdi_64bit}
    (expr_list:REG_DEAD (reg/f:DI 155)
    (expr_list:REG_EQUAL (mem/f/c:DI 
     (symbol_ref:DI ("r") [flags 0x86]  
        <var_decl 0x7a69fcdb1630 r>) [4 r+0 S8 A64])

(insn 115 165 181 2 (set (reg:DI 245)
   (const_int 1 [0x1])) {*movdi_64bit}
     (expr_list:REG_EQUIV (const_int 1 [0x1])

          ---- spill code start -----

(insn 181 115 182 2 (set (reg/f:DI 246 
    [orig:137 r.4_4 ] [137])
        (reg/f:DI 137 [ r.4_4 ])) {*movdi_64bit}
     (expr_list:REG_DEAD (reg/f:DI 137 [ r.4_4 ])

(insn 182 181 183 2 (set (reg/f:DI 247 
    [orig:143 w.9_10 ] [143])
        (reg/f:DI 143 [ w.9_10 ])) {*movdi_64bit}
     (expr_list:REG_DEAD (reg/f:DI 143 [ w.9_10 ])

(insn 183 182 184 2 (set (reg:DI 248 
    [orig:223 MEM[(int *)j.15_19 + 4B] ] [223])
        (reg:DI 223 [ MEM[(int *)j.15_19 + 4B] ])) 
            {*movdi_64bit}
     (expr_list:REG_DEAD (reg:DI 223 
                     [ MEM[(int *)j.15_19 + 4B] ])

(insn 184 183 174 2 (set (reg:DI 249 
    [orig:237 _38 ] [237])
        (reg:DI 237 [ _38 ])) {*movdi_64bit}
     (expr_list:REG_DEAD (reg:DI 237 [ _38 ])

          ---- spill code -----

(jump_insn 174 184 175 2 (set (pc)
        (label_ref 100)) 350 {jump}
     (nil)
 -> 100)

(barrier 175 174 196)

          ---- spill code start -----

(code_label 196 175 195 10 10 (nil) [1 uses])
(note 195 196 189 10 [bb 10] NOTE_INSN_BASIC_BLOCK)

(insn 189 195 190 10 (set (reg/f:DI 250 
    [orig:137 r.4_4 ] [137])
        (reg/f:DI 246 [orig:137 r.4_4 ] [137])) 
             {*movdi_64bit}
     (expr_list:REG_DEAD (reg/f:DI 246 
          [orig:137 r.4_4 ] [137])

(insn 190 189 191 10 (set (reg/f:DI 251 
    [orig:143 w.9_10 ] [143])
        (reg/f:DI 247 [orig:143 w.9_10 ] [143])) 
         {*movdi_64bit}
     (expr_list:REG_DEAD (reg/f:DI 247 
           [orig:143 w.9_10 ] [143])

(insn 191 190 192 10 (set (reg/v:DI 252 
     [orig:152 i ] [152])
        (reg/v:DI 152 [ i ])) 208 {*movdi_64bit}
     (expr_list:REG_DEAD (reg/v:DI 152 [ i ])

(insn 192 191 193 10 (set (reg:DI 253 
    [orig:223 MEM[(int *)j.15_19 + 4B] ] [223])
        (reg:DI 248 [orig:223 
    MEM[(int *)j.15_19 + 4B] ] [223])) {*movdi_64bit}
     (expr_list:REG_DEAD (reg:DI 248 
         [orig:223 MEM[(int *)j.15_19 + 4B] ] [223])

(insn 193 192 97 10 (set (reg:DI 254 
    [orig:237 _38 ] [237])
        (reg:DI 249 [orig:237 _38 ] [237])) 
              {*movdi_64bit}
     (expr_list:REG_DEAD (reg:DI 249 
              [orig:237 _38 ] [237])

          ---- spill code -----

(code_label 97 193 14 3 3 (nil) [1 uses])
(note 14 97 33 3 [bb 3] NOTE_INSN_BASIC_BLOCK)

(insn 17 56 21 3 (set (reg:DF 159 [ l ])
        (mem/c:DF (lo_sum:DI (reg/f:DI 234)
```

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
  2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
@ 2024-04-15 22:43 ` pinskia at gcc dot gnu.org
  2024-04-15 22:49 ` vineetg at gcc dot gnu.org
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-15 22:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization, ra
          Component|target                      |rtl-optimization

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I suspect there are some other duplicates of this issue (including ones that
affect aarch64).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
  2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
  2024-04-15 22:43 ` [Bug rtl-optimization/114729] " pinskia at gcc dot gnu.org
@ 2024-04-15 22:49 ` vineetg at gcc dot gnu.org
  2024-04-15 23:14 ` law at gcc dot gnu.org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: vineetg at gcc dot gnu.org @ 2024-04-15 22:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

--- Comment #2 from Vineet Gupta <vineetg at gcc dot gnu.org> ---
FWIW -fsched-pressure is already default enabled for RISC-V.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
  2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
  2024-04-15 22:43 ` [Bug rtl-optimization/114729] " pinskia at gcc dot gnu.org
  2024-04-15 22:49 ` vineetg at gcc dot gnu.org
@ 2024-04-15 23:14 ` law at gcc dot gnu.org
  2024-04-15 23:28 ` vineetg at gcc dot gnu.org
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: law at gcc dot gnu.org @ 2024-04-15 23:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

Jeffrey A. Law <law at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2024-04-15
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
                 CC|                            |law at gcc dot gnu.org

--- Comment #3 from Jeffrey A. Law <law at gcc dot gnu.org> ---
Right.  So what I'm most interested in are the scheduler decisions as most
likely IRA/LRA are simply stuck dealing with a pathological conflict graph
given the number of register available.  ie, sched1 is the root cause.

Given the size of the problem and the fact that we have register pressure
sensitive scheduling enabled, I don't really think it's related to the issues
Andrew linked or the others I've looked at in the past.  But we're going to
have to dive into those sched1 dumps to know for sure.

Vineet, do we have this isolated enough that we know what function is most
affected and presumably the most impacted blocks?  If so we can probably start
to debug scheduler dumps.

There's a flag -fsched-verbose=N that gives a lot more low level information
about the scheduler's decisions.  I usually use N=99.  It makes for a huge
dump, but gives extremely detailed information about the scheduler's view of
the world.  It's going to be big enough that bugzilla might balk at attaching
the file, even after compression.

I'm going to go ahead and confirm given Robin's seen the same behavior on this
benchmark.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
  2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-04-15 23:14 ` law at gcc dot gnu.org
@ 2024-04-15 23:28 ` vineetg at gcc dot gnu.org
  2024-04-16  2:39 ` juzhe.zhong at rivai dot ai
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: vineetg at gcc dot gnu.org @ 2024-04-15 23:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

--- Comment #4 from Vineet Gupta <vineetg at gcc dot gnu.org> ---
(In reply to Jeffrey A. Law from comment #3)

> Vineet, do we have this isolated enough that we know what function is most
> affected and presumably the most impacted blocks?  If so we can probably
> start to debug scheduler dumps.

I think so :-) But this is all anecdotal.

The test attached was reduced from original/full ML_BSSN_RHS.ii (which granted
is 2nd most spill, orig is ML_BSSN_Advect.ii which i have also reduced now).
Anyhow  pretty much all of file is one function and my reduction methodology
was to see 1 spill with sched1 enabled and none otherwise. I hope that is
representative of the pathology seen in the original/full ML_BSSN_RHS.ii

> There's a flag -fsched-verbose=N that gives a lot more low level information
> about the scheduler's decisions.  I usually use N=99.  It makes for a huge
> dump, but gives extremely detailed information about the scheduler's view of
> the world.

I'll start diving into sched1 dumps as you suggest.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
  2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-04-15 23:28 ` vineetg at gcc dot gnu.org
@ 2024-04-16  2:39 ` juzhe.zhong at rivai dot ai
  2024-04-16  7:40 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-04-16  2:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

JuzheZhong <juzhe.zhong at rivai dot ai> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |juzhe.zhong at rivai dot ai

--- Comment #5 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Did you try another scheduler ?

-fselective-scheduling to see whether the spill issues still exist ?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
  2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-04-16  2:39 ` juzhe.zhong at rivai dot ai
@ 2024-04-16  7:40 ` rguenth at gcc dot gnu.org
  2024-04-16 13:49 ` law at gcc dot gnu.org
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-16  7:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |riscv

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
There are also different sched-pressure algorithms via --param
sched-pressure-algorithm and -flive-range-shrinkage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
  2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-04-16  7:40 ` rguenth at gcc dot gnu.org
@ 2024-04-16 13:49 ` law at gcc dot gnu.org
  2024-04-16 13:58 ` law at gcc dot gnu.org
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: law at gcc dot gnu.org @ 2024-04-16 13:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

--- Comment #7 from Jeffrey A. Law <law at gcc dot gnu.org> ---
Yes, there are different algorithms.  I looked at them a while back when we
first noticed the problems with spilling and x264.  There was very little
difference for specint when we varied the algorithms.  I didn't look at specfp
at the time IIRC.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
  2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2024-04-16 13:49 ` law at gcc dot gnu.org
@ 2024-04-16 13:58 ` law at gcc dot gnu.org
  2024-04-16 23:02 ` vineetg at gcc dot gnu.org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: law at gcc dot gnu.org @ 2024-04-16 13:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

--- Comment #8 from Jeffrey A. Law <law at gcc dot gnu.org> ---
I didn't even notice you had that testcase attached!

I haven't done a deep dive, but the first thing that jumps out is the number of
instructions in the ready queue, most likely because of the addressing of
objects in static storage.  The highparts alone are going to take ~18 GPRs for
the loop.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
  2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-04-16 13:58 ` law at gcc dot gnu.org
@ 2024-04-16 23:02 ` vineetg at gcc dot gnu.org
  2024-04-18  0:45 ` vineetg at gcc dot gnu.org
  2024-04-18 14:30 ` law at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: vineetg at gcc dot gnu.org @ 2024-04-16 23:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

Vineet Gupta <vineetg at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2024-04-15 00:00:00         |2024-4-16

--- Comment #9 from Vineet Gupta <vineetg at gcc dot gnu.org> ---
So I stared with the reg being spilled (a1)

.L2:
        beq     a1,zero,.L5    # if j[1] == 0
        li      a2,1
        ble     a6,s11,.L2    # if j[0] < 1
        sd      a1,8(sp)                # spill (save)


.L3:                       # inner loop start
       ...

        blt  a2,a6,.L3    # inner loop end

        ld      a1,8(sp)                # spill (restore)
        j       .L2

Next was zooming into the inner loop where a1 is being used/clobbered by sched1
and not w/o sched1 with my rudimentary define, use, dead annotation.

------------------------------------------------------------------------------
        -fschedule-insns (NOK)       |         -fno-schedule-insns (OK)
------------------------------------------------------------------------------
1-def      ld    a5,%lo(u)(s0) #u, u | 1-def       ld    a5,%lo(u)(t6)  # u, u
2-def      srliw a0,a5,16            | 2-def       srliw s10,a5,16
3-def      srli  a1,a5,32            | 1-use       sh    a5,%lo(_Z1sv)(a4)
1-use      sh    a5,%lo(_Z1sv)(a3)   | 2-dead      sh    s10,%lo(_Z1sv+2)(a4)
              ---insn1---            | 3-def       srli  s10,a5,32
1-use      srli  a5,a5,48            | 1-use       srli  a5,a5,48
              ---insn2---            | 1-dead      sh    a5,%lo(_Z1sv+6)(a4)
2-dead     sh    a0,%lo(_Z1sv+2)(a3) |              ---insn1---
3-dead     sh    a1,%lo(_Z1sv+4)(a3) |              ---insn2---
1-dead     sh    a5,%lo(_Z1sv+6)(a3) | 3-dead      sh    s10,%lo(_Z1sv+4)(a4)

The problem seems to be longer live range of 2-def (on left side). If it was
used/dead right afte, 3-def won't need a new register.

With that insight, I can now start looking into the sched1 dumps of the
corresponding BB.

;;       10--> b  0: i  35 r170#0=[r242+low(`u')]                 
:alu:@GR_REGS+1(1)@FP_REGS+0(0)
;;       11--> b  0: i  79 r209=[r229+low(`f')]                   
:alu:GR_REGS+0(0)FP_REGS+1(1)
;;       12--> b  0: i  76 r141=fix(r206)                         
:alu:@GR_REGS+1(1)@FP_REGS+0(-1)
;;       13--> b  0: i  46 r180=zxt(r170,0x10,0x10)               
:alu:@GR_REGS+1(1)@FP_REGS+0(0)
;;       14--> b  0: i  55 r188=r170 0>>0x20                      
:alu:GR_REGS+1(1)FP_REGS+0(0)
;;       15--> b  0: i  81 r210=r141<<0x3                         
:alu:GR_REGS+1(0)FP_REGS+0(0)
;;       16--> b  0: i  82 r211=r143+r210                         
:alu:GR_REGS+1(0)FP_REGS+0(0)
;;       17--> b  0: i  44 [r230+low(`_Z1sv')]=r170#0             
:alu:@GR_REGS+0(0)@FP_REGS+0(0)
;;       18--> b  0: i  65 r197=r170 0>>0x30                      
:alu:GR_REGS+1(0)FP_REGS+0(0)
;;       19--> b  0: i  54 [r230+low(const(`_Z1sv'+0x2))]=r180#0  
:alu:@GR_REGS+0(-1)@FP_REGS+0(0)
;;       20--> b  0: i  64 [r230+low(const(`_Z1sv'+0x4))]=r188#0  
:alu:GR_REGS+0(-1)FP_REGS+0(0)
;;       21--> b  0: i  73 [r230+low(const(`_Z1sv'+0x6))]=r197#0  
:alu:GR_REGS+0(-1)FP_REGS+0(0)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
  2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-04-16 23:02 ` vineetg at gcc dot gnu.org
@ 2024-04-18  0:45 ` vineetg at gcc dot gnu.org
  2024-04-18 14:30 ` law at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: vineetg at gcc dot gnu.org @ 2024-04-18  0:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

--- Comment #10 from Vineet Gupta <vineetg at gcc dot gnu.org> ---
Debug update -fsched-verbose=99 dumps (they are reaaaaalllly verbose)

For the insn/regs under consideration, the canonical pre-scheduled sequence
with ideal live-range (but non-ideal load-to-use delay) is following

  ;;   ======================================================
  ;;   -- basic block 3 from 17 to 98 -- before reload
  ;;   ======================================================

  ;;    |   35 |   10 | r170#0=[r242+low(`u')]         alu
  ;;    |   44 |    6 | [r230+low(`_Z1sv')]=r170#0     alu

  ;;    |   46 |    7 | r180=zxt(r170,0x10,0x10)       alu
  ;;    |   54 |    6 | [r230+low(const(`_Z1sv'+0x2))]=r180#0 alu

  ;;    |   55 |    7 | r188=r170 0>>0x20              alu
  ;;    |   64 |    6 | [r230+low(const(`_Z1sv'+0x4))]=r188#0 alu

  ;;    |   65 |    7 | r197=r170 0>>0x30              alu
  ;;    |   73 |    6 | [r230+low(const(`_Z1sv'+0x6))]=r197#0 alu

r170 (insn 35) is the central character whose live range has to be longest 
because of dependencies.

 - {46, 55, 65} USE r170, and sources which create new pseudos
 - {54, 64, 73} are where these new pseudos sink.

How these 2 sets are interleaved defines the register pressure.
 - If above src1:sink1:src2:sink2:src3:sink3: 1 reg suffices
 - If src1:src2:src3:                         3 reg needed

Per sched1 dumps, the "source" set gets inducted into the ready queue together:

  ;;    dependencies resolved: insn 65
  ;;    tick updated: insn 65 into ready
  ;;    dependencies resolved: insn 55
  ;;    tick updated: insn 55 into ready
  ;;    dependencies resolved: insn 46
  ;;    tick updated: insn 46 into ready
  ;;    dependencies resolved: insn 44
  ;;    tick updated: insn 44 into ready
  ;;    +------------------------------------------------------
  ;;    | Pressure costs for ready queue
  ;;    |  pressure points GR_REGS:[26->28 at 17:54] FP_REGS:[1->1 at 0:94]
  ;;    +------------------------------------------------------
  ;;    |  15   44 |    6  +3 | GR_REGS:[0 base cost 0] FP_REGS:[0 base cost 0]
  ;;    |  16   46 |    7  +3 | GR_REGS:[1 base cost 0] FP_REGS:[0 base cost 0]
               ^^^^
  ;;    |  18   55 |    7  +3 | GR_REGS:[1 base cost 1] FP_REGS:[0 base cost 0]
               ^^^^
  ;;    |  20   65 |    7  +3 | GR_REGS:[1 base cost 1] FP_REGS:[0 base cost 0]
               ^^^^
  ;;    |  11   76 |   10  +2 | GR_REGS:[1 base cost 0] FP_REGS:[-1 base cost
0]
  ;;    |   0   94 |    2  +1 | GR_REGS:[0 base cost 0] FP_REGS:[0 base cost 0]
  ;;    |  28   92 |    5  +1 | GR_REGS:[0 base cost 0] FP_REGS:[1 base cost 0]
  ;;    |  26   88 |    5  +1 | GR_REGS:[0 base cost 0] FP_REGS:[1 base cost 0]
  ;;    |  22   79 |    9  +1 | GR_REGS:[0 base cost 0] FP_REGS:[1 base cost 0]
  ;;    +------------------------------------------------------
  ;;      RFS_PRESSURE_DELAY: 7: 44 46 76 94
  ;;            RFS_PRIORITY: 6: 92 88 79
  ;;      RFS_PRESSURE_INDEX: 2: 55
  ;;    Ready list (t =  10):    65:44(cost=1:prio=7:delay=3:idx=20) 
55:42(cost=1:prio=7:delay=3:idx=18)  44:39(cost=0:prio=6:delay=3:idx=15) 
46:40(cost=0:prio=7:delay=3:idx=16)  76:47(cost=0:prio=10:delay=2:idx=11) 
94:58(cost=0:prio=2:delay=1:idx=0)  92:56(cost=0:prio=5:delay=1:idx=28) 
88:54(cost=0:prio=5:delay=1:idx=26)  79:48(cost=0:prio=9:delay=1:idx=22)

As the algorithm converges, they move around a bit, but rarely are the src/sink
considered in same iteration and if at all only 1

  ;;    +------------------------------------------------------
  ;;    | Pressure costs for ready queue
  ;;    |  pressure points GR_REGS:[29->29 at 0:94] FP_REGS:[1->1 at 0:94]
  ;;    +------------------------------------------------------

...

  ;;    |  19   64 |    6  +0 | GR_REGS:[-1 base cost -1] FP_REGS:[0 base cost
0]
  ;;    |  17   54 |    6  +0 | GR_REGS:[-1 base cost -1] FP_REGS:[0 base cost
0]
  ;;    |  20   65 |    7  +0 | GR_REGS:[0 base cost 0] FP_REGS:[0 base cos


All of this leads to the pessimistic schedule emitted in the end.

I'm still trying to wrap my head around the humungous dump info.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
  2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2024-04-18  0:45 ` vineetg at gcc dot gnu.org
@ 2024-04-18 14:30 ` law at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: law at gcc dot gnu.org @ 2024-04-18 14:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

--- Comment #11 from Jeffrey A. Law <law at gcc dot gnu.org> ---
Yup.  -fsched-verbose=99 is *very* verbose.  But that's the point, to see all
the gory details.  It can be dialed down, but I've never done so myself.

What stands out to me is this:

  ;;    | Pressure costs for ready queue
  ;;    |  pressure points GR_REGS:[29->29 at 0:94] FP_REGS:[1->1 at 0:94]

I haven't had to debug pressure stuff, so I'm not as familiar with its dump
format.  But I'd hazard a guess the "29->29" means the insn is neutral WRT
register pressure with the estimate being we need 29 GPRs before/after this
insn.  

If we think about our GPR file, at 29 we're likely already spilling.  32 - (sp,
fp, ra, x0, gp, perhaps tp as well).  So maybe that points at the first two
thing to verify.

1. What does the "29" actually mean.  I'm guessing it means the number of GPRs
estimated live at this point.  But we should make sure.

2. How does the heuristic determine when to start applying pressure
sensitivity?  Presumably it's based on the number of registers in a particular
class.  But given we can't allocate sp, ra, x0, fp, gp, are we properly
accounting for those limitations?

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-04-18 14:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-15 22:42 [Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns vineetg at gcc dot gnu.org
2024-04-15 22:43 ` [Bug rtl-optimization/114729] " pinskia at gcc dot gnu.org
2024-04-15 22:49 ` vineetg at gcc dot gnu.org
2024-04-15 23:14 ` law at gcc dot gnu.org
2024-04-15 23:28 ` vineetg at gcc dot gnu.org
2024-04-16  2:39 ` juzhe.zhong at rivai dot ai
2024-04-16  7:40 ` rguenth at gcc dot gnu.org
2024-04-16 13:49 ` law at gcc dot gnu.org
2024-04-16 13:58 ` law at gcc dot gnu.org
2024-04-16 23:02 ` vineetg at gcc dot gnu.org
2024-04-18  0:45 ` vineetg at gcc dot gnu.org
2024-04-18 14:30 ` law at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).