public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* ICE for interim fix for PR/110748
@ 2023-08-01 19:17 Vineet Gupta
  2023-08-11 23:32 ` IRA update_equiv_regs for (was Re: ICE for interim fix for PR/110748) Vineet Gupta
  0 siblings, 1 reply; 7+ messages in thread
From: Vineet Gupta @ 2023-08-01 19:17 UTC (permalink / raw)
  To: Jeff Law; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 936 bytes --]

Hi Jeff,

As discussed this morning, I'm sending over dumps for the optim of DF 
const -0.0 (PR/110748)  [1]
For rv64gc_zbs build, IRA is undoing the split which eventually leads to 
ICE in final pass.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748#c15

void znd(double *d) {  *d = -0.0;   }


*split1*

(insn 10 3 11 2 (set (reg:DI 136)
         (const_int [0x8000000000000000])) "neg.c":4:5 -1

(insn 11 10 0 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
         (subreg:DF (reg:DI 136) 0)) "neg.c":4:5 -1

*ira*

(insn 11 9 12 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
         (const_double:DF -0.0 [-0x0.0p+0])) "neg.c":4:5 190 
{*movdf_hardfloat_rv64}
      (expr_list:REG_DEAD (reg:DI 135)


For the working case, the large const is not involved and not subject to 
IRA playing foul.

Attached are split1 and IRA dumps for OK (rv64gc) and NOK (rv64gc_zbs) 
cases.

Thx,
-Vineet

[-- Attachment #2: neg-NOK-rv64gc_zbs.302r.ira --]
[-- Type: text/plain, Size: 4487 bytes --]


;; Function znd (znd, funcdef_no=0, decl_uid=2278, cgraph_uid=1, symbol_order=0)

Starting decreasing number of live ranges...
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
;; 1 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2
;; 2 succs { 1 }
rescanning insn with uid = 11.
deleting insn with uid = 10.
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
df_worklist_dataflow_doublequeue: n_basic_blocks 3 n_edges 2 count 3 (    1)
Reg 135 uninteresting
;; 1 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2
;; 2 succs { 1 }
Building IRA IR
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called

Pass 0 for finding pseudo/allocno costs

    a0 (r135,l0) best GR_REGS, allocno GR_REGS

  a0(r135,l0) costs: SIBCALL_REGS:2000,2000 JALR_REGS:2000,2000 GR_REGS:2000,2000 MEM:10000,10000


Pass 1 for finding pseudo/allocno costs

    r135: preferred GR_REGS, alternative NO_REGS, allocno GR_REGS

  a0(r135,l0) costs: GR_REGS:2000,2000 MEM:10000,10000

   Insn 11(l0): point = 0
   Insn 9(l0): point = 2
 a0(r135): [1..2]
Compressing live ranges: from 5 to 2 - 40%
Ranges after the compression:
 a0(r135): [0..1]
+++Allocating 0 bytes for conflict table (uncompressed size 8)
;; a0(r135,l0) conflicts:
;;     total conflict hard regs:
;;     conflict hard regs:


  pref0:a0(r135)<-hr10@2000
  regions=1, blocks=3, points=2
    allocnos=1 (big 0), copies=0, conflicts=0, ranges=1

**** Allocnos coloring:


  Loop 0 (parent -1, header bb2, depth 0)
    bbs: 2
    all: 0r135
    modified regnos: 135
    border:
    Pressure: GR_REGS=2
    Hard reg set forest:
      0:( 1 5-63)@0
        1:( 5-31)@24000
      Allocno a0r135 of GR_REGS(28) has 27 avail. regs  5-31, node:  5-31 (confl regs =  0-4 32-127)
      Forming thread from colorable bucket:
      Pushing a0(r135,l0)(cost 0)
      Popping a0(r135,l0)  --         assign reg 10
Disposition:
    0:r135 l0    10
New iteration of spill/restore move
+++Costs: overall -2000, reg -2000, mem 0, ld 0, st 0, move 0
+++       move loops 0, new jumps 0


znd

Dataflow summary:
;;  fully invalidated by EH 	 0 [zero] 3 [gp] 4 [tp] 5 [t0] 6 [t1] 7 [t2] 10 [a0] 11 [a1] 12 [a2] 13 [a3] 14 [a4] 15 [a5] 16 [a6] 17 [a7] 28 [t3] 29 [t4] 30 [t5] 31 [t6] 32 [ft0] 33 [ft1] 34 [ft2] 35 [ft3] 36 [ft4] 37 [ft5] 38 [ft6] 39 [ft7] 42 [fa0] 43 [fa1] 44 [fa2] 45 [fa3] 46 [fa4] 47 [fa5] 48 [fa6] 49 [fa7] 60 [ft8] 61 [ft9] 62 [ft10] 63 [ft11] 66 [vl] 67 [vtype] 68 [vxrm] 69 [frm] 70 [N/A] 71 [N/A] 72 [N/A] 73 [N/A] 74 [N/A] 75 [N/A] 76 [N/A] 77 [N/A] 78 [N/A] 79 [N/A] 80 [N/A] 81 [N/A] 82 [N/A] 83 [N/A] 84 [N/A] 85 [N/A] 86 [N/A] 87 [N/A] 88 [N/A] 89 [N/A] 90 [N/A] 91 [N/A] 92 [N/A] 93 [N/A] 94 [N/A] 95 [N/A] 96 [v0] 97 [v1] 98 [v2] 99 [v3] 100 [v4] 101 [v5] 102 [v6] 103 [v7] 104 [v8] 105 [v9] 106 [v10] 107 [v11] 108 [v12] 109 [v13] 110 [v14] 111 [v15] 112 [v16] 113 [v17] 114 [v18] 115 [v19] 116 [v20] 117 [v21] 118 [v22] 119 [v23] 120 [v24] 121 [v25] 122 [v26] 123 [v27] 124 [v28] 125 [v29] 126 [v30] 127 [v31]
;;  hardware regs used 	 2 [sp] 64 [arg] 65 [frame]
;;  regular block artificial uses 	 2 [sp] 8 [s0] 64 [arg] 65 [frame]
;;  eh block artificial uses 	 2 [sp] 8 [s0] 64 [arg] 65 [frame]
;;  entry block defs 	 1 [ra] 2 [sp] 8 [s0] 10 [a0] 11 [a1] 12 [a2] 13 [a3] 14 [a4] 15 [a5] 16 [a6] 17 [a7] 42 [fa0] 43 [fa1] 44 [fa2] 45 [fa3] 46 [fa4] 47 [fa5] 48 [fa6] 49 [fa7] 64 [arg] 65 [frame]
;;  exit block uses 	 1 [ra] 2 [sp] 8 [s0] 65 [frame]
;;  regs ever live 	 10 [a0]
;;  ref usage 	r1={1d,1u} r2={1d,2u} r8={1d,2u} r10={1d,1u} r11={1d} r12={1d} r13={1d} r14={1d} r15={1d} r16={1d} r17={1d} r42={1d} r43={1d} r44={1d} r45={1d} r46={1d} r47={1d} r48={1d} r49={1d} r64={1d,1u} r65={1d,2u} r135={1d,1u} 
;;    total ref usage 32{22d,10u,0e} in 2{2 regular + 0 call} insns.
(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(note 2 4 3 2 NOTE_INSN_DELETED)
(note 3 2 9 2 NOTE_INSN_FUNCTION_BEG)
(insn 9 3 11 2 (set (reg:DI 135)
        (reg:DI 10 a0 [ d ])) "neg.c":3:1 179 {*movdi_64bit}
     (expr_list:REG_DEAD (reg:DI 10 a0 [ d ])
        (nil)))
(insn 11 9 12 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
        (const_double:DF -0.0 [-0x0.0p+0])) "neg.c":4:5 190 {*movdf_hardfloat_rv64}
     (expr_list:REG_DEAD (reg:DI 135)
        (nil)))
(note 12 11 0 NOTE_INSN_DELETED)

[-- Attachment #3: neg-NOK-rv64gc_zbs.293r.split1 --]
[-- Type: text/plain, Size: 2526 bytes --]


;; Function znd (znd, funcdef_no=0, decl_uid=2278, cgraph_uid=1, symbol_order=0)

Splitting with gen_split_19 (riscv.md:2144)
scanning new insn with uid = 10.
scanning new insn with uid = 11.
deleting insn with uid = 6.
deleting insn with uid = 6.


znd

Dataflow summary:
;;  fully invalidated by EH 	 0 [zero] 3 [gp] 4 [tp] 5 [t0] 6 [t1] 7 [t2] 10 [a0] 11 [a1] 12 [a2] 13 [a3] 14 [a4] 15 [a5] 16 [a6] 17 [a7] 28 [t3] 29 [t4] 30 [t5] 31 [t6] 32 [ft0] 33 [ft1] 34 [ft2] 35 [ft3] 36 [ft4] 37 [ft5] 38 [ft6] 39 [ft7] 42 [fa0] 43 [fa1] 44 [fa2] 45 [fa3] 46 [fa4] 47 [fa5] 48 [fa6] 49 [fa7] 60 [ft8] 61 [ft9] 62 [ft10] 63 [ft11] 66 [vl] 67 [vtype] 68 [vxrm] 69 [frm] 70 [N/A] 71 [N/A] 72 [N/A] 73 [N/A] 74 [N/A] 75 [N/A] 76 [N/A] 77 [N/A] 78 [N/A] 79 [N/A] 80 [N/A] 81 [N/A] 82 [N/A] 83 [N/A] 84 [N/A] 85 [N/A] 86 [N/A] 87 [N/A] 88 [N/A] 89 [N/A] 90 [N/A] 91 [N/A] 92 [N/A] 93 [N/A] 94 [N/A] 95 [N/A] 96 [v0] 97 [v1] 98 [v2] 99 [v3] 100 [v4] 101 [v5] 102 [v6] 103 [v7] 104 [v8] 105 [v9] 106 [v10] 107 [v11] 108 [v12] 109 [v13] 110 [v14] 111 [v15] 112 [v16] 113 [v17] 114 [v18] 115 [v19] 116 [v20] 117 [v21] 118 [v22] 119 [v23] 120 [v24] 121 [v25] 122 [v26] 123 [v27] 124 [v28] 125 [v29] 126 [v30] 127 [v31]
;;  hardware regs used 	 2 [sp] 64 [arg] 65 [frame]
;;  regular block artificial uses 	 2 [sp] 8 [s0] 64 [arg] 65 [frame]
;;  eh block artificial uses 	 2 [sp] 8 [s0] 64 [arg] 65 [frame]
;;  entry block defs 	 1 [ra] 2 [sp] 8 [s0] 10 [a0] 11 [a1] 12 [a2] 13 [a3] 14 [a4] 15 [a5] 16 [a6] 17 [a7] 42 [fa0] 43 [fa1] 44 [fa2] 45 [fa3] 46 [fa4] 47 [fa5] 48 [fa6] 49 [fa7] 64 [arg] 65 [frame]
;;  exit block uses 	 1 [ra] 2 [sp] 8 [s0] 65 [frame]
;;  regs ever live 	 10 [a0]
;;  ref usage 	r1={1d,1u} r2={1d,2u} r8={1d,2u} r10={1d,1u} r11={1d} r12={1d} r13={1d} r14={1d} r15={1d} r16={1d} r17={1d} r42={1d} r43={1d} r44={1d} r45={1d} r46={1d} r47={1d} r48={1d} r49={1d} r64={1d,1u} r65={1d,2u} r135={1d,1u} r136={1d,1u} 
;;    total ref usage 34{23d,11u,0e} in 3{3 regular + 0 call} insns.
(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 9 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 9 4 2 2 (set (reg:DI 135)
        (reg:DI 10 a0 [ d ])) "neg.c":3:1 -1
     (expr_list:REG_DEAD (reg:DI 10 a0 [ d ])
        (nil)))
(note 2 9 3 2 NOTE_INSN_DELETED)
(note 3 2 10 2 NOTE_INSN_FUNCTION_BEG)
(insn 10 3 11 2 (set (reg:DI 136)
        (const_int -9223372036854775808 [0x8000000000000000])) "neg.c":4:5 -1
     (nil))
(insn 11 10 0 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
        (subreg:DF (reg:DI 136) 0)) "neg.c":4:5 -1
     (nil))

[-- Attachment #4: neg-OK-rv64gc.302r.ira --]
[-- Type: text/plain, Size: 6638 bytes --]


;; Function znd (znd, funcdef_no=0, decl_uid=2278, cgraph_uid=1, symbol_order=0)

Starting decreasing number of live ranges...
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
;; 1 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2
;; 2 succs { 1 }
scanning new insn with uid = 14.
verify found no changes in insn with uid = 14.
deleting insn with uid = 10.
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
df_worklist_dataflow_doublequeue: n_basic_blocks 3 n_edges 2 count 3 (    1)
Reg 135: local to bb 2 def dominates all uses has unique first use
Reg 137 uninteresting
Reg 136 uninteresting
Found def insn 9 for 135 to be not moveable
;; 1 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2
;; 2 succs { 1 }
Building IRA IR
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
init_insns for 137: (insn_list:REG_DEP_TRUE 14 (nil))

Pass 0 for finding pseudo/allocno costs

    a2 (r137,l0) best GR_REGS, allocno GR_REGS
    a1 (r136,l0) best JALR_REGS, allocno JALR_REGS
    a0 (r135,l0) best GR_REGS, allocno GR_REGS

  a0(r135,l0) costs: SIBCALL_REGS:2000,2000 JALR_REGS:2000,2000 GR_REGS:2000,2000 MEM:10000,10000
  a1(r136,l0) costs: SIBCALL_REGS:2000,2000 JALR_REGS:2000,2000 GR_REGS:8000,8000 MEM:10000,10000
  a2(r137,l0) costs: SIBCALL_REGS:0,0 JALR_REGS:0,0 GR_REGS:0,0 MEM:10000,10000


Pass 1 for finding pseudo/allocno costs

    r137: preferred GR_REGS, alternative NO_REGS, allocno GR_REGS
    r136: preferred JALR_REGS, alternative GR_REGS, allocno GR_REGS
    r135: preferred GR_REGS, alternative NO_REGS, allocno GR_REGS

  a0(r135,l0) costs: GR_REGS:2000,2000 MEM:10000,10000
  a1(r136,l0) costs: JALR_REGS:4000,4000 GR_REGS:10000,10000 MEM:12000,12000
  a2(r137,l0) costs: GR_REGS:0,0 MEM:10000,10000

   Insn 12(l0): point = 0
   Insn 11(l0): point = 2
   Insn 14(l0): point = 4
   Insn 9(l0): point = 6
 a0(r135): [1..6]
 a1(r136): [1..2]
 a2(r137): [3..4]
Compressing live ranges: from 9 to 4 - 44%
Ranges after the compression:
 a0(r135): [0..3]
 a1(r136): [0..1]
 a2(r137): [2..3]
+++Allocating 24 bytes for conflict table (uncompressed size 24)
;; a0(r135,l0) conflicts: a1(r136,l0) a2(r137,l0)
;;     total conflict hard regs: 1
;;     conflict hard regs: 1

;; a1(r136,l0) conflicts: a0(r135,l0)
;;     total conflict hard regs: 1
;;     conflict hard regs: 1

;; a2(r137,l0) conflicts: a0(r135,l0)
;;     total conflict hard regs: 1
;;     conflict hard regs: 1


  cp0:a1(r136)<->a2(r137)@125:shuffle
  pref0:a0(r135)<-hr10@2000
  regions=1, blocks=3, points=4
    allocnos=3 (big 0), copies=1, conflicts=0, ranges=3

**** Allocnos coloring:


  Loop 0 (parent -1, header bb2, depth 0)
    bbs: 2
    all: 0r135 1r136 2r137
    modified regnos: 135 136 137
    border:
    Pressure: GR_REGS=3
    Hard reg set forest:
      0:( 1 5-63)@0
        1:( 5-31)@60000
      Allocno a0r135 of GR_REGS(28) has 27 avail. regs  5-31, node:  5-31 (confl regs =  0-4 32-127)
      Allocno a1r136 of GR_REGS(28) has 27 avail. regs  5-31, node:  5-31 (confl regs =  0-4 32-127)
      Allocno a2r137 of GR_REGS(28) has 27 avail. regs  5-31, node:  5-31 (confl regs =  0-4 32-127)
      Forming thread from colorable bucket:
        Forming thread by copy 0:a1r136-a2r137 (freq=125):
          Result (freq=4000): a1r136(2000) a2r137(2000)
      Pushing a0(r135,l0)(cost 0)
      Pushing a2(r137,l0)(cost 0)
      Pushing a1(r136,l0)(cost 0)
      Popping a1(r136,l0)  --         assign reg 15
      Popping a2(r137,l0)  --         assign reg 15
      Popping a0(r135,l0)  --         assign reg 10
Disposition:
    0:r135 l0    10    1:r136 l0    15    2:r137 l0    15
New iteration of spill/restore move
+++Costs: overall 2000, reg 2000, mem 0, ld 0, st 0, move 0
+++       move loops 0, new jumps 0


znd

Dataflow summary:
;;  fully invalidated by EH 	 0 [zero] 3 [gp] 4 [tp] 5 [t0] 6 [t1] 7 [t2] 10 [a0] 11 [a1] 12 [a2] 13 [a3] 14 [a4] 15 [a5] 16 [a6] 17 [a7] 28 [t3] 29 [t4] 30 [t5] 31 [t6] 32 [ft0] 33 [ft1] 34 [ft2] 35 [ft3] 36 [ft4] 37 [ft5] 38 [ft6] 39 [ft7] 42 [fa0] 43 [fa1] 44 [fa2] 45 [fa3] 46 [fa4] 47 [fa5] 48 [fa6] 49 [fa7] 60 [ft8] 61 [ft9] 62 [ft10] 63 [ft11] 66 [vl] 67 [vtype] 68 [vxrm] 69 [frm] 70 [N/A] 71 [N/A] 72 [N/A] 73 [N/A] 74 [N/A] 75 [N/A] 76 [N/A] 77 [N/A] 78 [N/A] 79 [N/A] 80 [N/A] 81 [N/A] 82 [N/A] 83 [N/A] 84 [N/A] 85 [N/A] 86 [N/A] 87 [N/A] 88 [N/A] 89 [N/A] 90 [N/A] 91 [N/A] 92 [N/A] 93 [N/A] 94 [N/A] 95 [N/A] 96 [v0] 97 [v1] 98 [v2] 99 [v3] 100 [v4] 101 [v5] 102 [v6] 103 [v7] 104 [v8] 105 [v9] 106 [v10] 107 [v11] 108 [v12] 109 [v13] 110 [v14] 111 [v15] 112 [v16] 113 [v17] 114 [v18] 115 [v19] 116 [v20] 117 [v21] 118 [v22] 119 [v23] 120 [v24] 121 [v25] 122 [v26] 123 [v27] 124 [v28] 125 [v29] 126 [v30] 127 [v31]
;;  hardware regs used 	 2 [sp] 64 [arg] 65 [frame]
;;  regular block artificial uses 	 2 [sp] 8 [s0] 64 [arg] 65 [frame]
;;  eh block artificial uses 	 2 [sp] 8 [s0] 64 [arg] 65 [frame]
;;  entry block defs 	 1 [ra] 2 [sp] 8 [s0] 10 [a0] 11 [a1] 12 [a2] 13 [a3] 14 [a4] 15 [a5] 16 [a6] 17 [a7] 42 [fa0] 43 [fa1] 44 [fa2] 45 [fa3] 46 [fa4] 47 [fa5] 48 [fa6] 49 [fa7] 64 [arg] 65 [frame]
;;  exit block uses 	 1 [ra] 2 [sp] 8 [s0] 65 [frame]
;;  regs ever live 	 10 [a0]
;;  ref usage 	r1={1d,1u} r2={1d,2u} r8={1d,2u} r10={1d,1u} r11={1d} r12={1d} r13={1d} r14={1d} r15={1d} r16={1d} r17={1d} r42={1d} r43={1d} r44={1d} r45={1d} r46={1d} r47={1d} r48={1d} r49={1d} r64={1d,1u} r65={1d,2u} r135={1d,1u} r136={1d,1u} r137={1d,1u} 
;;    total ref usage 36{24d,12u,0e} in 4{4 regular + 0 call} insns.
(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(note 2 4 3 2 NOTE_INSN_DELETED)
(note 3 2 9 2 NOTE_INSN_FUNCTION_BEG)
(insn 9 3 14 2 (set (reg:DI 135)
        (reg:DI 10 a0 [ d ])) "neg.c":3:1 179 {*movdi_64bit}
     (expr_list:REG_DEAD (reg:DI 10 a0 [ d ])
        (nil)))
(insn 14 9 11 2 (set (reg:DI 137)
        (const_int -1 [0xffffffffffffffff])) "neg.c":4:5 179 {*movdi_64bit}
     (expr_list:REG_EQUIV (const_int -1 [0xffffffffffffffff])
        (nil)))
(insn 11 14 12 2 (set (reg:DI 136)
        (ashift:DI (reg:DI 137)
            (const_int 63 [0x3f]))) "neg.c":4:5 198 {ashldi3}
     (expr_list:REG_DEAD (reg:DI 137)
        (nil)))
(insn 12 11 13 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
        (subreg:DF (reg:DI 136) 0)) "neg.c":4:5 190 {*movdf_hardfloat_rv64}
     (expr_list:REG_DEAD (reg:DI 136)
        (expr_list:REG_DEAD (reg:DI 135)
            (nil))))
(note 13 12 0 NOTE_INSN_DELETED)

[-- Attachment #5: neg-OK-rv64gc.293r.split1 --]
[-- Type: text/plain, Size: 2684 bytes --]


;; Function znd (znd, funcdef_no=0, decl_uid=2278, cgraph_uid=1, symbol_order=0)

Splitting with gen_split_19 (riscv.md:2144)
scanning new insn with uid = 10.
scanning new insn with uid = 11.
scanning new insn with uid = 12.
deleting insn with uid = 6.
deleting insn with uid = 6.


znd

Dataflow summary:
;;  fully invalidated by EH 	 0 [zero] 3 [gp] 4 [tp] 5 [t0] 6 [t1] 7 [t2] 10 [a0] 11 [a1] 12 [a2] 13 [a3] 14 [a4] 15 [a5] 16 [a6] 17 [a7] 28 [t3] 29 [t4] 30 [t5] 31 [t6] 32 [ft0] 33 [ft1] 34 [ft2] 35 [ft3] 36 [ft4] 37 [ft5] 38 [ft6] 39 [ft7] 42 [fa0] 43 [fa1] 44 [fa2] 45 [fa3] 46 [fa4] 47 [fa5] 48 [fa6] 49 [fa7] 60 [ft8] 61 [ft9] 62 [ft10] 63 [ft11] 66 [vl] 67 [vtype] 68 [vxrm] 69 [frm] 70 [N/A] 71 [N/A] 72 [N/A] 73 [N/A] 74 [N/A] 75 [N/A] 76 [N/A] 77 [N/A] 78 [N/A] 79 [N/A] 80 [N/A] 81 [N/A] 82 [N/A] 83 [N/A] 84 [N/A] 85 [N/A] 86 [N/A] 87 [N/A] 88 [N/A] 89 [N/A] 90 [N/A] 91 [N/A] 92 [N/A] 93 [N/A] 94 [N/A] 95 [N/A] 96 [v0] 97 [v1] 98 [v2] 99 [v3] 100 [v4] 101 [v5] 102 [v6] 103 [v7] 104 [v8] 105 [v9] 106 [v10] 107 [v11] 108 [v12] 109 [v13] 110 [v14] 111 [v15] 112 [v16] 113 [v17] 114 [v18] 115 [v19] 116 [v20] 117 [v21] 118 [v22] 119 [v23] 120 [v24] 121 [v25] 122 [v26] 123 [v27] 124 [v28] 125 [v29] 126 [v30] 127 [v31]
;;  hardware regs used 	 2 [sp] 64 [arg] 65 [frame]
;;  regular block artificial uses 	 2 [sp] 8 [s0] 64 [arg] 65 [frame]
;;  eh block artificial uses 	 2 [sp] 8 [s0] 64 [arg] 65 [frame]
;;  entry block defs 	 1 [ra] 2 [sp] 8 [s0] 10 [a0] 11 [a1] 12 [a2] 13 [a3] 14 [a4] 15 [a5] 16 [a6] 17 [a7] 42 [fa0] 43 [fa1] 44 [fa2] 45 [fa3] 46 [fa4] 47 [fa5] 48 [fa6] 49 [fa7] 64 [arg] 65 [frame]
;;  exit block uses 	 1 [ra] 2 [sp] 8 [s0] 65 [frame]
;;  regs ever live 	 10 [a0]
;;  ref usage 	r1={1d,1u} r2={1d,2u} r8={1d,2u} r10={1d,1u} r11={1d} r12={1d} r13={1d} r14={1d} r15={1d} r16={1d} r17={1d} r42={1d} r43={1d} r44={1d} r45={1d} r46={1d} r47={1d} r48={1d} r49={1d} r64={1d,1u} r65={1d,2u} r135={1d,1u} r136={1d,1u} r137={1d,1u} 
;;    total ref usage 36{24d,12u,0e} in 4{4 regular + 0 call} insns.
(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 9 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 9 4 2 2 (set (reg:DI 135)
        (reg:DI 10 a0 [ d ])) "neg.c":3:1 -1
     (expr_list:REG_DEAD (reg:DI 10 a0 [ d ])
        (nil)))
(note 2 9 3 2 NOTE_INSN_DELETED)
(note 3 2 10 2 NOTE_INSN_FUNCTION_BEG)
(insn 10 3 11 2 (set (reg:DI 137)
        (const_int -1 [0xffffffffffffffff])) "neg.c":4:5 -1
     (nil))
(insn 11 10 12 2 (set (reg:DI 136)
        (ashift:DI (reg:DI 137)
            (const_int 63 [0x3f]))) "neg.c":4:5 -1
     (nil))
(insn 12 11 0 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
        (subreg:DF (reg:DI 136) 0)) "neg.c":4:5 -1
     (nil))

^ permalink raw reply	[flat|nested] 7+ messages in thread

* IRA update_equiv_regs for (was Re: ICE for interim fix for PR/110748)
  2023-08-01 19:17 ICE for interim fix for PR/110748 Vineet Gupta
@ 2023-08-11 23:32 ` Vineet Gupta
  2023-08-12  0:04   ` Jeff Law
  0 siblings, 1 reply; 7+ messages in thread
From: Vineet Gupta @ 2023-08-11 23:32 UTC (permalink / raw)
  To: Jeff Law; +Cc: GCC Patches, gnu-toolchain


On 8/1/23 12:17, Vineet Gupta wrote:
> Hi Jeff,
>
> As discussed this morning, I'm sending over dumps for the optim of DF 
> const -0.0 (PR/110748)  [1]
> For rv64gc_zbs build, IRA is undoing the split which eventually leads 
> to ICE in final pass.
>
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748#c15
>
> void znd(double *d) {  *d = -0.0;   }
>
>
> *split1*
>
> (insn 10 3 11 2 (set (reg:DI 136)
>         (const_int [0x8000000000000000])) "neg.c":4:5 -1
>
> (insn 11 10 0 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
>         (subreg:DF (reg:DI 136) 0)) "neg.c":4:5 -1
>
> *ira*
>
> (insn 11 9 12 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
>         (const_double:DF -0.0 [-0x0.0p+0])) "neg.c":4:5 190 
> {*movdf_hardfloat_rv64}
>      (expr_list:REG_DEAD (reg:DI 135)
>
>
> For the working case, the large const is not involved and not subject 
> to IRA playing foul.

I investigated this some more. So IRA update_equiv_regs () has code 
identifying potential replacements: if a reg is referenced exactly 
twice: set once and used once.

           if (REG_N_REFS (regno) == 2
               && (rtx_equal_p (replacement, src)
               || ! equiv_init_varies_p (src))
               && NONJUMP_INSN_P (insn)
               && equiv_init_movable_p (PATTERN (insn), regno))
             reg_equiv[regno].replace = 1;
         }

And combine_and_move_insns () does the replacement, undoing the split1 
above.

As a hack if I disable this code, the ICE above goes away and we get the 
right output.

     bseti    a5,zero,63
     sd    a5,0(a0)
     ret

In fact this is the reason for many more split1 being undone. See the 
suboptimal codegen for large const for Andrew Pinski's test case [1]

   long long f(void)
     {
       unsigned t = 0x101_0101;
       long long t1 = t;
       long long t2 = ((unsigned long long )t) << 32;
       asm("":"+r"(t1));
       return t1 | t2;
     }

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620413.html

Again if I use the hacked compiler the redundant immediate goes away.

       upstream          |       ira hack         |
     li a0,0x101_0000    |    li   a5,0x1010000   |
     addi a0,a0,0x101    |    addi a5,a5,0x101    |
     li a5,0x101_0000    |    mv   a0,a5          |
     slli a0,a0,32       |    slli a5,a5,32       |
     addi a5,a5,0x101    |    or   a0,a0,a5       |
     or   a0,a5,a0       |    ret                 |
     ret                 |                        |



This code has been there since 2009, when ira.c was created so it 
obviously per design/expected.

I'm wondering (naively) if there is some way to tune this - for a given 
backend. In general it would make sense to do the replacement, but not 
if the cost changes (e.g. consts could be embedded in x86 insn freely, 
but not for RISC-V where this is costly and if something is split, it 
might been intentional.

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRA update_equiv_regs for (was Re: ICE for interim fix for PR/110748)
  2023-08-11 23:32 ` IRA update_equiv_regs for (was Re: ICE for interim fix for PR/110748) Vineet Gupta
@ 2023-08-12  0:04   ` Jeff Law
  2023-08-12 16:44     ` Jivan Hakobyan
  2023-08-15  0:35     ` Vineet Gupta
  0 siblings, 2 replies; 7+ messages in thread
From: Jeff Law @ 2023-08-12  0:04 UTC (permalink / raw)
  To: Vineet Gupta; +Cc: GCC Patches, gnu-toolchain



On 8/11/23 17:32, Vineet Gupta wrote:
> 
> On 8/1/23 12:17, Vineet Gupta wrote:
>> Hi Jeff,
>>
>> As discussed this morning, I'm sending over dumps for the optim of DF 
>> const -0.0 (PR/110748)  [1]
>> For rv64gc_zbs build, IRA is undoing the split which eventually leads 
>> to ICE in final pass.
>>
>> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748#c15
>>
>> void znd(double *d) {  *d = -0.0;   }
>>
>>
>> *split1*
>>
>> (insn 10 3 11 2 (set (reg:DI 136)
>>         (const_int [0x8000000000000000])) "neg.c":4:5 -1
>>
>> (insn 11 10 0 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
>>         (subreg:DF (reg:DI 136) 0)) "neg.c":4:5 -1
>>
>> *ira*
>>
>> (insn 11 9 12 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
>>         (const_double:DF -0.0 [-0x0.0p+0])) "neg.c":4:5 190 
>> {*movdf_hardfloat_rv64}
>>      (expr_list:REG_DEAD (reg:DI 135)
>>
>>
>> For the working case, the large const is not involved and not subject 
>> to IRA playing foul.
> 
> I investigated this some more. So IRA update_equiv_regs () has code 
> identifying potential replacements: if a reg is referenced exactly 
> twice: set once and used once.
> 
>            if (REG_N_REFS (regno) == 2
>                && (rtx_equal_p (replacement, src)
>                || ! equiv_init_varies_p (src))
>                && NONJUMP_INSN_P (insn)
>                && equiv_init_movable_p (PATTERN (insn), regno))
>              reg_equiv[regno].replace = 1;
>          }
> 
> And combine_and_move_insns () does the replacement, undoing the split1 
> above.
Right.  This is as expected.  There was actually similar code that goes 
back even before the introduction of IRA -- like to the 80s and 90s.

Conceptually the idea is a value with an equivalence that has a single 
set and single use isn't a good use of a hard register.  Better to 
narrow the live range to a single pair of instructions.

It's not always a good tradeoff.  Consider if the equivalence was also a 
loop invariant and hoisted out of the loop and register pressure is low.


> 
> In fact this is the reason for many more split1 being undone. See the 
> suboptimal codegen for large const for Andrew Pinski's test case [1]
No doubt.  I think it's also a problem with some of Jivan's work.


> 
> I'm wondering (naively) if there is some way to tune this - for a given 
> backend. In general it would make sense to do the replacement, but not 
> if the cost changes (e.g. consts could be embedded in x86 insn freely, 
> but not for RISC-V where this is costly and if something is split, it 
> might been intentional.
I'm not immediately aware of a way to tune.

When it comes to tuning, the toplevel questions are do we have any of 
the info we need to tune at the point where the transformation occurs. 
The two most obvious pieces here would be loop info an register pressure.

ie, do we have enough loop structure to know if the def is at a 
shallower loop nest than the use.  There's a reasonable chance we have 
this information as my recollection is this analysis is done fairly 
early in IRA.

But that means we likely don't have any sense of register pressure at 
the points between the def and use.   So the most useful metric for 
tuning isn't really available.

The one thing that stands out is we don't do this transformation at all 
when register pressure sensitive scheduling is enabled.  And we really 
should be turning that on by default.  Our data shows register pressure 
sensitive scheduling is about a 6-7% cycle improvement on x264 as it 
avoids spilling in those key satd loops.

>  /* Don't move insns if live range shrinkage or register
>      pressure-sensitive scheduling were done because it will not
>      improve allocation but likely worsen insn scheduling.  */
>   if (optimize
>       && !flag_live_range_shrinkage
>       && !(flag_sched_pressure && flag_schedule_insns))
>     combine_and_move_insns ();


So you might want to look at register pressure sensitive scheduling 
first.  If you go into x264_r from specint and look at 
x264_pixel_satd_8x4.  First verify the loops are fully unrolled.  If 
they are, then look for 32bit loads/stores into the stack.  If you have 
them, then you're spilling and getting crappy performance.  Using 
register pressure sensitive scheduling should help significantly.

We've certainly seen that internally.  The plan was to submit a patch to 
make register pressure sensitive scheduling the default when the 
scheduler is enabled.  We just haven't pushed on it.  If you can verify 
that you're seeing spilling as well, then it'd certainly bolster the 
argument that register-pressure-sensitive-scheduling is desirable.

Jeff








^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRA update_equiv_regs for (was Re: ICE for interim fix for PR/110748)
  2023-08-12  0:04   ` Jeff Law
@ 2023-08-12 16:44     ` Jivan Hakobyan
  2023-08-15  3:57       ` Jeff Law
  2023-08-15  0:35     ` Vineet Gupta
  1 sibling, 1 reply; 7+ messages in thread
From: Jivan Hakobyan @ 2023-08-12 16:44 UTC (permalink / raw)
  To: Jeff Law; +Cc: Vineet Gupta, GCC Patches, gnu-toolchain

[-- Attachment #1: Type: text/plain, Size: 6653 bytes --]

Yes, as mentioned Jeff I have some work in that scope.

The first is related to address computation when it has a large constant
part.
Suppose we have this code:

    int  consume (void *);
    int foo (void) {
       int x[1000];
       return consume (x);
    }

before IRA we have the following sequence
    19: r140:DI=0xfffffffffffff000
    20: r136:DI=r140:DI+0x60
      REG_EQUAL 0xfffffffffffff060
    8: a0:DI=frame:DI+r136:DI
      REG_DEAD r136:DI

but during IRA (eliminate_regs_in_insn) insn 8 transforms to
   8: a0:DI=r136:DI+0xfa0+frame:DI
        REG_DEAD r136:DI

and in the end, we get the wrong sequence.
   21: r136:DI=0xfffffffffffff060
      REG_EQUIV 0xfffffffffffff060
   25: r143:DI=0x1000
   26: r142:DI=r143:DI-0x60
      REG_DEAD r143:DI
      REG_EQUAL 0xfa0
   27: r142:DI=r142:DI+r136:DI
      REG_DEAD r136:DI
   8: a0:DI=r142:DI+frame:DI
      REG_DEAD r142:DI

My changes prevent that transformation.
I have tested on spec and did not get regressions.
Besides. executed 40B fewer instructions.

The second work related to hoisting out loop invariant code.
I have a test case where SP + const can be hoisted out.
......
.L3:
      call foo
      addi a5,sp,16
      sh3add a0,a0,a5
.......

Before IRA that code is already out of the loop, but IRA moves back.
My approach was done in update_equiv_regs().
It prevents any move if its uses and defs are held in a single place, and
used in the loop.
Currently, that improvement is under evaluation.


On Sat, Aug 12, 2023 at 4:05 AM Jeff Law via Gcc-patches <
gcc-patches@gcc.gnu.org> wrote:

>
>
> On 8/11/23 17:32, Vineet Gupta wrote:
> >
> > On 8/1/23 12:17, Vineet Gupta wrote:
> >> Hi Jeff,
> >>
> >> As discussed this morning, I'm sending over dumps for the optim of DF
> >> const -0.0 (PR/110748)  [1]
> >> For rv64gc_zbs build, IRA is undoing the split which eventually leads
> >> to ICE in final pass.
> >>
> >> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748#c15
> >>
> >> void znd(double *d) {  *d = -0.0;   }
> >>
> >>
> >> *split1*
> >>
> >> (insn 10 3 11 2 (set (reg:DI 136)
> >>         (const_int [0x8000000000000000])) "neg.c":4:5 -1
> >>
> >> (insn 11 10 0 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
> >>         (subreg:DF (reg:DI 136) 0)) "neg.c":4:5 -1
> >>
> >> *ira*
> >>
> >> (insn 11 9 12 2 (set (mem:DF (reg:DI 135) [1 *d_2(D)+0 S8 A64])
> >>         (const_double:DF -0.0 [-0x0.0p+0])) "neg.c":4:5 190
> >> {*movdf_hardfloat_rv64}
> >>      (expr_list:REG_DEAD (reg:DI 135)
> >>
> >>
> >> For the working case, the large const is not involved and not subject
> >> to IRA playing foul.
> >
> > I investigated this some more. So IRA update_equiv_regs () has code
> > identifying potential replacements: if a reg is referenced exactly
> > twice: set once and used once.
> >
> >            if (REG_N_REFS (regno) == 2
> >                && (rtx_equal_p (replacement, src)
> >                || ! equiv_init_varies_p (src))
> >                && NONJUMP_INSN_P (insn)
> >                && equiv_init_movable_p (PATTERN (insn), regno))
> >              reg_equiv[regno].replace = 1;
> >          }
> >
> > And combine_and_move_insns () does the replacement, undoing the split1
> > above.
> Right.  This is as expected.  There was actually similar code that goes
> back even before the introduction of IRA -- like to the 80s and 90s.
>
> Conceptually the idea is a value with an equivalence that has a single
> set and single use isn't a good use of a hard register.  Better to
> narrow the live range to a single pair of instructions.
>
> It's not always a good tradeoff.  Consider if the equivalence was also a
> loop invariant and hoisted out of the loop and register pressure is low.
>
>
> >
> > In fact this is the reason for many more split1 being undone. See the
> > suboptimal codegen for large const for Andrew Pinski's test case [1]
> No doubt.  I think it's also a problem with some of Jivan's work.
>
>
> >
> > I'm wondering (naively) if there is some way to tune this - for a given
> > backend. In general it would make sense to do the replacement, but not
> > if the cost changes (e.g. consts could be embedded in x86 insn freely,
> > but not for RISC-V where this is costly and if something is split, it
> > might been intentional.
> I'm not immediately aware of a way to tune.
>
> When it comes to tuning, the toplevel questions are do we have any of
> the info we need to tune at the point where the transformation occurs.
> The two most obvious pieces here would be loop info an register pressure.
>
> ie, do we have enough loop structure to know if the def is at a
> shallower loop nest than the use.  There's a reasonable chance we have
> this information as my recollection is this analysis is done fairly
> early in IRA.
>
> But that means we likely don't have any sense of register pressure at
> the points between the def and use.   So the most useful metric for
> tuning isn't really available.
>
> The one thing that stands out is we don't do this transformation at all
> when register pressure sensitive scheduling is enabled.  And we really
> should be turning that on by default.  Our data shows register pressure
> sensitive scheduling is about a 6-7% cycle improvement on x264 as it
> avoids spilling in those key satd loops.
>
> >  /* Don't move insns if live range shrinkage or register
> >      pressure-sensitive scheduling were done because it will not
> >      improve allocation but likely worsen insn scheduling.  */
> >   if (optimize
> >       && !flag_live_range_shrinkage
> >       && !(flag_sched_pressure && flag_schedule_insns))
> >     combine_and_move_insns ();
>
>
> So you might want to look at register pressure sensitive scheduling
> first.  If you go into x264_r from specint and look at
> x264_pixel_satd_8x4.  First verify the loops are fully unrolled.  If
> they are, then look for 32bit loads/stores into the stack.  If you have
> them, then you're spilling and getting crappy performance.  Using
> register pressure sensitive scheduling should help significantly.
>
> We've certainly seen that internally.  The plan was to submit a patch to
> make register pressure sensitive scheduling the default when the
> scheduler is enabled.  We just haven't pushed on it.  If you can verify
> that you're seeing spilling as well, then it'd certainly bolster the
> argument that register-pressure-sensitive-scheduling is desirable.
>
> Jeff
>
>
>
>
>
>
>
>

-- 
With the best regards
Jivan Hakobyan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRA update_equiv_regs for (was Re: ICE for interim fix for PR/110748)
  2023-08-12  0:04   ` Jeff Law
  2023-08-12 16:44     ` Jivan Hakobyan
@ 2023-08-15  0:35     ` Vineet Gupta
  2023-08-15  4:02       ` Jeff Law
  1 sibling, 1 reply; 7+ messages in thread
From: Vineet Gupta @ 2023-08-15  0:35 UTC (permalink / raw)
  To: Jeff Law; +Cc: GCC Patches, gnu-toolchain


On 8/11/23 17:04, Jeff Law wrote:
>>
>> I'm wondering (naively) if there is some way to tune this - for a 
>> given backend. In general it would make sense to do the replacement, 
>> but not if the cost changes (e.g. consts could be embedded in x86 
>> insn freely, but not for RISC-V where this is costly and if something 
>> is split, it might been intentional.
> I'm not immediately aware of a way to tune.
>
> When it comes to tuning, the toplevel questions are do we have any of 
> the info we need to tune at the point where the transformation occurs. 
> The two most obvious pieces here would be loop info an register pressure.
>
> ie, do we have enough loop structure to know if the def is at a 
> shallower loop nest than the use.  There's a reasonable chance we have 
> this information as my recollection is this analysis is done fairly 
> early in IRA.
>
> But that means we likely don't have any sense of register pressure at 
> the points between the def and use.   So the most useful metric for 
> tuning isn't really available.

I'd argue that even if the register pressure were high, in some cases, 
there's just no way around it and RA needs to honor what the backend did 
apriori (split in this case), otherwise we end up with something which 
doesn't compute literally and leads to ICE. I'm puzzled that in this 
case, intentional implementation is getting in the way. So while I don't 
care about the -0.0 case in itself, it seems with the current framework 
we can't just achieve the results, other that the roundabout way of 
peephole2 you alluded to.

>
> The one thing that stands out is we don't do this transformation at 
> all when register pressure sensitive scheduling is enabled. And we 
> really should be turning that on by default.  Our data shows register 
> pressure sensitive scheduling is about a 6-7% cycle improvement on 
> x264 as it avoids spilling in those key satd loops.
>
>>  /* Don't move insns if live range shrinkage or register
>>      pressure-sensitive scheduling were done because it will not
>>      improve allocation but likely worsen insn scheduling.  */
>>   if (optimize
>>       && !flag_live_range_shrinkage
>>       && !(flag_sched_pressure && flag_schedule_insns))
>>     combine_and_move_insns ();
>
>
> So you might want to look at register pressure sensitive scheduling 
> first.  If you go into x264_r from specint and look at 
> x264_pixel_satd_8x4.  First verify the loops are fully unrolled. If 
> they are, then look for 32bit loads/stores into the stack.  If you 
> have them, then you're spilling and getting crappy performance.  Using 
> register pressure sensitive scheduling should help significantly.

Is that -fira-loop-pressure ?
>
> We've certainly seen that internally.  The plan was to submit a patch 
> to make register pressure sensitive scheduling the default when the 
> scheduler is enabled.  We just haven't pushed on it.  If you can 
> verify that you're seeing spilling as well, then it'd certainly 
> bolster the argument that register-pressure-sensitive-scheduling is 
> desirable.

I can confirm that the loop is fully unrolled and there's a zillion 
stack spills there for intermediate computes (-Ofast 
-march=rv64gc_zba_zbb_zbs, no V in that build).

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRA update_equiv_regs for (was Re: ICE for interim fix for PR/110748)
  2023-08-12 16:44     ` Jivan Hakobyan
@ 2023-08-15  3:57       ` Jeff Law
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff Law @ 2023-08-15  3:57 UTC (permalink / raw)
  To: Jivan Hakobyan; +Cc: Vineet Gupta, GCC Patches, gnu-toolchain



On 8/12/23 10:44, Jivan Hakobyan wrote:
> Yes, as mentioned Jeff I have some work in that scope.
> 
> The first is related to address computation when it has a large constant 
> part.
> Suppose we have this code:
> 
>      int  consume (void *);
>      int foo (void) {
>         int x[1000];
>         return consume (x);
>      }
> 
> before IRA we have the following sequence
>      19: r140:DI=0xfffffffffffff000
>      20: r136:DI=r140:DI+0x60
>        REG_EQUAL 0xfffffffffffff060
>      8: a0:DI=frame:DI+r136:DI
>        REG_DEAD r136:DI
> 
> but during IRA (eliminate_regs_in_insn) insn 8 transforms to
>     8: a0:DI=r136:DI+0xfa0+frame:DI
>          REG_DEAD r136:DI
> 
> and in the end, we get the wrong sequence.
>     21: r136:DI=0xfffffffffffff060
>        REG_EQUIV 0xfffffffffffff060
>     25: r143:DI=0x1000
>     26: r142:DI=r143:DI-0x60
>        REG_DEAD r143:DI
>        REG_EQUAL 0xfa0
>     27: r142:DI=r142:DI+r136:DI
>        REG_DEAD r136:DI
>     8: a0:DI=r142:DI+frame:DI
>        REG_DEAD r142:DI
> 
> My changes prevent that transformation.
> I have tested on spec and did not get regressions.
> Besides. executed 40B fewer instructions.
Right.  And this looks like a generic failing of the register 
elimination code to simplify after eliminating fp/ap to sp.  It's a bit 
of a surprise as I thought that code had some simplification 
capabilities.   But clearly if it has that ability it isn't working 
well.  Part of me wondered if it's falling down due to constants not 
fitting in a 12 bit signed immediate.   I've got a TODO to look at your 
patch in this space.  Maybe tonight if I can keep moving things off my 
TODO list ;-)

> 
> The second work related to hoisting out loop invariant code.
> I have a test case where SP + const can be hoisted out.
> ......
> .L3:
>        call foo
>        addi a5,sp,16
>        sh3add a0,a0,a5
> .......
> 
> Before IRA that code is already out of the loop, but IRA moves back.
> My approach was done in update_equiv_regs().
> It prevents any move if its uses and defs are held in a single place, 
> and used in the loop.
> Currently, that improvement is under evaluation.
Yea, we're going to need to sit down with this.  IRA is working per 
design and we may be able to avoid these problems with -fsched-pressure, 
but it feels a bit hackish.


Jeff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IRA update_equiv_regs for (was Re: ICE for interim fix for PR/110748)
  2023-08-15  0:35     ` Vineet Gupta
@ 2023-08-15  4:02       ` Jeff Law
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff Law @ 2023-08-15  4:02 UTC (permalink / raw)
  To: Vineet Gupta; +Cc: GCC Patches, gnu-toolchain



On 8/14/23 18:35, Vineet Gupta wrote:
> 
> On 8/11/23 17:04, Jeff Law wrote:
>>>
>>> I'm wondering (naively) if there is some way to tune this - for a 
>>> given backend. In general it would make sense to do the replacement, 
>>> but not if the cost changes (e.g. consts could be embedded in x86 
>>> insn freely, but not for RISC-V where this is costly and if something 
>>> is split, it might been intentional.
>> I'm not immediately aware of a way to tune.
>>
>> When it comes to tuning, the toplevel questions are do we have any of 
>> the info we need to tune at the point where the transformation occurs. 
>> The two most obvious pieces here would be loop info an register pressure.
>>
>> ie, do we have enough loop structure to know if the def is at a 
>> shallower loop nest than the use.  There's a reasonable chance we have 
>> this information as my recollection is this analysis is done fairly 
>> early in IRA.
>>
>> But that means we likely don't have any sense of register pressure at 
>> the points between the def and use.   So the most useful metric for 
>> tuning isn't really available.
> 
> I'd argue that even if the register pressure were high, in some cases, 
> there's just no way around it and RA needs to honor what the backend did 
> apriori (split in this case), otherwise we end up with something which 
> doesn't compute literally and leads to ICE. I'm puzzled that in this 
> case, intentional implementation is getting in the way. So while I don't 
> care about the -0.0 case in itself, it seems with the current framework 
> we can't just achieve the results, other that the roundabout way of 
> peephole2 you alluded to.
I think you'll run into a lot of resistance with that approach.   The 
fact it we're being a bit sneaky and telling a bit of a fib in the 
backend (claiming support for certain capabilities that don't actually 
exist).

As many have said, lie to GCC and ultimately it will gets revenge.  This 
is but one example.

When we lie to some parts of gcc, we may well trigger undesirable 
behavior later in the pipeline.  It's a tradeoff and sometimes we have 
to back out those little lies.





> 
>>
>> The one thing that stands out is we don't do this transformation at 
>> all when register pressure sensitive scheduling is enabled. And we 
>> really should be turning that on by default.  Our data shows register 
>> pressure sensitive scheduling is about a 6-7% cycle improvement on 
>> x264 as it avoids spilling in those key satd loops.
>>
>>>  /* Don't move insns if live range shrinkage or register
>>>      pressure-sensitive scheduling were done because it will not
>>>      improve allocation but likely worsen insn scheduling.  */
>>>   if (optimize
>>>       && !flag_live_range_shrinkage
>>>       && !(flag_sched_pressure && flag_schedule_insns))
>>>     combine_and_move_insns ();
>>
>>
>> So you might want to look at register pressure sensitive scheduling 
>> first.  If you go into x264_r from specint and look at 
>> x264_pixel_satd_8x4.  First verify the loops are fully unrolled. If 
>> they are, then look for 32bit loads/stores into the stack.  If you 
>> have them, then you're spilling and getting crappy performance.  Using 
>> register pressure sensitive scheduling should help significantly.
> 
> Is that -fira-loop-pressure ?
-fsched-pressure I think.


>>
>> We've certainly seen that internally.  The plan was to submit a patch 
>> to make register pressure sensitive scheduling the default when the 
>> scheduler is enabled.  We just haven't pushed on it.  If you can 
>> verify that you're seeing spilling as well, then it'd certainly 
>> bolster the argument that register-pressure-sensitive-scheduling is 
>> desirable.
> 
> I can confirm that the loop is fully unrolled and there's a zillion 
> stack spills there for intermediate computes (-Ofast 
> -march=rv64gc_zba_zbb_zbs, no V in that build).
Yea, you'll take a big hit from those spills.  Good to get a 
confirmation that you're seeing it too.

The fix should be pretty simple.  We just turn on -fsched-pressure in 
the RV backend.


Jeff

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-08-15  4:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-01 19:17 ICE for interim fix for PR/110748 Vineet Gupta
2023-08-11 23:32 ` IRA update_equiv_regs for (was Re: ICE for interim fix for PR/110748) Vineet Gupta
2023-08-12  0:04   ` Jeff Law
2023-08-12 16:44     ` Jivan Hakobyan
2023-08-15  3:57       ` Jeff Law
2023-08-15  0:35     ` Vineet Gupta
2023-08-15  4:02       ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).