* Followup on PR/109279: large constants on RISCV
@ 2023-06-02 2:38 Vineet Gupta
2023-06-07 23:44 ` Jeff Law
0 siblings, 1 reply; 4+ messages in thread
From: Vineet Gupta @ 2023-06-02 2:38 UTC (permalink / raw)
To: Jeff Law
Cc: gcc, Kito Cheng, Palmer Dabbelt, gnu-toolchain, pinskia, GCC Patches
Hi Jeff,
I finally got around to collecting various observations on PR/109279 -
more importantly the state of large constants in RV backend, apologies
in advance for the long email.
It seems the various commits in area have improved the original test
case of 0x1010101_01010101
Before 2e886eef7f2b | With 2e886eef7f2b | With
0530254413f8 | With c104ef4b5eb1
(const pool) | define_insn_and_split | "splitter relaxed
new |
| "*mvconst_internal" | pseudos" |
lui a5,%hi(.LANCHOR0) | li a0,0x01010000 | li a5,0x01010000
| li a5,0x01010000
ld a0,%lo(.LANCHOR0)(a5) | addi a0,0x0101 | addi
a5,a5,0x0101 | addi a5,a5,0x0101
ret | slli a0,a0,16 | mv a0,a5
| slli a0,a5,32
| addi a0,a0,0x0101 | slli
a5,a5,32 | add a0,a0,a5
| slli a0,a0,16 | add a0,a5,a0 |
| addi a0,a0,0x0101 |
ret |
| ret |
But same commits seem to have regressed Andrew's test from same PR
(which is the theme of this email).
The seemingly contrived test turned out to be much more than I'd hoped for.
long long f(void)
{
unsigned t = 0x101_0101;
long long t1 = t;
long long t2 = ((unsigned long long )t) << 32;
asm("":"+r"(t1));
return t1 | t2;
}
Before 2e886eef7f2b | With 2e886eef7f2b | With 0530254413f8
(ideal code) | define_insn_and_split | "splitter relaxed new
| | pseudos"
li a0,0x1010000 | li a5,0x1010000 | li a0,0x101_0000
addi a0,a0,0x101 | addi a5,a5,0x101 | addi a0,a0,0x101
slli a5,a0,32 | mv a0,a5 | li a5,0x101_0000
or a0,a0,a5 | slli a5,a5,32 | slli a0,a0,32
ret | or a0,a0,a5 | addi a5,a5,0x101
| ret | or a0,a5,a0
| ret
As a baseline, RTL just before cse1 (in 260r.dfinit) in all of above is:
# lower word
(insn 6 2 7 2 (set (reg:DI 138)
(const_int [0x1010000])) {*movdi_64bit}
(insn 7 6 8 2 (set (reg:DI 137)
(plus:DI (reg:DI 138)
(const_int [0x101]))) {adddi3}
(expr_list:REG_EQUAL (const_int [0x1010101]) )
(insn 5 8 9 2 (set (reg/v:DI 134 [ t1 ])
(reg:DI 136 [ t1 ])) {*movdi_64bit}
# upper word (created independently)
(insn 9 5 10 2 (set (reg:DI 141)
(const_int [0x1010000])) {*movdi_64bit}
(insn 10 9 11 2 (set (reg:DI 142)
(plus:DI (reg:DI 141)
(const_int [0x101]))) {adddi3}
(insn 11 10 12 2 (set (reg:DI 140)
(ashift:DI (reg:DI 142)
(const_int 32 [0x20]))) {ashldi3}
(expr_list:REG_EQUAL (const_int [0x101010100000000]))
# stitch them
(insn 12 11 13 2 (set (reg:DI 139)
(ior:DI (reg/v:DI 134 [ t1 ])
(reg:DI 140))) "const2.c":7:13 99 {iordi3}
Prior to 2e886eef7f2b, cse1 could do its job: finding oldest equivalent
registers for the fragments of const and reusing the reg.
(insn 7 6 8 2 (set (reg:DI 137)
(plus:DI (reg:DI 138)
(const_int [0x101]))) {adddi3}
(expr_list:REG_EQUAL (const_int [0x1010101])))
[...]
(insn 11 10 12 2 (set (reg:DI 140)
(ashift:DI (reg:DI 137)
^^^^^ OLD EQUIV REG
(const_int 32 [0x20]))) {ashldi3}
(expr_list:REG_EQUAL (const_int [0x1010101_00000000])))
With 2e886eef7f2b, define_insn_and_split "*mvconst_internal" recog()
kicks in during cse1, eliding insns for a const_int.
(insn 7 6 8 2 (set (reg:DI 137)
(const_int [0x1010101])) {*mvconst_internal}
(expr_list:REG_EQUAL (const_int [0x1010101])))
[...]
(insn 11 10 12 2 (set (reg:DI 140)
(const_int [0x1010101_00000000])) {*mvconst_internal}
(expr_list:REG_EQUAL (const_int [0x1010101_00000000]) ))
Eventually split1 breaks it up using same mvconst_internal splitter, but
the cse opportunity has been lost.
*This is a now a baseline for large consts handling for RV backend which
we all need to be aware of*.
(2) Now on to the nuances as to why things get progressively worse after
commit 0530254413f8.
It all seems to get down to register allocation passes:
sched1 before 0530254413f8
;; 0--> b 0: i 22 r140=0x1010000 :alu
;; 1--> b 0: i 20 r137=0x1010000 :alu
;; 2--> b 0: i 23 r140=r140+0x101 :alu
;; 3--> b 0: i 21 r137=r137+0x101 :alu
;; 4--> b 0: i 24 r140=r140<<0x20 :alu
;; 5--> b 0: i 25 r136=r137 :alu
;; 6--> b 0: i 8 r136=asm_operands :nothing
;; 7--> b 0: i 17 a0=r136|r140 :alu
;; 8--> b 0: i 18 use a0 :nothing
sched1 with 0530254413f8
;; 0--> b 0: i 22 r144=0x1010000 :alu
;; 1--> b 0: i 20 r143=0x1010000 :alu
;; 2--> b 0: i 23 r145=r144+0x101 :alu
;; 3--> b 0: i 21 r137=r143+0x101 :alu
;; 4--> b 0: i 24 r140=r145<<0x20 :alu
;; 5--> b 0: i 25 r136=r137 :alu
;; 6--> b 0: i 8 r136=asm_operands :nothing
;; 7--> b 0: i 17 a0=r136|r140 :alu
;; 8--> b 0: i 18 use a0 :nothing
The insn stream is same, only differences being registers reuse (due to
splitter restriction) vs. not.
Next IRA, for reasons I don't understand (and not brave enough yet to
dive into) decides to regenerate const_int.
And my guess is, this being so late in the game that it gets
rematerialized as is in the end, causing the duplicity.
FWIW, IRA for latter case only, emits additional REG_EQUIV notes which
could also be playing a role.
deleting insn with uid = 22.
deleting insn with uid = 21.
(insn 20 12 23 2 (set (reg:DI 143)
(const_int [0x1010000])) {*movdi_64bit}
(expr_list:REG_EQUIV (const_int [0x1010000])))
(insn 23 20 24 2 (set (reg:DI 145)
(const_int [0x1010101])) *{*mvconst_internal})*
^^^^^^^^^^^
(insn 24 23 25 2 (set (reg:DI 140)
(ashift:DI (reg:DI 145)
(const_int 32 [0x20]))) {ashldi3}
(expr_list:REG_DEAD (reg:DI 145)
(expr_list:REG_EQUIV (const_int [0x1010101_00000000]))))
(insn 25 24 8 2 (set (reg:DI 136 [ t1 ])
(plus:DI (reg:DI 143)
(const_int [0x101]))) {adddi3}
(expr_list:REG_DEAD (reg:DI 143)))
(insn 17 8 18 2 (set (reg/i:DI 10 a0)
(ior:DI (reg:DI 136 [ t1 ])
(reg:DI 140))) {iordi3}
(expr_list:REG_DEAD (reg:DI 140)
(expr_list:REG_DEAD (reg:DI 136 [ t1 ]) )))
I naively tried to gate mvconst_internal to !reload_completed, but that
triggered some ICE.
It seems wrong anyways since post reload split2 also sometimes needs to
invokes the mvconst_internal splitter (afaikr for the original
0x01010... test)
Do you have any ideas/directions on how to tackle this ?
Thx,
-Vineet
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Followup on PR/109279: large constants on RISCV
2023-06-02 2:38 Followup on PR/109279: large constants on RISCV Vineet Gupta
@ 2023-06-07 23:44 ` Jeff Law
2023-06-12 19:32 ` Vineet Gupta
0 siblings, 1 reply; 4+ messages in thread
From: Jeff Law @ 2023-06-07 23:44 UTC (permalink / raw)
To: Vineet Gupta
Cc: gcc, Kito Cheng, Palmer Dabbelt, gnu-toolchain, pinskia, GCC Patches
On 6/1/23 20:38, Vineet Gupta wrote:
> Hi Jeff,
>
> I finally got around to collecting various observations on PR/109279 -
> more importantly the state of large constants in RV backend, apologies
> in advance for the long email.
>
> It seems the various commits in area have improved the original test
> case of 0x1010101_01010101
>
> Before 2e886eef7f2b | With 2e886eef7f2b | With
> 0530254413f8 | With c104ef4b5eb1
Right. The handling of that constant shows a nice progression. On our
architecture the latter two versions are probably equivalent from a
latency standpoint, but the last is obviously best as it's smaller and
probably better on in-order architectures as well.
>
> But same commits seem to have regressed Andrew's test from same PR
> (which is the theme of this email).
> The seemingly contrived test turned out to be much more than I'd hoped for.
>
> long long f(void)
> {
> unsigned t = 0x101_0101;
> long long t1 = t;
> long long t2 = ((unsigned long long )t) << 32;
> asm("":"+r"(t1));
> return t1 | t2;
> }
[ ... ]
It may be more instructions, but I suspect they end up being the same
performance for us across all three varaints. Fusion and out-of-order
execution save the day. But I realize there may be targets where the
first is going to be preferred.
>
> Before 2e886eef7f2b | With 2e886eef7f2b | With 0530254413f8
> (ideal code) | define_insn_and_split | "splitter relaxed new
> | | pseudos"
> li a0,0x1010000 | li a5,0x1010000 | li a0,0x101_0000
> addi a0,a0,0x101 | addi a5,a5,0x101 | addi a0,a0,0x101
> slli a5,a0,32 | mv a0,a5 | li a5,0x101_0000
> or a0,a0,a5 | slli a5,a5,32 | slli a0,a0,32
> ret | or a0,a0,a5 | addi a5,a5,0x101
> | ret | or a0,a5,a0
> | ret
>
> As a baseline, RTL just before cse1 (in 260r.dfinit) in all of above is:
[ ... ]
Right. Standard looking synthesis.
>
>
> Prior to 2e886eef7f2b, cse1 could do its job: finding oldest equivalent
> registers for the fragments of const and reusing the reg.
Right. That's what I would expect.
[ ... ]
>
>
> With 2e886eef7f2b, define_insn_and_split "*mvconst_internal" recog()
> kicks in during cse1, eliding insns for a const_int.
>
> (insn 7 6 8 2 (set (reg:DI 137)
> (const_int [0x1010101])) {*mvconst_internal}
> (expr_list:REG_EQUAL (const_int [0x1010101])))
> [...]
>
> (insn 11 10 12 2 (set (reg:DI 140)
> (const_int [0x1010101_00000000])) {*mvconst_internal}
> (expr_list:REG_EQUAL (const_int [0x1010101_00000000]) ))
Understood. Not ideal, but we generally don't have good ways to limit
patterns to being available at different times during the optimization
phase. One thing you might want to try (which I thought we used at one
point) was make the pattern conditional on cse_not_expected. The goal
would be to avoid exposing the pattern until a later point in the
optimizer pipeline. It may have been the case that we dropped that over
time during development. It's all getting fuzzy at this point.
>
> Eventually split1 breaks it up using same mvconst_internal splitter, but
> the cse opportunity has been lost.
Right. I'd have to look at the pass definitions, but I suspect the
splitting pass where this happens is after the last standard CSE pass.
So we don't get a chance to CSE the constant synthesis.
> *This is a now a baseline for large consts handling for RV backend which
> we all need to be aware of*.
Understood. Though it's not as bad as you might think :-) You can
spend an inordinate amount of time improving constant synthesis,
generate code that looks really good, but in the end it may not make a
bit of different in real performance. Been there, done that. I'm not
saying we give up, but we need to keep in mind that we're often better
off trading a bit on the constant synthesis if doing so helps code where
those constants get used.
>
>
> (2) Now on to the nuances as to why things get progressively worse after
> commit 0530254413f8.
>
> It all seems to get down to register allocation passes:
>
> sched1 before 0530254413f8
>
> ;; 0--> b 0: i 22 r140=0x1010000 :alu
> ;; 1--> b 0: i 20 r137=0x1010000 :alu
> ;; 2--> b 0: i 23 r140=r140+0x101 :alu
> ;; 3--> b 0: i 21 r137=r137+0x101 :alu
> ;; 4--> b 0: i 24 r140=r140<<0x20 :alu
> ;; 5--> b 0: i 25 r136=r137 :alu
> ;; 6--> b 0: i 8 r136=asm_operands :nothing
> ;; 7--> b 0: i 17 a0=r136|r140 :alu
> ;; 8--> b 0: i 18 use a0 :nothing
>
> sched1 with 0530254413f8
>
> ;; 0--> b 0: i 22 r144=0x1010000 :alu
> ;; 1--> b 0: i 20 r143=0x1010000 :alu
> ;; 2--> b 0: i 23 r145=r144+0x101 :alu
> ;; 3--> b 0: i 21 r137=r143+0x101 :alu
> ;; 4--> b 0: i 24 r140=r145<<0x20 :alu
> ;; 5--> b 0: i 25 r136=r137 :alu
> ;; 6--> b 0: i 8 r136=asm_operands :nothing
> ;; 7--> b 0: i 17 a0=r136|r140 :alu
> ;; 8--> b 0: i 18 use a0 :nothing
>
> The insn stream is same, only differences being registers reuse (due to
> splitter restriction) vs. not.
>
> Next IRA, for reasons I don't understand (and not brave enough yet to
> dive into) decides to regenerate const_int.
Sure. It's pretty standard practice. When it finds a register that has
a known re-synthesizable value it will often replace the register with
the value. It can help in cases where register pressure it excessively
high by reducing the range of the register holding the value.
> And my guess is, this being so late in the game that it gets
> rematerialized as is in the end, causing the duplicity.
Yup. Though there is a post-reload CSE pass. It's pretty limited in
what it can do and register assignments often make it impossible to do
anything, but it's worth looking at to see if why it's not helping here.
> FWIW, IRA for latter case only, emits additional REG_EQUIV notes which
> could also be playing a role.
REG_EQUAL notes get promoted to REG_EQUIV notes in some cases. And when
other equivalences are discovered it may create a REG_EQUIV note out of
thin air.
The REG_EQUIV note essentially means that everywhere the register occurs
you can validly (from a program semantics standpoint) replace the
register with the value. It might require reloading, but it's a valid
semantic transformation which may reduce register pressure -- especially
for constants that were subject to LICM.
Contrast to REG_EQUAL which creates an equivalence at a particular point
in the IL, but the equivalence may not hold elsewhere in the IL.
>
> I naively tried to gate mvconst_internal to !reload_completed, but that
> triggered some ICE.
I wouldn't expect that to help here.
I would first start to see if using cse_not_expected in the splitter
pattern. I would also look at reload_cse_regs which should give us
some chance at seeing the value reuse if/when IRA/LRA muck things up.
jeff
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Followup on PR/109279: large constants on RISCV
2023-06-07 23:44 ` Jeff Law
@ 2023-06-12 19:32 ` Vineet Gupta
2023-06-17 22:13 ` Jeff Law
0 siblings, 1 reply; 4+ messages in thread
From: Vineet Gupta @ 2023-06-12 19:32 UTC (permalink / raw)
To: Jeff Law
Cc: gcc, Kito Cheng, Palmer Dabbelt, gnu-toolchain, pinskia, GCC Patches
Hi Jeff,
Thx for the detailed explanation and insight.
On 6/7/23 16:44, Jeff Law wrote:
>> With 2e886eef7f2b, define_insn_and_split "*mvconst_internal" recog()
>> kicks in during cse1, eliding insns for a const_int.
>>
>> (insn 7 6 8 2 (set (reg:DI 137)
>> (const_int [0x1010101])) {*mvconst_internal}
>> (expr_list:REG_EQUAL (const_int [0x1010101])))
>> [...]
>>
>> (insn 11 10 12 2 (set (reg:DI 140)
>> (const_int [0x1010101_00000000])) {*mvconst_internal}
>> (expr_list:REG_EQUAL (const_int [0x1010101_00000000]) ))
> Understood. Not ideal, but we generally don't have good ways to limit
> patterns to being available at different times during the optimization
> phase. One thing you might want to try (which I thought we used at
> one point) was make the pattern conditional on cse_not_expected. The
> goal would be to avoid exposing the pattern until a later point in the
> optimizer pipeline. It may have been the case that we dropped that
> over time during development. It's all getting fuzzy at this point.
Gave this a try and it seems to fix Andrew's test, but then regresses
the actual large const case: 0x1010101_01010101 : the mem to const_int
transformation was being done in cse1 which no longer happens and the
const pool from initial expand remains all the way into asm generated. I
don't think we want to go back to that state
>
>>
>> Eventually split1 breaks it up using same mvconst_internal splitter,
>> but the cse opportunity has been lost.
> Right. I'd have to look at the pass definitions, but I suspect the
> splitting pass where this happens is after the last standard CSE pass.
> So we don't get a chance to CSE the constant synthesis.
Yep split1 and friends happen after cse1 and cse2. At -O2 gcse doesn't
kick in and if forced to, it is currently limited in what it can do more
so given this is post reload.
>
>> *This is a now a baseline for large consts handling for RV backend
>> which we all need to be aware of*.
> Understood. Though it's not as bad as you might think :-) You can
> spend an inordinate amount of time improving constant synthesis,
> generate code that looks really good, but in the end it may not make a
> bit of different in real performance. Been there, done that. I'm not
> saying we give up, but we need to keep in mind that we're often better
> off trading a bit on the constant synthesis if doing so helps code
> where those constants get used.
Understood :-) I was coming to same realization and this seems like a
good segway into switching topic and investigating post reload gcse for
Const Rematerialization, another pesky issue with RV and likely to have
bigger impact across a whole bunch of workloads.
>> FWIW, IRA for latter case only, emits additional REG_EQUIV notes
>> which could also be playing a role.
> REG_EQUAL notes get promoted to REG_EQUIV notes in some cases. And
> when other equivalences are discovered it may create a REG_EQUIV note
> out of thin air.
>
> The REG_EQUIV note essentially means that everywhere the register
> occurs you can validly (from a program semantics standpoint) replace
> the register with the value. It might require reloading, but it's a
> valid semantic transformation which may reduce register pressure --
> especially for constants that were subject to LICM.
>
> Contrast to REG_EQUAL which creates an equivalence at a particular
> point in the IL, but the equivalence may not hold elsewhere in the IL.
Ok. From reading gccint it seems REG_EQUIV is a stronger form of
equivalence and seems to be prefered by post reload passes, while
REG_EQUAL is more of use in pre-reload.
> I would also look at reload_cse_regs which should give us some
> chance at seeing the value reuse if/when IRA/LRA muck things up.
I'll be out of office for the rest of week, will look into this once I'm
back.
Thx,
-Vineet
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Followup on PR/109279: large constants on RISCV
2023-06-12 19:32 ` Vineet Gupta
@ 2023-06-17 22:13 ` Jeff Law
0 siblings, 0 replies; 4+ messages in thread
From: Jeff Law @ 2023-06-17 22:13 UTC (permalink / raw)
To: Vineet Gupta
Cc: gcc, Kito Cheng, Palmer Dabbelt, gnu-toolchain, pinskia, GCC Patches
On 6/12/23 13:32, Vineet Gupta wrote:
>
> Gave this a try and it seems to fix Andrew's test, but then regresses
> the actual large const case: 0x1010101_01010101 : the mem to const_int
> transformation was being done in cse1 which no longer happens and the
> const pool from initial expand remains all the way into asm generated. I
> don't think we want to go back to that state
Was that with just -O1/-Og? I have vague memories of seeing that kind
of behavior. If it was just with -O1 that might be a reasonable tradeoff.
>
> Ok. From reading gccint it seems REG_EQUIV is a stronger form of
> equivalence and seems to be prefered by post reload passes, while
> REG_EQUAL is more of use in pre-reload.
That's a reasonable way to look at it. In fact it's the allocators that
promote REG_EQUAL to REG_EQUIV when its's safe to do so.
>
>
>> I would also look at reload_cse_regs which should give us some
>> chance at seeing the value reuse if/when IRA/LRA muck things up.
>
> I'll be out of office for the rest of week, will look into this once I'm
> back.
NP.
jeff
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-06-17 22:13 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-02 2:38 Followup on PR/109279: large constants on RISCV Vineet Gupta
2023-06-07 23:44 ` Jeff Law
2023-06-12 19:32 ` Vineet Gupta
2023-06-17 22:13 ` Jeff Law
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).