On 7/12/23 14:59, Jivan Hakobyan via Gcc-patches wrote:
> Accessing local arrays element turned into load form (fp + (index << C1)) +
> C2 address.
> In the case when access is in the loop we got loop invariant computation.
> For some reason, moving out that part cannot be done in
> loop-invariant passes.
> But we can handle that in target-specific hook (legitimize_address).
> That provides an opportunity to rewrite memory access more suitable for the
> target architecture.
> 
> This patch solves the mentioned case by rewriting mentioned case to ((fp +
> C2) + (index << C1))
> I have evaluated it on SPEC2017 and got an improvement on leela (over 7b
> instructions,
> .39% of the dynamic count) and dwarfs the regression for gcc (14m
> instructions, .0012%
> of the dynamic count).
> 
> 
> gcc/ChangeLog:
>          * config/riscv/riscv.cc (riscv_legitimize_address): Handle folding.
>          (mem_shadd_or_shadd_rtx_p): New predicate.
So I poked a bit more in this space today.

As you may have noted, Manolis's patch still needs another rev.  But I 
was able to test this patch in conjunction with the f-m-o patch as well 
as the additional improvements made to hard register cprop.  The net 
result was that this patch still shows a nice decrease in instruction 
counts on leela.  It's a bit of a mixed bag elsewhere.

I dove a bit deeper into the small regression in x264.  In the case I 
looked at the reason the patch regresses is the original form of the 
address calculations exposes a common subexpression ie

addr1 = (reg1 << 2) + fp + C1
addr2 = (reg1 << 2) + fp + C2

(reg1 << 2) + fp is a common subexpression resulting in something like 
this as we leave CSE:

t = (reg1 << 2) + fp;
addr1 = t + C1
addr2 = t + C2
mem (addr1)
mem (addr2)

C1 and C2 are small constants, so combine generates

t = (reg1 << 2) + fp;
mem (t+C1)
mem (t+C2)

FP elimination occurs after IRA and we get:

t2 = sp + C3
t = (reg << 2) + t2
mem (t + C1)
mem (t + C2)


Not bad.  Manolis's work should allow us to improve that a bit more.


With this patch we don't capture the CSE and ultimately generate 
slightly worse code.  This kind of issue is fairly inherent in 
reassociations -- and given the regression is 2 orders of magnitude 
smaller than the improvement my inclination is to go forward with this 
patch.


I've fixed a few formatting issues and changed once conditional to use 
CONST_INT_P rather than checking the code directory and pushed the final 
version to the trunk.

Thanks for your patience.

jeff