Hi.

I re-run the benchmarks and hopefully got the same profit.
I also compared the leela's code and figured out the reason.

Actually, my and Manolis's patches do the same thing. The difference is
only execution order.
Because of f-m-o held after the register allocation it cannot eliminate
redundant move 'sp' to another register.

Here is an example.

int core_bench_state(int *ptr) {
>    int final_counts[100] = {0};

   while (*ptr) {
>      int id = foo();
>      final_counts[id]++;
>      ptr++;
>    }

   return final_counts[0];
> }


For this loop, the f-m-o pass generates the following.

.L3:

call foo
* mv a5,sp*
sh2add a0,a0,a5
lw a5,0(a0)
lw a4,4(s0)
addi s0,s0,4
addiw a5,a5,1
sw a5,0(a0)
bne a4,zero,.L3


Here '*mv a5, sp*' instruction is redundant.
Leela's FastState::try_move() function has a loop that iterates over 1.3 B
times and contains 5 memory folding cases (5 redundant moves).

Besides that, I have checked the build failure on x264_r. It is already
fixed on the third version.

On Sat, Jul 15, 2023 at 10:16 AM Jeff Law <jeffreyalaw@gmail.com> wrote:

>
>
> On 7/12/23 14:59, Jivan Hakobyan via Gcc-patches wrote:
> > Accessing local arrays element turned into load form (fp + (index <<
> > C1)) + C2 address. In the case when access is in the loop we got loop
> > invariant computation. For some reason, moving out that part cannot
> > be done in loop-invariant passes. But we can handle that in
> > target-specific hook (legitimize_address). That provides an
> > opportunity to rewrite memory access more suitable for the target
> > architecture.
> >
> > This patch solves the mentioned case by rewriting mentioned case to
> > ((fp + C2) + (index << C1)) I have evaluated it on SPEC2017 and got
> > an improvement on leela (over 7b instructions, .39% of the dynamic
> > count) and dwarfs the regression for gcc (14m instructions, .0012% of
> > the dynamic count).
> >
> >
> > gcc/ChangeLog: * config/riscv/riscv.cc (riscv_legitimize_address):
> > Handle folding. (mem_shadd_or_shadd_rtx_p): New predicate.
> So I still need to give the new version a review.  But a high level
> question -- did you re-run the benchmarks with this version to verify
> that we still saw the same nice improvement in leela?
>
> The reason I ask is when I use this on Ventana's internal tree I don't
> see any notable differences in the dynamic instruction counts.  And
> probably the most critical difference between the upstream tree and
> Ventana's tree in this space is Ventana's internal tree has an earlier
> version of the fold-mem-offsets work from Manolis.
>
> It may ultimately be the case that this work and Manolis's f-m-o patch
> have a lot of overlap in terms of their final effect on code generation.
>   Manolis's pass runs much later (after register allocation), so it's
> not going to address the loop-invariant-code-motion issue that
> originally got us looking into this space.  But his pass is generic
> enough that it helps other targets.  So we may ultimately want both.
>
> Anyway, just wanted to verify if this variant is still showing the nice
> improvement on leela that the prior version did.
>
> Jeff
>
> ps.  I know you're on PTO.  No rush on responding -- enjoy the time off.
>
>

-- 
With the best regards
Jivan Hakobyan