RISC-V: Folding memory for FP + constant case

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* RISC-V: Folding memory for FP + constant case
@ 2023-07-12 20:59 Jivan Hakobyan
  2023-07-15  6:16 ` Jeff Law
  2023-08-09 19:31 ` Jeff Law
  0 siblings, 2 replies; 21+ messages in thread
From: Jivan Hakobyan @ 2023-07-12 20:59 UTC (permalink / raw)
  To: gcc-patches


[-- Attachment #1.1: Type: text/plain, Size: 888 bytes --]

Accessing local arrays element turned into load form (fp + (index << C1)) +
C2 address.
In the case when access is in the loop we got loop invariant computation.
For some reason, moving out that part cannot be done in
loop-invariant passes.
But we can handle that in target-specific hook (legitimize_address).
That provides an opportunity to rewrite memory access more suitable for the
target architecture.

This patch solves the mentioned case by rewriting mentioned case to ((fp +
C2) + (index << C1))
I have evaluated it on SPEC2017 and got an improvement on leela (over 7b
instructions,
.39% of the dynamic count) and dwarfs the regression for gcc (14m
instructions, .0012%
of the dynamic count).


gcc/ChangeLog:
        * config/riscv/riscv.cc (riscv_legitimize_address): Handle folding.
        (mem_shadd_or_shadd_rtx_p): New predicate.


-- 
With the best regards
Jivan Hakobyan

[-- Attachment #2: legitimize_address.diff --]
[-- Type: text/x-patch, Size: 2530 bytes --]

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index e4dc8115e696ed44affe6ee8b51d635fe0eaaa33..2a7e464b855ec45f1fce4daec36d84842f3f3ea4 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -1754,6 +1754,22 @@ riscv_shorten_lw_offset (rtx base, HOST_WIDE_INT offset)
   return addr;
 }
 
+/* Helper for riscv_legitimize_address. Given X, return true if it
+   is a left shift by 1, 2 or 3 positions or a multiply by 2, 4 or 8.
+
+   This respectively represent canonical shift-add rtxs or scaled
+   memory addresses.  */
+static bool
+mem_shadd_or_shadd_rtx_p (rtx x)
+{
+  return ((GET_CODE (x) == ASHIFT
+           || GET_CODE (x) == MULT)
+          && GET_CODE (XEXP (x, 1)) == CONST_INT
+          && ((GET_CODE (x) == ASHIFT && IN_RANGE (INTVAL (XEXP (x, 1)), 1, 3))
+              || (GET_CODE (x) == MULT
+                  && IN_RANGE (exact_log2 (INTVAL (XEXP (x, 1))), 1, 3))));
+}
+
 /* This function is used to implement LEGITIMIZE_ADDRESS.  If X can
    be legitimized in a way that the generic machinery might not expect,
    return a new address, otherwise return NULL.  MODE is the mode of
@@ -1779,6 +1795,33 @@ riscv_legitimize_address (rtx x, rtx oldx ATTRIBUTE_UNUSED,
       rtx base = XEXP (x, 0);
       HOST_WIDE_INT offset = INTVAL (XEXP (x, 1));
 
+      /* Handle (plus (plus (mult (a) (mem_shadd_constant)) (fp)) (C)) case.  */
+      if (GET_CODE (base) == PLUS && mem_shadd_or_shadd_rtx_p (XEXP (base, 0))
+          && SMALL_OPERAND (offset))
+        {
+
+          rtx index = XEXP (base, 0);
+          rtx fp = XEXP (base, 1);
+          if (REGNO (fp) == VIRTUAL_STACK_VARS_REGNUM)
+            {
+
+              /* If we were given a MULT, we must fix the constant
+                 as we're going to create the ASHIFT form.  */
+              int shift_val = INTVAL (XEXP (index, 1));
+              if (GET_CODE (index) == MULT)
+                shift_val = exact_log2 (shift_val);
+
+              rtx reg1 = gen_reg_rtx (Pmode);
+              rtx reg2 = gen_reg_rtx (Pmode);
+              rtx reg3 = gen_reg_rtx (Pmode);
+              riscv_emit_binary (PLUS, reg1, fp, GEN_INT (offset));
+              riscv_emit_binary (ASHIFT, reg2, XEXP (index, 0), GEN_INT (shift_val));
+              riscv_emit_binary (PLUS, reg3, reg2, reg1);
+
+              return reg3;
+            }
+        }
+
       if (!riscv_valid_base_register_p (base, mode, false))
 	base = copy_to_mode_reg (Pmode, base);
       if (optimize_function_for_size_p (cfun)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-07-12 20:59 RISC-V: Folding memory for FP + constant case Jivan Hakobyan
@ 2023-07-15  6:16 ` Jeff Law
  2023-07-25 11:24   ` Jivan Hakobyan
  2023-08-09 19:31 ` Jeff Law
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff Law @ 2023-07-15  6:16 UTC (permalink / raw)
  To: Jivan Hakobyan, gcc-patches

On 7/12/23 14:59, Jivan Hakobyan via Gcc-patches wrote:
> Accessing local arrays element turned into load form (fp + (index <<
> C1)) + C2 address. In the case when access is in the loop we got loop
> invariant computation. For some reason, moving out that part cannot
> be done in loop-invariant passes. But we can handle that in
> target-specific hook (legitimize_address). That provides an
> opportunity to rewrite memory access more suitable for the target
> architecture.
> 
> This patch solves the mentioned case by rewriting mentioned case to
> ((fp + C2) + (index << C1)) I have evaluated it on SPEC2017 and got
> an improvement on leela (over 7b instructions, .39% of the dynamic
> count) and dwarfs the regression for gcc (14m instructions, .0012% of
> the dynamic count).
> 
> 
> gcc/ChangeLog: * config/riscv/riscv.cc (riscv_legitimize_address):
> Handle folding. (mem_shadd_or_shadd_rtx_p): New predicate.
So I still need to give the new version a review.  But a high level 
question -- did you re-run the benchmarks with this version to verify 
that we still saw the same nice improvement in leela?

The reason I ask is when I use this on Ventana's internal tree I don't 
see any notable differences in the dynamic instruction counts.  And 
probably the most critical difference between the upstream tree and 
Ventana's tree in this space is Ventana's internal tree has an earlier 
version of the fold-mem-offsets work from Manolis.

It may ultimately be the case that this work and Manolis's f-m-o patch 
have a lot of overlap in terms of their final effect on code generation. 
  Manolis's pass runs much later (after register allocation), so it's 
not going to address the loop-invariant-code-motion issue that 
originally got us looking into this space.  But his pass is generic 
enough that it helps other targets.  So we may ultimately want both.

Anyway, just wanted to verify if this variant is still showing the nice 
improvement on leela that the prior version did.

Jeff

ps.  I know you're on PTO.  No rush on responding -- enjoy the time off.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-07-15  6:16 ` Jeff Law
@ 2023-07-25 11:24   ` Jivan Hakobyan
  2023-07-26  3:31     ` Jeff Law
  0 siblings, 1 reply; 21+ messages in thread
From: Jivan Hakobyan @ 2023-07-25 11:24 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3367 bytes --]

Hi.

I re-run the benchmarks and hopefully got the same profit.
I also compared the leela's code and figured out the reason.

Actually, my and Manolis's patches do the same thing. The difference is
only execution order.
Because of f-m-o held after the register allocation it cannot eliminate
redundant move 'sp' to another register.

Here is an example.

int core_bench_state(int *ptr) {
>    int final_counts[100] = {0};

   while (*ptr) {
>      int id = foo();
>      final_counts[id]++;
>      ptr++;
>    }

   return final_counts[0];
> }


For this loop, the f-m-o pass generates the following.

.L3:

call foo
* mv a5,sp*
sh2add a0,a0,a5
lw a5,0(a0)
lw a4,4(s0)
addi s0,s0,4
addiw a5,a5,1
sw a5,0(a0)
bne a4,zero,.L3


Here '*mv a5, sp*' instruction is redundant.
Leela's FastState::try_move() function has a loop that iterates over 1.3 B
times and contains 5 memory folding cases (5 redundant moves).

Besides that, I have checked the build failure on x264_r. It is already
fixed on the third version.

On Sat, Jul 15, 2023 at 10:16 AM Jeff Law <jeffreyalaw@gmail.com> wrote:

>
>
> On 7/12/23 14:59, Jivan Hakobyan via Gcc-patches wrote:
> > Accessing local arrays element turned into load form (fp + (index <<
> > C1)) + C2 address. In the case when access is in the loop we got loop
> > invariant computation. For some reason, moving out that part cannot
> > be done in loop-invariant passes. But we can handle that in
> > target-specific hook (legitimize_address). That provides an
> > opportunity to rewrite memory access more suitable for the target
> > architecture.
> >
> > This patch solves the mentioned case by rewriting mentioned case to
> > ((fp + C2) + (index << C1)) I have evaluated it on SPEC2017 and got
> > an improvement on leela (over 7b instructions, .39% of the dynamic
> > count) and dwarfs the regression for gcc (14m instructions, .0012% of
> > the dynamic count).
> >
> >
> > gcc/ChangeLog: * config/riscv/riscv.cc (riscv_legitimize_address):
> > Handle folding. (mem_shadd_or_shadd_rtx_p): New predicate.
> So I still need to give the new version a review.  But a high level
> question -- did you re-run the benchmarks with this version to verify
> that we still saw the same nice improvement in leela?
>
> The reason I ask is when I use this on Ventana's internal tree I don't
> see any notable differences in the dynamic instruction counts.  And
> probably the most critical difference between the upstream tree and
> Ventana's tree in this space is Ventana's internal tree has an earlier
> version of the fold-mem-offsets work from Manolis.
>
> It may ultimately be the case that this work and Manolis's f-m-o patch
> have a lot of overlap in terms of their final effect on code generation.
>   Manolis's pass runs much later (after register allocation), so it's
> not going to address the loop-invariant-code-motion issue that
> originally got us looking into this space.  But his pass is generic
> enough that it helps other targets.  So we may ultimately want both.
>
> Anyway, just wanted to verify if this variant is still showing the nice
> improvement on leela that the prior version did.
>
> Jeff
>
> ps.  I know you're on PTO.  No rush on responding -- enjoy the time off.
>
>

-- 
With the best regards
Jivan Hakobyan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-07-25 11:24   ` Jivan Hakobyan
@ 2023-07-26  3:31     ` Jeff Law
  2023-08-01 19:14       ` Vineet Gupta
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Law @ 2023-07-26  3:31 UTC (permalink / raw)
  To: Jivan Hakobyan; +Cc: gcc-patches



On 7/25/23 05:24, Jivan Hakobyan wrote:
> Hi.
> 
> I re-run the benchmarks and hopefully got the same profit.
> I also compared the leela's code and figured out the reason.
> 
> Actually, my and Manolis's patches do the same thing. The difference is 
> only execution order.
But shouldn't your patch also allow for for at the last the potential to 
pull the fp+offset computation out of a loop?  I'm pretty sure Manolis's 
patch can't do that.

> Because of f-m-o held after the register allocation it cannot eliminate 
> redundant move 'sp' to another register.
Actually that's supposed to be handled by a different patch that should 
already be upstream.  Specifically;

> commit 6a2e8dcbbd4bab374b27abea375bf7a921047800
> Author: Manolis Tsamis <manolis.tsamis@vrull.eu>
> Date:   Thu May 25 13:44:41 2023 +0200
> 
>     cprop_hardreg: Enable propagation of the stack pointer if possible
>     
>     Propagation of the stack pointer in cprop_hardreg is currenty
>     forbidden in all cases, due to maybe_mode_change returning NULL.
>     Relax this restriction and allow propagation when no mode change is
>     requested.
>     
>     gcc/ChangeLog:
>     
>             * regcprop.cc (maybe_mode_change): Enable stack pointer
>             propagation.
I think there were a couple-follow-ups.  But that's the key change that 
should allow propagation of copies from the stack pointer and thus 
eliminate the mov gpr,sp instructions.  If that's not happening, then 
it's worth investigating why.

> 
> Besides that, I have checked the build failure on x264_r. It is already 
> fixed on the third version.
Yea, this was a problem with re-recognition.  I think it was fixed by:

> commit ecfa870ff29d979bd2c3d411643b551f2b6915b0
> Author: Vineet Gupta <vineetg@rivosinc.com>
> Date:   Thu Jul 20 11:15:37 2023 -0700
> 
>     RISC-V: optim const DF +0.0 store to mem [PR/110748]
>     
>     Fixes: ef85d150b5963 ("RISC-V: Enable TARGET_SUPPORTS_WIDE_INT")
>     
>     DF +0.0 is bitwise all zeros so int x0 store to mem can be used to optimize it.
[ ... ]


So I think the big question WRT your patch is does it still help the 
case where we weren't pulling the fp+offset computation out of a loop.

Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-07-26  3:31     ` Jeff Law
@ 2023-08-01 19:14       ` Vineet Gupta
  2023-08-01 20:16         ` Jivan Hakobyan
  2023-08-01 21:55         ` Jeff Law
  0 siblings, 2 replies; 21+ messages in thread
From: Vineet Gupta @ 2023-08-01 19:14 UTC (permalink / raw)
  To: Jeff Law, Jivan Hakobyan; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2508 bytes --]



On 7/25/23 20:31, Jeff Law via Gcc-patches wrote:
>
>
> On 7/25/23 05:24, Jivan Hakobyan wrote:
>> Hi.
>>
>> I re-run the benchmarks and hopefully got the same profit.
>> I also compared the leela's code and figured out the reason.
>>
>> Actually, my and Manolis's patches do the same thing. The difference 
>> is only execution order.
> But shouldn't your patch also allow for for at the last the potential 
> to pull the fp+offset computation out of a loop?  I'm pretty sure 
> Manolis's patch can't do that.
>
>> Because of f-m-o held after the register allocation it cannot 
>> eliminate redundant move 'sp' to another register.
> Actually that's supposed to be handled by a different patch that 
> should already be upstream.  Specifically;
>
>> commit 6a2e8dcbbd4bab374b27abea375bf7a921047800
>> Author: Manolis Tsamis <manolis.tsamis@vrull.eu>
>> Date:   Thu May 25 13:44:41 2023 +0200
>>
>>     cprop_hardreg: Enable propagation of the stack pointer if possible
>>         Propagation of the stack pointer in cprop_hardreg is currenty
>>     forbidden in all cases, due to maybe_mode_change returning NULL.
>>     Relax this restriction and allow propagation when no mode change is
>>     requested.
>>         gcc/ChangeLog:
>>                 * regcprop.cc (maybe_mode_change): Enable stack pointer
>>             propagation.
> I think there were a couple-follow-ups.  But that's the key change 
> that should allow propagation of copies from the stack pointer and 
> thus eliminate the mov gpr,sp instructions.  If that's not happening, 
> then it's worth investigating why.
>
>>
>> Besides that, I have checked the build failure on x264_r. It is 
>> already fixed on the third version.
> Yea, this was a problem with re-recognition.  I think it was fixed by:
>
>> commit ecfa870ff29d979bd2c3d411643b551f2b6915b0
>> Author: Vineet Gupta <vineetg@rivosinc.com>
>> Date:   Thu Jul 20 11:15:37 2023 -0700
>>
>>     RISC-V: optim const DF +0.0 store to mem [PR/110748]
>>         Fixes: ef85d150b5963 ("RISC-V: Enable TARGET_SUPPORTS_WIDE_INT")
>>         DF +0.0 is bitwise all zeros so int x0 store to mem can be 
>> used to optimize it.
> [ ... ]
>
>
> So I think the big question WRT your patch is does it still help the 
> case where we weren't pulling the fp+offset computation out of a loop.

I have some numbers for f-m-o v3 vs this. Attached here (vs. inline to 
avoid the Thunderbird mangling the test formatting)

[-- Attachment #2: fmo-vs-fold-fp --]
[-- Type: text/plain, Size: 2698 bytes --]

benchmark	workload #       upstream        upstream +              upstream +
                                 g54e54f77c1        f-m-o               fold-fp-off

500.perlbench_r	0		1217932817476	1217884553366	0.004%	1217928953834	0.000%
		1		743757241201	743655528133	0.014%	743695820426	0.008%
		2		703455646090	703423559298	0.005%	703455296251	0.000%
502.gcc_r	0		195004369104	194973478945	0.016%	194984188400	0.010%
		1		232719938778	232688491113	0.014%	232692379085	0.012%
		2		223443280459	223413616368	0.013%	223424151848	0.009%
		3		186233704624	186206516421	0.015%	186231137616	0.001%
		4		287406394232	287378870279	0.010%	287403707466	0.001%
503.bwaves_r	0		316194043679	316194043679	0.000%	316194043662	0.000%
		1		499293490380	499293490380	0.000%	499293490363	0.000%
		2		389365401615	389365401615	0.000%	389365401598	0.000%
		3		473514310679	473514310679	0.000%	473514310662	0.000%
505.mcf_r	0		689258694902	689254740344	0.001%	689258694887	0.000%
507.cactuBSSN_r	0		3966612364613	3966498234698	0.003%	3966612365068	0.000%
508.namd_r	0		1903766272166	1903766271701	0.000%	1903765987301	0.000%
510.parest_r	0		3512678127316	3512676752062	0.000%	3512677505662	0.008%
511.povray_r	0		3036725558618	3036722265149	0.000%	3036725556997	0.000%
519.lbm_r	0		1134454304533	1134454304533	0.000%	1134454304518	0.000%
520.omnetpp_r	0		1001937885126	1001937884542	0.000%	1001937883931	0.000%
521.wrf_r	0		3959642601629	3959541912013	0.003%	3959642615086	0.000%
523.xalancbmk_r	0		1065004269065	1064981413043	0.002%	1065004132070	0.000%
525.x264_r	0		496492857533	496459367582	0.007%	496477988435	0.003%
		1		1891248078083	1891222197535	0.001%	1890990911614	0.014%
		2		1815609267498	1815561397105	0.003%	1815341248007	0.015%
526.blender_r	0		1672203767444	1671549923427	0.039%	1672224626743  -0.001%
527.cam4_r	0		2326424925038	2320567166886	0.252%	2326333566227	0.004% <-
531.deepsjeng_r	0		1668993359340	1662816376544	0.370%	1668993353038	0.000% <-
538.imagick_r	0		3260965672876	3260965672712	0.000%	3260965672777	0.000%
541.leela_r	0		2034139863891	2034101807341	0.002%	2026647843672	0.368%    <--
544.nab_r	0		1566465507272	1565420628706	0.067%	1566465379674	0.000%
548.exchange2_r	0		2228112071994	2228109962469	0.000%	2228114278251	0.000%
549.fotonik3d_r	0		2255238867247	2255238867246	0.000%	2255238865924	0.000%
554.roms_r	0		2653150555486	2651884870455	0.048%	2653150554877	0.000%
557.xz_r	0		367892301169	367892301167	0.000%	367892301154	0.000%
		1		979549393200	979549393198	0.000%	979549393185	0.000%
		2		525066235331	525066235329	0.000%	525066235316	0.000%
997.specrand_fr	0		453112389	453112389	0.000%	453112374	0.000%
999.specrand_ir	0		453112389	453112389	0.000%	453112374	0.000%

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 19:14       ` Vineet Gupta
@ 2023-08-01 20:16         ` Jivan Hakobyan
  2023-08-01 21:55         ` Jeff Law
  1 sibling, 0 replies; 21+ messages in thread
From: Jivan Hakobyan @ 2023-08-01 20:16 UTC (permalink / raw)
  To: Vineet Gupta; +Cc: Jeff Law, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2844 bytes --]

Thank you for your effort.
I had evaluated only in intrate tests.
I am glad to see the same result on Leela.

On Tue, Aug 1, 2023 at 11:14 PM Vineet Gupta <vineetg@rivosinc.com> wrote:

>
>
> On 7/25/23 20:31, Jeff Law via Gcc-patches wrote:
> >
> >
> > On 7/25/23 05:24, Jivan Hakobyan wrote:
> >> Hi.
> >>
> >> I re-run the benchmarks and hopefully got the same profit.
> >> I also compared the leela's code and figured out the reason.
> >>
> >> Actually, my and Manolis's patches do the same thing. The difference
> >> is only execution order.
> > But shouldn't your patch also allow for for at the last the potential
> > to pull the fp+offset computation out of a loop?  I'm pretty sure
> > Manolis's patch can't do that.
> >
> >> Because of f-m-o held after the register allocation it cannot
> >> eliminate redundant move 'sp' to another register.
> > Actually that's supposed to be handled by a different patch that
> > should already be upstream.  Specifically;
> >
> >> commit 6a2e8dcbbd4bab374b27abea375bf7a921047800
> >> Author: Manolis Tsamis <manolis.tsamis@vrull.eu>
> >> Date:   Thu May 25 13:44:41 2023 +0200
> >>
> >>     cprop_hardreg: Enable propagation of the stack pointer if possible
> >>         Propagation of the stack pointer in cprop_hardreg is currenty
> >>     forbidden in all cases, due to maybe_mode_change returning NULL.
> >>     Relax this restriction and allow propagation when no mode change is
> >>     requested.
> >>         gcc/ChangeLog:
> >>                 * regcprop.cc (maybe_mode_change): Enable stack pointer
> >>             propagation.
> > I think there were a couple-follow-ups.  But that's the key change
> > that should allow propagation of copies from the stack pointer and
> > thus eliminate the mov gpr,sp instructions.  If that's not happening,
> > then it's worth investigating why.
> >
> >>
> >> Besides that, I have checked the build failure on x264_r. It is
> >> already fixed on the third version.
> > Yea, this was a problem with re-recognition.  I think it was fixed by:
> >
> >> commit ecfa870ff29d979bd2c3d411643b551f2b6915b0
> >> Author: Vineet Gupta <vineetg@rivosinc.com>
> >> Date:   Thu Jul 20 11:15:37 2023 -0700
> >>
> >>     RISC-V: optim const DF +0.0 store to mem [PR/110748]
> >>         Fixes: ef85d150b5963 ("RISC-V: Enable TARGET_SUPPORTS_WIDE_INT")
> >>         DF +0.0 is bitwise all zeros so int x0 store to mem can be
> >> used to optimize it.
> > [ ... ]
> >
> >
> > So I think the big question WRT your patch is does it still help the
> > case where we weren't pulling the fp+offset computation out of a loop.
>
> I have some numbers for f-m-o v3 vs this. Attached here (vs. inline to
> avoid the Thunderbird mangling the test formatting)
>


-- 
With the best regards
Jivan Hakobyan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 19:14       ` Vineet Gupta
  2023-08-01 20:16         ` Jivan Hakobyan
@ 2023-08-01 21:55         ` Jeff Law
  2023-08-01 22:07           ` Philipp Tomsich
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff Law @ 2023-08-01 21:55 UTC (permalink / raw)
  To: Vineet Gupta, Jivan Hakobyan; +Cc: gcc-patches



On 8/1/23 13:14, Vineet Gupta wrote:

> 
> I have some numbers for f-m-o v3 vs this. Attached here (vs. inline to 
> avoid the Thunderbird mangling the test formatting)
Thanks.  Of particular importance is the leela change.  My recollection 
was that the f-m-o work also picked up that case.  But if my memory is 
faulty (always a possibility), then that shows a clear case where 
Jivan's work picks up a case not handled by Manolis's work.

And on the other direction we can see that deepsjeng isn't helped by 
Jivan's work, but is helped by Manolis's new pass.

I'd always hoped/expected we'd have cases where one patch clearly helped 
over the other.  While the .25% to .37% improvements for the three most 
impacted benchmarks doesn't move the needle much across the whole suite 
they do add up over time.

Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 21:55         ` Jeff Law
@ 2023-08-01 22:07           ` Philipp Tomsich
  2023-08-01 23:03             ` Vineet Gupta
  0 siblings, 1 reply; 21+ messages in thread
From: Philipp Tomsich @ 2023-08-01 22:07 UTC (permalink / raw)
  To: Jeff Law, Manolis Tsamis; +Cc: Vineet Gupta, Jivan Hakobyan, gcc-patches

+Manolis Tsamis

On Tue, 1 Aug 2023 at 23:56, Jeff Law via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
>
> On 8/1/23 13:14, Vineet Gupta wrote:
>
> >
> > I have some numbers for f-m-o v3 vs this. Attached here (vs. inline to
> > avoid the Thunderbird mangling the test formatting)
> Thanks.  Of particular importance is the leela change.  My recollection
> was that the f-m-o work also picked up that case.  But if my memory is
> faulty (always a possibility), then that shows a clear case where
> Jivan's work picks up a case not handled by Manolis's work.

f-m-o originally targeted (and benefited) the leela-case.  I wonder if
other optimizations/changes over the last year interfere with this and
what needs to be changed to accomodate this... looks like we need to
revisit against trunk.

Philipp.

> And on the other direction we can see that deepsjeng isn't helped by
> Jivan's work, but is helped by Manolis's new pass.
>
> I'd always hoped/expected we'd have cases where one patch clearly helped
> over the other.  While the .25% to .37% improvements for the three most
> impacted benchmarks doesn't move the needle much across the whole suite
> they do add up over time.
>
> Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 22:07           ` Philipp Tomsich
@ 2023-08-01 23:03             ` Vineet Gupta
  2023-08-01 23:06               ` Philipp Tomsich
  2023-08-01 23:21               ` Jeff Law
  0 siblings, 2 replies; 21+ messages in thread
From: Vineet Gupta @ 2023-08-01 23:03 UTC (permalink / raw)
  To: Philipp Tomsich, Jeff Law, Manolis Tsamis; +Cc: Jivan Hakobyan, gcc-patches



On 8/1/23 15:07, Philipp Tomsich wrote:
> +Manolis Tsamis
>
> On Tue, 1 Aug 2023 at 23:56, Jeff Law via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>>
>> On 8/1/23 13:14, Vineet Gupta wrote:
>>
>>> I have some numbers for f-m-o v3 vs this. Attached here (vs. inline to
>>> avoid the Thunderbird mangling the test formatting)
>> Thanks.  Of particular importance is the leela change.  My recollection
>> was that the f-m-o work also picked up that case.  But if my memory is
>> faulty (always a possibility), then that shows a clear case where
>> Jivan's work picks up a case not handled by Manolis's work.
> f-m-o originally targeted (and benefited) the leela-case.  I wonder if
> other optimizations/changes over the last year interfere with this and
> what needs to be changed to accomodate this... looks like we need to
> revisit against trunk.
>
> Philipp.
>
>> And on the other direction we can see that deepsjeng isn't helped by
>> Jivan's work, but is helped by Manolis's new pass.
>>
>> I'd always hoped/expected we'd have cases where one patch clearly helped
>> over the other.  While the .25% to .37% improvements for the three most
>> impacted benchmarks doesn't move the needle much across the whole suite
>> they do add up over time.
>>
>> Jeff

I took a quick look at Leela, the significant difference is from 
additional insns with SP not getting propagated.

e.g.

    231b6:    mv    a4,sp
    231b8:    sh2add    a5,a5,a4

vs.

    1e824:    sh2add    a5,a5,sp

There are 5 such instances which more or less make up for the delta.

-Vineet


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 23:03             ` Vineet Gupta
@ 2023-08-01 23:06               ` Philipp Tomsich
  2023-08-01 23:13                 ` Vineet Gupta
  2023-08-01 23:22                 ` Jeff Law
  2023-08-01 23:21               ` Jeff Law
  1 sibling, 2 replies; 21+ messages in thread
From: Philipp Tomsich @ 2023-08-01 23:06 UTC (permalink / raw)
  To: Vineet Gupta; +Cc: Jeff Law, Manolis Tsamis, Jivan Hakobyan, gcc-patches

Very helpful! Looks as if regprop for stack_pointer is now either too
conservative — or one of our patches is missing in everyone's test
setup; we'll take a closer look.

On Wed, 2 Aug 2023 at 01:03, Vineet Gupta <vineetg@rivosinc.com> wrote:
>
>
>
> On 8/1/23 15:07, Philipp Tomsich wrote:
> > +Manolis Tsamis
> >
> > On Tue, 1 Aug 2023 at 23:56, Jeff Law via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >>
> >> On 8/1/23 13:14, Vineet Gupta wrote:
> >>
> >>> I have some numbers for f-m-o v3 vs this. Attached here (vs. inline to
> >>> avoid the Thunderbird mangling the test formatting)
> >> Thanks.  Of particular importance is the leela change.  My recollection
> >> was that the f-m-o work also picked up that case.  But if my memory is
> >> faulty (always a possibility), then that shows a clear case where
> >> Jivan's work picks up a case not handled by Manolis's work.
> > f-m-o originally targeted (and benefited) the leela-case.  I wonder if
> > other optimizations/changes over the last year interfere with this and
> > what needs to be changed to accomodate this... looks like we need to
> > revisit against trunk.
> >
> > Philipp.
> >
> >> And on the other direction we can see that deepsjeng isn't helped by
> >> Jivan's work, but is helped by Manolis's new pass.
> >>
> >> I'd always hoped/expected we'd have cases where one patch clearly helped
> >> over the other.  While the .25% to .37% improvements for the three most
> >> impacted benchmarks doesn't move the needle much across the whole suite
> >> they do add up over time.
> >>
> >> Jeff
>
> I took a quick look at Leela, the significant difference is from
> additional insns with SP not getting propagated.
>
> e.g.
>
>     231b6:    mv    a4,sp
>     231b8:    sh2add    a5,a5,a4
>
> vs.
>
>     1e824:    sh2add    a5,a5,sp
>
> There are 5 such instances which more or less make up for the delta.
>
> -Vineet
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 23:06               ` Philipp Tomsich
@ 2023-08-01 23:13                 ` Vineet Gupta
  2023-08-01 23:27                   ` Jeff Law
  2023-08-01 23:22                 ` Jeff Law
  1 sibling, 1 reply; 21+ messages in thread
From: Vineet Gupta @ 2023-08-01 23:13 UTC (permalink / raw)
  To: Philipp Tomsich; +Cc: Jeff Law, Manolis Tsamis, Jivan Hakobyan, gcc-patches


On 8/1/23 16:06, Philipp Tomsich wrote:
> Very helpful! Looks as if regprop for stack_pointer is now either too
> conservative — or one of our patches is missing in everyone's test
> setup; we'll take a closer look.

FWIW, all 5 of them involve a SH2ADD have SP as source in the fold FP 
case which f-m-o seems to be generating a MV for.

>
> On Wed, 2 Aug 2023 at 01:03, Vineet Gupta <vineetg@rivosinc.com> wrote:
>>
>>
>> On 8/1/23 15:07, Philipp Tomsich wrote:
>>> +Manolis Tsamis
>>>
>>> On Tue, 1 Aug 2023 at 23:56, Jeff Law via Gcc-patches
>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>
>>>> On 8/1/23 13:14, Vineet Gupta wrote:
>>>>
>>>>> I have some numbers for f-m-o v3 vs this. Attached here (vs. inline to
>>>>> avoid the Thunderbird mangling the test formatting)
>>>> Thanks.  Of particular importance is the leela change.  My recollection
>>>> was that the f-m-o work also picked up that case.  But if my memory is
>>>> faulty (always a possibility), then that shows a clear case where
>>>> Jivan's work picks up a case not handled by Manolis's work.
>>> f-m-o originally targeted (and benefited) the leela-case.  I wonder if
>>> other optimizations/changes over the last year interfere with this and
>>> what needs to be changed to accomodate this... looks like we need to
>>> revisit against trunk.
>>>
>>> Philipp.
>>>
>>>> And on the other direction we can see that deepsjeng isn't helped by
>>>> Jivan's work, but is helped by Manolis's new pass.
>>>>
>>>> I'd always hoped/expected we'd have cases where one patch clearly helped
>>>> over the other.  While the .25% to .37% improvements for the three most
>>>> impacted benchmarks doesn't move the needle much across the whole suite
>>>> they do add up over time.
>>>>
>>>> Jeff
>> I took a quick look at Leela, the significant difference is from
>> additional insns with SP not getting propagated.
>>
>> e.g.
>>
>>      231b6:    mv    a4,sp
>>      231b8:    sh2add    a5,a5,a4
>>
>> vs.
>>
>>      1e824:    sh2add    a5,a5,sp
>>
>> There are 5 such instances which more or less make up for the delta.
>>
>> -Vineet
>>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 23:03             ` Vineet Gupta
  2023-08-01 23:06               ` Philipp Tomsich
@ 2023-08-01 23:21               ` Jeff Law
  1 sibling, 0 replies; 21+ messages in thread
From: Jeff Law @ 2023-08-01 23:21 UTC (permalink / raw)
  To: Vineet Gupta, Philipp Tomsich, Manolis Tsamis; +Cc: Jivan Hakobyan, gcc-patches



On 8/1/23 17:03, Vineet Gupta wrote:
> 
> 
> On 8/1/23 15:07, Philipp Tomsich wrote:
>> +Manolis Tsamis
>>
>> On Tue, 1 Aug 2023 at 23:56, Jeff Law via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>>
>>>
>>> On 8/1/23 13:14, Vineet Gupta wrote:
>>>
>>>> I have some numbers for f-m-o v3 vs this. Attached here (vs. inline to
>>>> avoid the Thunderbird mangling the test formatting)
>>> Thanks.  Of particular importance is the leela change.  My recollection
>>> was that the f-m-o work also picked up that case.  But if my memory is
>>> faulty (always a possibility), then that shows a clear case where
>>> Jivan's work picks up a case not handled by Manolis's work.
>> f-m-o originally targeted (and benefited) the leela-case.  I wonder if
>> other optimizations/changes over the last year interfere with this and
>> what needs to be changed to accomodate this... looks like we need to
>> revisit against trunk.
>>
>> Philipp.
>>
>>> And on the other direction we can see that deepsjeng isn't helped by
>>> Jivan's work, but is helped by Manolis's new pass.
>>>
>>> I'd always hoped/expected we'd have cases where one patch clearly helped
>>> over the other.  While the .25% to .37% improvements for the three most
>>> impacted benchmarks doesn't move the needle much across the whole suite
>>> they do add up over time.
>>>
>>> Jeff
> 
> I took a quick look at Leela, the significant difference is from 
> additional insns with SP not getting propagated.
> 
> e.g.
> 
>     231b6:    mv    a4,sp
>     231b8:    sh2add    a5,a5,a4
> 
> vs.
> 
>     1e824:    sh2add    a5,a5,sp
> 
> There are 5 such instances which more or less make up for the delta.
ACK.  Jivan and I have seen similar things with some other work in this 
space.  What's weird is the bits of Manolis's work that went in a month 
or so ago are supposed to address this exact issue in the post-reload 
const/copy propagation pass.

Jeff


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 23:06               ` Philipp Tomsich
  2023-08-01 23:13                 ` Vineet Gupta
@ 2023-08-01 23:22                 ` Jeff Law
  2023-08-01 23:28                   ` Vineet Gupta
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff Law @ 2023-08-01 23:22 UTC (permalink / raw)
  To: Philipp Tomsich, Vineet Gupta; +Cc: Manolis Tsamis, Jivan Hakobyan, gcc-patches



On 8/1/23 17:06, Philipp Tomsich wrote:
> Very helpful! Looks as if regprop for stack_pointer is now either too
> conservative — or one of our patches is missing in everyone's test
> setup; we'll take a closer look.
They must not be working as expected or folks are using old trees. 
Manolis's work for regcprop has been on the trunk for about 5-6 weeks ag 
this point:


> commit 893883f2f8f56984209c6ed210ee992ff71a14b0
> Author: Manolis Tsamis <manolis.tsamis@vrull.eu>
> Date:   Tue Jun 20 16:23:52 2023 +0200
> 
>     cprop_hardreg: fix ORIGINAL_REGNO/REG_ATTRS/REG_POINTER handling
>     
>     Fixes: 6a2e8dcbbd4bab3
>     
>     Propagation for the stack pointer in regcprop was enabled in
>     6a2e8dcbbd4bab3, but set ORIGINAL_REGNO/REG_ATTRS/REG_POINTER for
>     stack_pointer_rtx which caused regression (e.g., PR 110313, PR 110308).
>     
>     This fix adds special handling for stack_pointer_rtx in the places
>     where maybe_mode_change is called. This also adds an check in
>     maybe_mode_change to return the stack pointer only when the requested
>     mode matches the mode of stack_pointer_rtx.
>     
>             PR debug/110308
>     
>     gcc/ChangeLog:
>     
>             * regcprop.cc (maybe_mode_change): Check stack_pointer_rtx mode.
>             (maybe_copy_reg_attrs): New function.
>             (find_oldest_value_reg): Use maybe_copy_reg_attrs.
>             (copyprop_hardreg_forward_1): Ditto.
>     
>     gcc/testsuite/ChangeLog:
>     
>             * g++.dg/torture/pr110308.C: New test.
>     
>     Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
>     Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu>
> 
> commit 6a2e8dcbbd4bab374b27abea375bf7a921047800
> Author: Manolis Tsamis <manolis.tsamis@vrull.eu>
> Date:   Thu May 25 13:44:41 2023 +0200
> 
>     cprop_hardreg: Enable propagation of the stack pointer if possible
>     
>     Propagation of the stack pointer in cprop_hardreg is currenty
>     forbidden in all cases, due to maybe_mode_change returning NULL.
>     Relax this restriction and allow propagation when no mode change is
>     requested.
>     
>     gcc/ChangeLog:
>     
>             * regcprop.cc (maybe_mode_change): Enable stack pointer
>             propagation.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 23:13                 ` Vineet Gupta
@ 2023-08-01 23:27                   ` Jeff Law
  2023-08-01 23:38                     ` Vineet Gupta
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Law @ 2023-08-01 23:27 UTC (permalink / raw)
  To: Vineet Gupta, Philipp Tomsich; +Cc: Manolis Tsamis, Jivan Hakobyan, gcc-patches



On 8/1/23 17:13, Vineet Gupta wrote:
> 
> On 8/1/23 16:06, Philipp Tomsich wrote:
>> Very helpful! Looks as if regprop for stack_pointer is now either too
>> conservative — or one of our patches is missing in everyone's test
>> setup; we'll take a closer look.
> 
> FWIW, all 5 of them involve a SH2ADD have SP as source in the fold FP 
> case which f-m-o seems to be generating a MV for.
To clarify f-m-o isn't generating the mv.  It's simplifying a sequence 
by pushing the constant in an addi instruction into the memory 
reference.  As a result the addi simplifies into a sp->reg copy that is 
supposed to then be propagated away.

Also note that getting FP out of the shift-add sequences is the other 
key goal of Jivan's work.  FP elimination always results in a 
spill/reload if we have a shift-add insn where one operand is FP.



jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 23:22                 ` Jeff Law
@ 2023-08-01 23:28                   ` Vineet Gupta
  0 siblings, 0 replies; 21+ messages in thread
From: Vineet Gupta @ 2023-08-01 23:28 UTC (permalink / raw)
  To: Jeff Law, Philipp Tomsich; +Cc: Manolis Tsamis, Jivan Hakobyan, gcc-patches



On 8/1/23 16:22, Jeff Law wrote:
> They must not be working as expected or folks are using old trees. 
> Manolis's work for regcprop has been on the trunk for about 5-6 weeks 
> ag this point: 

I have bleeding edge trunk from 2-3 days back. I think we are looking 
for the following which the tree does have.

2023-05-25 6a2e8dcbbd4b cprop_hardreg: Enable propagation of the stack 
pointer if possible

-Vineet

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 23:27                   ` Jeff Law
@ 2023-08-01 23:38                     ` Vineet Gupta
  2023-08-01 23:52                       ` Jeff Law
  0 siblings, 1 reply; 21+ messages in thread
From: Vineet Gupta @ 2023-08-01 23:38 UTC (permalink / raw)
  To: Jeff Law, Philipp Tomsich; +Cc: Manolis Tsamis, Jivan Hakobyan, gcc-patches



On 8/1/23 16:27, Jeff Law wrote:
>
>
> On 8/1/23 17:13, Vineet Gupta wrote:
>>
>> On 8/1/23 16:06, Philipp Tomsich wrote:
>>> Very helpful! Looks as if regprop for stack_pointer is now either too
>>> conservative — or one of our patches is missing in everyone's test
>>> setup; we'll take a closer look.
>>
>> FWIW, all 5 of them involve a SH2ADD have SP as source in the fold FP 
>> case which f-m-o seems to be generating a MV for.
> To clarify f-m-o isn't generating the mv.  It's simplifying a sequence 
> by pushing the constant in an addi instruction into the memory 
> reference.  As a result the addi simplifies into a sp->reg copy that 
> is supposed to then be propagated away.

Yep, that's clear.

>
> Also note that getting FP out of the shift-add sequences is the other 
> key goal of Jivan's work.  FP elimination always results in a 
> spill/reload if we have a shift-add insn where one operand is FP. 

Hmm, are you saying it should NOT be generating shift-add with SP as 
src, because currently thats exactly what fold FP offset *is* doing and 
is the reason it has 5 less insns.

-Vineet



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 23:38                     ` Vineet Gupta
@ 2023-08-01 23:52                       ` Jeff Law
  2023-08-04  9:52                         ` Manolis Tsamis
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Law @ 2023-08-01 23:52 UTC (permalink / raw)
  To: Vineet Gupta, Philipp Tomsich; +Cc: Manolis Tsamis, Jivan Hakobyan, gcc-patches



On 8/1/23 17:38, Vineet Gupta wrote:
>>
>> Also note that getting FP out of the shift-add sequences is the other 
>> key goal of Jivan's work.  FP elimination always results in a 
>> spill/reload if we have a shift-add insn where one operand is FP. 
> 
> Hmm, are you saying it should NOT be generating shift-add with SP as 
> src, because currently thats exactly what fold FP offset *is* doing and 
> is the reason it has 5 less insns.
We should not have shift-add with FP as a source prior to register 
allocation because it will almost always generate spill code.


jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-01 23:52                       ` Jeff Law
@ 2023-08-04  9:52                         ` Manolis Tsamis
  2023-08-04 16:23                           ` Jeff Law
  0 siblings, 1 reply; 21+ messages in thread
From: Manolis Tsamis @ 2023-08-04  9:52 UTC (permalink / raw)
  To: Jeff Law; +Cc: Vineet Gupta, Philipp Tomsich, Jivan Hakobyan, gcc-patches

Hi all,

It is true that regcprop currently does not propagate sp and hence
leela is not optimized, but from what I see this should be something
we can address.

The reason that the propagation fails is this check that I have added
when I introduced maybe_copy_reg_attrs:

else if (REG_POINTER (new_reg) != REG_POINTER (old_reg))
  {
    /* Only a single instance of STACK_POINTER_RTX must exist and we cannot
       modify it. Allow propagation if REG_POINTER for OLD_REG matches and
       don't touch ORIGINAL_REGNO and REG_ATTRS. */
    return NULL_RTX;
  }

To be honest I did add this back then just to be on the safe side of
whether a mismatch in REG_POINTER after propagation would be an issue
(since the original regcprop had caused enough problems).

I see two ways to solve this and make fmo able to optimize leela as well:
 1) Remove the REG_POINTER check in regcprop if that is safe. My
understanding is that REG_POINTER is used as a hint and there would be
no correctness issues.
 2) Mark the corresponding registers with REG_POINTER. I'm not sure
where that is supposed to happen.

Since the instructions look like this:
  (insn 113 11 16 2 (set (reg:DI 15 a5 [226])
          (reg/f:DI 2 sp)) 179 {*movdi_64bit}
       (nil))

I assume that we'd want to mark a5 as REG_POINTER anyway (which is
not), and in that case propagation would work.
On the other hand if there's no correctness issue w.r.t. REG_POINTER
and regcprop then removing the additional check would increase
propagation opportunities in general which is also good.

Thanks,
Manolis

On Wed, Aug 2, 2023 at 2:52 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 8/1/23 17:38, Vineet Gupta wrote:
> >>
> >> Also note that getting FP out of the shift-add sequences is the other
> >> key goal of Jivan's work.  FP elimination always results in a
> >> spill/reload if we have a shift-add insn where one operand is FP.
> >
> > Hmm, are you saying it should NOT be generating shift-add with SP as
> > src, because currently thats exactly what fold FP offset *is* doing and
> > is the reason it has 5 less insns.
> We should not have shift-add with FP as a source prior to register
> allocation because it will almost always generate spill code.
>
>
> jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-04  9:52                         ` Manolis Tsamis
@ 2023-08-04 16:23                           ` Jeff Law
  2023-08-05  9:27                             ` Manolis Tsamis
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Law @ 2023-08-04 16:23 UTC (permalink / raw)
  To: Manolis Tsamis; +Cc: Vineet Gupta, Philipp Tomsich, Jivan Hakobyan, gcc-patches



On 8/4/23 03:52, Manolis Tsamis wrote:
> Hi all,
> 
> It is true that regcprop currently does not propagate sp and hence
> leela is not optimized, but from what I see this should be something
> we can address.
> 
> The reason that the propagation fails is this check that I have added
> when I introduced maybe_copy_reg_attrs:
> 
> else if (REG_POINTER (new_reg) != REG_POINTER (old_reg))
>    {
>      /* Only a single instance of STACK_POINTER_RTX must exist and we cannot
>         modify it. Allow propagation if REG_POINTER for OLD_REG matches and
>         don't touch ORIGINAL_REGNO and REG_ATTRS. */
>      return NULL_RTX;
>    }
> 
> To be honest I did add this back then just to be on the safe side of
> whether a mismatch in REG_POINTER after propagation would be an issue
> (since the original regcprop had caused enough problems).
No worries.  Obviously not propagating is the safe thing to do and if we 
find notable missed cases we can always refine the existing code like 
we're considering now.

> 
> I see two ways to solve this and make fmo able to optimize leela as well:
>   1) Remove the REG_POINTER check in regcprop if that is safe. My
> understanding is that REG_POINTER is used as a hint and there would be
> no correctness issues.
REG_POINTER is meant to be conservatively correct.  If REG_POINTER is 
set, then the register must be a pointer to a valid memory location.  If 
REG_POINTER is not set, then it may or may not point to a valid memory 
location.  sp, fp, ap are by definition pointers and thus have 
REG_POINTER set.

With that definition we can safely copy propagate away an sp->gpr copy 
from a REG_POINTER standpoint.










>   2) Mark the corresponding registers with REG_POINTER. I'm not sure
> where that is supposed to happen.
> 
> Since the instructions look like this:
>    (insn 113 11 16 2 (set (reg:DI 15 a5 [226])
>            (reg/f:DI 2 sp)) 179 {*movdi_64bit}
>         (nil))
> 
> I assume that we'd want to mark a5 as REG_POINTER anyway (which is
> not), and in that case propagation would work.
I'd strongly recommend against that.  The problem is the destination 
register might have been used in another context as an index register. 
We've fixed a few bugs in this space through the years I suspect there 
may be others lurking.  You can't directly set that up in C code, but 
once you get down to RTL it can happen.


> On the other hand if there's no correctness issue w.r.t. REG_POINTER
> and regcprop then removing the additional check would increase
> propagation opportunities in general which is also good.
I think you can safely remove this limitation.

jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-08-04 16:23                           ` Jeff Law
@ 2023-08-05  9:27                             ` Manolis Tsamis
  0 siblings, 0 replies; 21+ messages in thread
From: Manolis Tsamis @ 2023-08-05  9:27 UTC (permalink / raw)
  To: Jeff Law; +Cc: Vineet Gupta, Philipp Tomsich, Jivan Hakobyan, gcc-patches

Hi Jeff,

Thanks for all the info!
Then I'll prepare/test a patch that removes this regcprop limitation
and send it out.
I have already tested that this change is enough to make fmo optimize
leela as well.

Manolis

On Fri, Aug 4, 2023 at 7:23 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 8/4/23 03:52, Manolis Tsamis wrote:
> > Hi all,
> >
> > It is true that regcprop currently does not propagate sp and hence
> > leela is not optimized, but from what I see this should be something
> > we can address.
> >
> > The reason that the propagation fails is this check that I have added
> > when I introduced maybe_copy_reg_attrs:
> >
> > else if (REG_POINTER (new_reg) != REG_POINTER (old_reg))
> >    {
> >      /* Only a single instance of STACK_POINTER_RTX must exist and we cannot
> >         modify it. Allow propagation if REG_POINTER for OLD_REG matches and
> >         don't touch ORIGINAL_REGNO and REG_ATTRS. */
> >      return NULL_RTX;
> >    }
> >
> > To be honest I did add this back then just to be on the safe side of
> > whether a mismatch in REG_POINTER after propagation would be an issue
> > (since the original regcprop had caused enough problems).
> No worries.  Obviously not propagating is the safe thing to do and if we
> find notable missed cases we can always refine the existing code like
> we're considering now.
>
> >
> > I see two ways to solve this and make fmo able to optimize leela as well:
> >   1) Remove the REG_POINTER check in regcprop if that is safe. My
> > understanding is that REG_POINTER is used as a hint and there would be
> > no correctness issues.
> REG_POINTER is meant to be conservatively correct.  If REG_POINTER is
> set, then the register must be a pointer to a valid memory location.  If
> REG_POINTER is not set, then it may or may not point to a valid memory
> location.  sp, fp, ap are by definition pointers and thus have
> REG_POINTER set.
>
> With that definition we can safely copy propagate away an sp->gpr copy
> from a REG_POINTER standpoint.
>
>
>
>
>
>
>
>
>
>
> >   2) Mark the corresponding registers with REG_POINTER. I'm not sure
> > where that is supposed to happen.
> >
> > Since the instructions look like this:
> >    (insn 113 11 16 2 (set (reg:DI 15 a5 [226])
> >            (reg/f:DI 2 sp)) 179 {*movdi_64bit}
> >         (nil))
> >
> > I assume that we'd want to mark a5 as REG_POINTER anyway (which is
> > not), and in that case propagation would work.
> I'd strongly recommend against that.  The problem is the destination
> register might have been used in another context as an index register.
> We've fixed a few bugs in this space through the years I suspect there
> may be others lurking.  You can't directly set that up in C code, but
> once you get down to RTL it can happen.
>
>
> > On the other hand if there's no correctness issue w.r.t. REG_POINTER
> > and regcprop then removing the additional check would increase
> > propagation opportunities in general which is also good.
> I think you can safely remove this limitation.
>
> jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RISC-V: Folding memory for FP + constant case
  2023-07-12 20:59 RISC-V: Folding memory for FP + constant case Jivan Hakobyan
  2023-07-15  6:16 ` Jeff Law
@ 2023-08-09 19:31 ` Jeff Law
  1 sibling, 0 replies; 21+ messages in thread
From: Jeff Law @ 2023-08-09 19:31 UTC (permalink / raw)
  To: Jivan Hakobyan, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2520 bytes --]



On 7/12/23 14:59, Jivan Hakobyan via Gcc-patches wrote:
> Accessing local arrays element turned into load form (fp + (index << C1)) +
> C2 address.
> In the case when access is in the loop we got loop invariant computation.
> For some reason, moving out that part cannot be done in
> loop-invariant passes.
> But we can handle that in target-specific hook (legitimize_address).
> That provides an opportunity to rewrite memory access more suitable for the
> target architecture.
> 
> This patch solves the mentioned case by rewriting mentioned case to ((fp +
> C2) + (index << C1))
> I have evaluated it on SPEC2017 and got an improvement on leela (over 7b
> instructions,
> .39% of the dynamic count) and dwarfs the regression for gcc (14m
> instructions, .0012%
> of the dynamic count).
> 
> 
> gcc/ChangeLog:
>          * config/riscv/riscv.cc (riscv_legitimize_address): Handle folding.
>          (mem_shadd_or_shadd_rtx_p): New predicate.
So I poked a bit more in this space today.

As you may have noted, Manolis's patch still needs another rev.  But I 
was able to test this patch in conjunction with the f-m-o patch as well 
as the additional improvements made to hard register cprop.  The net 
result was that this patch still shows a nice decrease in instruction 
counts on leela.  It's a bit of a mixed bag elsewhere.

I dove a bit deeper into the small regression in x264.  In the case I 
looked at the reason the patch regresses is the original form of the 
address calculations exposes a common subexpression ie

addr1 = (reg1 << 2) + fp + C1
addr2 = (reg1 << 2) + fp + C2

(reg1 << 2) + fp is a common subexpression resulting in something like 
this as we leave CSE:

t = (reg1 << 2) + fp;
addr1 = t + C1
addr2 = t + C2
mem (addr1)
mem (addr2)

C1 and C2 are small constants, so combine generates

t = (reg1 << 2) + fp;
mem (t+C1)
mem (t+C2)

FP elimination occurs after IRA and we get:

t2 = sp + C3
t = (reg << 2) + t2
mem (t + C1)
mem (t + C2)


Not bad.  Manolis's work should allow us to improve that a bit more.


With this patch we don't capture the CSE and ultimately generate 
slightly worse code.  This kind of issue is fairly inherent in 
reassociations -- and given the regression is 2 orders of magnitude 
smaller than the improvement my inclination is to go forward with this 
patch.



I've fixed a few formatting issues and changed once conditional to use 
CONST_INT_P rather than checking the code directory and pushed the final 
version to the trunk.

Thanks for your patience.

jeff

[-- Attachment #2: P --]
[-- Type: text/plain, Size: 3381 bytes --]

commit a16dc729fda9fabd6472d50cce45791cb3b6ada8
Author: Jivan Hakobyan <jivanhakobyan9@gmail.com>
Date:   Wed Aug 9 13:26:58 2023 -0600

    RISC-V: Folding memory for FP + constant case
    
    Accessing local arrays element turned into load form (fp + (index << C1)) +
    C2 address.
    
    In the case when access is in the loop we got loop invariant computation.  For
    some reason, moving out that part cannot be done in loop-invariant passes.  But
    we can handle that in target-specific hook (legitimize_address).  That provides
    an opportunity to rewrite memory access more suitable for the target
    architecture.
    
    This patch solves the mentioned case by rewriting mentioned case to ((fp +
    C2) + (index << C1))
    
    I have evaluated it on SPEC2017 and got an improvement on leela (over 7b
    instructions, .39% of the dynamic count) and dwarfs the regression for gcc (14m
    instructions, .0012% of the dynamic count).
    
    gcc/ChangeLog:
            * config/riscv/riscv.cc (riscv_legitimize_address): Handle folding.
            (mem_shadd_or_shadd_rtx_p): New function.

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 77892da2920..7f2041a54ba 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -1805,6 +1805,22 @@ riscv_shorten_lw_offset (rtx base, HOST_WIDE_INT offset)
   return addr;
 }
 
+/* Helper for riscv_legitimize_address. Given X, return true if it
+   is a left shift by 1, 2 or 3 positions or a multiply by 2, 4 or 8.
+
+   This respectively represent canonical shift-add rtxs or scaled
+   memory addresses.  */
+static bool
+mem_shadd_or_shadd_rtx_p (rtx x)
+{
+  return ((GET_CODE (x) == ASHIFT
+	   || GET_CODE (x) == MULT)
+	  && CONST_INT_P (XEXP (x, 1))
+	  && ((GET_CODE (x) == ASHIFT && IN_RANGE (INTVAL (XEXP (x, 1)), 1, 3))
+	      || (GET_CODE (x) == MULT
+		  && IN_RANGE (exact_log2 (INTVAL (XEXP (x, 1))), 1, 3))));
+}
+
 /* This function is used to implement LEGITIMIZE_ADDRESS.  If X can
    be legitimized in a way that the generic machinery might not expect,
    return a new address, otherwise return NULL.  MODE is the mode of
@@ -1830,6 +1846,32 @@ riscv_legitimize_address (rtx x, rtx oldx ATTRIBUTE_UNUSED,
       rtx base = XEXP (x, 0);
       HOST_WIDE_INT offset = INTVAL (XEXP (x, 1));
 
+      /* Handle (plus (plus (mult (a) (mem_shadd_constant)) (fp)) (C)) case.  */
+      if (GET_CODE (base) == PLUS && mem_shadd_or_shadd_rtx_p (XEXP (base, 0))
+	  && SMALL_OPERAND (offset))
+	{
+	  rtx index = XEXP (base, 0);
+	  rtx fp = XEXP (base, 1);
+	  if (REGNO (fp) == VIRTUAL_STACK_VARS_REGNUM)
+	    {
+
+	      /* If we were given a MULT, we must fix the constant
+		 as we're going to create the ASHIFT form.  */
+	      int shift_val = INTVAL (XEXP (index, 1));
+	      if (GET_CODE (index) == MULT)
+		shift_val = exact_log2 (shift_val);
+
+	      rtx reg1 = gen_reg_rtx (Pmode);
+	      rtx reg2 = gen_reg_rtx (Pmode);
+	      rtx reg3 = gen_reg_rtx (Pmode);
+	      riscv_emit_binary (PLUS, reg1, fp, GEN_INT (offset));
+	      riscv_emit_binary (ASHIFT, reg2, XEXP (index, 0), GEN_INT (shift_val));
+	      riscv_emit_binary (PLUS, reg3, reg2, reg1);
+
+	      return reg3;
+	    }
+	}
+
       if (!riscv_valid_base_register_p (base, mode, false))
 	base = copy_to_mode_reg (Pmode, base);
       if (optimize_function_for_size_p (cfun)

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-08-09 19:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-12 20:59 RISC-V: Folding memory for FP + constant case Jivan Hakobyan
2023-07-15  6:16 ` Jeff Law
2023-07-25 11:24   ` Jivan Hakobyan
2023-07-26  3:31     ` Jeff Law
2023-08-01 19:14       ` Vineet Gupta
2023-08-01 20:16         ` Jivan Hakobyan
2023-08-01 21:55         ` Jeff Law
2023-08-01 22:07           ` Philipp Tomsich
2023-08-01 23:03             ` Vineet Gupta
2023-08-01 23:06               ` Philipp Tomsich
2023-08-01 23:13                 ` Vineet Gupta
2023-08-01 23:27                   ` Jeff Law
2023-08-01 23:38                     ` Vineet Gupta
2023-08-01 23:52                       ` Jeff Law
2023-08-04  9:52                         ` Manolis Tsamis
2023-08-04 16:23                           ` Jeff Law
2023-08-05  9:27                             ` Manolis Tsamis
2023-08-01 23:22                 ` Jeff Law
2023-08-01 23:28                   ` Vineet Gupta
2023-08-01 23:21               ` Jeff Law
2023-08-09 19:31 ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).