public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] CPROP: Allow cprop optimization when the function has a single block
@ 2023-02-01 12:24 juzhe.zhong
  2023-02-01 12:40 ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: juzhe.zhong @ 2023-02-01 12:24 UTC (permalink / raw)
  To: gcc-patches
  Cc: kito.cheng, richard.sandiford, rguenther, jeffreyalaw, apinski,
	Ju-Zhe Zhong

From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>

Hi, this patch is present for GCC 14 since I understand it's not appropriate
to land it in GCC 13.

NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
one is entry block, the other is exit block.
So according this code, the function will not do cprop optimization when
there is only exactly one single block.

I am not sure whether it's correct to fix it like this.
Can anyone tell me why forbid cprop optimization if the function only has s
single block ?

Let's take a look at these 2 examples of RVV intrinsics:
1.    void f1 (void * in, void *out, int64_t x, int n)
      {
      vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
      vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
      vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
      vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
      __riscv_vse64_v_i64m1 (out + 2, v4, 4);
      }

asm:
     	addi	sp,sp,-16
	sw	a2,8(sp)
	sw	a3,12(sp)
	sw	a2,0(sp)
	sw	a3,4(sp)
	addi	a5,a0,1
	vsetivli	zero,4,e64,m1,ta,ma
	addi	a0,a0,2
	vle64.v	v24,0(a5)
	addi	a5,sp,8
	vlse64.v	v27,0(a5),zero
	addi	a1,a1,2
	vsetvli	zero,zero,e64,m1,tu,ma
	vle64.v	v24,0(a0)
	vsetvli	zero,zero,e64,m1,ta,ma
	vlse64.v	v25,0(sp),zero
	vadd.vv	v26,v24,v27
	vadd.vv	v24,v26,v25
	vse64.v	v24,0(a1)
	addi	sp,sp,16
	jr	ra
you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
optimization (the function only has 1 single block). It can be optimized if we apply
this patch.

2. void f1 (void * in, void *out, int64_t x, int n)
{
    if (n) {
      vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
      vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
      vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
      vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
      __riscv_vse64_v_i64m1 (out + 2, v4, 4);
    }
}

asm:
  f1:
	vsetivli	zero,4,e64,m1,ta,ma
	beq	a4,zero,.L7
	addi	sp,sp,-16
	sw	a2,8(sp)
	sw	a3,12(sp)
	addi	a5,a0,1
	vle64.v	v24,0(a5)
	addi	a0,a0,2
	addi	a5,sp,8
	vlse64.v	v25,0(a5),zero
	addi	a1,a1,2
	vsetvli	zero,zero,e64,m1,tu,ma
	vle64.v	v24,0(a0)
	vadd.vv	v26,v24,v25
	vadd.vv	v24,v26,v25
	vse64.v	v24,0(a1)
	addi	sp,sp,16
	jr	ra
.L7:
	ret

Here, if we add if (n) condition here, the program will end up with more than 1 block.
So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.

I am not sure whether this patch is correct.
Can anyone help me with that ?
Thanks.


gcc/ChangeLog:

        * cprop.cc (one_cprop_pass): Remove +1.

---
 gcc/cprop.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/cprop.cc b/gcc/cprop.cc
index 6ec0bda4a24..615bc5078b6 100644
--- a/gcc/cprop.cc
+++ b/gcc/cprop.cc
@@ -1749,7 +1749,7 @@ one_cprop_pass (void)
   int changed = 0;
 
   /* Return if there's nothing to do, or it is too expensive.  */
-  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
+  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
       || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
     return 0;
 
-- 
2.36.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
  2023-02-01 12:24 [PATCH] CPROP: Allow cprop optimization when the function has a single block juzhe.zhong
@ 2023-02-01 12:40 ` Richard Biener
  2023-02-01 12:47   ` juzhe.zhong
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2023-02-01 12:40 UTC (permalink / raw)
  To: Ju-Zhe Zhong
  Cc: gcc-patches, kito.cheng, richard.sandiford, jeffreyalaw, apinski

On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:

> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> 
> Hi, this patch is present for GCC 14 since I understand it's not appropriate
> to land it in GCC 13.
> 
> NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> one is entry block, the other is exit block.
> So according this code, the function will not do cprop optimization when
> there is only exactly one single block.

cprop / GCSE are global dataflow problems, there's "nothing" to do for
a single block, the local problem plus its application isn't considered
but probably left for CSE.

Why does CSE not perform the desired transform?

> I am not sure whether it's correct to fix it like this.
> Can anyone tell me why forbid cprop optimization if the function only has s
> single block ?
> 
> Let's take a look at these 2 examples of RVV intrinsics:
> 1.    void f1 (void * in, void *out, int64_t x, int n)
>       {
>       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
>       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
>       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
>       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
>       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
>       }
> 
> asm:
>      	addi	sp,sp,-16
> 	sw	a2,8(sp)
> 	sw	a3,12(sp)
> 	sw	a2,0(sp)
> 	sw	a3,4(sp)
> 	addi	a5,a0,1
> 	vsetivli	zero,4,e64,m1,ta,ma
> 	addi	a0,a0,2
> 	vle64.v	v24,0(a5)
> 	addi	a5,sp,8
> 	vlse64.v	v27,0(a5),zero
> 	addi	a1,a1,2
> 	vsetvli	zero,zero,e64,m1,tu,ma
> 	vle64.v	v24,0(a0)
> 	vsetvli	zero,zero,e64,m1,ta,ma
> 	vlse64.v	v25,0(sp),zero
> 	vadd.vv	v26,v24,v27
> 	vadd.vv	v24,v26,v25
> 	vse64.v	v24,0(a1)
> 	addi	sp,sp,16
> 	jr	ra
> you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> optimization (the function only has 1 single block). It can be optimized if we apply
> this patch.
> 
> 2. void f1 (void * in, void *out, int64_t x, int n)
> {
>     if (n) {
>       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
>       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
>       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
>       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
>       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
>     }
> }
> 
> asm:
>   f1:
> 	vsetivli	zero,4,e64,m1,ta,ma
> 	beq	a4,zero,.L7
> 	addi	sp,sp,-16
> 	sw	a2,8(sp)
> 	sw	a3,12(sp)
> 	addi	a5,a0,1
> 	vle64.v	v24,0(a5)
> 	addi	a0,a0,2
> 	addi	a5,sp,8
> 	vlse64.v	v25,0(a5),zero
> 	addi	a1,a1,2
> 	vsetvli	zero,zero,e64,m1,tu,ma
> 	vle64.v	v24,0(a0)
> 	vadd.vv	v26,v24,v25
> 	vadd.vv	v24,v26,v25
> 	vse64.v	v24,0(a1)
> 	addi	sp,sp,16
> 	jr	ra
> .L7:
> 	ret
> 
> Here, if we add if (n) condition here, the program will end up with more than 1 block.
> So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> 
> I am not sure whether this patch is correct.
> Can anyone help me with that ?
> Thanks.
> 
> 
> gcc/ChangeLog:
> 
>         * cprop.cc (one_cprop_pass): Remove +1.
> 
> ---
>  gcc/cprop.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> index 6ec0bda4a24..615bc5078b6 100644
> --- a/gcc/cprop.cc
> +++ b/gcc/cprop.cc
> @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
>    int changed = 0;
>  
>    /* Return if there's nothing to do, or it is too expensive.  */
> -  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
>        || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
>      return 0;
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
  2023-02-01 12:40 ` Richard Biener
@ 2023-02-01 12:47   ` juzhe.zhong
  2023-02-01 12:51     ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: juzhe.zhong @ 2023-02-01 12:47 UTC (permalink / raw)
  To: rguenther
  Cc: gcc-patches, kito.cheng, richard.sandiford, jeffreyalaw, apinski

[-- Attachment #1: Type: text/plain, Size: 9798 bytes --]

 I don't known whether CSE do the job. What I saw is CPROP do the optimization when we have more than 1 block.

This the RTL dump before CPROP:

(insn 19 18 20 4 (set (reg:VNx1DI 151)
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
     (nil))
(insn 20 19 21 4 (set (reg/v:VNx1DI 139 [ v3 ])
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
                (reg:VNx1DI 151))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":22:23 1528 {pred_addvnx1di}
     (expr_list:REG_DEAD (reg:VNx1DI 151)
        (nil)))
(insn 21 20 22 4 (set (reg:VNx1DI 152)
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":23:23 695 {pred_broadcastvnx1di}
     (nil))
(insn 22 21 23 4 (set (reg/v:VNx1DI 140 [ v4 ])
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 0 [0])
                    (const_int 2 [0x2])
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
                (reg:VNx1DI 152))
            (reg/v:VNx1DI 139 [ v3 ]))) "rvv.c":23:23 1528 {pred_addvnx1di}
     (expr_list:REG_DEAD (reg:VNx1DI 152)
        (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
            (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
                (nil)))))

After CRPOP:
(insn 15 14 16 4 (set (reg:VNx1DI 147)
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (vec_duplicate:VNx1DI (reg/v:DI 143 [ x ]))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":11:23 695 {pred_broadcastvnx1di}
     (nil))
(insn 16 15 18 4 (set (reg/v:VNx1DI 139 [ v3 ])
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
                (reg:VNx1DI 147))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":11:23 1528 {pred_addvnx1di}
     (expr_list:REG_DEAD (reg:VNx1DI 147)
        (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
            (nil))))
(insn 18 16 19 4 (set (reg/v:VNx1DI 140 [ v4 ])
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (plus:VNx1DI (reg/v:VNx1DI 139 [ v3 ])
                (reg:VNx1DI 147))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":12:23 1528 {pred_addvnx1di}
     (expr_list:REG_DEAD (reg:VNx1DI 148)
        (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
            (nil))))

You can see CPROP remove the second the "pred_broadcast" instruction and propagate the result to the second "pred_add" instruction。



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-02-01 20:40
To: Ju-Zhe Zhong
CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
 
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> 
> Hi, this patch is present for GCC 14 since I understand it's not appropriate
> to land it in GCC 13.
> 
> NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> one is entry block, the other is exit block.
> So according this code, the function will not do cprop optimization when
> there is only exactly one single block.
 
cprop / GCSE are global dataflow problems, there's "nothing" to do for
a single block, the local problem plus its application isn't considered
but probably left for CSE.
 
Why does CSE not perform the desired transform?
 
> I am not sure whether it's correct to fix it like this.
> Can anyone tell me why forbid cprop optimization if the function only has s
> single block ?
> 
> Let's take a look at these 2 examples of RVV intrinsics:
> 1.    void f1 (void * in, void *out, int64_t x, int n)
>       {
>       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
>       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
>       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
>       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
>       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
>       }
> 
> asm:
>      addi sp,sp,-16
> sw a2,8(sp)
> sw a3,12(sp)
> sw a2,0(sp)
> sw a3,4(sp)
> addi a5,a0,1
> vsetivli zero,4,e64,m1,ta,ma
> addi a0,a0,2
> vle64.v v24,0(a5)
> addi a5,sp,8
> vlse64.v v27,0(a5),zero
> addi a1,a1,2
> vsetvli zero,zero,e64,m1,tu,ma
> vle64.v v24,0(a0)
> vsetvli zero,zero,e64,m1,ta,ma
> vlse64.v v25,0(sp),zero
> vadd.vv v26,v24,v27
> vadd.vv v24,v26,v25
> vse64.v v24,0(a1)
> addi sp,sp,16
> jr ra
> you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> optimization (the function only has 1 single block). It can be optimized if we apply
> this patch.
> 
> 2. void f1 (void * in, void *out, int64_t x, int n)
> {
>     if (n) {
>       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
>       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
>       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
>       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
>       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
>     }
> }
> 
> asm:
>   f1:
> vsetivli zero,4,e64,m1,ta,ma
> beq a4,zero,.L7
> addi sp,sp,-16
> sw a2,8(sp)
> sw a3,12(sp)
> addi a5,a0,1
> vle64.v v24,0(a5)
> addi a0,a0,2
> addi a5,sp,8
> vlse64.v v25,0(a5),zero
> addi a1,a1,2
> vsetvli zero,zero,e64,m1,tu,ma
> vle64.v v24,0(a0)
> vadd.vv v26,v24,v25
> vadd.vv v24,v26,v25
> vse64.v v24,0(a1)
> addi sp,sp,16
> jr ra
> .L7:
> ret
> 
> Here, if we add if (n) condition here, the program will end up with more than 1 block.
> So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> 
> I am not sure whether this patch is correct.
> Can anyone help me with that ?
> Thanks.
> 
> 
> gcc/ChangeLog:
> 
>         * cprop.cc (one_cprop_pass): Remove +1.
> 
> ---
>  gcc/cprop.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> index 6ec0bda4a24..615bc5078b6 100644
> --- a/gcc/cprop.cc
> +++ b/gcc/cprop.cc
> @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
>    int changed = 0;
>  
>    /* Return if there's nothing to do, or it is too expensive.  */
> -  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
>        || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
>      return 0;
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
  2023-02-01 12:47   ` juzhe.zhong
@ 2023-02-01 12:51     ` Richard Biener
  2023-02-01 12:56       ` juzhe.zhong
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2023-02-01 12:51 UTC (permalink / raw)
  To: juzhe.zhong
  Cc: gcc-patches, kito.cheng, richard.sandiford, jeffreyalaw, apinski

On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:

>  I don't known whether CSE do the job. What I saw is CPROP do the optimization when we have more than 1 block.
> 
> This the RTL dump before CPROP:
> 
> (insn 19 18 20 4 (set (reg:VNx1DI 151)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 20 19 21 4 (set (reg/v:VNx1DI 139 [ v3 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 151))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 151)
>         (nil)))
> (insn 21 20 22 4 (set (reg:VNx1DI 152)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":23:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 22 21 23 4 (set (reg/v:VNx1DI 140 [ v4 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 0 [0])
>                     (const_int 2 [0x2])
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 152))
>             (reg/v:VNx1DI 139 [ v3 ]))) "rvv.c":23:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 152)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
>             (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
>                 (nil)))))
> 
> After CRPOP:
> (insn 15 14 16 4 (set (reg:VNx1DI 147)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 143 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":11:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 16 15 18 4 (set (reg/v:VNx1DI 139 [ v3 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 147))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":11:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 147)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
>             (nil))))
> (insn 18 16 19 4 (set (reg/v:VNx1DI 140 [ v4 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 139 [ v3 ])
>                 (reg:VNx1DI 147))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":12:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 148)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
>             (nil))))
> 
> You can see CPROP remove the second the "pred_broadcast" instruction and propagate the result to the second "pred_add" instruction?

I see loads of UNSPECs, that might explain why some passes do something
and some not.  That said, not sure what exactly CPROP does, the two
pred_broadcast insns look exactly the same so CSE should CSE them?

> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-02-01 20:40
> To: Ju-Zhe Zhong
> CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
> Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
> On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
>  
> > From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> > 
> > Hi, this patch is present for GCC 14 since I understand it's not appropriate
> > to land it in GCC 13.
> > 
> > NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> > one is entry block, the other is exit block.
> > So according this code, the function will not do cprop optimization when
> > there is only exactly one single block.
>  
> cprop / GCSE are global dataflow problems, there's "nothing" to do for
> a single block, the local problem plus its application isn't considered
> but probably left for CSE.
>  
> Why does CSE not perform the desired transform?
>  
> > I am not sure whether it's correct to fix it like this.
> > Can anyone tell me why forbid cprop optimization if the function only has s
> > single block ?
> > 
> > Let's take a look at these 2 examples of RVV intrinsics:
> > 1.    void f1 (void * in, void *out, int64_t x, int n)
> >       {
> >       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> >       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> >       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> >       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> >       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> >       }
> > 
> > asm:
> >      addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > sw a2,0(sp)
> > sw a3,4(sp)
> > addi a5,a0,1
> > vsetivli zero,4,e64,m1,ta,ma
> > addi a0,a0,2
> > vle64.v v24,0(a5)
> > addi a5,sp,8
> > vlse64.v v27,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vsetvli zero,zero,e64,m1,ta,ma
> > vlse64.v v25,0(sp),zero
> > vadd.vv v26,v24,v27
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> > GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> > optimization (the function only has 1 single block). It can be optimized if we apply
> > this patch.
> > 
> > 2. void f1 (void * in, void *out, int64_t x, int n)
> > {
> >     if (n) {
> >       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> >       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> >       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> >       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> >       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> >     }
> > }
> > 
> > asm:
> >   f1:
> > vsetivli zero,4,e64,m1,ta,ma
> > beq a4,zero,.L7
> > addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > addi a5,a0,1
> > vle64.v v24,0(a5)
> > addi a0,a0,2
> > addi a5,sp,8
> > vlse64.v v25,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vadd.vv v26,v24,v25
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > .L7:
> > ret
> > 
> > Here, if we add if (n) condition here, the program will end up with more than 1 block.
> > So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> > 
> > I am not sure whether this patch is correct.
> > Can anyone help me with that ?
> > Thanks.
> > 
> > 
> > gcc/ChangeLog:
> > 
> >         * cprop.cc (one_cprop_pass): Remove +1.
> > 
> > ---
> >  gcc/cprop.cc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> > index 6ec0bda4a24..615bc5078b6 100644
> > --- a/gcc/cprop.cc
> > +++ b/gcc/cprop.cc
> > @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
> >    int changed = 0;
> >  
> >    /* Return if there's nothing to do, or it is too expensive.  */
> > -  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> > +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
> >        || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
> >      return 0;
> >  
> > 
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
  2023-02-01 12:51     ` Richard Biener
@ 2023-02-01 12:56       ` juzhe.zhong
       [not found]         ` <0CEDD257-703F-4ED3-B927-C21279557CE6@gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: juzhe.zhong @ 2023-02-01 12:56 UTC (permalink / raw)
  To: rguenther
  Cc: gcc-patches, kito.cheng, richard.sandiford, jeffreyalaw, apinski

[-- Attachment #1: Type: text/plain, Size: 11505 bytes --]

>> I see loads of UNSPECs, that might explain why some passes do something
>> and some not.  That said, not sure what exactly CPROP does, the two
>> pred_broadcast insns look exactly the same so CSE should CSE them?
Yes, the "source" these 2 pred_broadcast the same. However, they have different pseudos in their "dest" (operands[0])
The first one is 151, the second one is 152. And the result in these 2 pseudos are the same.
The first pred_add use 151, the second pred_add use 152.

CSE should do the job to eliminate the second pred_broadcast, but I am not sure whether  CSE can propagate the 151 pseudo
to the second pred_add ??


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-02-01 20:51
To: juzhe.zhong@rivai.ai
CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
 
>  I don't known whether CSE do the job. What I saw is CPROP do the optimization when we have more than 1 block.
> 
> This the RTL dump before CPROP:
> 
> (insn 19 18 20 4 (set (reg:VNx1DI 151)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 20 19 21 4 (set (reg/v:VNx1DI 139 [ v3 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 151))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 151)
>         (nil)))
> (insn 21 20 22 4 (set (reg:VNx1DI 152)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":23:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 22 21 23 4 (set (reg/v:VNx1DI 140 [ v4 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 0 [0])
>                     (const_int 2 [0x2])
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 152))
>             (reg/v:VNx1DI 139 [ v3 ]))) "rvv.c":23:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 152)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
>             (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
>                 (nil)))))
> 
> After CRPOP:
> (insn 15 14 16 4 (set (reg:VNx1DI 147)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 143 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":11:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 16 15 18 4 (set (reg/v:VNx1DI 139 [ v3 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 147))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":11:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 147)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
>             (nil))))
> (insn 18 16 19 4 (set (reg/v:VNx1DI 140 [ v4 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 139 [ v3 ])
>                 (reg:VNx1DI 147))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":12:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 148)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
>             (nil))))
> 
> You can see CPROP remove the second the "pred_broadcast" instruction and propagate the result to the second "pred_add" instruction?
 
I see loads of UNSPECs, that might explain why some passes do something
and some not.  That said, not sure what exactly CPROP does, the two
pred_broadcast insns look exactly the same so CSE should CSE them?
 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-02-01 20:40
> To: Ju-Zhe Zhong
> CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
> Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
> On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
>  
> > From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> > 
> > Hi, this patch is present for GCC 14 since I understand it's not appropriate
> > to land it in GCC 13.
> > 
> > NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> > one is entry block, the other is exit block.
> > So according this code, the function will not do cprop optimization when
> > there is only exactly one single block.
>  
> cprop / GCSE are global dataflow problems, there's "nothing" to do for
> a single block, the local problem plus its application isn't considered
> but probably left for CSE.
>  
> Why does CSE not perform the desired transform?
>  
> > I am not sure whether it's correct to fix it like this.
> > Can anyone tell me why forbid cprop optimization if the function only has s
> > single block ?
> > 
> > Let's take a look at these 2 examples of RVV intrinsics:
> > 1.    void f1 (void * in, void *out, int64_t x, int n)
> >       {
> >       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> >       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> >       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> >       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> >       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> >       }
> > 
> > asm:
> >      addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > sw a2,0(sp)
> > sw a3,4(sp)
> > addi a5,a0,1
> > vsetivli zero,4,e64,m1,ta,ma
> > addi a0,a0,2
> > vle64.v v24,0(a5)
> > addi a5,sp,8
> > vlse64.v v27,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vsetvli zero,zero,e64,m1,ta,ma
> > vlse64.v v25,0(sp),zero
> > vadd.vv v26,v24,v27
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> > GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> > optimization (the function only has 1 single block). It can be optimized if we apply
> > this patch.
> > 
> > 2. void f1 (void * in, void *out, int64_t x, int n)
> > {
> >     if (n) {
> >       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> >       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> >       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> >       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> >       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> >     }
> > }
> > 
> > asm:
> >   f1:
> > vsetivli zero,4,e64,m1,ta,ma
> > beq a4,zero,.L7
> > addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > addi a5,a0,1
> > vle64.v v24,0(a5)
> > addi a0,a0,2
> > addi a5,sp,8
> > vlse64.v v25,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vadd.vv v26,v24,v25
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > .L7:
> > ret
> > 
> > Here, if we add if (n) condition here, the program will end up with more than 1 block.
> > So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> > 
> > I am not sure whether this patch is correct.
> > Can anyone help me with that ?
> > Thanks.
> > 
> > 
> > gcc/ChangeLog:
> > 
> >         * cprop.cc (one_cprop_pass): Remove +1.
> > 
> > ---
> >  gcc/cprop.cc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> > index 6ec0bda4a24..615bc5078b6 100644
> > --- a/gcc/cprop.cc
> > +++ b/gcc/cprop.cc
> > @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
> >    int changed = 0;
> >  
> >    /* Return if there's nothing to do, or it is too expensive.  */
> > -  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> > +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
> >        || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
> >      return 0;
> >  
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
       [not found]         ` <0CEDD257-703F-4ED3-B927-C21279557CE6@gmail.com>
@ 2023-02-02  1:49           ` juzhe.zhong
  2023-02-02 12:26             ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: juzhe.zhong @ 2023-02-02  1:49 UTC (permalink / raw)
  To: Richard Biener
  Cc: rguenther, gcc-patches, kito.cheng, richard.sandiford,
	jeffreyalaw, apinski

[-- Attachment #1: Type: text/plain, Size: 14114 bytes --]

Yeah, Thanks. You are right. CSE should do the job. 
Now I know the reason CSE failed to optimize is I include VL_REGNUM(66)/VTYPE_RENUM(67) hard reg
as the dependency of pred_broadcast:
(insn 19 18 20 4 (set (reg:VNx1DI 152)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
>      (nil))
Then CSE failed to set the 152 as copy.

VL_REGNUM(66)/VTYPE_RENUM(67) are the global hard reg that I should make each RVV instruction depend on them.
Since we use vsetvl instruction (which is setting global VL_REGNUM(66)/VTYPE_RENUM(67) status) to set the global status for
each RVV instruction. 
Including the dependency here is to make sure the global VL/VTYPE status is correct of each RVV instruction. (If we don't include
such dependency in RVV instruction, instruction scheduling may move the RVV instructions and vsetvl instructions randomly then
produce incorrect vsetvl configuration)

The original reg_class of VL_REGNUM(66)/VTYPE_RENUM(67) I set here:
riscv_regno_to_class [VL_REGNUM] = VL_REGS;
riscv_regno_to_class [VTYPE_RENUM] = VTYPE_REGS;
Such configuration make CSE failed.

However, if I change the reg_class :
riscv_regno_to_class [VL_REGNUM] = NO_REGS;
riscv_regno_to_class [VTYPE_RENUM] = NO_REGS;
The CSE now can do the optimization now!

1) Would you mind telling me the difference between them?
2) If I set these 2 global status register as NO_REGS, will it create issues for the global status configuration of each RVV instructions ?

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-02-01 23:46
To: juzhe.zhong
CC: rguenther; gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block


Am 01.02.2023 um 13:57 schrieb juzhe.zhong@rivai.ai:

 
>> I see loads of UNSPECs, that might explain why some passes do something
>> and some not.  That said, not sure what exactly CPROP does, the two
>> pred_broadcast insns look exactly the same so CSE should CSE them?
Yes, the "source" these 2 pred_broadcast the same. However, they have different pseudos in their "dest" (operands[0])
The first one is 151, the second one is 152. And the result in these 2 pseudos are the same.
The first pred_add use 151, the second pred_add use 152.

CSE should do the job to eliminate the second pred_broadcast, but I am not sure whether  CSE can propagate the 151 pseudo
to the second pred_add ??

It should do that.  At least it should make setting 152 a copy.  So you need to figure why it doesn’t do that?  Maybe it uses costs and those say recomputing is cheaper than a copy?



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-02-01 20:51
To: juzhe.zhong@rivai.ai
CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
 
>  I don't known whether CSE do the job. What I saw is CPROP do the optimization when we have more than 1 block.
> 
> This the RTL dump before CPROP:
> 
> (insn 19 18 20 4 (set (reg:VNx1DI 151)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 20 19 21 4 (set (reg/v:VNx1DI 139 [ v3 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 151))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 151)
>         (nil)))
> (insn 21 20 22 4 (set (reg:VNx1DI 152)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":23:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 22 21 23 4 (set (reg/v:VNx1DI 140 [ v4 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 0 [0])
>                     (const_int 2 [0x2])
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 152))
>             (reg/v:VNx1DI 139 [ v3 ]))) "rvv.c":23:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 152)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
>             (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
>                 (nil)))))
> 
> After CRPOP:
> (insn 15 14 16 4 (set (reg:VNx1DI 147)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 143 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":11:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 16 15 18 4 (set (reg/v:VNx1DI 139 [ v3 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 147))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":11:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 147)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
>             (nil))))
> (insn 18 16 19 4 (set (reg/v:VNx1DI 140 [ v4 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 139 [ v3 ])
>                 (reg:VNx1DI 147))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":12:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 148)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
>             (nil))))
> 
> You can see CPROP remove the second the "pred_broadcast" instruction and propagate the result to the second "pred_add" instruction?
 
I see loads of UNSPECs, that might explain why some passes do something
and some not.  That said, not sure what exactly CPROP does, the two
pred_broadcast insns look exactly the same so CSE should CSE them?
 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-02-01 20:40
> To: Ju-Zhe Zhong
> CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
> Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
> On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
>  
> > From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> > 
> > Hi, this patch is present for GCC 14 since I understand it's not appropriate
> > to land it in GCC 13.
> > 
> > NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> > one is entry block, the other is exit block.
> > So according this code, the function will not do cprop optimization when
> > there is only exactly one single block.
>  
> cprop / GCSE are global dataflow problems, there's "nothing" to do for
> a single block, the local problem plus its application isn't considered
> but probably left for CSE.
>  
> Why does CSE not perform the desired transform?
>  
> > I am not sure whether it's correct to fix it like this.
> > Can anyone tell me why forbid cprop optimization if the function only has s
> > single block ?
> > 
> > Let's take a look at these 2 examples of RVV intrinsics:
> > 1.    void f1 (void * in, void *out, int64_t x, int n)
> >       {
> >       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> >       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> >       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> >       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> >       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> >       }
> > 
> > asm:
> >      addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > sw a2,0(sp)
> > sw a3,4(sp)
> > addi a5,a0,1
> > vsetivli zero,4,e64,m1,ta,ma
> > addi a0,a0,2
> > vle64.v v24,0(a5)
> > addi a5,sp,8
> > vlse64.v v27,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vsetvli zero,zero,e64,m1,ta,ma
> > vlse64.v v25,0(sp),zero
> > vadd.vv v26,v24,v27
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> > GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> > optimization (the function only has 1 single block). It can be optimized if we apply
> > this patch.
> > 
> > 2. void f1 (void * in, void *out, int64_t x, int n)
> > {
> >     if (n) {
> >       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> >       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> >       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> >       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> >       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> >     }
> > }
> > 
> > asm:
> >   f1:
> > vsetivli zero,4,e64,m1,ta,ma
> > beq a4,zero,.L7
> > addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > addi a5,a0,1
> > vle64.v v24,0(a5)
> > addi a0,a0,2
> > addi a5,sp,8
> > vlse64.v v25,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vadd.vv v26,v24,v25
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > .L7:
> > ret
> > 
> > Here, if we add if (n) condition here, the program will end up with more than 1 block.
> > So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> > 
> > I am not sure whether this patch is correct.
> > Can anyone help me with that ?
> > Thanks.
> > 
> > 
> > gcc/ChangeLog:
> > 
> >         * cprop.cc (one_cprop_pass): Remove +1.
> > 
> > ---
> >  gcc/cprop.cc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> > index 6ec0bda4a24..615bc5078b6 100644
> > --- a/gcc/cprop.cc
> > +++ b/gcc/cprop.cc
> > @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
> >    int changed = 0;
> >  
> >    /* Return if there's nothing to do, or it is too expensive.  */
> > -  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> > +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
> >        || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
> >      return 0;
> >  
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
  2023-02-02  1:49           ` juzhe.zhong
@ 2023-02-02 12:26             ` Richard Biener
  2023-02-02 12:35               ` juzhe.zhong
  2023-02-02 14:36               ` Jeff Law
  0 siblings, 2 replies; 12+ messages in thread
From: Richard Biener @ 2023-02-02 12:26 UTC (permalink / raw)
  To: juzhe.zhong
  Cc: gcc-patches, kito.cheng, richard.sandiford, jeffreyalaw, apinski

On Thu, 2 Feb 2023, juzhe.zhong@rivai.ai wrote:

> Yeah, Thanks. You are right. CSE should do the job. 
> Now I know the reason CSE failed to optimize is I include VL_REGNUM(66)/VTYPE_RENUM(67) hard reg
> as the dependency of pred_broadcast:
> (insn 19 18 20 4 (set (reg:VNx1DI 152)
> >         (if_then_else:VNx1DI (unspec:VNx1BI [
> >                     (const_vector:VNx1BI repeat [
> >                             (const_int 1 [0x1])
> >                         ])
> >                     (const_int 4 [0x4])
> >                     (const_int 2 [0x2]) repeated x2
> >                     (const_int 0 [0])
> >                     (reg:SI 66 vl)
> >                     (reg:SI 67 vtype)
> >                 ] UNSPEC_VPREDICATE)
> >             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
> >             (unspec:VNx1DI [
> >                     (const_int 0 [0])
> >                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
> >      (nil))
> Then CSE failed to set the 152 as copy.
> 
> VL_REGNUM(66)/VTYPE_RENUM(67) are the global hard reg that I should make each RVV instruction depend on them.
> Since we use vsetvl instruction (which is setting global VL_REGNUM(66)/VTYPE_RENUM(67) status) to set the global status for
> each RVV instruction. 
> Including the dependency here is to make sure the global VL/VTYPE status is correct of each RVV instruction. (If we don't include
> such dependency in RVV instruction, instruction scheduling may move the RVV instructions and vsetvl instructions randomly then
> produce incorrect vsetvl configuration)
> 
> The original reg_class of VL_REGNUM(66)/VTYPE_RENUM(67) I set here:
> riscv_regno_to_class [VL_REGNUM] = VL_REGS;
> riscv_regno_to_class [VTYPE_RENUM] = VTYPE_REGS;
> Such configuration make CSE failed.
> 
> However, if I change the reg_class :
> riscv_regno_to_class [VL_REGNUM] = NO_REGS;
> riscv_regno_to_class [VTYPE_RENUM] = NO_REGS;
> The CSE now can do the optimization now!
> 
> 1) Would you mind telling me the difference between them?

No idea.  I think CSE avoids to touch hard register references because
eliding them to copies can increase register pressure.

> 2) If I set these 2 global status register as NO_REGS, will it create 
>    issues for the global status configuration of each RVV instructions ?

No idea either.  Usually these kind of dependences are introduced
by targets at the point the VL setting is introduced to avoid
pessimizing optimizations earlier.  Often, for cases like a VL
register, this is done after register allocation only and indeed
necessary to avoid the second scheduling pass from breaking things.

Richard.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
  2023-02-02 12:26             ` Richard Biener
@ 2023-02-02 12:35               ` juzhe.zhong
  2023-02-02 14:36                 ` Jeff Law
  2023-02-02 14:36               ` Jeff Law
  1 sibling, 1 reply; 12+ messages in thread
From: juzhe.zhong @ 2023-02-02 12:35 UTC (permalink / raw)
  To: rguenther
  Cc: gcc-patches, kito.cheng, richard.sandiford, jeffreyalaw, apinski

[-- Attachment #1: Type: text/plain, Size: 3077 bytes --]

Thank you so much. Kito helped me fix it already.
RVV instruction patterns can have CSE optimizations now.



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-02-02 20:26
To: juzhe.zhong@rivai.ai
CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
On Thu, 2 Feb 2023, juzhe.zhong@rivai.ai wrote:
 
> Yeah, Thanks. You are right. CSE should do the job. 
> Now I know the reason CSE failed to optimize is I include VL_REGNUM(66)/VTYPE_RENUM(67) hard reg
> as the dependency of pred_broadcast:
> (insn 19 18 20 4 (set (reg:VNx1DI 152)
> >         (if_then_else:VNx1DI (unspec:VNx1BI [
> >                     (const_vector:VNx1BI repeat [
> >                             (const_int 1 [0x1])
> >                         ])
> >                     (const_int 4 [0x4])
> >                     (const_int 2 [0x2]) repeated x2
> >                     (const_int 0 [0])
> >                     (reg:SI 66 vl)
> >                     (reg:SI 67 vtype)
> >                 ] UNSPEC_VPREDICATE)
> >             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
> >             (unspec:VNx1DI [
> >                     (const_int 0 [0])
> >                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
> >      (nil))
> Then CSE failed to set the 152 as copy.
> 
> VL_REGNUM(66)/VTYPE_RENUM(67) are the global hard reg that I should make each RVV instruction depend on them.
> Since we use vsetvl instruction (which is setting global VL_REGNUM(66)/VTYPE_RENUM(67) status) to set the global status for
> each RVV instruction. 
> Including the dependency here is to make sure the global VL/VTYPE status is correct of each RVV instruction. (If we don't include
> such dependency in RVV instruction, instruction scheduling may move the RVV instructions and vsetvl instructions randomly then
> produce incorrect vsetvl configuration)
> 
> The original reg_class of VL_REGNUM(66)/VTYPE_RENUM(67) I set here:
> riscv_regno_to_class [VL_REGNUM] = VL_REGS;
> riscv_regno_to_class [VTYPE_RENUM] = VTYPE_REGS;
> Such configuration make CSE failed.
> 
> However, if I change the reg_class :
> riscv_regno_to_class [VL_REGNUM] = NO_REGS;
> riscv_regno_to_class [VTYPE_RENUM] = NO_REGS;
> The CSE now can do the optimization now!
> 
> 1) Would you mind telling me the difference between them?
 
No idea.  I think CSE avoids to touch hard register references because
eliding them to copies can increase register pressure.
 
> 2) If I set these 2 global status register as NO_REGS, will it create 
>    issues for the global status configuration of each RVV instructions ?
 
No idea either.  Usually these kind of dependences are introduced
by targets at the point the VL setting is introduced to avoid
pessimizing optimizations earlier.  Often, for cases like a VL
register, this is done after register allocation only and indeed
necessary to avoid the second scheduling pass from breaking things.
 
Richard.
 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
  2023-02-02 12:26             ` Richard Biener
  2023-02-02 12:35               ` juzhe.zhong
@ 2023-02-02 14:36               ` Jeff Law
  1 sibling, 0 replies; 12+ messages in thread
From: Jeff Law @ 2023-02-02 14:36 UTC (permalink / raw)
  To: Richard Biener, juzhe.zhong
  Cc: gcc-patches, kito.cheng, richard.sandiford, apinski



On 2/2/23 05:26, Richard Biener wrote:
> On Thu, 2 Feb 2023, juzhe.zhong@rivai.ai wrote:
> 
>> Yeah, Thanks. You are right. CSE should do the job.
>> Now I know the reason CSE failed to optimize is I include VL_REGNUM(66)/VTYPE_RENUM(67) hard reg
>> as the dependency of pred_broadcast:
>> (insn 19 18 20 4 (set (reg:VNx1DI 152)
>>>          (if_then_else:VNx1DI (unspec:VNx1BI [
>>>                      (const_vector:VNx1BI repeat [
>>>                              (const_int 1 [0x1])
>>>                          ])
>>>                      (const_int 4 [0x4])
>>>                      (const_int 2 [0x2]) repeated x2
>>>                      (const_int 0 [0])
>>>                      (reg:SI 66 vl)
>>>                      (reg:SI 67 vtype)
>>>                  ] UNSPEC_VPREDICATE)
>>>              (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>>>              (unspec:VNx1DI [
>>>                      (const_int 0 [0])
>>>                  ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
>>>       (nil))
>> Then CSE failed to set the 152 as copy.
>>
>> VL_REGNUM(66)/VTYPE_RENUM(67) are the global hard reg that I should make each RVV instruction depend on them.
>> Since we use vsetvl instruction (which is setting global VL_REGNUM(66)/VTYPE_RENUM(67) status) to set the global status for
>> each RVV instruction.
>> Including the dependency here is to make sure the global VL/VTYPE status is correct of each RVV instruction. (If we don't include
>> such dependency in RVV instruction, instruction scheduling may move the RVV instructions and vsetvl instructions randomly then
>> produce incorrect vsetvl configuration)
>>
>> The original reg_class of VL_REGNUM(66)/VTYPE_RENUM(67) I set here:
>> riscv_regno_to_class [VL_REGNUM] = VL_REGS;
>> riscv_regno_to_class [VTYPE_RENUM] = VTYPE_REGS;
>> Such configuration make CSE failed.
>>
>> However, if I change the reg_class :
>> riscv_regno_to_class [VL_REGNUM] = NO_REGS;
>> riscv_regno_to_class [VTYPE_RENUM] = NO_REGS;
>> The CSE now can do the optimization now!
>>
>> 1) Would you mind telling me the difference between them?
> 
> No idea.  I think CSE avoids to touch hard register references because
> eliding them to copies can increase register pressure.IIRC the costing is set up differently and for a given partition a 
pseudo will be preferred over a hard reg.  This is in addition to other 
places that test the small register class hooks.



> 
>> 2) If I set these 2 global status register as NO_REGS, will it create
>>     issues for the global status configuration of each RVV instructions ?
> 
> No idea either.  Usually these kind of dependences are introduced
> by targets at the point the VL setting is introduced to avoid
> pessimizing optimizations earlier.  Often, for cases like a VL
> register, this is done after register allocation only and indeed
> necessary to avoid the second scheduling pass from breaking things.
Yea.  I'm wondering about when the right place to introduce these 
dependencies might be.  I'm still a few months out from worrying about 
RVV, but it's not too far away.
jeff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
  2023-02-02 12:35               ` juzhe.zhong
@ 2023-02-02 14:36                 ` Jeff Law
  2023-02-02 14:43                   ` juzhe.zhong
  2023-02-02 15:18                   ` Kito Cheng
  0 siblings, 2 replies; 12+ messages in thread
From: Jeff Law @ 2023-02-02 14:36 UTC (permalink / raw)
  To: juzhe.zhong, rguenther
  Cc: gcc-patches, kito.cheng, richard.sandiford, apinski



On 2/2/23 05:35, juzhe.zhong@rivai.ai wrote:
> Thank you so much. Kito helped me fix it already.
> RVV instruction patterns can have CSE optimizations now.
What was the issue?

jeff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
  2023-02-02 14:36                 ` Jeff Law
@ 2023-02-02 14:43                   ` juzhe.zhong
  2023-02-02 15:18                   ` Kito Cheng
  1 sibling, 0 replies; 12+ messages in thread
From: juzhe.zhong @ 2023-02-02 14:43 UTC (permalink / raw)
  To: Jeff Law, rguenther; +Cc: gcc-patches, kito.cheng, richard.sandiford, apinski

[-- Attachment #1: Type: text/plain, Size: 1013 bytes --]

We set VL/VTYPE these 2 implicit global status denpency register as fixed reg.
Then CSE can do the optimization now.

>> Yea.  I'm wondering about when the right place to introduce these
>>dependencies might be.  I'm still a few months out from worrying about
>>RVV, but it's not too far away.
You don't need to worry about RVV. I can promise you that RVV support in GCC will be solid and
optimal. You can just try. For example, try VSETVL PASS,  this PASS implemented in GCC is much better
than LLVM. I have include so many fancy optimizations there.


juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2023-02-02 22:36
To: juzhe.zhong@rivai.ai; rguenther
CC: gcc-patches; kito.cheng; richard.sandiford; apinski
Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
 
 
On 2/2/23 05:35, juzhe.zhong@rivai.ai wrote:
> Thank you so much. Kito helped me fix it already.
> RVV instruction patterns can have CSE optimizations now.
What was the issue?
 
jeff
 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
  2023-02-02 14:36                 ` Jeff Law
  2023-02-02 14:43                   ` juzhe.zhong
@ 2023-02-02 15:18                   ` Kito Cheng
  1 sibling, 0 replies; 12+ messages in thread
From: Kito Cheng @ 2023-02-02 15:18 UTC (permalink / raw)
  To: Jeff Law; +Cc: juzhe.zhong, rguenther, gcc-patches, richard.sandiford, apinski

> > Thank you so much. Kito helped me fix it already.
> > RVV instruction patterns can have CSE optimizations now.
> What was the issue?

VL and VTYPE isn't listed in fixed register so CSE feel that isn't
cheap (See CHEAP_REGNO in cse.cc),
but actually it's kind of mistake sett for VL and VTYPE register to
non fixed register,
it all managed by vsetvl insertion pass, and won't involved into the
register allocation
process, so it should be set 1 in FIXED_REGISTERS,

then CSE pass is happy to cse that after we fix that :)

More story behind that is we were trying to rely on RA to manage VL
and VTYPE before,
and then...we gave up and decided to manage that by ourselves.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-02-02 15:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-01 12:24 [PATCH] CPROP: Allow cprop optimization when the function has a single block juzhe.zhong
2023-02-01 12:40 ` Richard Biener
2023-02-01 12:47   ` juzhe.zhong
2023-02-01 12:51     ` Richard Biener
2023-02-01 12:56       ` juzhe.zhong
     [not found]         ` <0CEDD257-703F-4ED3-B927-C21279557CE6@gmail.com>
2023-02-02  1:49           ` juzhe.zhong
2023-02-02 12:26             ` Richard Biener
2023-02-02 12:35               ` juzhe.zhong
2023-02-02 14:36                 ` Jeff Law
2023-02-02 14:43                   ` juzhe.zhong
2023-02-02 15:18                   ` Kito Cheng
2023-02-02 14:36               ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).