From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=HfGo=5O=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 83B2A3858D28
	for <gcc-patches@gcc.gnu.org>; Tue, 17 Jan 2023 16:00:59 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 83B2A3858D28
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 627D21FB;
	Tue, 17 Jan 2023 08:01:41 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 9F83C3F67D;
	Tue, 17 Jan 2023 08:00:58 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: lehua.ding@rivai.ai
Mail-Followup-To: lehua.ding@rivai.ai,gcc-patches@gcc.gnu.org,  juzhe.zhong@rivai.ai, richard.sandiford@arm.com
Cc: gcc-patches@gcc.gnu.org,  juzhe.zhong@rivai.ai
Subject: Re: [PATCH 1/1] [fwprop]: Add the support of forwarding the vec_duplicate rtx
References: <20230113094236.77805-1-lehua.ding@rivai.ai>
Date: Tue, 17 Jan 2023 16:00:57 +0000
In-Reply-To: <20230113094236.77805-1-lehua.ding@rivai.ai> (lehua ding's
	message of "Fri, 13 Jan 2023 17:42:36 +0800")
Message-ID: <mpt3589rw06.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-37.0 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_ASCII_DIVIDERS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

lehua.ding@rivai.ai writes:
> From: Lehua Ding <lehua.ding@rivai.ai>
>
> ps: Resend for adjusting the width of each line of text.
>
> Hi,
>
> When I was adding the new RISC-V auto-vectorization function, I found that
> converting `vector-reg1 vop vector-vreg2` to `scalar-reg3 vop vectorreg2`
> is not very easy to handle where `vector-reg1` is a vec_duplicate_expr.
> For example the bellow gimple IR:
>
> ```gimple
> <bb2>
> vect_cst__51 = [vec_duplicate_expr] z_14(D);
>
> <bb 3>
> vect_iftmp.13_53 = .LEN_COND_ADD(mask__40.9_47, vect__6.12_50, vect_cst__51, { 0.0, ... }, curr_cnt_60);
> ```
>
> I once wanted to add corresponding functions to gimple IR, such as adding
> .LEN_COND_ADD_VS, and then convert .LEN_COND_ADD to .LEN_COND_ADD_VS in match.pd.
> This method can be realized, but it will cause too many similar internal functions
> to be added to gimple IR. It doesn't feel necessary. Later, I tried to combine them
> on the combine pass but failed. Finally, I thought of adding the ability to support
> forwarding `(vec_duplciate reg)` in fwprop pass, so I have this patch.
>
> Because the current upstream does not support the RISC-V automatic vectorization
> function, I found an example in sve that can also be optimized and simply tried
> it. For the float type, one instruction can be reduced, for example the bellow C
> code. The difference between the new and old assembly code is that the new one
> uses the mov instruction to directly move the scalar variable to the vector register.
> The old assembly code first moves the scalar variable to the vector register outside
> the loop, and then uses the sel instruction. Compared with the entire assembly code,
> the new assembly code has one instruction less. In addition, I noticed that some
> instructions in the new assembly code are ahead of the `ble .L1` instruction.
> I debugged and found that the modification was made in the ce1 pass. This pass
> believes that moving up is more beneficial to performance.
>
> In addition, for the int type, compared with the float type, the new assembly code
> will have one more `fmov s2, w2` instruction, so I can't judge whether the
> performance is better than the previous one. In fact, I mainly do RISC-V development work.
>
> This patch is an exploratory patch and has not been tested too much. I mainly
> want to see your suggestions on whether this method is feasible and possible
> potential problems.
>
> Best,
> Lehua Ding
>
> ```c
> /* compiler options: -O3 -march=armv8.2-a+sve -S */
> void test1 (int *pred, float *x, float z, int n)
> {
>          for (int i = 0; i < n; i += 1)
>            {
>                  x[i] = pred[i] != 1 ? x[i] : z;
>            }
> }
> ```
>
> The old assembly code like this (compiler explorer link: https://godbolt.org/z/hxTnEhaqY):
>
> ```asm
> test1:
>          cmp w2, 0
>          ble.L1
>          mov x3, 0
>          cntw x4
>          mov z0.s, s0
>          whilelo p0.s, wzr, w2
>          ptrue p2.b, all
> .L3:
>          ld1w z2.s, p0/z, [x0, x3, lsl 2]
>          ld1w z1.s, p0/z, [x1, x3, lsl 2]
>          cmpne p1.s, p2/z, z2.s, #1
>          sel z1.s, p1, z1.s, z0.s
>          st1w z1.s, p0, [x1, x3, lsl 2]
>          add x3, x3, x4
>          while lo p0.s, w3, w2
>          b.any.L3
> .L1:
>          ret
> ```
>
> The new assembly code like this:
>
> ```asm
> test1:
>          whilelo p0.s, wzr, w2
>          mov x3, 0
>          cntw x4
>          ptrue p2.b, all
>          cmp w2, 0
>          ble.L1
> .L3:
>          ld1w z2.s, p0/z, [x0, x3, lsl 2]
>          ld1w z1.s, p0/z, [x1, x3, lsl 2]
>          cmpne p1.s, p2/z, z2.s, #1
>          mov z1.s, p1/m, s0
>          st1w z1.s, p0, [x1, x3, lsl 2]
>          add x3, x3, x4
>          while lo p0.s, w3, w2
>          b.any.L3
> .L1:
>          ret
> ```
>
>
> gcc/ChangeLog:
>
>         * config/aarch64/aarch64-sve.md (@aarch64_sel_dup<mode>_vs): Add new pattern to capture new opeands order
>         * fwprop.cc (fwprop_propagation::profitable_p): Add new check
>         (reg_single_def_for_src_p): Add new function for src rtx
>         (forward_propagate_into): Change to new function call
>
> ---
>  gcc/config/aarch64/aarch64-sve.md | 20 ++++++++++++++++++++
>  gcc/fwprop.cc                     | 16 +++++++++++++++-
>  2 files changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
> index b8cc47ef5fc..84d8ed0924d 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -7636,6 +7636,26 @@
>    [(set_attr "movprfx" "*,*,yes,yes,yes,yes")]
>  )
>  
> +;; Swap the order of operand 1 and operand 2 so that it matches the above pattern
> +(define_insn_and_split "@aarch64_sel_dup<mode>_vs"
> +  [(set (match_operand:SVE_ALL 0 "register_operand" "=?w, w, ??w, ?&w, ??&w, ?&w")
> +	(unspec:SVE_ALL
> +	  [(match_operand:<VPRED> 3 "register_operand" "Upl, Upl, Upl, Upl, Upl, Upl")
> +           (match_operand:SVE_ALL 1 "aarch64_simd_reg_or_zero" "0, 0, Dz, Dz, w, w")
> +	   (vec_duplicate:SVE_ALL
> +             (match_operand:<VEL> 2 "register_operand" "r, w, r, w, r, w"))]
> +	  UNSPEC_SEL))]
> +  "TARGET_SVE"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +        (unspec:SVE_ALL
> +          [(match_dup 3)
> +           (vec_duplicate:SVE_ALL (match_dup 2))
> +           (match_dup 1)]
> +          UNSPEC_SEL))]
> +)
> +

I don't think this pattern is correct, because SEL isn't commutative
in the vector operands.

But the idea of the fwprop change looks OK to me in principle.
What we have now seems conservative, based on heuristics that
haven't been updated in a long time.  So relaxing them a bit seems
like a good idea.  IIRC Jeff had another case in which the current
heuristics were too strict.

However:

>  ;; -------------------------------------------------------------------------
>  ;; ---- [INT,FP] Compare and select
>  ;; -------------------------------------------------------------------------
> diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
> index ae342f59407..5d921dd3d2f 100644
> --- a/gcc/fwprop.cc
> +++ b/gcc/fwprop.cc
> @@ -342,6 +342,9 @@ fwprop_propagation::profitable_p () const
>    if (CONSTANT_P (to))
>      return true;
>  
> +  if (GET_CODE (to) == VEC_DUPLICATE)
> +    return true;
> +

I think this should be:

  if (...)
    to = XEXP (to, 0);

and should be before the REG_P test.  We don't want to treat
arbitrary duplicates as profitable.

It's not obvious that vec_duplicate is special enough that we should
treat it differently from other unary operators.  For example,
zero_extend and sign_extend don't seem fundamentally more expensive
than vec_duplicate.

On the other hand, I suppose if we allow unary arithmetic such as NEG,
we should probably also allow binary arithmetic that contains only
a single variable operand.  So including all unary operators wouldn't
lead to a natural stopping point either.

So yeah, maybe just start with vec_duplicate, with an open door for
*_extend if someone wants that too.  And we can see where things
go from there.

>    return false;
>  }
>  
> @@ -353,6 +356,17 @@ reg_single_def_p (rtx x)
>    return REG_P (x) && crtl->ssa->single_dominating_def (REGNO (x));
>  }
>  
> +/* Check that X has a single def or a VEC_DUPLICATE expr whose elements have a
> +   single def. */
> +static bool
> +reg_single_def_for_src_p (rtx x)
> +{
> +  if (GET_CODE (x) == VEC_DUPLICATE)
> +    x = XEXP (x, 0);
> +
> +  return reg_single_def_p (x);
> +}
> +
>  /* Return true if X contains a paradoxical subreg.  */
>  
>  static bool
> @@ -873,7 +887,7 @@ forward_propagate_into (use_info *use, bool reg_prop_only = false)
>    if ((reg_prop_only
>         || (def_loop != use_loop
>  	   && !flow_loop_nested_p (use_loop, def_loop)))
> -      && (!reg_single_def_p (dest) || !reg_single_def_p (src)))
> +      && (!reg_single_def_p (dest) || !reg_single_def_for_src_p (src)))
>      return false;
>  
>    /* Don't substitute into a non-local goto, this confuses CFG.  */

It's a while since I looked at this code, but I assume that, even after
this change, we will still require the new in-loop instruction to be
no more expensive than the old in-loop instruction.  Is that right?

All in all, I agree this looks like a reasonable thing to do.

Thanks,
Richard