cpymem for RISCV with v extension

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* cpymem for RISCV with v extension
@ 2023-07-18  4:47 Joern Rennecke
  2023-08-04 20:52 ` Jeff Law
  0 siblings, 1 reply; 8+ messages in thread
From: Joern Rennecke @ 2023-07-18  4:47 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1425 bytes --]

As discussed on last week's patch call, this patch uses either a
straight copy or an opaque pattern that emits the loop as assembly to
optimize cpymem for the 'v' extension.
I used Ju-Zhe Zhong's patch - starting in git with:

Author: zhongjuzhe <66454988+zhongjuzhe@users.noreply.github.com>
Date:   Mon Mar 21 14:20:42 2022 +0800

      PR for RVV support using splitted small chunks (#334)

as a starting point, even though not all that much of the original code remains.

Regression tested on x86_64-pc-linux-gnu X
    riscv-sim
    riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zba_zbb_zbc_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32f
    riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32
    riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32f
    riscv-sim/-march=rv32imfdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32
    riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zba_zbb_zbc_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
    riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zba_zbb_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
    riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d

[-- Attachment #2: cpymem-diff-20230718.txt --]
[-- Type: text/plain, Size: 12304 bytes --]

2023-07-12  Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
            Joern Rennecke  <joern.rennecke@embecosm.com>

	* config/riscv/riscv-protos.h (riscv_vector::expand_block_move):
	Declare.
	* config/riscv/riscv-v.cc (riscv_vector::expand_block_move):
	New function.
	* config/riscv/riscv.md (cpymemsi): Use riscv_vector::expand_block_move.
	* config/riscv/vector.md (@cpymem_straight<P:mode><V_WHOLE:mode>):
	New define_insn patterns.
	(@cpymem_loop<P:mode><V_WHOLE:mode>): Likewise.
	(@cpymem_loop_fast<P:mode><V_WHOLE:mode>): Likewise.

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 16fb8dabca0..40965a00681 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -301,6 +301,7 @@ bool slide1_sew64_helper (int, machine_mode, machine_mode,
 			  machine_mode, rtx *);
 rtx gen_avl_for_scalar_move (rtx);
 void expand_tuple_move (rtx *);
+bool expand_block_move (rtx, rtx, rtx);
 machine_mode preferred_simd_mode (scalar_mode);
 opt_machine_mode get_mask_mode (machine_mode);
 void expand_vec_series (rtx, rtx, rtx);
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index b4884a30872..e61110fa3ad 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -49,6 +49,7 @@
 #include "tm-constrs.h"
 #include "rtx-vector-builder.h"
 #include "targhooks.h"
+#include "predict.h"
 
 using namespace riscv_vector;
 
@@ -2164,6 +2165,191 @@ expand_tuple_move (rtx *ops)
     }
 }
 
+/* Used by cpymemsi in riscv.md .  */
+
+bool
+expand_block_move (rtx dest_in, rtx src_in, rtx length_in)
+{
+  /*
+    memcpy:
+	mv a3, a0                       # Copy destination
+    loop:
+	vsetvli t0, a2, e8, m8, ta, ma  # Vectors of 8b
+	vle8.v v0, (a1)                 # Load bytes
+	add a1, a1, t0                  # Bump pointer
+	sub a2, a2, t0                  # Decrement count
+	vse8.v v0, (a3)                 # Store bytes
+	add a3, a3, t0                  # Bump pointer
+	bnez a2, loop                   # Any more?
+	ret                             # Return
+  */
+  if (!TARGET_VECTOR)
+    return false;
+  HOST_WIDE_INT potential_ew
+    = (MIN (MIN (MEM_ALIGN (src_in), MEM_ALIGN (dest_in)), BITS_PER_WORD)
+       / BITS_PER_UNIT);
+  machine_mode vmode = VOIDmode;
+  bool need_loop = true;
+  bool size_p = optimize_function_for_size_p (cfun);
+  rtx src, dst;
+  rtx end = gen_reg_rtx (Pmode);
+  rtx vec;
+  rtx length_rtx = length_in;
+
+  if (CONST_INT_P (length_in))
+    {
+      HOST_WIDE_INT length = INTVAL (length_in);
+
+    /* By using LMUL=8, we can copy as many bytes in one go as there
+       are bits in a vector register.  If the entire block thus fits,
+       we don't need a loop.  */
+    if (length <= TARGET_MIN_VLEN)
+      {
+	need_loop = false;
+
+	/* If a single scalar load / store pair can do the job, leave it
+	   to the scalar code to do that.  */
+
+	if (pow2p_hwi (length) && length <= potential_ew)
+	  return false;
+      }
+
+      /* Find the vector mode to use.  Using the largest possible element
+	 size is likely to give smaller constants, and thus potentially
+	 reducing code size.  However, if we need a loop, we need to update
+	 the pointers, and that is more complicated with a larger element
+	 size, unless we use an immediate, which prevents us from dynamically
+	 using the largets transfer size that the hart supports.  And then,
+	 unless we know the *exact* vector size of the hart, we'd need
+	 multiple vsetvli / branch statements, so it's not even a size win.
+	 If, in the future, we find an RISCV-V implementation that is slower
+	 for small element widths, we might allow larger element widths for
+	 loops too.  */
+      if (need_loop)
+	potential_ew = 1;
+      for (; potential_ew; potential_ew >>= 1)
+	{
+	  scalar_int_mode elem_mode;
+	  unsigned HOST_WIDE_INT bits = potential_ew * BITS_PER_UNIT;
+	  unsigned HOST_WIDE_INT per_iter;
+	  HOST_WIDE_INT nunits;
+
+	  if (need_loop)
+	    per_iter = TARGET_MIN_VLEN;
+	  else
+	    per_iter = length;
+	  nunits = per_iter / potential_ew;
+
+	  /* Unless we get an implementation that's slow for small element
+	     size / non-word-aligned accesses, we assume that the hardware
+	     handles this well, and we don't want to complicate the code
+	     with shifting word contents around or handling extra bytes at
+	     the start and/or end.  So we want the total transfer size and
+	     alignemnt to fit with the element size.  */
+	  if (length % potential_ew != 0
+	      || !int_mode_for_size (bits, 0).exists (&elem_mode))
+	    continue;
+	  /* Find the mode to use for the copy inside the loop - or the
+	     sole copy, if there is no loop.  */
+	  if (!need_loop)
+	    {
+	      /* Try if we have an exact mode for the copy.  */
+	      if (get_vector_mode (elem_mode, nunits).exists (&vmode))
+		break;
+	      /* We might have an odd transfer size.  Try to round it up to
+		 a power of two to get a valid vector mode for a clobber.  */
+	      for (nunits = 1ULL << ceil_log2 (nunits);
+		   nunits <= TARGET_MIN_VLEN;
+		   nunits <<= 1)
+		if (get_vector_mode (elem_mode, nunits).exists (&vmode))
+		  break;
+
+	      if (vmode != VOIDmode)
+		break;
+	    }
+
+	  // The VNx*?I modes have a factor of riscv_vector_chunks for nunits.
+	  if (get_vector_mode (elem_mode,
+			       TARGET_MIN_VLEN / potential_ew
+			       * riscv_vector_chunks).exists (&vmode))
+	    break;
+
+	  /* We may get here if we tried an element size that's larger than
+	     the hardware supports, but we should at least find a suitable
+	     byte vector mode.  */
+	  gcc_assert (potential_ew > 1);
+	}
+      if (potential_ew > 1)
+	length_rtx = GEN_INT (length / potential_ew);
+    }
+  else
+    {
+      vmode = (get_vector_mode (QImode, TARGET_MIN_VLEN * riscv_vector_chunks)
+	       .require ());
+    }
+
+  /* A memcpy libcall in the worst case takes 3 instructions to prepare the
+     arguments + 1 for the call.  When RVV should take 7 instructions and
+     we're optimizing for size a libcall may be preferable.  */
+  if (size_p && need_loop)
+    return false;
+
+  /* If we don't need a loop and have a suitable mode to describe the size,
+     just do a load / store pair and leave it up to the later lazy code
+     motion pass to insert the appropriate vsetvli.  */
+  if (!need_loop && known_eq (GET_MODE_SIZE (vmode), INTVAL (length_in)))
+    {
+      vec = gen_reg_rtx (vmode);
+      src = change_address (src_in, vmode, NULL);
+      dst = change_address (dest_in, vmode, NULL);
+      emit_move_insn (vec, src);
+      emit_move_insn (dst, vec);
+      return true;
+    }
+
+  if (CONST_POLY_INT_P (length_rtx))
+    {
+      if (GET_MODE (length_rtx) != Pmode)
+	{
+	  poly_int64 value = rtx_to_poly_int64 (length_rtx);
+	  emit_insn (gen_rtx_SET (end,
+				  gen_int_mode (poly_int64 (value.coeffs[0],
+							    value.coeffs[1]),
+						Pmode)));
+	}
+      else
+	emit_insn (gen_rtx_SET (end, length_rtx));
+    }
+  else
+    {
+      if (GET_MODE (length_rtx) != Pmode)
+	riscv_emit_move (end, gen_lowpart (Pmode, length_rtx));
+      else
+	riscv_emit_move (end, length_rtx);
+    }
+
+  /* Move the address into scratch registers.  */
+  dst = copy_addr_to_reg (XEXP (dest_in, 0));
+  src = copy_addr_to_reg (XEXP (src_in, 0));
+
+  /* Since we haven't implemented VLA handling in general, we emit
+     opaque patterns that output the appropriate instructions.  */
+  if (!need_loop)
+    emit_insn (gen_cpymem_straight (Pmode, vmode, dst, src, end));
+  /* The *_fast pattern needs 13 instructions instead of 7, and
+     considering that this code is usually memory-constrainted, limit this
+     to -O3.  ??? It would make sense to differentiate here between in-order
+     and OOO microarchitectures.  */
+  else if (!size_p && optimize >= 3)
+    emit_insn (gen_cpymem_loop_fast (Pmode, vmode, dst, src, end));
+  else
+    emit_insn (gen_cpymem_loop (Pmode, vmode, dst, src, end));
+
+  /* A nop to attach notes to.  */
+  emit_insn (gen_nop ());
+  return true;
+}
+
 /* Return the vectorization machine mode for RVV according to LMUL.  */
 machine_mode
 preferred_simd_mode (scalar_mode mode)
diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index 7edef1fb546..4e596f42576 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -2141,7 +2141,9 @@
 	      (use (match_operand:SI 3 "const_int_operand"))])]
   ""
 {
-  if (riscv_expand_block_move (operands[0], operands[1], operands[2]))
+  if (riscv_vector::expand_block_move (operands[0], operands[1], operands[2]))
+    DONE;
+  else if (riscv_expand_block_move (operands[0], operands[1], operands[2]))
     DONE;
   else
     FAIL;
diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
index 215ecb9cb58..eee58a8ff71 100644
--- a/gcc/config/riscv/vector.md
+++ b/gcc/config/riscv/vector.md
@@ -837,6 +837,97 @@
   [(set_attr "type" "vmov,vlde,vste")
    (set_attr "mode" "<VT:MODE>")])
 
+;; The (use (and (match_dup 1) (const_int 127))) is here to prevent the
+;; optimizers from changing cpymem_loop_* into this.
+(define_insn "@cpymem_straight<P:mode><V_WHOLE:mode>"
+  [(set (mem:BLK (match_operand:P 0 "register_operand" "r,r"))
+	(mem:BLK (match_operand:P 1 "register_operand" "r,r")))
+	(use (and (match_dup 1) (const_int 127)))
+   (use (match_operand:P 2 "reg_or_int_operand" "r,K"))
+   (clobber (match_scratch:V_WHOLE 3 "=&vr,&vr"))
+   (clobber (reg:SI VL_REGNUM))
+   (clobber (reg:SI VTYPE_REGNUM))]
+  "TARGET_VECTOR"
+  "@vsetvli zero,%2,e<sew>,m8,ta,ma\;vle<sew>.v %3,(%1)\;vse<sew>.v %3,(%0)
+   vsetivli zero,%2,e<sew>,m8,ta,ma\;vle<sew>.v %3,(%1)\;vse<sew>.v %3,(%0)"
+)
+
+(define_insn "@cpymem_loop<P:mode><V_WHOLE:mode>"
+  [(set (mem:BLK (match_operand:P 0 "register_operand" "+r"))
+	(mem:BLK (match_operand:P 1 "register_operand" "+r")))
+   (use (match_operand:P 2 "register_operand" "+r"))
+   (clobber (match_scratch:V_WHOLE 3 "=&vr"))
+   (clobber (match_scratch:P 4 "=&r"))
+   (clobber (match_dup 0))
+   (clobber (match_dup 1))
+   (clobber (match_dup 2))
+   (clobber (reg:SI VL_REGNUM))
+   (clobber (reg:SI VTYPE_REGNUM))]
+  "TARGET_VECTOR"
+{ output_asm_insn ("\n0:\t" "vsetvli %4,%2,e<sew>,m8,ta,ma\;"
+		   "vle<sew>.v %3,(%1)\;"
+		   "sub %2,%2,%4", operands);
+  if (<sew> != 8)
+    {
+      rtx xop[2];
+      xop[0] = operands[4];
+      xop[1] = GEN_INT (exact_log2 (<sew>/8));
+      output_asm_insn ("slli %0,%0,%1", xop);
+    }
+  output_asm_insn ("add %1,%1,%4\;"
+		   "vse<sew>.v %3,(%0)\;"
+		   "add %0,%0,%4\;"
+		   "bnez %2,0b", operands);
+  return "";
+})
+
+;; This pattern (at bltu) assumes pointers can be treated as unsigned,
+;; i.e.  objects can't straddle 0xffffffffffffffff / 0x0000000000000000 .
+(define_insn "@cpymem_loop_fast<P:mode><V_WHOLE:mode>"
+  [(set (mem:BLK (match_operand:P 0 "register_operand" "+r"))
+	(mem:BLK (match_operand:P 1 "register_operand" "+r")))
+   (use (match_operand:P 2 "register_operand" "+r"))
+   (clobber (match_scratch:V_WHOLE 3 "=&vr"))
+   (clobber (match_scratch:P 4 "=&r"))
+   (clobber (match_scratch:P 5 "=&r"))
+   (clobber (match_scratch:P 6 "=&r"))
+   (clobber (match_dup 0))
+   (clobber (match_dup 1))
+   (clobber (match_dup 2))
+   (clobber (reg:SI VL_REGNUM))
+   (clobber (reg:SI VTYPE_REGNUM))]
+  "TARGET_VECTOR"
+{
+  output_asm_insn ("vsetvli %4,%2,e<sew>,m8,ta,ma\;"
+		   "beq %4,%2,1f\;"
+		   "add %5,%0,%2\;"
+		   "sub %6,%5,%4", operands);
+  if (<sew> != 8)
+    {
+      rtx xop[2];
+      xop[0] = operands[4];
+      xop[1] = GEN_INT (exact_log2 (<sew>/8));
+      output_asm_insn ("slli %0,%0,%1", xop);
+    }
+  output_asm_insn ("\n0:\t" "vle<sew>.v %3,(%1)\;"
+		   "add %1,%1,%4\;"
+		   "vse<sew>.v %3,(%0)\;"
+		   "add %0,%0,%4\;"
+		   "bltu %0,%6,0b\;"
+		   "sub %5,%5,%0", operands);
+  if (<sew> != 8)
+    {
+      rtx xop[2];
+      xop[0] = operands[4];
+      xop[1] = GEN_INT (exact_log2 (<sew>/8));
+      output_asm_insn ("srli %0,%0,%1", xop);
+    }
+  output_asm_insn ("vsetvli %4,%5,e<sew>,m8,ta,ma\n"
+	    "1:\t" "vle<sew>.v %3,(%1)\;"
+		   "vse<sew>.v %3,(%0)", operands);
+  return "";
+})
+
 ;; -----------------------------------------------------------------
 ;; ---- Duplicate Operations
 ;; -----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cpymem for RISCV with v extension
  2023-07-18  4:47 cpymem for RISCV with v extension Joern Rennecke
@ 2023-08-04 20:52 ` Jeff Law
  2023-08-15  1:46   ` Joern Rennecke
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Law @ 2023-08-04 20:52 UTC (permalink / raw)
  To: Joern Rennecke, GCC Patches



On 7/17/23 22:47, Joern Rennecke wrote:
> Subject:
> cpymem for RISCV with v extension
> From:
> Joern Rennecke <joern.rennecke@embecosm.com>
> Date:
> 7/17/23, 22:47
> 
> To:
> GCC Patches <gcc-patches@gcc.gnu.org>
> 
> 
> As discussed on last week's patch call, this patch uses either a
> straight copy or an opaque pattern that emits the loop as assembly to
> optimize cpymem for the 'v' extension.
> I used Ju-Zhe Zhong's patch - starting in git with:
> 
> Author: zhongjuzhe<66454988+zhongjuzhe@users.noreply.github.com>
> Date:   Mon Mar 21 14:20:42 2022 +0800
> 
>        PR for RVV support using splitted small chunks (#334)
> 
> as a starting point, even though not all that much of the original code remains.
> 
> Regression tested on x86_64-pc-linux-gnu X
>      riscv-sim
>      riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zba_zbb_zbc_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32f
>      riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32
>      riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32f
>      riscv-sim/-march=rv32imfdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32
>      riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zba_zbb_zbc_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
>      riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zba_zbb_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
>      riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
> 
> 
> cpymem-diff-20230718.txt
> 
> 2023-07-12  Ju-Zhe Zhong<juzhe.zhong@rivai.ai>
>              Joern Rennecke<joern.rennecke@embecosm.com>
> 
> 	* config/riscv/riscv-protos.h (riscv_vector::expand_block_move):
> 	Declare.
> 	* config/riscv/riscv-v.cc (riscv_vector::expand_block_move):
> 	New function.
> 	* config/riscv/riscv.md (cpymemsi): Use riscv_vector::expand_block_move.
> 	* config/riscv/vector.md (@cpymem_straight<P:mode><V_WHOLE:mode>):
> 	New define_insn patterns.
> 	(@cpymem_loop<P:mode><V_WHOLE:mode>): Likewise.
> 	(@cpymem_loop_fast<P:mode><V_WHOLE:mode>): Likewise.
> 

> diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
> index b4884a30872..e61110fa3ad 100644
> --- a/gcc/config/riscv/riscv-v.cc
> +++ b/gcc/config/riscv/riscv-v.cc
> @@ -49,6 +49,7 @@
>   #include "tm-constrs.h"
>   #include "rtx-vector-builder.h"
>   #include "targhooks.h"
> +#include "predict.h"
Not sure this is needed, but I didn't scan for it explicitly.  If it's 
not needed, then remove it.



> +  if (CONST_INT_P (length_in))
> +    {
> +      HOST_WIDE_INT length = INTVAL (length_in);
> +
> +    /* By using LMUL=8, we can copy as many bytes in one go as there
> +       are bits in a vector register.  If the entire block thus fits,
> +       we don't need a loop.  */
> +    if (length <= TARGET_MIN_VLEN)
> +      {
> +	need_loop = false;
> +
> +	/* If a single scalar load / store pair can do the job, leave it
> +	   to the scalar code to do that.  */
> +
> +	if (pow2p_hwi (length) && length <= potential_ew)
> +	  return false;
> +      }
We could probably argue over the threshold for doing the copy on the 
scalar side, but I don't think it's necessary.  Once we start seeing V 
hardware we can revisit.


> +
> +      /* Find the vector mode to use.  Using the largest possible element
> +	 size is likely to give smaller constants, and thus potentially
> +	 reducing code size.  However, if we need a loop, we need to update
> +	 the pointers, and that is more complicated with a larger element
> +	 size, unless we use an immediate, which prevents us from dynamically
> +	 using the largets transfer size that the hart supports.  And then,
> +	 unless we know the*exact*  vector size of the hart, we'd need
> +	 multiple vsetvli / branch statements, so it's not even a size win.
> +	 If, in the future, we find an RISCV-V implementation that is slower
> +	 for small element widths, we might allow larger element widths for
> +	 loops too.  */
s/largets/largest/

And a space is missing in "the*extact*"

Note that I think the proposed glibc copier does allow larger elements 
widths for this case.

> +
> +	  /* Unless we get an implementation that's slow for small element
> +	     size / non-word-aligned accesses, we assume that the hardware
> +	     handles this well, and we don't want to complicate the code
> +	     with shifting word contents around or handling extra bytes at
> +	     the start and/or end.  So we want the total transfer size and
> +	     alignemnt to fit with the element size.  */
s/alignemnt/alignment/

Yes, let's not try to support every uarch we can envision and instead do 
a good job on the uarches we know about.    If a uarch with slow element 
or non-word aligned accesses comes along, they can propose changes at 
that time.



> +
> +	  // The VNx*?I modes have a factor of riscv_vector_chunks for nunits.
Comment might need updating after the recent work to adjust the modes. 
I don't recall if we kept the VNx*?I modes or not.

So the adjustments are all comment related, so this is OK after fixing 
the comments.  Just post the update for archival purposes and consider 
it pre-approved for the trunk.

Thanks and sorry for the wait Joern.

jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cpymem for RISCV with v extension
  2023-08-04 20:52 ` Jeff Law
@ 2023-08-15  1:46   ` Joern Rennecke
  2023-08-15 13:46     ` Jeff Law
  0 siblings, 1 reply; 8+ messages in thread
From: Joern Rennecke @ 2023-08-15  1:46 UTC (permalink / raw)
  To: Jeff Law; +Cc: GCC Patches

On Fri, 4 Aug 2023 at 21:52, Jeff Law <jeffreyalaw@gmail.com> wrote:

> > diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
> > index b4884a30872..e61110fa3ad 100644
> > --- a/gcc/config/riscv/riscv-v.cc
> > +++ b/gcc/config/riscv/riscv-v.cc
> > @@ -49,6 +49,7 @@
> >   #include "tm-constrs.h"
> >   #include "rtx-vector-builder.h"
> >   #include "targhooks.h"
> > +#include "predict.h"
> Not sure this is needed, but I didn't scan for it explicitly.  If it's
> not needed, then remove it.

It is needed to declare optimize_function_for_size_p .

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cpymem for RISCV with v extension
  2023-08-15  1:46   ` Joern Rennecke
@ 2023-08-15 13:46     ` Jeff Law
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Law @ 2023-08-15 13:46 UTC (permalink / raw)
  To: Joern Rennecke; +Cc: GCC Patches



On 8/14/23 19:46, Joern Rennecke wrote:
> On Fri, 4 Aug 2023 at 21:52, Jeff Law <jeffreyalaw@gmail.com> wrote:
> 
>>> diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
>>> index b4884a30872..e61110fa3ad 100644
>>> --- a/gcc/config/riscv/riscv-v.cc
>>> +++ b/gcc/config/riscv/riscv-v.cc
>>> @@ -49,6 +49,7 @@
>>>    #include "tm-constrs.h"
>>>    #include "rtx-vector-builder.h"
>>>    #include "targhooks.h"
>>> +#include "predict.h"
>> Not sure this is needed, but I didn't scan for it explicitly.  If it's
>> not needed, then remove it.
> 
> It is needed to declare optimize_function_for_size_p .
Obviously a trivial nit.  Thanks for tracking it down.

jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cpymem for RISCV with v extension
  2023-08-15  9:16       ` juzhe.zhong
@ 2023-08-15 14:06         ` Jeff Law
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Law @ 2023-08-15 14:06 UTC (permalink / raw)
  To: juzhe.zhong, joern.rennecke
  Cc: gcc-patches, kito.cheng, Kito.cheng, Robin Dapp



On 8/15/23 03:16, juzhe.zhong@rivai.ai wrote:
> The new  patch looks reasonable to me now. Thanks for fixing it.
> 
> Could you append testcase after finishing test infrastructure ?
> I prefer this patch with testcase after infrastructure.
So let's call this an ACK, but ask that Joern not commit until the 
testsuite bits are in place.


jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cpymem for RISCV with v extension
  2023-08-15  8:12     ` Joern Rennecke
  2023-08-15  9:16       ` juzhe.zhong
@ 2023-08-15 14:04       ` Jeff Law
  1 sibling, 0 replies; 8+ messages in thread
From: Jeff Law @ 2023-08-15 14:04 UTC (permalink / raw)
  To: Joern Rennecke, 钟居哲
  Cc: gcc-patches, kito.cheng, kito.cheng, rdapp.gcc



On 8/15/23 02:12, Joern Rennecke wrote:

> It lacks the strength reduction of the opaque pattern version for -O3,
> though.  Would people also like to see that expanded into RTL?  Or
> should I just drop in the opaque pattern for that?  Or not at all,
> because everyone uses Superscalar Out-Of-Order execution?
I doubt it's going to matter all that much.  Your decision IMHO.  I'd 
like to think everyone implementing V will be OOO superscalar, but I'm 
not naive enough to believe that will hold in practice (even with the P 
extension on the way).

jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cpymem for RISCV with v extension
  2023-08-04 23:10 钟居哲
@ 2023-08-04 23:17 ` Jeff Law
  2023-08-04 23:34   ` 钟居哲
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Law @ 2023-08-04 23:17 UTC (permalink / raw)
  To: 钟居哲, gcc-patches
  Cc: kito.cheng, kito.cheng, rdapp.gcc, Joern Rennecke



On 8/4/23 17:10, 钟居哲 wrote:
> Could you add testcases for this patch?
Testing what specifically?  Are you asking for correctness tests, 
performance/code quality tests?


> 
> +;; The (use (and (match_dup 1) (const_int 127))) is here to prevent the
> +;; optimizers from changing cpymem_loop_* into this.
> +(define_insn "@cpymem_straight<P:mode><V_WHOLE:mode>"
> +  [(set (mem:BLK (match_operand:P 0 "register_operand" "r,r"))
> +	(mem:BLK (match_operand:P 1 "register_operand" "r,r")))
> +	(use (and (match_dup 1) (const_int 127)))
> +   (use (match_operand:P 2 "reg_or_int_operand" "r,K"))
> +   (clobber (match_scratch:V_WHOLE 3 "=&vr,&vr"))
> +   (clobber (reg:SI VL_REGNUM))
> +   (clobber (reg:SI VTYPE_REGNUM))]
> +  "TARGET_VECTOR"
> +  "@vsetvli zero,%2,e<sew>,m8,ta,ma\;vle<sew>.v %3,(%1)\;vse<sew>.v %3,(%0)
> +   vsetivli zero,%2,e<sew>,m8,ta,ma\;vle<sew>.v %3,(%1)\;vse<sew>.v %3,(%0)"
> +)
> +
> +(define_insn "@cpymem_loop<P:mode><V_WHOLE:mode>"
> +  [(set (mem:BLK (match_operand:P 0 "register_operand" "+r"))
> +	(mem:BLK (match_operand:P 1 "register_operand" "+r")))
> +   (use (match_operand:P 2 "register_operand" "+r"))
> +   (clobber (match_scratch:V_WHOLE 3 "=&vr"))
> +   (clobber (match_scratch:P 4 "=&r"))
> +   (clobber (match_dup 0))
> +   (clobber (match_dup 1))
> +   (clobber (match_dup 2))
> +   (clobber (reg:SI VL_REGNUM))
> +   (clobber (reg:SI VTYPE_REGNUM))]
> +  "TARGET_VECTOR"
> +{ output_asm_insn ("\n0:\t" "vsetvli %4,%2,e<sew>,m8,ta,ma\;"
> +		   "vle<sew>.v %3,(%1)\;"
> +		   "sub %2,%2,%4", operands);
> +  if (<sew> != 8)
> +    {
> +      rtx xop[2];
> +      xop[0] = operands[4];
> +      xop[1] = GEN_INT (exact_log2 (<sew>/8));
> +      output_asm_insn ("slli %0,%0,%1", xop);
> +    }
> +  output_asm_insn ("add %1,%1,%4\;"
> +		   "vse<sew>.v %3,(%0)\;"
> +		   "add %0,%0,%4\;"
> +		   "bnez %2,0b", operands);
> +  return "";
> +})
> +
> +;; This pattern (at bltu) assumes pointers can be treated as unsigned,
> +;; i.e.  objects can't straddle 0xffffffffffffffff / 0x0000000000000000 .
> +(define_insn "@cpymem_loop_fast<P:mode><V_WHOLE:mode>"
> +  [(set (mem:BLK (match_operand:P 0 "register_operand" "+r"))
> +	(mem:BLK (match_operand:P 1 "register_operand" "+r")))
> +   (use (match_operand:P 2 "register_operand" "+r"))
> +   (clobber (match_scratch:V_WHOLE 3 "=&vr"))
> +   (clobber (match_scratch:P 4 "=&r"))
> +   (clobber (match_scratch:P 5 "=&r"))
> +   (clobber (match_scratch:P 6 "=&r"))
> +   (clobber (match_dup 0))
> +   (clobber (match_dup 1))
> +   (clobber (match_dup 2))
> +   (clobber (reg:SI VL_REGNUM))
> +   (clobber (reg:SI VTYPE_REGNUM))]
> +  "TARGET_VECTOR"
> +{
> +  output_asm_insn ("vsetvli %4,%2,e<sew>,m8,ta,ma\;"
> +		   "beq %4,%2,1f\;"
> +		   "add %5,%0,%2\;"
> +		   "sub %6,%5,%4", operands);
> +  if (<sew> != 8)
> +    {
> +      rtx xop[2];
> +      xop[0] = operands[4];
> +      xop[1] = GEN_INT (exact_log2 (<sew>/8));
> +      output_asm_insn ("slli %0,%0,%1", xop);
> +    }
> +  output_asm_insn ("\n0:\t" "vle<sew>.v %3,(%1)\;"
> +		   "add %1,%1,%4\;"
> +		   "vse<sew>.v %3,(%0)\;"
> +		   "add %0,%0,%4\;"
>>> 		   "bltu %0,%6,0b\;"
>>> 		   "sub %5,%5,%0", operands);
>>>   if (<sew> != 8)
>>>     {
>>>       rtx xop[2];
>>>       xop[0] = operands[4];
>>>       xop[1] = GEN_INT (exact_log2 (<sew>/8));
>>>       output_asm_insn ("srli %0,%0,%1", xop);
>>>      }
>>>   output_asm_insn ("vsetvli %4,%5,e<sew>,m8,ta,ma\n"
>>> 	    "1:\t" "vle<sew>.v %3,(%1)\;"
>>> 		   "vse<sew>.v %3,(%0)", operands);
>>>   return "";
>>> })
> 
> I don't think they are necessary.
What specifically do you think is not necessary?


> 
>>> Just post the update for archival purposes and consider
>>> it pre-approved for the trunk.
> 
> I am so sorry that I disagree approve this patch too fast.
Umm, this patch has been queued up for at least a couple weeks now.

> 
> It should be well tested.
If you refer to Joern's message he indicated how it was tested.  Joern 
is a long time GCC developer and is well aware of how to test code.


It was tested on this set of multilibs without regressions:

>    riscv-sim
>     riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zba_zbb_zbc_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32f
>     riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32
>     riscv-sim/-march=rv32imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32f
>     riscv-sim/-march=rv32imfdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=ilp32
>     riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zba_zbb_zbc_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
>     riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zba_zbb_zbs_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d
>     riscv-sim/-march=rv64imafdcv_zicsr_zifencei_zfh_zve32f_zve32x_zve64d_zve64f_zve64x_zvl128b_zvl32b_zvl64b/-mabi=lp64d




> 
> 
> We should at least these 2 following situations:
> 
> 1. an unknown number bytes to be memcpy, this codegen should be as follows:
> 
> vsetvl a5,a2,e8,m8,ta,ma
> 
> vle
> 
> vse
> 
> bump counter
> 
> branch
> 
> 2. a known number bytes to be memcpy, and the number bytes allow us to 
> fine a VLS modes to hold it.
> 
> For example, memcpy 16 bytes QImode.
> 
> Then, we can use V16QImode directly, the codegen should be:
> 
> vsetvli zero,16,....
> 
> vle
> 
> vse
> 
> Simple 3 instructions are enough.
> 
> 
> This patch should be well tested with these 2 situations before approved 
> since LLVM does the same thing.
> 
> We should be able to have the same behavior as LLVM.
I'm not sure that's strictly necessary and I don't mind iterating a bit 
on performance issues as long as we don't have correctness problems.

But since you've raised concerns -- Joern don't install until we've 
resolved the questions at hand.  Thanks.

jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* cpymem for RISCV with v extension
@ 2023-08-04 23:10 钟居哲
  2023-08-04 23:17 ` Jeff Law
  0 siblings, 1 reply; 8+ messages in thread
From: 钟居哲 @ 2023-08-04 23:10 UTC (permalink / raw)
  To: gcc-patches; +Cc: kito.cheng, kito.cheng, Jeff Law, rdapp.gcc

[-- Attachment #1: Type: text/plain, Size: 4537 bytes --]

Could you add testcases for this patch?

+;; The (use (and (match_dup 1) (const_int 127))) is here to prevent the
+;; optimizers from changing cpymem_loop_* into this.
+(define_insn "@cpymem_straight<P:mode><V_WHOLE:mode>"
+  [(set (mem:BLK (match_operand:P 0 "register_operand" "r,r"))
+	(mem:BLK (match_operand:P 1 "register_operand" "r,r")))
+	(use (and (match_dup 1) (const_int 127)))
+   (use (match_operand:P 2 "reg_or_int_operand" "r,K"))
+   (clobber (match_scratch:V_WHOLE 3 "=&vr,&vr"))
+   (clobber (reg:SI VL_REGNUM))
+   (clobber (reg:SI VTYPE_REGNUM))]
+  "TARGET_VECTOR"
+  "@vsetvli zero,%2,e<sew>,m8,ta,ma\;vle<sew>.v %3,(%1)\;vse<sew>.v %3,(%0)
+   vsetivli zero,%2,e<sew>,m8,ta,ma\;vle<sew>.v %3,(%1)\;vse<sew>.v %3,(%0)"
+)
+
+(define_insn "@cpymem_loop<P:mode><V_WHOLE:mode>"
+  [(set (mem:BLK (match_operand:P 0 "register_operand" "+r"))
+	(mem:BLK (match_operand:P 1 "register_operand" "+r")))
+   (use (match_operand:P 2 "register_operand" "+r"))
+   (clobber (match_scratch:V_WHOLE 3 "=&vr"))
+   (clobber (match_scratch:P 4 "=&r"))
+   (clobber (match_dup 0))
+   (clobber (match_dup 1))
+   (clobber (match_dup 2))
+   (clobber (reg:SI VL_REGNUM))
+   (clobber (reg:SI VTYPE_REGNUM))]
+  "TARGET_VECTOR"
+{ output_asm_insn ("\n0:\t" "vsetvli %4,%2,e<sew>,m8,ta,ma\;"
+		   "vle<sew>.v %3,(%1)\;"
+		   "sub %2,%2,%4", operands);
+  if (<sew> != 8)
+    {
+      rtx xop[2];
+      xop[0] = operands[4];
+      xop[1] = GEN_INT (exact_log2 (<sew>/8));
+      output_asm_insn ("slli %0,%0,%1", xop);
+    }
+  output_asm_insn ("add %1,%1,%4\;"
+		   "vse<sew>.v %3,(%0)\;"
+		   "add %0,%0,%4\;"
+		   "bnez %2,0b", operands);
+  return "";
+})
+
+;; This pattern (at bltu) assumes pointers can be treated as unsigned,
+;; i.e.  objects can't straddle 0xffffffffffffffff / 0x0000000000000000 .
+(define_insn "@cpymem_loop_fast<P:mode><V_WHOLE:mode>"
+  [(set (mem:BLK (match_operand:P 0 "register_operand" "+r"))
+	(mem:BLK (match_operand:P 1 "register_operand" "+r")))
+   (use (match_operand:P 2 "register_operand" "+r"))
+   (clobber (match_scratch:V_WHOLE 3 "=&vr"))
+   (clobber (match_scratch:P 4 "=&r"))
+   (clobber (match_scratch:P 5 "=&r"))
+   (clobber (match_scratch:P 6 "=&r"))
+   (clobber (match_dup 0))
+   (clobber (match_dup 1))
+   (clobber (match_dup 2))
+   (clobber (reg:SI VL_REGNUM))
+   (clobber (reg:SI VTYPE_REGNUM))]
+  "TARGET_VECTOR"
+{
+  output_asm_insn ("vsetvli %4,%2,e<sew>,m8,ta,ma\;"
+		   "beq %4,%2,1f\;"
+		   "add %5,%0,%2\;"
+		   "sub %6,%5,%4", operands);
+  if (<sew> != 8)
+    {
+      rtx xop[2];
+      xop[0] = operands[4];
+      xop[1] = GEN_INT (exact_log2 (<sew>/8));
+      output_asm_insn ("slli %0,%0,%1", xop);
+    }
+  output_asm_insn ("\n0:\t" "vle<sew>.v %3,(%1)\;"
+		   "add %1,%1,%4\;"
+		   "vse<sew>.v %3,(%0)\;"
+		   "add %0,%0,%4\;"
>>  		   "bltu %0,%6,0b\;"
>>  		   "sub %5,%5,%0", operands);
>>   if (<sew> != 8)
>>     {
>>       rtx xop[2];
>>       xop[0] = operands[4];
>>       xop[1] = GEN_INT (exact_log2 (<sew>/8));
>>       output_asm_insn ("srli %0,%0,%1", xop);
>>      }
>>   output_asm_insn ("vsetvli %4,%5,e<sew>,m8,ta,ma\n"
>>  	    "1:\t" "vle<sew>.v %3,(%1)\;"
>>  		   "vse<sew>.v %3,(%0)", operands);
>>   return "";
>>  })
I don't think they are necessary.

>>      considering that this code is usually memory-constrainted, limit this
>>      to -O3.  ??? It would make sense to differentiate here between in-order
>>     and OOO microarchitectures.  */
>>     else if (!size_p && optimize >= 3)
>>       emit_insn (gen_cpymem_loop_fast (Pmode, vmode, dst, src, end));
>>      else
>>       emit_insn (gen_cpymem_loop (Pmode, vmode, dst, src, end));
Why not just emit RVV pattern.
>> Just post the update for archival purposes and consider 
>> it pre-approved for the trunk.I am so sorry that I disagree approve this patch too fast.It should be well tested.
We should at least these 2 following situations:1. an unknown number bytes to be memcpy, this codegen should be as follows:   vsetvl a5,a2,e8,m8,ta,ma    vle    vse    bump counter    branch2. a known number bytes to be memcpy, and the number bytes allow us to fine a VLS modes to hold it.    For example, memcpy 16 bytes QImode.    Then, we can use V16QImode directly, the codegen should be:    vsetvli zero,16,....     vle     vseSimple 3 instructions are enough. 
This patch should be well tested with these 2 situations before approved since LLVM does the same thing.We should be able to have the same behavior as LLVM.


juzhe.zhong@rivai.ai

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-08-15 14:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-18  4:47 cpymem for RISCV with v extension Joern Rennecke
2023-08-04 20:52 ` Jeff Law
2023-08-15  1:46   ` Joern Rennecke
2023-08-15 13:46     ` Jeff Law
2023-08-04 23:10 钟居哲
2023-08-04 23:17 ` Jeff Law
2023-08-04 23:34   ` 钟居哲
2023-08-15  8:12     ` Joern Rennecke
2023-08-15  9:16       ` juzhe.zhong
2023-08-15 14:06         ` Jeff Law
2023-08-15 14:04       ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).