public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Joern Rennecke <joern.rennecke@embecosm.com>
To: 钟居哲 <juzhe.zhong@rivai.ai>
Cc: Jeff Law <jeffreyalaw@gmail.com>,
	gcc-patches <gcc-patches@gcc.gnu.org>,
	 "kito.cheng" <kito.cheng@gmail.com>,
	"kito.cheng" <kito.cheng@sifive.com>,
	 "rdapp.gcc" <rdapp.gcc@gmail.com>
Subject: Re: Re: cpymem for RISCV with v extension
Date: Tue, 15 Aug 2023 09:12:46 +0100	[thread overview]
Message-ID: <CAMqJFCrCj3+0+ZN=e6i9URF-1q=J+9ScEcyL3uLQMnCs1=L2pQ@mail.gmail.com> (raw)
In-Reply-To: <960B8F718A41FF46+2023080507344044042924@rivai.ai>

[-- Attachment #1: Type: text/plain, Size: 2018 bytes --]

On Sat, 5 Aug 2023 at 00:35, 钟居哲 <juzhe.zhong@rivai.ai> wrote:
>
> >> Testing what specifically?  Are you asking for correctness tests,
> >> performance/code quality tests?
>
> Add memcpy test using RVV instructions, just like we are adding testcases for auto-vectorization support.

I wanted to get in the test infrastructure first.

> void foo (int32_t * a, int32_t * b, int num)
> {
>   memcpy (a, b, num);
> }
>
>
> In my downstream LLVM/GCC codegen:
> foo:
> .L2:
>         vsetvli a5,a2,e8,m8,ta,ma
>         vle8.v  v24,(a1)
>         sub     a2,a2,a5
>         vse8.v  v24,(a0)
>         add     a1,a1,a5
>         add     a0,a0,a5
>         bne     a2,zero,.L2
>         ret

Yeah, it does that.

>
> Another example:
> void foo (int32_t * a, int32_t * b, int num)
> {
>   memcpy (a, b, 4);
> }
>
>
> My downstream LLVM/GCC assembly:
>
> foo:
> vsetvli zero,16,e8,m1,ta,ma
> vle8.v v24,(a1)
> vse8.v v24,(a0)
> ret

copying 16 bytes when asked to copy 4 is problematic.  Mine copies 4.

Note also for:
typedef struct { int a[31]; } s;

void foo (s *a, s *b)
{
  *a = *b;
}

You get:

        vsetivli        zero,31,e32,m8,ta,ma
        vle32.v v8,0(a1)
        vse32.v v8,0(a0)

Using memcpy, the compiler unfortunately discards the alignment.

> emit_insn (gen_pred_store...)

Thanks to pointing me in the right direction.  From the naming of the
patterns, the dearth of comments, and the default behaviour of the
compiler when optimizing with generic optimization options (i.e. no
vectorization) I had assumed that the infrastructure was still
missing.

I have attached a re-worked patch that uses pred_mov / pred_store and
as adapted to the refactored modes.
It lacks the strength reduction of the opaque pattern version for -O3,
though.  Would people also like to see that expanded into RTL?  Or
should I just drop in the opaque pattern for that?  Or not at all,
because everyone uses Superscalar Out-Of-Order execution?

[-- Attachment #2: cpymem-20230815.txt --]
[-- Type: text/plain, Size: 9251 bytes --]

commit 1f4b7a8e6798acab1f79de38e85d9d080a76eb4a
Author: Joern Rennecke <joern.rennecke@embecosm.com>
Date:   Tue Aug 15 08:18:53 2023 +0100

    cpymem using pred_mov / pred_store and adapted to mode refactoring.
    
    2023-07-12  Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
                Joern Rennecke  <joern.rennecke@embecosm.com>
    
    gcc/
            * config/riscv/riscv-protos.h (riscv_vector::expand_block_move):
            Declare.
            * config/riscv/riscv-v.cc (riscv_vector::expand_block_move):
            New function.
            * config/riscv/riscv.md (cpymemsi): Use riscv_vector::expand_block_move.
            Change to ..
            (cpymem<P:mode>) .. this.

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 2fbed04ff84..70ffdcdf180 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -315,6 +315,7 @@ bool slide1_sew64_helper (int, machine_mode, machine_mode,
 			  machine_mode, rtx *);
 rtx gen_avl_for_scalar_move (rtx);
 void expand_tuple_move (rtx *);
+bool expand_block_move (rtx, rtx, rtx);
 machine_mode preferred_simd_mode (scalar_mode);
 machine_mode get_mask_mode (machine_mode);
 void expand_vec_series (rtx, rtx, rtx);
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 5f9b296c92e..ea96a0ef84d 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -49,6 +49,7 @@
 #include "tm-constrs.h"
 #include "rtx-vector-builder.h"
 #include "targhooks.h"
+#include "predict.h"
 
 using namespace riscv_vector;
 
@@ -2379,6 +2380,192 @@ expand_tuple_move (rtx *ops)
     }
 }
 
+/* Used by cpymemsi in riscv.md .  */
+
+bool
+expand_block_move (rtx dst_in, rtx src_in, rtx length_in)
+{
+  /*
+    memcpy:
+	mv a3, a0                       # Copy destination
+    loop:
+	vsetvli t0, a2, e8, m8, ta, ma  # Vectors of 8b
+	vle8.v v0, (a1)                 # Load bytes
+	add a1, a1, t0                  # Bump pointer
+	sub a2, a2, t0                  # Decrement count
+	vse8.v v0, (a3)                 # Store bytes
+	add a3, a3, t0                  # Bump pointer
+	bnez a2, loop                   # Any more?
+	ret                             # Return
+  */
+  if (!TARGET_VECTOR)
+    return false;
+  HOST_WIDE_INT potential_ew
+    = (MIN (MIN (MEM_ALIGN (src_in), MEM_ALIGN (dst_in)), BITS_PER_WORD)
+       / BITS_PER_UNIT);
+  machine_mode vmode = VOIDmode;
+  bool need_loop = true;
+  bool size_p = optimize_function_for_size_p (cfun);
+  rtx src, dst;
+  rtx end = gen_reg_rtx (Pmode);
+  rtx vec;
+  rtx length_rtx = length_in;
+
+  if (CONST_INT_P (length_in))
+    {
+      HOST_WIDE_INT length = INTVAL (length_in);
+
+    /* By using LMUL=8, we can copy as many bytes in one go as there
+       are bits in a vector register.  If the entire block thus fits,
+       we don't need a loop.  */
+    if (length <= TARGET_MIN_VLEN)
+      {
+	need_loop = false;
+
+	/* If a single scalar load / store pair can do the job, leave it
+	   to the scalar code to do that.  */
+
+	if (pow2p_hwi (length) && length <= potential_ew)
+	  return false;
+      }
+
+      /* Find the vector mode to use.  Using the largest possible element
+	 size is likely to give smaller constants, and thus potentially
+	 reducing code size.  However, if we need a loop, we need to update
+	 the pointers, and that is more complicated with a larger element
+	 size, unless we use an immediate, which prevents us from dynamically
+	 using the targets transfer size that the hart supports.  And then,
+	 unless we know the *exact* vector size of the hart, we'd need
+	 multiple vsetvli / branch statements, so it's not even a size win.
+	 If, in the future, we find an RISCV-V implementation that is slower
+	 for small element widths, we might allow larger element widths for
+	 loops too.  */
+      if (need_loop)
+	potential_ew = 1;
+      for (; potential_ew; potential_ew >>= 1)
+	{
+	  scalar_int_mode elem_mode;
+	  unsigned HOST_WIDE_INT bits = potential_ew * BITS_PER_UNIT;
+	  unsigned HOST_WIDE_INT per_iter;
+	  HOST_WIDE_INT nunits;
+
+	  if (need_loop)
+	    per_iter = TARGET_MIN_VLEN;
+	  else
+	    per_iter = length;
+	  nunits = per_iter / potential_ew;
+
+	  /* Unless we get an implementation that's slow for small element
+	     size / non-word-aligned accesses, we assume that the hardware
+	     handles this well, and we don't want to complicate the code
+	     with shifting word contents around or handling extra bytes at
+	     the start and/or end.  So we want the total transfer size and
+	     alignment to fit with the element size.  */
+	  if (length % potential_ew != 0
+	      || !int_mode_for_size (bits, 0).exists (&elem_mode))
+	    continue;
+	  /* Find the mode to use for the copy inside the loop - or the
+	     sole copy, if there is no loop.  */
+	  if (!need_loop)
+	    {
+	      /* Try if we have an exact mode for the copy.  */
+	      if (get_vector_mode (elem_mode, nunits).exists (&vmode))
+		break;
+	      /* We might have an odd transfer size.  Try to round it up to
+		 a power of two to get a valid vector mode for a clobber.  */
+	      for (nunits = 1ULL << ceil_log2 (nunits);
+		   nunits <= TARGET_MIN_VLEN;
+		   nunits <<= 1)
+		if (get_vector_mode (elem_mode, nunits).exists (&vmode))
+		  break;
+
+	      if (vmode != VOIDmode)
+		break;
+	    }
+
+	  /* The RVVM8?I modes are notionally 8 * BYTES_PER_RISCV_VECTOR bytes
+	     wide.  BYTES_PER_RISCV_VECTOR can't be eavenly divided by
+	     the sizes of larger element types; the LMUL factor of 8 can at
+	     the moment with SEW of up to 8 bytes, but there are reserved
+	     encodings so there might be larger SEW in the future.  */
+	  if (get_vector_mode (elem_mode,
+			       exact_div (BYTES_PER_RISCV_VECTOR * 8,
+					  potential_ew)).exists (&vmode))
+	    break;
+
+	  /* We may get here if we tried an element size that's larger than
+	     the hardware supports, but we should at least find a suitable
+	     byte vector mode.  */
+	  gcc_assert (potential_ew > 1);
+	}
+      if (potential_ew > 1)
+	length_rtx = GEN_INT (length / potential_ew);
+    }
+  else
+    {
+      vmode = E_RVVM8QImode;
+    }
+
+  /* A memcpy libcall in the worst case takes 3 instructions to prepare the
+     arguments + 1 for the call.  When RVV should take 7 instructions and
+     we're optimizing for size a libcall may be preferable.  */
+  if (size_p && need_loop)
+    return false;
+
+  rtx cnt = length_rtx;
+  rtx label = NULL_RTX;
+  rtx dst_addr = copy_addr_to_reg (XEXP (dst_in, 0));
+  rtx src_addr = copy_addr_to_reg (XEXP (src_in, 0));
+
+  if (need_loop)
+    {
+      length_rtx = copy_to_mode_reg (Pmode, length_rtx);
+      label = gen_label_rtx ();
+
+      emit_label (label);
+      emit_insn (gen_no_side_effects_vsetvl_rtx (vmode, cnt, length_rtx));
+    }
+
+  vec = gen_reg_rtx (vmode);
+  src = change_address (src_in, vmode, src_addr);
+  dst = change_address (dst_in, vmode, dst_addr);
+
+  /* If we don't need a loop and have a suitable mode to describe the size,
+     just do a load / store pair and leave it up to the later lazy code
+     motion pass to insert the appropriate vsetvli.  */
+  if (!need_loop && known_eq (GET_MODE_SIZE (vmode), INTVAL (length_in)))
+    {
+      emit_move_insn (vec, src);
+      emit_move_insn (dst, vec);
+    }
+  else
+    {
+      machine_mode mask_mode = get_vector_mode (BImode, GET_MODE_NUNITS (vmode)).require ();
+      rtx mask =  CONSTM1_RTX (mask_mode);
+      if (!satisfies_constraint_K (cnt))
+	cnt= force_reg (Pmode, cnt);
+      rtx m_ops[] = {vec, mask, RVV_VUNDEF (vmode), src};
+      emit_nonvlmax_masked_insn (code_for_pred_mov (vmode), RVV_UNOP_M,
+				 m_ops, cnt);
+      emit_insn (gen_pred_store (vmode, dst, mask, vec, cnt,
+				 get_avl_type_rtx (NONVLMAX)));
+    }
+
+  if (need_loop)
+    {
+      emit_insn (gen_rtx_SET (src_addr, gen_rtx_PLUS (Pmode, src_addr, cnt)));
+      emit_insn (gen_rtx_SET (dst_addr, gen_rtx_PLUS (Pmode, dst_addr, cnt)));
+      emit_insn (gen_rtx_SET (length_rtx, gen_rtx_MINUS (Pmode, length_rtx, cnt)));
+
+      /* Emit the loop condition.  */
+      rtx test = gen_rtx_NE (VOIDmode, end, const0_rtx);
+      emit_jump_insn (gen_cbranch4 (Pmode, test, length_rtx, const0_rtx, label));
+      emit_insn (gen_nop ());
+    }
+
+  return true;
+}
+
 /* Return the vectorization machine mode for RVV according to LMUL.  */
 machine_mode
 preferred_simd_mode (scalar_mode mode)
diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index b456fa6abb3..5b63da4d93b 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -2149,14 +2149,16 @@
   DONE;
 })
 
-(define_expand "cpymemsi"
+(define_expand "cpymem<mode>"
   [(parallel [(set (match_operand:BLK 0 "general_operand")
 		   (match_operand:BLK 1 "general_operand"))
-	      (use (match_operand:SI 2 ""))
+	      (use (match_operand:P 2 ""))
 	      (use (match_operand:SI 3 "const_int_operand"))])]
   ""
 {
-  if (riscv_expand_block_move (operands[0], operands[1], operands[2]))
+  if (riscv_vector::expand_block_move (operands[0], operands[1], operands[2]))
+    DONE;
+  else if (riscv_expand_block_move (operands[0], operands[1], operands[2]))
     DONE;
   else
     FAIL;

  reply	other threads:[~2023-08-15  8:12 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-04 23:10 钟居哲
2023-08-04 23:17 ` Jeff Law
2023-08-04 23:34   ` 钟居哲
2023-08-15  8:12     ` Joern Rennecke [this message]
2023-08-15  9:16       ` juzhe.zhong
2023-08-15 14:06         ` Jeff Law
2023-10-02  2:43           ` [RISC-V]: " Joern Rennecke
2023-10-04 17:38             ` Patrick O'Neill
2023-10-04 19:19               ` Joern Rennecke
2023-10-04 21:35                 ` Patrick O'Neill
2023-08-15 14:04       ` Jeff Law
2023-08-04 23:44   ` 钟居哲

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMqJFCrCj3+0+ZN=e6i9URF-1q=J+9ScEcyL3uLQMnCs1=L2pQ@mail.gmail.com' \
    --to=joern.rennecke@embecosm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jeffreyalaw@gmail.com \
    --cc=juzhe.zhong@rivai.ai \
    --cc=kito.cheng@gmail.com \
    --cc=kito.cheng@sifive.com \
    --cc=rdapp.gcc@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).