[PATCH][1/4][committed] aarch64: Add support for Armv8.8-a memory operations and memcpy expansion

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][1/4][committed] aarch64: Add support for Armv8.8-a memory operations and memcpy expansion
@ 2021-12-13 14:29 Kyrylo Tkachov
  2021-12-14 10:48 ` Richard Sandiford
  2021-12-15  9:21 ` Christophe Lyon
  0 siblings, 2 replies; 5+ messages in thread
From: Kyrylo Tkachov @ 2021-12-13 14:29 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1411 bytes --]

Hi all,

This patch adds the +mops architecture extension flag from the 2021 Arm Architecture extensions, Armv8.8-a.
The +mops extensions introduce instructions to accelerate the memcpy, memset, memmove standard functions.
The first patch here uses the instructions in the inline memcpy expansion.
Further patches in the series will use similar instructions to inline memmove and memset.

A new param, aarch64-mops-memcpy-size-threshold, is introduced to control the size threshold above which to
emit the new sequence. Its default setting is 256 bytes, which is the same as the current threshold above
which we'd emit a libcall.

Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

	* config/aarch64/aarch64-option-extensions.def (mops): Define.
	* config/aarch64/aarch64.c (aarch64_expand_cpymem_mops): Define.
	(aarch64_expand_cpymem): Define.
	* config/aarch64/aarch64.h (AARCH64_FL_MOPS): Define.
	(AARCH64_ISA_MOPS): Define.
	(TARGET_MOPS): Define.
	(MOVE_RATIO): Adjust for TARGET_MOPS.
	* config/aarch64/aarch64.md ("unspec"): Add UNSPEC_CPYMEM.
	(aarch64_cpymemdi): New pattern.
	(cpymemdi): Adjust for TARGET_MOPS.
	* config/aarch64/aarch64.opt (aarch64-mops-memcpy-size-threshol):
	New param.
	* doc/invoke.texi (AArch64 Options): Document +mops.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/mops_1.c: New test.

[-- Attachment #2: mops_1.patch --]
[-- Type: application/octet-stream, Size: 11565 bytes --]

diff --git a/gcc/config/aarch64/aarch64-option-extensions.def b/gcc/config/aarch64/aarch64-option-extensions.def
index b61f1df9019e655503f03c9a5aa4ecffa658d0fd..3f449fba41557afd66d11f976ec9fbdcddfafb42 100644
--- a/gcc/config/aarch64/aarch64-option-extensions.def
+++ b/gcc/config/aarch64/aarch64-option-extensions.def
@@ -235,4 +235,7 @@ AARCH64_OPT_EXTENSION("pauth", AARCH64_FL_PAUTH, 0, 0, false, "paca pacg")
 /* Enabling/Disabling "ls64" only changes "ls64".  */
 AARCH64_OPT_EXTENSION("ls64", AARCH64_FL_LS64, 0, 0, false, "")
 
+/* Enabling/disabling "mops" only changes "mops".  */
+AARCH64_OPT_EXTENSION("mops", AARCH64_FL_MOPS, 0, 0, false, "")
+
 #undef AARCH64_OPT_EXTENSION
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 2792bb29adbbb5b3145b3f767615af8edbc30b08..79d0bcd357fc75c8b85a252813b2aac0e2b6e5bd 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -231,14 +231,17 @@ extern unsigned aarch64_architecture_version;
 /* Pointer Authentication (PAUTH) extension.  */
 #define AARCH64_FL_PAUTH      (1ULL << 40)
 
+/* Armv9.0-A.  */
+#define AARCH64_FL_V9         (1ULL << 41)  /* Armv9.0-A Architecture.  */
+
 /* 64-byte atomic load/store extensions.  */
-#define AARCH64_FL_LS64      (1ULL << 41)
+#define AARCH64_FL_LS64      (1ULL << 42)
 
 /* Armv8.7-a architecture extensions.  */
-#define AARCH64_FL_V8_7       (1ULL << 42)
+#define AARCH64_FL_V8_7       (1ULL << 43)
 
-/* Armv9.0-A.  */
-#define AARCH64_FL_V9         (1ULL << 43)  /* Armv9.0-A Architecture.  */
+/* Hardware memory operation instructions.  */
+#define AARCH64_FL_MOPS       (1ULL << 44)
 
 /* Has FP and SIMD.  */
 #define AARCH64_FL_FPSIMD     (AARCH64_FL_FP | AARCH64_FL_SIMD)
@@ -310,6 +313,7 @@ extern unsigned aarch64_architecture_version;
 #define AARCH64_ISA_V8_R	   (aarch64_isa_flags & AARCH64_FL_V8_R)
 #define AARCH64_ISA_PAUTH	   (aarch64_isa_flags & AARCH64_FL_PAUTH)
 #define AARCH64_ISA_V9		   (aarch64_isa_flags & AARCH64_FL_V9)
+#define AARCH64_ISA_MOPS	   (aarch64_isa_flags & AARCH64_FL_MOPS)
 
 /* Crypto is an optional extension to AdvSIMD.  */
 #define TARGET_CRYPTO (TARGET_SIMD && AARCH64_ISA_CRYPTO)
@@ -401,6 +405,9 @@ extern unsigned aarch64_architecture_version;
 /* PAUTH instructions are enabled through +pauth.  */
 #define TARGET_PAUTH (AARCH64_ISA_PAUTH)
 
+/* MOPS instructions are enabled through +mops.  */
+#define TARGET_MOPS (AARCH64_ISA_MOPS)
+
 /* Make sure this is always defined so we don't have to check for ifdefs
    but rather use normal ifs.  */
 #ifndef TARGET_FIX_ERR_A53_835769_DEFAULT
@@ -1046,9 +1053,10 @@ typedef struct
    7-byte copy is a 4-byte + 2-byte + byte copy.  This proves inefficient
    for both size and speed of copy, so we will instead use the "cpymem"
    standard name to implement the copy.  This logic does not apply when
-   targeting -mstrict-align, so keep a sensible default in that case.  */
+   targeting -mstrict-align or TARGET_MOPS, so keep a sensible default in
+   that case.  */
 #define MOVE_RATIO(speed) \
-  (!STRICT_ALIGNMENT ? 2 : (((speed) ? 15 : AARCH64_CALL_RATIO) / 2))
+  ((!STRICT_ALIGNMENT || TARGET_MOPS) ? 2 : (((speed) ? 15 : AARCH64_CALL_RATIO) / 2))
 
 /* Like MOVE_RATIO, without -mstrict-align, make decisions in "setmem" when
    we would use more than 3 scalar instructions.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index be24b7320d28deed9a19a0451c96bd67d2fb3104..bd754e4e2c2b43e76021d3dd4b10cc03d78d9543 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -23568,6 +23568,28 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
   *dst = aarch64_progress_pointer (*dst);
 }
 
+/* Expand a cpymem using the MOPS extension.  OPERANDS are taken
+   from the cpymem pattern.  Return true iff we succeeded.  */
+static bool
+aarch64_expand_cpymem_mops (rtx *operands)
+{
+  if (!TARGET_MOPS)
+    return false;
+  rtx addr_dst = XEXP (operands[0], 0);
+  rtx addr_src = XEXP (operands[1], 0);
+  rtx sz_reg = operands[2];
+
+  if (!REG_P (sz_reg))
+    sz_reg = force_reg (DImode, sz_reg);
+  if (!REG_P (addr_dst))
+    addr_dst = force_reg (DImode, addr_dst);
+  if (!REG_P (addr_src))
+    addr_src = force_reg (DImode, addr_src);
+  emit_insn (gen_aarch64_cpymemdi (addr_dst, addr_src, sz_reg));
+
+  return true;
+}
+
 /* Expand cpymem, as if from a __builtin_memcpy.  Return true if
    we succeed, otherwise return false, indicating that a libcall to
    memcpy should be emitted.  */
@@ -23581,19 +23603,25 @@ aarch64_expand_cpymem (rtx *operands)
   rtx base;
   machine_mode cur_mode = BLKmode;
 
-  /* Only expand fixed-size copies.  */
+  /* Variable-sized memcpy can go through the MOPS expansion if available.  */
   if (!CONST_INT_P (operands[2]))
-    return false;
+    return aarch64_expand_cpymem_mops (operands);
 
   unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
 
-  /* Try to inline up to 256 bytes.  */
-  unsigned HOST_WIDE_INT max_copy_size = 256;
+  /* Try to inline up to 256 bytes or use the MOPS threshold if available.  */
+  unsigned HOST_WIDE_INT max_copy_size
+    = TARGET_MOPS ? aarch64_mops_memcpy_size_threshold : 256;
 
   bool size_p = optimize_function_for_size_p (cfun);
 
+  /* Large constant-sized cpymem should go through MOPS when possible.
+     It should be a win even for size optimization in the general case.
+     For speed optimization the choice between MOPS and the SIMD sequence
+     depends on the size of the copy, rather than number of instructions,
+     alignment etc.  */
   if (size > max_copy_size)
-    return false;
+    return aarch64_expand_cpymem_mops (operands);
 
   int copy_bits = 256;
 
@@ -23643,9 +23671,9 @@ aarch64_expand_cpymem (rtx *operands)
       nops += 2;
       n -= mode_bits;
 
-      /* Emit trailing copies using overlapping unaligned accesses - this is
-	 smaller and faster.  */
-      if (n > 0 && n < copy_bits / 2)
+      /* Emit trailing copies using overlapping unaligned accesses
+	(when !STRICT_ALIGNMENT) - this is smaller and faster.  */
+      if (n > 0 && n < copy_bits / 2 && !STRICT_ALIGNMENT)
 	{
 	  machine_mode next_mode = smallest_mode_for_size (n, MODE_INT);
 	  int n_bits = GET_MODE_BITSIZE (next_mode).to_constant ();
@@ -23657,9 +23685,25 @@ aarch64_expand_cpymem (rtx *operands)
     }
   rtx_insn *seq = get_insns ();
   end_sequence ();
+  /* MOPS sequence requires 3 instructions for the memory copying + 1 to move
+     the constant size into a register.  */
+  unsigned mops_cost = 3 + 1;
+
+  /* If MOPS is available at this point we don't consider the libcall as it's
+     not a win even on code size.  At this point only consider MOPS if
+     optimizing for size.  For speed optimizations we will have chosen between
+     the two based on copy size already.  */
+  if (TARGET_MOPS)
+    {
+      if (size_p && mops_cost < nops)
+	return aarch64_expand_cpymem_mops (operands);
+      emit_insn (seq);
+      return true;
+    }
 
   /* A memcpy libcall in the worst case takes 3 instructions to prepare the
-     arguments + 1 for the call.  */
+     arguments + 1 for the call.  When MOPS is not available and we're
+     optimizing for size a libcall may be preferable.  */
   unsigned libcall_cost = 4;
   if (size_p && libcall_cost < nops)
     return false;
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 5297b2d3f95744ac72e36814c6676cc97478d48b..d623c1b00bf62bf24420813fb7a3a6bf09ff1dc0 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -143,6 +143,7 @@ (define_c_enum "unspec" [
     UNSPEC_AUTIBSP
     UNSPEC_CALLEE_ABI
     UNSPEC_CASESI
+    UNSPEC_CPYMEM
     UNSPEC_CRC32B
     UNSPEC_CRC32CB
     UNSPEC_CRC32CH
@@ -1572,6 +1573,18 @@ (define_split
   }
 )
 
+(define_insn "aarch64_cpymemdi"
+  [(parallel [
+   (set (match_operand:DI 2 "register_operand" "+&r") (const_int 0))
+   (clobber (match_operand:DI 0 "register_operand" "+&r"))
+   (clobber (match_operand:DI 1 "register_operand" "+&r"))
+   (set (mem:BLK (match_dup 0))
+        (unspec:BLK [(mem:BLK (match_dup 1)) (match_dup 2)] UNSPEC_CPYMEM))])]
+ "TARGET_MOPS"
+ "cpyfp\t[%x0]!, [%x1]!, %x2!\;cpyfm\t[%x0]!, [%x1]!, %x2!\;cpyfe\t[%x0]!, [%x1]!, %x2!"
+ [(set_attr "length" "12")]
+)
+
 ;; 0 is dst
 ;; 1 is src
 ;; 2 is size of copy in bytes
@@ -1580,9 +1593,9 @@ (define_split
 (define_expand "cpymemdi"
   [(match_operand:BLK 0 "memory_operand")
    (match_operand:BLK 1 "memory_operand")
-   (match_operand:DI 2 "immediate_operand")
+   (match_operand:DI 2 "general_operand")
    (match_operand:DI 3 "immediate_operand")]
-   "!STRICT_ALIGNMENT"
+   "!STRICT_ALIGNMENT || TARGET_MOPS"
 {
   if (aarch64_expand_cpymem (operands))
     DONE;
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 32191cf1acf43302c4a544b85db60b7b6e14da48..7445ed106cca9cb8a4537414499f6f28476951bf 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -280,3 +280,7 @@ Target Joined UInteger Var(aarch64_autovec_preference) Init(0) IntegerRange(0, 4
 
 -param=aarch64-loop-vect-issue-rate-niters=
 Target Joined UInteger Var(aarch64_loop_vect_issue_rate_niters) Init(6) IntegerRange(0, 65536) Param
+
+-param=aarch64-mops-memcpy-size-threshold=
+Target Joined UInteger Var(aarch64_mops_memcpy_size_threshold) Init(256) Param
+Constant memcpy size in bytes above which to start using MOPS sequence.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 510ed079b99374028e38d20f3edbac12d75f7842..9c38277e4f317332088555c9948ef19935ae890c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -19135,6 +19135,9 @@ prior to Armv8.2-A is not supported.
 @item ls64
 Enable the 64-byte atomic load and store instructions for accelerators.
 This option is enabled by default for @option{-march=armv8.7-a}.
+@item mops
+Enable the instructions to accelerate memory operations like @code{memcpy},
+@code{memmove}, @code{memset}.
 @item flagm
 Enable the Flag Manipulation instructions Extension.
 @item pauth
diff --git a/gcc/testsuite/gcc.target/aarch64/mops_1.c b/gcc/testsuite/gcc.target/aarch64/mops_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..661c14192e8a84fd4641a4d818b8db46ab4f1b28
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mops_1.c
@@ -0,0 +1,57 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=armv8.6-a+mops --param=aarch64-mops-memcpy-size-threshold=0" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include <stdlib.h>
+
+/* We want to inline variable-sized memcpy.
+** do_it_cpy:
+**	cpyfp	\[x1\]\!, \[x0\]\!, x2\!
+**	cpyfm	\[x1\]\!, \[x0\]\!, x2\!
+**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
+**	ret
+*/
+void do_it_cpy (char * in, char * out, size_t size)
+{
+  __builtin_memcpy (out, in, size);
+}
+
+/*
+** do_it_cpy_large:
+**	mov	x2, 1024
+**	cpyfp	\[x1\]\!, \[x0\]!, x2\!
+**	cpyfm	\[x1\]\!, \[x0\]!, x2\!
+**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
+**	ret
+*/
+void do_it_cpy_large (char * in, char * out)
+{
+  __builtin_memcpy (out, in, 1024);
+}
+
+/*
+** do_it_cpy_127:
+**	mov	x2, 127
+**	cpyfp	\[x1\]\!, \[x0\]!, x2\!
+**	cpyfm	\[x1\]\!, \[x0\]!, x2\!
+**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
+**	ret
+*/
+void do_it_cpy_127 (char * in, char * out)
+{
+  __builtin_memcpy (out, in, 127);
+}
+
+/*
+** do_it_cpy_128:
+**	mov	x2, 128
+**	cpyfp	\[x1\]\!, \[x0\]!, x2\!
+**	cpyfm	\[x1\]\!, \[x0\]!, x2\!
+**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
+**	ret
+*/
+void do_it_cpy_128 (char * in, char * out)
+{
+  __builtin_memcpy (out, in, 128);
+}
+

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][1/4][committed] aarch64: Add support for Armv8.8-a memory operations and memcpy expansion
  2021-12-13 14:29 [PATCH][1/4][committed] aarch64: Add support for Armv8.8-a memory operations and memcpy expansion Kyrylo Tkachov
@ 2021-12-14 10:48 ` Richard Sandiford
  2022-01-27 10:12   ` Kyrylo Tkachov
  2021-12-15  9:21 ` Christophe Lyon
  1 sibling, 1 reply; 5+ messages in thread
From: Richard Sandiford @ 2021-12-14 10:48 UTC (permalink / raw)
  To: Kyrylo Tkachov via Gcc-patches

Kyrylo Tkachov via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> @@ -23568,6 +23568,28 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
>    *dst = aarch64_progress_pointer (*dst);
>  }
>  
> +/* Expand a cpymem using the MOPS extension.  OPERANDS are taken
> +   from the cpymem pattern.  Return true iff we succeeded.  */
> +static bool
> +aarch64_expand_cpymem_mops (rtx *operands)
> +{
> +  if (!TARGET_MOPS)
> +    return false;
> +  rtx addr_dst = XEXP (operands[0], 0);
> +  rtx addr_src = XEXP (operands[1], 0);
> +  rtx sz_reg = operands[2];
> +
> +  if (!REG_P (sz_reg))
> +    sz_reg = force_reg (DImode, sz_reg);
> +  if (!REG_P (addr_dst))
> +    addr_dst = force_reg (DImode, addr_dst);
> +  if (!REG_P (addr_src))
> +    addr_src = force_reg (DImode, addr_src);
> +  emit_insn (gen_aarch64_cpymemdi (addr_dst, addr_src, sz_reg));
> +
> +  return true;
> +}

On this, I think it would be better to adjust the original src and dst
MEMs if possible, since they contain metadata about the size of the
access and alias set information.  It looks like the code above
generates an instruction with a wild read and a wild write instead.

It should be possible to do that with a define_expand/define_insn
pair, where the define_expand takes two extra operands for the MEMs,
but the define_insn contains the same operands as now.

Since the instruction clobbers the three registers, I think we have to
use copy_to_reg (unconditionally) to force a fresh register.  The ultimate
caller is not expecting the values of the registers in the original
address to change.

Thanks,
Richard



> +
>  /* Expand cpymem, as if from a __builtin_memcpy.  Return true if
>     we succeed, otherwise return false, indicating that a libcall to
>     memcpy should be emitted.  */
> @@ -23581,19 +23603,25 @@ aarch64_expand_cpymem (rtx *operands)
>    rtx base;
>    machine_mode cur_mode = BLKmode;
>  
> -  /* Only expand fixed-size copies.  */
> +  /* Variable-sized memcpy can go through the MOPS expansion if available.  */
>    if (!CONST_INT_P (operands[2]))
> -    return false;
> +    return aarch64_expand_cpymem_mops (operands);
>  
>    unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
>  
> -  /* Try to inline up to 256 bytes.  */
> -  unsigned HOST_WIDE_INT max_copy_size = 256;
> +  /* Try to inline up to 256 bytes or use the MOPS threshold if available.  */
> +  unsigned HOST_WIDE_INT max_copy_size
> +    = TARGET_MOPS ? aarch64_mops_memcpy_size_threshold : 256;
>  
>    bool size_p = optimize_function_for_size_p (cfun);
>  
> +  /* Large constant-sized cpymem should go through MOPS when possible.
> +     It should be a win even for size optimization in the general case.
> +     For speed optimization the choice between MOPS and the SIMD sequence
> +     depends on the size of the copy, rather than number of instructions,
> +     alignment etc.  */
>    if (size > max_copy_size)
> -    return false;
> +    return aarch64_expand_cpymem_mops (operands);
>  
>    int copy_bits = 256;
>  
> @@ -23643,9 +23671,9 @@ aarch64_expand_cpymem (rtx *operands)
>        nops += 2;
>        n -= mode_bits;
>  
> -      /* Emit trailing copies using overlapping unaligned accesses - this is
> -	 smaller and faster.  */
> -      if (n > 0 && n < copy_bits / 2)
> +      /* Emit trailing copies using overlapping unaligned accesses
> +	(when !STRICT_ALIGNMENT) - this is smaller and faster.  */
> +      if (n > 0 && n < copy_bits / 2 && !STRICT_ALIGNMENT)
>  	{
>  	  machine_mode next_mode = smallest_mode_for_size (n, MODE_INT);
>  	  int n_bits = GET_MODE_BITSIZE (next_mode).to_constant ();
> @@ -23657,9 +23685,25 @@ aarch64_expand_cpymem (rtx *operands)
>      }
>    rtx_insn *seq = get_insns ();
>    end_sequence ();
> +  /* MOPS sequence requires 3 instructions for the memory copying + 1 to move
> +     the constant size into a register.  */
> +  unsigned mops_cost = 3 + 1;
> +
> +  /* If MOPS is available at this point we don't consider the libcall as it's
> +     not a win even on code size.  At this point only consider MOPS if
> +     optimizing for size.  For speed optimizations we will have chosen between
> +     the two based on copy size already.  */
> +  if (TARGET_MOPS)
> +    {
> +      if (size_p && mops_cost < nops)
> +	return aarch64_expand_cpymem_mops (operands);
> +      emit_insn (seq);
> +      return true;
> +    }
>  
>    /* A memcpy libcall in the worst case takes 3 instructions to prepare the
> -     arguments + 1 for the call.  */
> +     arguments + 1 for the call.  When MOPS is not available and we're
> +     optimizing for size a libcall may be preferable.  */
>    unsigned libcall_cost = 4;
>    if (size_p && libcall_cost < nops)
>      return false;
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 5297b2d3f95744ac72e36814c6676cc97478d48b..d623c1b00bf62bf24420813fb7a3a6bf09ff1dc0 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -143,6 +143,7 @@ (define_c_enum "unspec" [
>      UNSPEC_AUTIBSP
>      UNSPEC_CALLEE_ABI
>      UNSPEC_CASESI
> +    UNSPEC_CPYMEM
>      UNSPEC_CRC32B
>      UNSPEC_CRC32CB
>      UNSPEC_CRC32CH
> @@ -1572,6 +1573,18 @@ (define_split
>    }
>  )
>  
> +(define_insn "aarch64_cpymemdi"
> +  [(parallel [
> +   (set (match_operand:DI 2 "register_operand" "+&r") (const_int 0))
> +   (clobber (match_operand:DI 0 "register_operand" "+&r"))
> +   (clobber (match_operand:DI 1 "register_operand" "+&r"))
> +   (set (mem:BLK (match_dup 0))
> +        (unspec:BLK [(mem:BLK (match_dup 1)) (match_dup 2)] UNSPEC_CPYMEM))])]
> + "TARGET_MOPS"
> + "cpyfp\t[%x0]!, [%x1]!, %x2!\;cpyfm\t[%x0]!, [%x1]!, %x2!\;cpyfe\t[%x0]!, [%x1]!, %x2!"
> + [(set_attr "length" "12")]
> +)
> +
>  ;; 0 is dst
>  ;; 1 is src
>  ;; 2 is size of copy in bytes
> @@ -1580,9 +1593,9 @@ (define_split
>  (define_expand "cpymemdi"
>    [(match_operand:BLK 0 "memory_operand")
>     (match_operand:BLK 1 "memory_operand")
> -   (match_operand:DI 2 "immediate_operand")
> +   (match_operand:DI 2 "general_operand")
>     (match_operand:DI 3 "immediate_operand")]
> -   "!STRICT_ALIGNMENT"
> +   "!STRICT_ALIGNMENT || TARGET_MOPS"
>  {
>    if (aarch64_expand_cpymem (operands))
>      DONE;
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index 32191cf1acf43302c4a544b85db60b7b6e14da48..7445ed106cca9cb8a4537414499f6f28476951bf 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -280,3 +280,7 @@ Target Joined UInteger Var(aarch64_autovec_preference) Init(0) IntegerRange(0, 4
>  
>  -param=aarch64-loop-vect-issue-rate-niters=
>  Target Joined UInteger Var(aarch64_loop_vect_issue_rate_niters) Init(6) IntegerRange(0, 65536) Param
> +
> +-param=aarch64-mops-memcpy-size-threshold=
> +Target Joined UInteger Var(aarch64_mops_memcpy_size_threshold) Init(256) Param
> +Constant memcpy size in bytes above which to start using MOPS sequence.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 510ed079b99374028e38d20f3edbac12d75f7842..9c38277e4f317332088555c9948ef19935ae890c 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -19135,6 +19135,9 @@ prior to Armv8.2-A is not supported.
>  @item ls64
>  Enable the 64-byte atomic load and store instructions for accelerators.
>  This option is enabled by default for @option{-march=armv8.7-a}.
> +@item mops
> +Enable the instructions to accelerate memory operations like @code{memcpy},
> +@code{memmove}, @code{memset}.
>  @item flagm
>  Enable the Flag Manipulation instructions Extension.
>  @item pauth
> diff --git a/gcc/testsuite/gcc.target/aarch64/mops_1.c b/gcc/testsuite/gcc.target/aarch64/mops_1.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..661c14192e8a84fd4641a4d818b8db46ab4f1b28
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/mops_1.c
> @@ -0,0 +1,57 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=armv8.6-a+mops --param=aarch64-mops-memcpy-size-threshold=0" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <stdlib.h>
> +
> +/* We want to inline variable-sized memcpy.
> +** do_it_cpy:
> +**	cpyfp	\[x1\]\!, \[x0\]\!, x2\!
> +**	cpyfm	\[x1\]\!, \[x0\]\!, x2\!
> +**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
> +**	ret
> +*/
> +void do_it_cpy (char * in, char * out, size_t size)
> +{
> +  __builtin_memcpy (out, in, size);
> +}
> +
> +/*
> +** do_it_cpy_large:
> +**	mov	x2, 1024
> +**	cpyfp	\[x1\]\!, \[x0\]!, x2\!
> +**	cpyfm	\[x1\]\!, \[x0\]!, x2\!
> +**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
> +**	ret
> +*/
> +void do_it_cpy_large (char * in, char * out)
> +{
> +  __builtin_memcpy (out, in, 1024);
> +}
> +
> +/*
> +** do_it_cpy_127:
> +**	mov	x2, 127
> +**	cpyfp	\[x1\]\!, \[x0\]!, x2\!
> +**	cpyfm	\[x1\]\!, \[x0\]!, x2\!
> +**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
> +**	ret
> +*/
> +void do_it_cpy_127 (char * in, char * out)
> +{
> +  __builtin_memcpy (out, in, 127);
> +}
> +
> +/*
> +** do_it_cpy_128:
> +**	mov	x2, 128
> +**	cpyfp	\[x1\]\!, \[x0\]!, x2\!
> +**	cpyfm	\[x1\]\!, \[x0\]!, x2\!
> +**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
> +**	ret
> +*/
> +void do_it_cpy_128 (char * in, char * out)
> +{
> +  __builtin_memcpy (out, in, 128);
> +}
> +

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][1/4][committed] aarch64: Add support for Armv8.8-a memory operations and memcpy expansion
  2021-12-13 14:29 [PATCH][1/4][committed] aarch64: Add support for Armv8.8-a memory operations and memcpy expansion Kyrylo Tkachov
  2021-12-14 10:48 ` Richard Sandiford
@ 2021-12-15  9:21 ` Christophe Lyon
  1 sibling, 0 replies; 5+ messages in thread
From: Christophe Lyon @ 2021-12-15  9:21 UTC (permalink / raw)
  To: Kyrylo Tkachov; +Cc: gcc-patches

On Mon, Dec 13, 2021 at 3:29 PM Kyrylo Tkachov via Gcc-patches <
gcc-patches@gcc.gnu.org> wrote:

> Hi all,
>
> This patch adds the +mops architecture extension flag from the 2021 Arm
> Architecture extensions, Armv8.8-a.
> The +mops extensions introduce instructions to accelerate the memcpy,
> memset, memmove standard functions.
> The first patch here uses the instructions in the inline memcpy expansion.
> Further patches in the series will use similar instructions to inline
> memmove and memset.
>
> A new param, aarch64-mops-memcpy-size-threshold, is introduced to control
> the size threshold above which to
> emit the new sequence. Its default setting is 256 bytes, which is the same
> as the current threshold above
> which we'd emit a libcall.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Pushing to trunk.
> Thanks,
> Kyrill
>
> gcc/ChangeLog:
>
>         * config/aarch64/aarch64-option-extensions.def (mops): Define.
>         * config/aarch64/aarch64.c (aarch64_expand_cpymem_mops): Define.
>         (aarch64_expand_cpymem): Define.
>         * config/aarch64/aarch64.h (AARCH64_FL_MOPS): Define.
>         (AARCH64_ISA_MOPS): Define.
>         (TARGET_MOPS): Define.
>         (MOVE_RATIO): Adjust for TARGET_MOPS.
>         * config/aarch64/aarch64.md ("unspec"): Add UNSPEC_CPYMEM.
>         (aarch64_cpymemdi): New pattern.
>         (cpymemdi): Adjust for TARGET_MOPS.
>         * config/aarch64/aarch64.opt (aarch64-mops-memcpy-size-threshol):
>         New param.
>         * doc/invoke.texi (AArch64 Options): Document +mops.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/mops_1.c: New test.
>

Hi Kyrill,

And this test fails with -mabi=ilp32 too, sorry for the delay.

Thanks

Christophe

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH][1/4][committed] aarch64: Add support for Armv8.8-a memory operations and memcpy expansion
  2021-12-14 10:48 ` Richard Sandiford
@ 2022-01-27 10:12   ` Kyrylo Tkachov
  2022-02-01 10:22     ` Richard Sandiford
  0 siblings, 1 reply; 5+ messages in thread
From: Kyrylo Tkachov @ 2022-01-27 10:12 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

Hi Richard,

Sorry for the delay in getting back to this. I'm now working on a patch to adjust this.

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Tuesday, December 14, 2021 10:48 AM
> To: Kyrylo Tkachov via Gcc-patches <gcc-patches@gcc.gnu.org>
> Cc: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: Re: [PATCH][1/4][committed] aarch64: Add support for Armv8.8-a
> memory operations and memcpy expansion
> 
> Kyrylo Tkachov via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > @@ -23568,6 +23568,28 @@
> aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
> >    *dst = aarch64_progress_pointer (*dst);
> >  }
> >
> > +/* Expand a cpymem using the MOPS extension.  OPERANDS are taken
> > +   from the cpymem pattern.  Return true iff we succeeded.  */
> > +static bool
> > +aarch64_expand_cpymem_mops (rtx *operands)
> > +{
> > +  if (!TARGET_MOPS)
> > +    return false;
> > +  rtx addr_dst = XEXP (operands[0], 0);
> > +  rtx addr_src = XEXP (operands[1], 0);
> > +  rtx sz_reg = operands[2];
> > +
> > +  if (!REG_P (sz_reg))
> > +    sz_reg = force_reg (DImode, sz_reg);
> > +  if (!REG_P (addr_dst))
> > +    addr_dst = force_reg (DImode, addr_dst);
> > +  if (!REG_P (addr_src))
> > +    addr_src = force_reg (DImode, addr_src);
> > +  emit_insn (gen_aarch64_cpymemdi (addr_dst, addr_src, sz_reg));
> > +
> > +  return true;
> > +}
> 
> On this, I think it would be better to adjust the original src and dst
> MEMs if possible, since they contain metadata about the size of the
> access and alias set information.  It looks like the code above
> generates an instruction with a wild read and a wild write instead.
> 

Hmm, do you mean adjust the address of the MEMs in operands with something like replace_equiv_address_nv?

> It should be possible to do that with a define_expand/define_insn
> pair, where the define_expand takes two extra operands for the MEMs,
> but the define_insn contains the same operands as now.

Could you please expand on this a bit? This path is reached from the cpymemdi expander that already takes the two MEMs as operands and generates the aarch64_cpymemdi define_insn that uses just the address registers as operands. Should we carry the MEMs around in the define_insn as well after expand?


> 
> Since the instruction clobbers the three registers, I think we have to
> use copy_to_reg (unconditionally) to force a fresh register.  The ultimate
> caller is not expecting the values of the registers in the original
> address to change.

Thanks,
Kyrill

> 
> Thanks,
> Richard
> 
> 
> 
> > +
> >  /* Expand cpymem, as if from a __builtin_memcpy.  Return true if
> >     we succeed, otherwise return false, indicating that a libcall to
> >     memcpy should be emitted.  */
> > @@ -23581,19 +23603,25 @@ aarch64_expand_cpymem (rtx *operands)
> >    rtx base;
> >    machine_mode cur_mode = BLKmode;
> >
> > -  /* Only expand fixed-size copies.  */
> > +  /* Variable-sized memcpy can go through the MOPS expansion if
> available.  */
> >    if (!CONST_INT_P (operands[2]))
> > -    return false;
> > +    return aarch64_expand_cpymem_mops (operands);
> >
> >    unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
> >
> > -  /* Try to inline up to 256 bytes.  */
> > -  unsigned HOST_WIDE_INT max_copy_size = 256;
> > +  /* Try to inline up to 256 bytes or use the MOPS threshold if available.  */
> > +  unsigned HOST_WIDE_INT max_copy_size
> > +    = TARGET_MOPS ? aarch64_mops_memcpy_size_threshold : 256;
> >
> >    bool size_p = optimize_function_for_size_p (cfun);
> >
> > +  /* Large constant-sized cpymem should go through MOPS when possible.
> > +     It should be a win even for size optimization in the general case.
> > +     For speed optimization the choice between MOPS and the SIMD
> sequence
> > +     depends on the size of the copy, rather than number of instructions,
> > +     alignment etc.  */
> >    if (size > max_copy_size)
> > -    return false;
> > +    return aarch64_expand_cpymem_mops (operands);
> >
> >    int copy_bits = 256;
> >
> > @@ -23643,9 +23671,9 @@ aarch64_expand_cpymem (rtx *operands)
> >        nops += 2;
> >        n -= mode_bits;
> >
> > -      /* Emit trailing copies using overlapping unaligned accesses - this is
> > -	 smaller and faster.  */
> > -      if (n > 0 && n < copy_bits / 2)
> > +      /* Emit trailing copies using overlapping unaligned accesses
> > +	(when !STRICT_ALIGNMENT) - this is smaller and faster.  */
> > +      if (n > 0 && n < copy_bits / 2 && !STRICT_ALIGNMENT)
> >  	{
> >  	  machine_mode next_mode = smallest_mode_for_size (n,
> MODE_INT);
> >  	  int n_bits = GET_MODE_BITSIZE (next_mode).to_constant ();
> > @@ -23657,9 +23685,25 @@ aarch64_expand_cpymem (rtx *operands)
> >      }
> >    rtx_insn *seq = get_insns ();
> >    end_sequence ();
> > +  /* MOPS sequence requires 3 instructions for the memory copying + 1 to
> move
> > +     the constant size into a register.  */
> > +  unsigned mops_cost = 3 + 1;
> > +
> > +  /* If MOPS is available at this point we don't consider the libcall as it's
> > +     not a win even on code size.  At this point only consider MOPS if
> > +     optimizing for size.  For speed optimizations we will have chosen
> between
> > +     the two based on copy size already.  */
> > +  if (TARGET_MOPS)
> > +    {
> > +      if (size_p && mops_cost < nops)
> > +	return aarch64_expand_cpymem_mops (operands);
> > +      emit_insn (seq);
> > +      return true;
> > +    }
> >
> >    /* A memcpy libcall in the worst case takes 3 instructions to prepare the
> > -     arguments + 1 for the call.  */
> > +     arguments + 1 for the call.  When MOPS is not available and we're
> > +     optimizing for size a libcall may be preferable.  */
> >    unsigned libcall_cost = 4;
> >    if (size_p && libcall_cost < nops)
> >      return false;
> > diff --git a/gcc/config/aarch64/aarch64.md
> b/gcc/config/aarch64/aarch64.md
> > index
> 5297b2d3f95744ac72e36814c6676cc97478d48b..d623c1b00bf62bf24420813f
> b7a3a6bf09ff1dc0 100644
> > --- a/gcc/config/aarch64/aarch64.md
> > +++ b/gcc/config/aarch64/aarch64.md
> > @@ -143,6 +143,7 @@ (define_c_enum "unspec" [
> >      UNSPEC_AUTIBSP
> >      UNSPEC_CALLEE_ABI
> >      UNSPEC_CASESI
> > +    UNSPEC_CPYMEM
> >      UNSPEC_CRC32B
> >      UNSPEC_CRC32CB
> >      UNSPEC_CRC32CH
> > @@ -1572,6 +1573,18 @@ (define_split
> >    }
> >  )
> >
> > +(define_insn "aarch64_cpymemdi"
> > +  [(parallel [
> > +   (set (match_operand:DI 2 "register_operand" "+&r") (const_int 0))
> > +   (clobber (match_operand:DI 0 "register_operand" "+&r"))
> > +   (clobber (match_operand:DI 1 "register_operand" "+&r"))
> > +   (set (mem:BLK (match_dup 0))
> > +        (unspec:BLK [(mem:BLK (match_dup 1)) (match_dup 2)]
> UNSPEC_CPYMEM))])]
> > + "TARGET_MOPS"
> > + "cpyfp\t[%x0]!, [%x1]!, %x2!\;cpyfm\t[%x0]!, [%x1]!, %x2!\;cpyfe\t[%x0]!,
> [%x1]!, %x2!"
> > + [(set_attr "length" "12")]
> > +)
> > +
> >  ;; 0 is dst
> >  ;; 1 is src
> >  ;; 2 is size of copy in bytes
> > @@ -1580,9 +1593,9 @@ (define_split
> >  (define_expand "cpymemdi"
> >    [(match_operand:BLK 0 "memory_operand")
> >     (match_operand:BLK 1 "memory_operand")
> > -   (match_operand:DI 2 "immediate_operand")
> > +   (match_operand:DI 2 "general_operand")
> >     (match_operand:DI 3 "immediate_operand")]
> > -   "!STRICT_ALIGNMENT"
> > +   "!STRICT_ALIGNMENT || TARGET_MOPS"
> >  {
> >    if (aarch64_expand_cpymem (operands))
> >      DONE;
> > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> > index
> 32191cf1acf43302c4a544b85db60b7b6e14da48..7445ed106cca9cb8a4537414
> 499f6f28476951bf 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -280,3 +280,7 @@ Target Joined UInteger
> Var(aarch64_autovec_preference) Init(0) IntegerRange(0, 4
> >
> >  -param=aarch64-loop-vect-issue-rate-niters=
> >  Target Joined UInteger Var(aarch64_loop_vect_issue_rate_niters) Init(6)
> IntegerRange(0, 65536) Param
> > +
> > +-param=aarch64-mops-memcpy-size-threshold=
> > +Target Joined UInteger Var(aarch64_mops_memcpy_size_threshold)
> Init(256) Param
> > +Constant memcpy size in bytes above which to start using MOPS
> sequence.
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index
> 510ed079b99374028e38d20f3edbac12d75f7842..9c38277e4f317332088555c9
> 948ef19935ae890c 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -19135,6 +19135,9 @@ prior to Armv8.2-A is not supported.
> >  @item ls64
> >  Enable the 64-byte atomic load and store instructions for accelerators.
> >  This option is enabled by default for @option{-march=armv8.7-a}.
> > +@item mops
> > +Enable the instructions to accelerate memory operations like
> @code{memcpy},
> > +@code{memmove}, @code{memset}.
> >  @item flagm
> >  Enable the Flag Manipulation instructions Extension.
> >  @item pauth
> > diff --git a/gcc/testsuite/gcc.target/aarch64/mops_1.c
> b/gcc/testsuite/gcc.target/aarch64/mops_1.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..661c14192e8a84fd4641a4d
> 818b8db46ab4f1b28
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/mops_1.c
> > @@ -0,0 +1,57 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=armv8.6-a+mops --param=aarch64-mops-
> memcpy-size-threshold=0" } */
> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > +
> > +#include <stdlib.h>
> > +
> > +/* We want to inline variable-sized memcpy.
> > +** do_it_cpy:
> > +**	cpyfp	\[x1\]\!, \[x0\]\!, x2\!
> > +**	cpyfm	\[x1\]\!, \[x0\]\!, x2\!
> > +**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
> > +**	ret
> > +*/
> > +void do_it_cpy (char * in, char * out, size_t size)
> > +{
> > +  __builtin_memcpy (out, in, size);
> > +}
> > +
> > +/*
> > +** do_it_cpy_large:
> > +**	mov	x2, 1024
> > +**	cpyfp	\[x1\]\!, \[x0\]!, x2\!
> > +**	cpyfm	\[x1\]\!, \[x0\]!, x2\!
> > +**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
> > +**	ret
> > +*/
> > +void do_it_cpy_large (char * in, char * out)
> > +{
> > +  __builtin_memcpy (out, in, 1024);
> > +}
> > +
> > +/*
> > +** do_it_cpy_127:
> > +**	mov	x2, 127
> > +**	cpyfp	\[x1\]\!, \[x0\]!, x2\!
> > +**	cpyfm	\[x1\]\!, \[x0\]!, x2\!
> > +**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
> > +**	ret
> > +*/
> > +void do_it_cpy_127 (char * in, char * out)
> > +{
> > +  __builtin_memcpy (out, in, 127);
> > +}
> > +
> > +/*
> > +** do_it_cpy_128:
> > +**	mov	x2, 128
> > +**	cpyfp	\[x1\]\!, \[x0\]!, x2\!
> > +**	cpyfm	\[x1\]\!, \[x0\]!, x2\!
> > +**	cpyfe	\[x1\]\!, \[x0\]\!, x2\!
> > +**	ret
> > +*/
> > +void do_it_cpy_128 (char * in, char * out)
> > +{
> > +  __builtin_memcpy (out, in, 128);
> > +}
> > +

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][1/4][committed] aarch64: Add support for Armv8.8-a memory operations and memcpy expansion
  2022-01-27 10:12   ` Kyrylo Tkachov
@ 2022-02-01 10:22     ` Richard Sandiford
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Sandiford @ 2022-02-01 10:22 UTC (permalink / raw)
  To: Kyrylo Tkachov; +Cc: gcc-patches

Kyrylo Tkachov <Kyrylo.Tkachov@arm.com> writes:
> Hi Richard,
>
> Sorry for the delay in getting back to this. I'm now working on a patch to adjust this.
>
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Tuesday, December 14, 2021 10:48 AM
>> To: Kyrylo Tkachov via Gcc-patches <gcc-patches@gcc.gnu.org>
>> Cc: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
>> Subject: Re: [PATCH][1/4][committed] aarch64: Add support for Armv8.8-a
>> memory operations and memcpy expansion
>> 
>> Kyrylo Tkachov via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > @@ -23568,6 +23568,28 @@
>> aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
>> >    *dst = aarch64_progress_pointer (*dst);
>> >  }
>> >
>> > +/* Expand a cpymem using the MOPS extension.  OPERANDS are taken
>> > +   from the cpymem pattern.  Return true iff we succeeded.  */
>> > +static bool
>> > +aarch64_expand_cpymem_mops (rtx *operands)
>> > +{
>> > +  if (!TARGET_MOPS)
>> > +    return false;
>> > +  rtx addr_dst = XEXP (operands[0], 0);
>> > +  rtx addr_src = XEXP (operands[1], 0);
>> > +  rtx sz_reg = operands[2];
>> > +
>> > +  if (!REG_P (sz_reg))
>> > +    sz_reg = force_reg (DImode, sz_reg);
>> > +  if (!REG_P (addr_dst))
>> > +    addr_dst = force_reg (DImode, addr_dst);
>> > +  if (!REG_P (addr_src))
>> > +    addr_src = force_reg (DImode, addr_src);
>> > +  emit_insn (gen_aarch64_cpymemdi (addr_dst, addr_src, sz_reg));
>> > +
>> > +  return true;
>> > +}
>> 
>> On this, I think it would be better to adjust the original src and dst
>> MEMs if possible, since they contain metadata about the size of the
>> access and alias set information.  It looks like the code above
>> generates an instruction with a wild read and a wild write instead.
>> 
>
> Hmm, do you mean adjust the address of the MEMs in operands with something like replace_equiv_address_nv?

Yeah.

>> It should be possible to do that with a define_expand/define_insn
>> pair, where the define_expand takes two extra operands for the MEMs,
>> but the define_insn contains the same operands as now.
>
> Could you please expand on this a bit? This path is reached from the cpymemdi expander that already takes the two MEMs as operands and generates the aarch64_cpymemdi define_insn that uses just the address registers as operands. Should we carry the MEMs around in the define_insn as well after expand?

It could be a second expander.  E.g.:

(define_expand "aarch64_cpymemdi"
  [(parallel
     [(set (match_operand 2) (const_int 0))
      (clobber (match_operand 0))
      (clobber (match_operand 1))
      (set (match_operand 3)
           (unspec:BLK [(match_operand 4) (match_dup 2)] UNSPEC_CPYMEM))])]
  "TARGET_MOPS"
)

with the existing define_insn maybe becoming *aarch64_cpymemdi.
(The define_insn doesn't need the outer parallel.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-02-01 10:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-13 14:29 [PATCH][1/4][committed] aarch64: Add support for Armv8.8-a memory operations and memcpy expansion Kyrylo Tkachov
2021-12-14 10:48 ` Richard Sandiford
2022-01-27 10:12   ` Kyrylo Tkachov
2022-02-01 10:22     ` Richard Sandiford
2021-12-15  9:21 ` Christophe Lyon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).