public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [pushed 0/8] aarch64: Fix regression in vec_init code quality
@ 2022-02-09 17:00 Richard Sandiford
  2022-02-09 17:00 ` [pushed 1/8] aarch64: Tighten general_operand predicates Richard Sandiford
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Richard Sandiford @ 2022-02-09 17:00 UTC (permalink / raw)
  To: gcc-patches

The main purpose of this patch series is to fix a performance
regression from GCC 8.  Before the patch:

int64x2_t s64q_1(int64_t a0, int64_t a1) {
  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
    return (int64x2_t) { a1, a0 };
  else
    return (int64x2_t) { a0, a1 };
}

generated:

        fmov    d0, x0
        ins     v0.d[1], x1
        ins     v0.d[1], x1
        ret

whereas GCC 8 generated the more respectable:

        dup     v0.2d, x0
        ins     v0.d[1], x1
        ret

But there are some related knock-on changes that IMO are needed to keep
things in a consistent and maintainable state.

There is still more cleanup and optimisation that could be done in this
area, but that's definitely GCC 13 material.

Tested on aarch64-linux-gnu and aarch64_be-elf, pushed.

Sorry for the size of the series, but it really did seem like the
best fix in the circumstances.

Richard

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pushed 1/8] aarch64: Tighten general_operand predicates
  2022-02-09 17:00 [pushed 0/8] aarch64: Fix regression in vec_init code quality Richard Sandiford
@ 2022-02-09 17:00 ` Richard Sandiford
  2022-02-09 17:00 ` [pushed 2/8] aarch64: Generalise vec_set predicate Richard Sandiford
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Richard Sandiford @ 2022-02-09 17:00 UTC (permalink / raw)
  To: gcc-patches

This patch fixes some case in which *general_operand was used over
*nonimmediate_operand by patterns that don't accept immediates.
This avoids some complication with later patches.

gcc/
	* config/aarch64/aarch64-simd.md (aarch64_simd_vec_set<mode>): Use
	aarch64_simd_nonimmediate_operand instead of
	aarch64_simd_general_operand.
	(@aarch64_combinez<mode>): Use nonimmediate_operand instead of
	general_operand.
	(@aarch64_combinez_be<mode>): Likewise.
---
 gcc/config/aarch64/aarch64-simd.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 6646e069ad2..9529bdb4997 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1039,7 +1039,7 @@ (define_insn "aarch64_simd_vec_set<mode>"
   [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
 	(vec_merge:VALL_F16
 	    (vec_duplicate:VALL_F16
-		(match_operand:<VEL> 1 "aarch64_simd_general_operand" "w,?r,Utv"))
+		(match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand" "w,?r,Utv"))
 	    (match_operand:VALL_F16 3 "register_operand" "0,0,0")
 	    (match_operand:SI 2 "immediate_operand" "i,i,i")))]
   "TARGET_SIMD"
@@ -4380,7 +4380,7 @@ (define_insn "store_pair_lanes<mode>"
 (define_insn "@aarch64_combinez<mode>"
   [(set (match_operand:<VDBL> 0 "register_operand" "=w,w,w")
 	(vec_concat:<VDBL>
-	  (match_operand:VDC 1 "general_operand" "w,?r,m")
+	  (match_operand:VDC 1 "nonimmediate_operand" "w,?r,m")
 	  (match_operand:VDC 2 "aarch64_simd_or_scalar_imm_zero")))]
   "TARGET_SIMD && !BYTES_BIG_ENDIAN"
   "@
@@ -4395,7 +4395,7 @@ (define_insn "@aarch64_combinez_be<mode>"
   [(set (match_operand:<VDBL> 0 "register_operand" "=w,w,w")
         (vec_concat:<VDBL>
 	  (match_operand:VDC 2 "aarch64_simd_or_scalar_imm_zero")
-	  (match_operand:VDC 1 "general_operand" "w,?r,m")))]
+	  (match_operand:VDC 1 "nonimmediate_operand" "w,?r,m")))]
   "TARGET_SIMD && BYTES_BIG_ENDIAN"
   "@
    mov\\t%0.8b, %1.8b
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pushed 2/8] aarch64: Generalise vec_set predicate
  2022-02-09 17:00 [pushed 0/8] aarch64: Fix regression in vec_init code quality Richard Sandiford
  2022-02-09 17:00 ` [pushed 1/8] aarch64: Tighten general_operand predicates Richard Sandiford
@ 2022-02-09 17:00 ` Richard Sandiford
  2022-02-09 17:00 ` [pushed 3/8] aarch64: Generalise adjacency check for load_pair_lanes Richard Sandiford
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Richard Sandiford @ 2022-02-09 17:00 UTC (permalink / raw)
  To: gcc-patches

The aarch64_simd_vec_set<mode> define_insn takes memory operands,
so this patch makes the vec_set<mode> optab expander do the same.

gcc/
	* config/aarch64/aarch64-simd.md (vec_set<mode>): Allow the
	element to be an aarch64_simd_nonimmediate_operand.
---
 gcc/config/aarch64/aarch64-simd.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 9529bdb4997..872a3d78269 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1378,7 +1378,7 @@ (define_insn "vec_shr_<mode>"
 
 (define_expand "vec_set<mode>"
   [(match_operand:VALL_F16 0 "register_operand")
-   (match_operand:<VEL> 1 "register_operand")
+   (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
    (match_operand:SI 2 "immediate_operand")]
   "TARGET_SIMD"
   {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pushed 3/8] aarch64: Generalise adjacency check for load_pair_lanes
  2022-02-09 17:00 [pushed 0/8] aarch64: Fix regression in vec_init code quality Richard Sandiford
  2022-02-09 17:00 ` [pushed 1/8] aarch64: Tighten general_operand predicates Richard Sandiford
  2022-02-09 17:00 ` [pushed 2/8] aarch64: Generalise vec_set predicate Richard Sandiford
@ 2022-02-09 17:00 ` Richard Sandiford
  2022-02-09 17:01 ` [pushed 4/8] aarch64: Remove redundant vec_concat patterns Richard Sandiford
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Richard Sandiford @ 2022-02-09 17:00 UTC (permalink / raw)
  To: gcc-patches

This patch generalises the load_pair_lanes<mode> guard so that
it uses aarch64_check_consecutive_mems to check for consecutive
mems.  It also allows the pattern to be used for STRICT_ALIGNMENT
targets if the alignment is high enough.

The main aim is to avoid an inline test, for the sake of a later patch
that needs to repeat it.  Reusing aarch64_check_consecutive_mems seemed
simpler than writing an entirely new function.

gcc/
	* config/aarch64/aarch64-protos.h (aarch64_mergeable_load_pair_p):
	Declare.
	* config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): Use
	aarch64_mergeable_load_pair_p instead of inline check.
	* config/aarch64/aarch64.cc (aarch64_expand_vector_init): Likewise.
	(aarch64_check_consecutive_mems): Allow the reversed parameter
	to be null.
	(aarch64_mergeable_load_pair_p): New function.
---
 gcc/config/aarch64/aarch64-protos.h           |  1 +
 gcc/config/aarch64/aarch64-simd.md            |  7 +--
 gcc/config/aarch64/aarch64.cc                 | 54 ++++++++++++-------
 gcc/testsuite/gcc.target/aarch64/vec-init-6.c | 12 +++++
 gcc/testsuite/gcc.target/aarch64/vec-init-7.c | 12 +++++
 5 files changed, 62 insertions(+), 24 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-6.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-7.c

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 26368538a55..b75ed35635b 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1000,6 +1000,7 @@ void aarch64_atomic_assign_expand_fenv (tree *, tree *, tree *);
 int aarch64_ccmp_mode_to_code (machine_mode mode);
 
 bool extract_base_offset_in_addr (rtx mem, rtx *base, rtx *offset);
+bool aarch64_mergeable_load_pair_p (machine_mode, rtx, rtx);
 bool aarch64_operands_ok_for_ldpstp (rtx *, bool, machine_mode);
 bool aarch64_operands_adjust_ok_for_ldpstp (rtx *, bool, machine_mode);
 void aarch64_swap_ldrstr_operands (rtx *, bool);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 872a3d78269..c5bc2ea658b 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4353,11 +4353,8 @@ (define_insn "load_pair_lanes<mode>"
 	(vec_concat:<VDBL>
 	   (match_operand:VDC 1 "memory_operand" "Utq")
 	   (match_operand:VDC 2 "memory_operand" "m")))]
-  "TARGET_SIMD && !STRICT_ALIGNMENT
-   && rtx_equal_p (XEXP (operands[2], 0),
-		   plus_constant (Pmode,
-				  XEXP (operands[1], 0),
-				  GET_MODE_SIZE (<MODE>mode)))"
+  "TARGET_SIMD
+   && aarch64_mergeable_load_pair_p (<VDBL>mode, operands[1], operands[2])"
   "ldr\\t%q0, %1"
   [(set_attr "type" "neon_load1_1reg_q")]
 )
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 296145e6008..c47543aebf3 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -21063,11 +21063,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 		 for store_pair_lanes<mode>.  */
 	      if (memory_operand (x0, inner_mode)
 		  && memory_operand (x1, inner_mode)
-		  && !STRICT_ALIGNMENT
-		  && rtx_equal_p (XEXP (x1, 0),
-				  plus_constant (Pmode,
-						 XEXP (x0, 0),
-						 GET_MODE_SIZE (inner_mode))))
+		  && aarch64_mergeable_load_pair_p (mode, x0, x1))
 		{
 		  rtx t;
 		  if (inner_mode == DFmode)
@@ -24687,14 +24683,20 @@ aarch64_sched_adjust_priority (rtx_insn *insn, int priority)
   return priority;
 }
 
-/* Check if *MEM1 and *MEM2 are consecutive memory references and,
+/* If REVERSED is null, return true if memory reference *MEM2 comes
+   immediately after memory reference *MEM1.  Do not change the references
+   in this case.
+
+   Otherwise, check if *MEM1 and *MEM2 are consecutive memory references and,
    if they are, try to make them use constant offsets from the same base
    register.  Return true on success.  When returning true, set *REVERSED
    to true if *MEM1 comes after *MEM2, false if *MEM1 comes before *MEM2.  */
 static bool
 aarch64_check_consecutive_mems (rtx *mem1, rtx *mem2, bool *reversed)
 {
-  *reversed = false;
+  if (reversed)
+    *reversed = false;
+
   if (GET_RTX_CLASS (GET_CODE (XEXP (*mem1, 0))) == RTX_AUTOINC
       || GET_RTX_CLASS (GET_CODE (XEXP (*mem2, 0))) == RTX_AUTOINC)
     return false;
@@ -24723,7 +24725,7 @@ aarch64_check_consecutive_mems (rtx *mem1, rtx *mem2, bool *reversed)
       if (known_eq (UINTVAL (offset1) + size1, UINTVAL (offset2)))
 	return true;
 
-      if (known_eq (UINTVAL (offset2) + size2, UINTVAL (offset1)))
+      if (known_eq (UINTVAL (offset2) + size2, UINTVAL (offset1)) && reversed)
 	{
 	  *reversed = true;
 	  return true;
@@ -24756,22 +24758,25 @@ aarch64_check_consecutive_mems (rtx *mem1, rtx *mem2, bool *reversed)
 
       if (known_eq (expr_offset1 + size1, expr_offset2))
 	;
-      else if (known_eq (expr_offset2 + size2, expr_offset1))
+      else if (known_eq (expr_offset2 + size2, expr_offset1) && reversed)
 	*reversed = true;
       else
 	return false;
 
-      if (base2)
+      if (reversed)
 	{
-	  rtx addr1 = plus_constant (Pmode, XEXP (*mem2, 0),
-				     expr_offset1 - expr_offset2);
-	  *mem1 = replace_equiv_address_nv (*mem1, addr1);
-	}
-      else
-	{
-	  rtx addr2 = plus_constant (Pmode, XEXP (*mem1, 0),
-				     expr_offset2 - expr_offset1);
-	  *mem2 = replace_equiv_address_nv (*mem2, addr2);
+	  if (base2)
+	    {
+	      rtx addr1 = plus_constant (Pmode, XEXP (*mem2, 0),
+					 expr_offset1 - expr_offset2);
+	      *mem1 = replace_equiv_address_nv (*mem1, addr1);
+	    }
+	  else
+	    {
+	      rtx addr2 = plus_constant (Pmode, XEXP (*mem1, 0),
+					 expr_offset2 - expr_offset1);
+	      *mem2 = replace_equiv_address_nv (*mem2, addr2);
+	    }
 	}
       return true;
     }
@@ -24779,6 +24784,17 @@ aarch64_check_consecutive_mems (rtx *mem1, rtx *mem2, bool *reversed)
   return false;
 }
 
+/* Return true if MEM1 and MEM2 can be combined into a single access
+   of mode MODE, with the combined access having the same address as MEM1.  */
+
+bool
+aarch64_mergeable_load_pair_p (machine_mode mode, rtx mem1, rtx mem2)
+{
+  if (STRICT_ALIGNMENT && MEM_ALIGN (mem1) < GET_MODE_ALIGNMENT (mode))
+    return false;
+  return aarch64_check_consecutive_mems (&mem1, &mem2, nullptr);
+}
+
 /* Given OPERANDS of consecutive load/store, check if we can merge
    them into ldp/stp.  LOAD is true if they are load instructions.
    MODE is the mode of memory operands.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-6.c b/gcc/testsuite/gcc.target/aarch64/vec-init-6.c
new file mode 100644
index 00000000000..96450157498
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-6.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+
+#include <arm_neon.h>
+
+int64_t s64[2];
+float64_t f64[2];
+
+int64x2_t test_s64() { return (int64x2_t) { s64[0], s64[1] }; }
+float64x2_t test_f64() { return (float64x2_t) { f64[0], f64[1] }; }
+
+/* { dg-final { scan-assembler-not {\tins\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-7.c b/gcc/testsuite/gcc.target/aarch64/vec-init-7.c
new file mode 100644
index 00000000000..795895286db
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-7.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O -mstrict-align" } */
+
+#include <arm_neon.h>
+
+int64_t s64[2] __attribute__((aligned(16)));
+float64_t f64[2] __attribute__((aligned(16)));
+
+int64x2_t test_s64() { return (int64x2_t) { s64[0], s64[1] }; }
+float64x2_t test_f64() { return (float64x2_t) { f64[0], f64[1] }; }
+
+/* { dg-final { scan-assembler-not {\tins\t} } } */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pushed 4/8] aarch64: Remove redundant vec_concat patterns
  2022-02-09 17:00 [pushed 0/8] aarch64: Fix regression in vec_init code quality Richard Sandiford
                   ` (2 preceding siblings ...)
  2022-02-09 17:00 ` [pushed 3/8] aarch64: Generalise adjacency check for load_pair_lanes Richard Sandiford
@ 2022-02-09 17:01 ` Richard Sandiford
  2022-02-09 17:01 ` [pushed 5/8] aarch64: Add more vec_combine patterns Richard Sandiford
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Richard Sandiford @ 2022-02-09 17:01 UTC (permalink / raw)
  To: gcc-patches

move_lo_quad_internal_<mode> and move_lo_quad_internal_be_<mode>
partially duplicate the later aarch64_combinez{,_be}<mode> patterns.
The duplication itself is a regression.

The only substantive differences between the two are:

* combinez uses vector MOV (ORR) instead of element MOV (DUP).
  The former seems more likely to be handled via renaming.

* combinez disparages the GPR->FPR alternative whereas move_lo_quad
  gave it equal cost.  The new test gives a token example of when
  the combinez behaviour helps.

gcc/
	* config/aarch64/aarch64-simd.md (move_lo_quad_internal_<mode>)
	(move_lo_quad_internal_be_<mode>): Delete.
	(move_lo_quad_<mode>): Use aarch64_combine<Vhalf> instead of the above.

gcc/testsuite/
	* gcc.target/aarch64/vec-init-8.c: New test.
---
 gcc/config/aarch64/aarch64-simd.md            | 37 +------------------
 gcc/testsuite/gcc.target/aarch64/vec-init-8.c | 15 ++++++++
 2 files changed, 17 insertions(+), 35 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-8.c

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index c5bc2ea658b..d6cd4c70fe7 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1584,46 +1584,13 @@ (define_insn "aarch64_<optab>p<mode>"
 ;; On little-endian this is { operand, zeroes }
 ;; On big-endian this is { zeroes, operand }
 
-(define_insn "move_lo_quad_internal_<mode>"
-  [(set (match_operand:VQMOV 0 "register_operand" "=w,w,w")
-	(vec_concat:VQMOV
-	  (match_operand:<VHALF> 1 "register_operand" "w,r,r")
-	  (match_operand:<VHALF> 2 "aarch64_simd_or_scalar_imm_zero")))]
-  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
-  "@
-   dup\\t%d0, %1.d[0]
-   fmov\\t%d0, %1
-   dup\\t%d0, %1"
-  [(set_attr "type" "neon_dup<q>,f_mcr,neon_dup<q>")
-   (set_attr "length" "4")
-   (set_attr "arch" "simd,fp,simd")]
-)
-
-(define_insn "move_lo_quad_internal_be_<mode>"
-  [(set (match_operand:VQMOV 0 "register_operand" "=w,w,w")
-	(vec_concat:VQMOV
-	  (match_operand:<VHALF> 2 "aarch64_simd_or_scalar_imm_zero")
-	  (match_operand:<VHALF> 1 "register_operand" "w,r,r")))]
-  "TARGET_SIMD && BYTES_BIG_ENDIAN"
-  "@
-   dup\\t%d0, %1.d[0]
-   fmov\\t%d0, %1
-   dup\\t%d0, %1"
-  [(set_attr "type" "neon_dup<q>,f_mcr,neon_dup<q>")
-   (set_attr "length" "4")
-   (set_attr "arch" "simd,fp,simd")]
-)
-
 (define_expand "move_lo_quad_<mode>"
   [(match_operand:VQMOV 0 "register_operand")
    (match_operand:<VHALF> 1 "register_operand")]
   "TARGET_SIMD"
 {
-  rtx zs = CONST0_RTX (<VHALF>mode);
-  if (BYTES_BIG_ENDIAN)
-    emit_insn (gen_move_lo_quad_internal_be_<mode> (operands[0], operands[1], zs));
-  else
-    emit_insn (gen_move_lo_quad_internal_<mode> (operands[0], operands[1], zs));
+  emit_insn (gen_aarch64_combine<Vhalf> (operands[0], operands[1],
+					 CONST0_RTX (<VHALF>mode)));
   DONE;
 }
 )
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-8.c b/gcc/testsuite/gcc.target/aarch64/vec-init-8.c
new file mode 100644
index 00000000000..18f8afe10f5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-8.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+
+#include <arm_neon.h>
+
+int64x2_t f1(int64_t *ptr) {
+  int64_t x = *ptr;
+  asm volatile ("" ::: "memory");
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int64x2_t) { 0, x };
+  else
+    return (int64x2_t) { x, 0 };
+}
+
+/* { dg-final { scan-assembler {\tldr\td0, \[x0\]\n} } } */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pushed 5/8] aarch64: Add more vec_combine patterns
  2022-02-09 17:00 [pushed 0/8] aarch64: Fix regression in vec_init code quality Richard Sandiford
                   ` (3 preceding siblings ...)
  2022-02-09 17:01 ` [pushed 4/8] aarch64: Remove redundant vec_concat patterns Richard Sandiford
@ 2022-02-09 17:01 ` Richard Sandiford
  2022-02-09 17:01 ` [pushed 6/8] aarch64: Add a general vec_concat expander Richard Sandiford
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Richard Sandiford @ 2022-02-09 17:01 UTC (permalink / raw)
  To: gcc-patches

vec_combine is really one instruction on aarch64, provided that
the lowpart element is in the same register as the destination
vector.  This patch adds patterns for that.

The patch fixes a regression from GCC 8.  Before the patch:

int64x2_t s64q_1(int64_t a0, int64_t a1) {
  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
    return (int64x2_t) { a1, a0 };
  else
    return (int64x2_t) { a0, a1 };
}

generated:

        fmov    d0, x0
        ins     v0.d[1], x1
        ins     v0.d[1], x1
        ret

whereas GCC 8 generated the more respectable:

        dup     v0.2d, x0
        ins     v0.d[1], x1
        ret

gcc/
	* config/aarch64/predicates.md (aarch64_reg_or_mem_pair_operand):
	New predicate.
	* config/aarch64/aarch64-simd.md (*aarch64_combine_internal<mode>)
	(*aarch64_combine_internal_be<mode>): New patterns.

gcc/testsuite/
	* gcc.target/aarch64/vec-init-9.c: New test.
	* gcc.target/aarch64/vec-init-10.c: Likewise.
	* gcc.target/aarch64/vec-init-11.c: Likewise.
---
 gcc/config/aarch64/aarch64-simd.md            |  62 ++++
 gcc/config/aarch64/predicates.md              |   4 +
 .../gcc.target/aarch64/vec-init-10.c          |  15 +
 .../gcc.target/aarch64/vec-init-11.c          |  12 +
 gcc/testsuite/gcc.target/aarch64/vec-init-9.c | 267 ++++++++++++++++++
 5 files changed, 360 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-10.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-11.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-9.c

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index d6cd4c70fe7..ead80396e70 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4326,6 +4326,25 @@ (define_insn "load_pair_lanes<mode>"
   [(set_attr "type" "neon_load1_1reg_q")]
 )
 
+;; This STP pattern is a partial duplicate of the general vec_concat patterns
+;; below.  The reason for having both of them is that the alternatives of
+;; the later patterns do not have consistent register preferences: the STP
+;; alternatives have no preference between GPRs and FPRs (and if anything,
+;; the GPR form is more natural for scalar integers) whereas the other
+;; alternatives *require* an FPR for operand 1 and prefer one for operand 2.
+;;
+;; Using "*" to hide the STP alternatives from the RA penalizes cases in
+;; which the destination was always memory.  On the other hand, expressing
+;; the true preferences makes GPRs seem more palatable than they really are
+;; for register destinations.
+;;
+;; Despite that, we do still want the general form to have STP alternatives,
+;; in order to handle cases where a register destination is spilled.
+;;
+;; The best compromise therefore seemed to be to have a dedicated STP
+;; pattern to catch cases in which the destination was always memory.
+;; This dedicated pattern must come first.
+
 (define_insn "store_pair_lanes<mode>"
   [(set (match_operand:<VDBL> 0 "aarch64_mem_pair_lanes_operand" "=Umn, Umn")
 	(vec_concat:<VDBL>
@@ -4338,6 +4357,49 @@ (define_insn "store_pair_lanes<mode>"
   [(set_attr "type" "neon_stp, store_16")]
 )
 
+;; Form a vector whose least significant half comes from operand 1 and whose
+;; most significant half comes from operand 2.  The register alternatives
+;; tie the least significant half to the same register as the destination,
+;; so that only the other half needs to be handled explicitly.  For the
+;; reasons given above, the STP alternatives use ? for constraints that
+;; the register alternatives either don't accept or themselves disparage.
+
+(define_insn "*aarch64_combine_internal<mode>"
+  [(set (match_operand:<VDBL> 0 "aarch64_reg_or_mem_pair_operand" "=w, w, w, Umn, Umn")
+	(vec_concat:<VDBL>
+	  (match_operand:VDC 1 "register_operand" "0, 0, 0, ?w, ?r")
+	  (match_operand:VDC 2 "aarch64_simd_nonimmediate_operand" "w, ?r, Utv, w, ?r")))]
+  "TARGET_SIMD
+   && !BYTES_BIG_ENDIAN
+   && (register_operand (operands[0], <VDBL>mode)
+       || register_operand (operands[2], <MODE>mode))"
+  "@
+   ins\t%0.d[1], %2.d[0]
+   ins\t%0.d[1], %2
+   ld1\t{%0.d}[1], %2
+   stp\t%d1, %d2, %y0
+   stp\t%x1, %x2, %y0"
+  [(set_attr "type" "neon_ins_q, neon_from_gp_q, neon_load1_one_lane_q, neon_stp, store_16")]
+)
+
+(define_insn "*aarch64_combine_internal_be<mode>"
+  [(set (match_operand:<VDBL> 0 "aarch64_reg_or_mem_pair_operand" "=w, w, w, Umn, Umn")
+	(vec_concat:<VDBL>
+	  (match_operand:VDC 2 "aarch64_simd_nonimmediate_operand" "w, ?r, Utv, ?w, ?r")
+	  (match_operand:VDC 1 "register_operand" "0, 0, 0, ?w, ?r")))]
+  "TARGET_SIMD
+   && BYTES_BIG_ENDIAN
+   && (register_operand (operands[0], <VDBL>mode)
+       || register_operand (operands[2], <MODE>mode))"
+  "@
+   ins\t%0.d[1], %2.d[0]
+   ins\t%0.d[1], %2
+   ld1\t{%0.d}[1], %2
+   stp\t%d2, %d1, %y0
+   stp\t%x2, %x1, %y0"
+  [(set_attr "type" "neon_ins_q, neon_from_gp_q, neon_load1_one_lane_q, neon_stp, store_16")]
+)
+
 ;; In this insn, operand 1 should be low, and operand 2 the high part of the
 ;; dest vector.
 
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 7dc4c155ea8..c308015ac2c 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -254,6 +254,10 @@ (define_predicate "aarch64_mem_pair_lanes_operand"
 						  false,
 						  ADDR_QUERY_LDP_STP_N)")))
 
+(define_predicate "aarch64_reg_or_mem_pair_operand"
+  (ior (match_operand 0 "register_operand")
+       (match_operand 0 "aarch64_mem_pair_lanes_operand")))
+
 (define_predicate "aarch64_prefetch_operand"
   (match_test "aarch64_address_valid_for_prefetch_p (op, false)"))
 
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-10.c b/gcc/testsuite/gcc.target/aarch64/vec-init-10.c
new file mode 100644
index 00000000000..f5dd83b94b5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-10.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+
+#include <arm_neon.h>
+
+int64x2_t f1(int64_t *x, int c) {
+  return c ? (int64x2_t) { x[0], x[2] } : (int64x2_t) { 0, 0 };
+}
+
+int64x2_t f2(int64_t *x, int i0, int i1, int c) {
+  return c ? (int64x2_t) { x[i0], x[i1] } : (int64x2_t) { 0, 0 };
+}
+
+/* { dg-final { scan-assembler-times {\t(?:ldr\td[0-9]+|ld1\t)} 4 } } */
+/* { dg-final { scan-assembler-not {\tldr\tx} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-11.c b/gcc/testsuite/gcc.target/aarch64/vec-init-11.c
new file mode 100644
index 00000000000..df242702c0c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-11.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+
+#include <arm_neon.h>
+
+void f1(int64x2_t *res, int64_t *x, int c0, int c1) {
+  res[0] = (int64x2_t) { c0 ? x[0] : 0, c1 ? x[2] : 0 };
+}
+
+/* { dg-final { scan-assembler-times {\tldr\tx[0-9]+} 2 } } */
+/* { dg-final { scan-assembler {\tstp\tx[0-9]+, x[0-9]+} } } */
+/* { dg-final { scan-assembler-not {\tldr\td} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-9.c b/gcc/testsuite/gcc.target/aarch64/vec-init-9.c
new file mode 100644
index 00000000000..8f68e06a559
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-9.c
@@ -0,0 +1,267 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include <arm_neon.h>
+
+void ext();
+
+/*
+** s64q_1:
+**	fmov	d0, x0
+**	ins	v0\.d\[1\], x1
+**	ret
+*/
+int64x2_t s64q_1(int64_t a0, int64_t a1) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int64x2_t) { a1, a0 };
+  else
+    return (int64x2_t) { a0, a1 };
+}
+/*
+** s64q_2:
+**	fmov	d0, x0
+**	ld1	{v0\.d}\[1\], \[x1\]
+**	ret
+*/
+int64x2_t s64q_2(int64_t a0, int64_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int64x2_t) { ptr[0], a0 };
+  else
+    return (int64x2_t) { a0, ptr[0] };
+}
+/*
+** s64q_3:
+**	ldr	d0, \[x0\]
+**	ins	v0\.d\[1\], x1
+**	ret
+*/
+int64x2_t s64q_3(int64_t *ptr, int64_t a1) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int64x2_t) { a1, ptr[0] };
+  else
+    return (int64x2_t) { ptr[0], a1 };
+}
+/*
+** s64q_4:
+**	stp	x1, x2, \[x0\]
+**	ret
+*/
+void s64q_4(int64x2_t *res, int64_t a0, int64_t a1) {
+  res[0] = (int64x2_t) { a0, a1 };
+}
+/*
+** s64q_5:
+**	stp	x1, x2, \[x0, #?8\]
+**	ret
+*/
+void s64q_5(uintptr_t res, int64_t a0, int64_t a1) {
+  *(int64x2_t *)(res + 8) = (int64x2_t) { a0, a1 };
+}
+/*
+** s64q_6:
+**	...
+**	stp	x0, x1, .*
+**	...
+**	ldr	q0, .*
+**	...
+**	ret
+*/
+int64x2_t s64q_6(int64_t a0, int64_t a1) {
+  int64x2_t res = { a0, a1 };
+  ext ();
+  return res;
+}
+
+/*
+** f64q_1:
+**	ins	v0\.d\[1\], v1\.d\[0\]
+**	ret
+*/
+float64x2_t f64q_1(float64_t a0, float64_t a1) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float64x2_t) { a1, a0 };
+  else
+    return (float64x2_t) { a0, a1 };
+}
+/*
+** f64q_2:
+**	ld1	{v0\.d}\[1\], \[x0\]
+**	ret
+*/
+float64x2_t f64q_2(float64_t a0, float64_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float64x2_t) { ptr[0], a0 };
+  else
+    return (float64x2_t) { a0, ptr[0] };
+}
+/*
+** f64q_3:
+**	ldr	d0, \[x0\]
+**	ins	v0\.d\[1\], v1\.d\[0\]
+**	ret
+*/
+float64x2_t f64q_3(float64_t a0, float64_t a1, float64_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float64x2_t) { a1, ptr[0] };
+  else
+    return (float64x2_t) { ptr[0], a1 };
+}
+/*
+** f64q_4:
+**	stp	d0, d1, \[x0\]
+**	ret
+*/
+void f64q_4(float64x2_t *res, float64_t a0, float64_t a1) {
+  res[0] = (float64x2_t) { a0, a1 };
+}
+/*
+** f64q_5:
+**	stp	d0, d1, \[x0, #?8\]
+**	ret
+*/
+void f64q_5(uintptr_t res, float64_t a0, float64_t a1) {
+  *(float64x2_t *)(res + 8) = (float64x2_t) { a0, a1 };
+}
+/*
+** f64q_6:
+**	...
+**	stp	d0, d1, .*
+**	...
+**	ldr	q0, .*
+**	...
+**	ret
+*/
+float64x2_t f64q_6(float64_t a0, float64_t a1) {
+  float64x2_t res = { a0, a1 };
+  ext ();
+  return res;
+}
+
+/*
+** s32q_1:
+**	ins	v0\.d\[1\], v1\.d\[0\]
+**	ret
+*/
+int32x4_t s32q_1(int32x2_t a0, int32x2_t a1) {
+  return vcombine_s32 (a0, a1);
+}
+/*
+** s32q_2:
+**	ld1	{v0\.d}\[1\], \[x0\]
+**	ret
+*/
+int32x4_t s32q_2(int32x2_t a0, int32x2_t *ptr) {
+  return vcombine_s32 (a0, ptr[0]);
+}
+/*
+** s32q_3:
+**	ldr	d0, \[x0\]
+**	ins	v0\.d\[1\], v1\.d\[0\]
+**	ret
+*/
+int32x4_t s32q_3(int32x2_t a0, int32x2_t a1, int32x2_t *ptr) {
+  return vcombine_s32 (ptr[0], a1);
+}
+/*
+** s32q_4:
+**	stp	d0, d1, \[x0\]
+**	ret
+*/
+void s32q_4(int32x4_t *res, int32x2_t a0, int32x2_t a1) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    res[0] = vcombine_s32 (a1, a0);
+  else
+    res[0] = vcombine_s32 (a0, a1);
+}
+/*
+** s32q_5:
+**	stp	d0, d1, \[x0, #?8\]
+**	ret
+*/
+void s32q_5(uintptr_t res, int32x2_t a0, int32x2_t a1) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    *(int32x4_t *)(res + 8) = vcombine_s32 (a1, a0);
+  else
+    *(int32x4_t *)(res + 8) = vcombine_s32 (a0, a1);
+}
+/*
+** s32q_6:
+**	...
+**	stp	d0, d1, .*
+**	...
+**	ldr	q0, .*
+**	...
+**	ret
+*/
+int32x4_t s32q_6(int32x2_t a0, int32x2_t a1) {
+  int32x4_t res = (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+		   ? vcombine_s32 (a1, a0)
+		   : vcombine_s32 (a0, a1));
+  ext ();
+  return res;
+}
+
+/*
+** f32q_1:
+**	ins	v0\.d\[1\], v1\.d\[0\]
+**	ret
+*/
+float32x4_t f32q_1(float32x2_t a0, float32x2_t a1) {
+  return vcombine_f32 (a0, a1);
+}
+/*
+** f32q_2:
+**	ld1	{v0\.d}\[1\], \[x0\]
+**	ret
+*/
+float32x4_t f32q_2(float32x2_t a0, float32x2_t *ptr) {
+  return vcombine_f32 (a0, ptr[0]);
+}
+/*
+** f32q_3:
+**	ldr	d0, \[x0\]
+**	ins	v0\.d\[1\], v1\.d\[0\]
+**	ret
+*/
+float32x4_t f32q_3(float32x2_t a0, float32x2_t a1, float32x2_t *ptr) {
+  return vcombine_f32 (ptr[0], a1);
+}
+/*
+** f32q_4:
+**	stp	d0, d1, \[x0\]
+**	ret
+*/
+void f32q_4(float32x4_t *res, float32x2_t a0, float32x2_t a1) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    res[0] = vcombine_f32 (a1, a0);
+  else
+    res[0] = vcombine_f32 (a0, a1);
+}
+/*
+** f32q_5:
+**	stp	d0, d1, \[x0, #?8\]
+**	ret
+*/
+void f32q_5(uintptr_t res, float32x2_t a0, float32x2_t a1) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    *(float32x4_t *)(res + 8) = vcombine_f32 (a1, a0);
+  else
+    *(float32x4_t *)(res + 8) = vcombine_f32 (a0, a1);
+}
+/*
+** f32q_6:
+**	...
+**	stp	d0, d1, .*
+**	...
+**	ldr	q0, .*
+**	...
+**	ret
+*/
+float32x4_t f32q_6(float32x2_t a0, float32x2_t a1) {
+  float32x4_t res = (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+		     ? vcombine_f32 (a1, a0)
+		     : vcombine_f32 (a0, a1));
+  ext ();
+  return res;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pushed 6/8] aarch64: Add a general vec_concat expander
  2022-02-09 17:00 [pushed 0/8] aarch64: Fix regression in vec_init code quality Richard Sandiford
                   ` (4 preceding siblings ...)
  2022-02-09 17:01 ` [pushed 5/8] aarch64: Add more vec_combine patterns Richard Sandiford
@ 2022-02-09 17:01 ` Richard Sandiford
  2022-02-09 17:01 ` [pushed 7/8] aarch64: Remove move_lo/hi_quad expanders Richard Sandiford
  2022-02-09 17:02 ` [pushed 8/8] aarch64: Extend vec_concat patterns to 8-byte vectors Richard Sandiford
  7 siblings, 0 replies; 9+ messages in thread
From: Richard Sandiford @ 2022-02-09 17:01 UTC (permalink / raw)
  To: gcc-patches

After previous patches, we have a (mostly new) group of vec_concat
patterns as well as vestiges of the old move_lo/hi_quad patterns.
(A previous patch removed the move_lo_quad insns, but we still
have the move_hi_quad insns and both sets of expanders.)

This patch is the first of two to remove the old move_lo/hi_quad
stuff.  It isn't technically a regression fix, but it seemed
better to make the changes now rather than leave things in
a half-finished and inconsistent state.

This patch defines an aarch64_vec_concat expander that coerces the
element operands into a valid form, including the ones added by the
previous patch.  This in turn lets us get rid of one move_lo/hi_quad
pair.

As a side-effect, it also means that vcombines of 2 vectors make
better use of the available forms, like vec_inits of 2 scalars
already do.

gcc/
	* config/aarch64/aarch64-protos.h (aarch64_split_simd_combine):
	Delete.
	* config/aarch64/aarch64-simd.md (@aarch64_combinez<mode>): Rename
	to...
	(*aarch64_combinez<mode>): ...this.
	(@aarch64_combinez_be<mode>): Rename to...
	(*aarch64_combinez_be<mode>): ...this.
	(@aarch64_vec_concat<mode>): New expander.
	(aarch64_combine<mode>): Use it.
	(@aarch64_simd_combine<mode>): Delete.
	* config/aarch64/aarch64.cc (aarch64_split_simd_combine): Delete.
	(aarch64_expand_vector_init): Use aarch64_vec_concat.

gcc/testsuite/
	* gcc.target/aarch64/vec-init-12.c: New test.
---
 gcc/config/aarch64/aarch64-protos.h           |  2 -
 gcc/config/aarch64/aarch64-simd.md            | 76 ++++++++++++-------
 gcc/config/aarch64/aarch64.cc                 | 55 ++------------
 .../gcc.target/aarch64/vec-init-12.c          | 65 ++++++++++++++++
 4 files changed, 122 insertions(+), 76 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-12.c

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index b75ed35635b..392efa0b74d 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -925,8 +925,6 @@ bool aarch64_split_128bit_move_p (rtx, rtx);
 
 bool aarch64_mov128_immediate (rtx);
 
-void aarch64_split_simd_combine (rtx, rtx, rtx);
-
 void aarch64_split_simd_move (rtx, rtx);
 
 /* Check for a legitimate floating point constant for FMOV.  */
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index ead80396e70..7acde0dd099 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4403,7 +4403,7 @@ (define_insn "*aarch64_combine_internal_be<mode>"
 ;; In this insn, operand 1 should be low, and operand 2 the high part of the
 ;; dest vector.
 
-(define_insn "@aarch64_combinez<mode>"
+(define_insn "*aarch64_combinez<mode>"
   [(set (match_operand:<VDBL> 0 "register_operand" "=w,w,w")
 	(vec_concat:<VDBL>
 	  (match_operand:VDC 1 "nonimmediate_operand" "w,?r,m")
@@ -4417,7 +4417,7 @@ (define_insn "@aarch64_combinez<mode>"
    (set_attr "arch" "simd,fp,simd")]
 )
 
-(define_insn "@aarch64_combinez_be<mode>"
+(define_insn "*aarch64_combinez_be<mode>"
   [(set (match_operand:<VDBL> 0 "register_operand" "=w,w,w")
         (vec_concat:<VDBL>
 	  (match_operand:VDC 2 "aarch64_simd_or_scalar_imm_zero")
@@ -4431,38 +4431,62 @@ (define_insn "@aarch64_combinez_be<mode>"
    (set_attr "arch" "simd,fp,simd")]
 )
 
-(define_expand "aarch64_combine<mode>"
-  [(match_operand:<VDBL> 0 "register_operand")
-   (match_operand:VDC 1 "register_operand")
-   (match_operand:VDC 2 "aarch64_simd_reg_or_zero")]
+;; Form a vector whose first half (in array order) comes from operand 1
+;; and whose second half (in array order) comes from operand 2.
+;; This operand order follows the RTL vec_concat operation.
+(define_expand "@aarch64_vec_concat<mode>"
+  [(set (match_operand:<VDBL> 0 "register_operand")
+	(vec_concat:<VDBL>
+	  (match_operand:VDC 1 "general_operand")
+	  (match_operand:VDC 2 "general_operand")))]
   "TARGET_SIMD"
 {
-  if (operands[2] == CONST0_RTX (<MODE>mode))
+  int lo = BYTES_BIG_ENDIAN ? 2 : 1;
+  int hi = BYTES_BIG_ENDIAN ? 1 : 2;
+
+  if (MEM_P (operands[1])
+      && MEM_P (operands[2])
+      && aarch64_mergeable_load_pair_p (<VDBL>mode, operands[1], operands[2]))
+    /* Use load_pair_lanes<mode>.  */
+    ;
+  else if (operands[hi] == CONST0_RTX (<MODE>mode))
     {
-      if (BYTES_BIG_ENDIAN)
-	emit_insn (gen_aarch64_combinez_be<mode> (operands[0], operands[1],
-						  operands[2]));
-      else
-	emit_insn (gen_aarch64_combinez<mode> (operands[0], operands[1],
-					       operands[2]));
+      /* Use *aarch64_combinez<mode>.  */
+      if (!nonimmediate_operand (operands[lo], <MODE>mode))
+	operands[lo] = force_reg (<MODE>mode, operands[lo]);
     }
   else
-    aarch64_split_simd_combine (operands[0], operands[1], operands[2]);
-  DONE;
-}
-)
+    {
+      /* Use *aarch64_combine_general<mode>.  */
+      operands[lo] = force_reg (<MODE>mode, operands[lo]);
+      if (!aarch64_simd_nonimmediate_operand (operands[hi], <MODE>mode))
+	{
+	  if (MEM_P (operands[hi]))
+	    {
+	      rtx addr = force_reg (Pmode, XEXP (operands[hi], 0));
+	      operands[hi] = replace_equiv_address (operands[hi], addr);
+	    }
+	  else
+	    operands[hi] = force_reg (<MODE>mode, operands[hi]);
+	}
+    }
+})
 
-(define_expand "@aarch64_simd_combine<mode>"
+;; Form a vector whose least significant half comes from operand 1 and whose
+;; most significant half comes from operand 2.  This operand order follows
+;; arm_neon.h vcombine* intrinsics.
+(define_expand "aarch64_combine<mode>"
   [(match_operand:<VDBL> 0 "register_operand")
-   (match_operand:VDC 1 "register_operand")
-   (match_operand:VDC 2 "register_operand")]
+   (match_operand:VDC 1 "general_operand")
+   (match_operand:VDC 2 "general_operand")]
   "TARGET_SIMD"
-  {
-    emit_insn (gen_move_lo_quad_<Vdbl> (operands[0], operands[1]));
-    emit_insn (gen_move_hi_quad_<Vdbl> (operands[0], operands[2]));
-    DONE;
-  }
-[(set_attr "type" "multiple")]
+{
+  if (BYTES_BIG_ENDIAN)
+    std::swap (operands[1], operands[2]);
+  emit_insn (gen_aarch64_vec_concat<mode> (operands[0], operands[1],
+					   operands[2]));
+  DONE;
+}
 )
 
 ;; <su><addsub>l<q>.
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index c47543aebf3..af42d1bedfe 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -4239,23 +4239,6 @@ aarch64_split_128bit_move_p (rtx dst, rtx src)
   return true;
 }
 
-/* Split a complex SIMD combine.  */
-
-void
-aarch64_split_simd_combine (rtx dst, rtx src1, rtx src2)
-{
-  machine_mode src_mode = GET_MODE (src1);
-  machine_mode dst_mode = GET_MODE (dst);
-
-  gcc_assert (VECTOR_MODE_P (dst_mode));
-  gcc_assert (register_operand (dst, dst_mode)
-	      && register_operand (src1, src_mode)
-	      && register_operand (src2, src_mode));
-
-  emit_insn (gen_aarch64_simd_combine (src_mode, dst, src1, src2));
-  return;
-}
-
 /* Split a complex SIMD move.  */
 
 void
@@ -20941,37 +20924,13 @@ aarch64_expand_vector_init (rtx target, rtx vals)
      of mode N in VALS and we must put their concatentation into TARGET.  */
   if (XVECLEN (vals, 0) == 2 && VECTOR_MODE_P (GET_MODE (XVECEXP (vals, 0, 0))))
     {
-      gcc_assert (known_eq (GET_MODE_SIZE (mode),
-		  2 * GET_MODE_SIZE (GET_MODE (XVECEXP (vals, 0, 0)))));
-      rtx lo = XVECEXP (vals, 0, 0);
-      rtx hi = XVECEXP (vals, 0, 1);
-      machine_mode narrow_mode = GET_MODE (lo);
-      gcc_assert (GET_MODE_INNER (narrow_mode) == inner_mode);
-      gcc_assert (narrow_mode == GET_MODE (hi));
-
-      /* When we want to concatenate a half-width vector with zeroes we can
-	 use the aarch64_combinez[_be] patterns.  Just make sure that the
-	 zeroes are in the right half.  */
-      if (BYTES_BIG_ENDIAN
-	  && aarch64_simd_imm_zero (lo, narrow_mode)
-	  && general_operand (hi, narrow_mode))
-	emit_insn (gen_aarch64_combinez_be (narrow_mode, target, hi, lo));
-      else if (!BYTES_BIG_ENDIAN
-	       && aarch64_simd_imm_zero (hi, narrow_mode)
-	       && general_operand (lo, narrow_mode))
-	emit_insn (gen_aarch64_combinez (narrow_mode, target, lo, hi));
-      else
-	{
-	  /* Else create the two half-width registers and combine them.  */
-	  if (!REG_P (lo))
-	    lo = force_reg (GET_MODE (lo), lo);
-	  if (!REG_P (hi))
-	    hi = force_reg (GET_MODE (hi), hi);
-
-	  if (BYTES_BIG_ENDIAN)
-	    std::swap (lo, hi);
-	  emit_insn (gen_aarch64_simd_combine (narrow_mode, target, lo, hi));
-	}
+      machine_mode narrow_mode = GET_MODE (XVECEXP (vals, 0, 0));
+      gcc_assert (GET_MODE_INNER (narrow_mode) == inner_mode
+		  && known_eq (GET_MODE_SIZE (mode),
+			       2 * GET_MODE_SIZE (narrow_mode)));
+      emit_insn (gen_aarch64_vec_concat (narrow_mode, target,
+					 XVECEXP (vals, 0, 0),
+					 XVECEXP (vals, 0, 1)));
      return;
    }
 
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-12.c b/gcc/testsuite/gcc.target/aarch64/vec-init-12.c
new file mode 100644
index 00000000000..c287478e2d8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-12.c
@@ -0,0 +1,65 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include <arm_neon.h>
+
+/*
+** s32_1:
+**	ldr	q0, \[x0\]
+**	ret
+*/
+int32x4_t s32_1(int32x2_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return vcombine_s32 (ptr[1], ptr[0]);
+  else
+    return vcombine_s32 (ptr[0], ptr[1]);
+}
+/*
+** s32_2:
+**	add	x([0-9])+, x0, #?8
+**	ld1	{v0\.d}\[1\], \[x\1\]
+**	ret
+*/
+int32x4_t s32_2(int32x2_t a0, int32x2_t *ptr) {
+  return vcombine_s32 (a0, ptr[1]);
+}
+/*
+** s32_3:
+**	ldr	d0, \[x0\], #?16
+**	ld1	{v0\.d}\[1\], \[x0\]
+**	ret
+*/
+int32x4_t s32_3(int32x2_t *ptr) {
+  return vcombine_s32 (ptr[0], ptr[2]);
+}
+
+/*
+** f32_1:
+**	ldr	q0, \[x0\]
+**	ret
+*/
+float32x4_t f32_1(float32x2_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return vcombine_f32 (ptr[1], ptr[0]);
+  else
+    return vcombine_f32 (ptr[0], ptr[1]);
+}
+/*
+** f32_2:
+**	add	x([0-9])+, x0, #?8
+**	ld1	{v0\.d}\[1\], \[x\1\]
+**	ret
+*/
+float32x4_t f32_2(float32x2_t a0, float32x2_t *ptr) {
+  return vcombine_f32 (a0, ptr[1]);
+}
+/*
+** f32_3:
+**	ldr	d0, \[x0\], #?16
+**	ld1	{v0\.d}\[1\], \[x0\]
+**	ret
+*/
+float32x4_t f32_3(float32x2_t *ptr) {
+  return vcombine_f32 (ptr[0], ptr[2]);
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pushed 7/8] aarch64: Remove move_lo/hi_quad expanders
  2022-02-09 17:00 [pushed 0/8] aarch64: Fix regression in vec_init code quality Richard Sandiford
                   ` (5 preceding siblings ...)
  2022-02-09 17:01 ` [pushed 6/8] aarch64: Add a general vec_concat expander Richard Sandiford
@ 2022-02-09 17:01 ` Richard Sandiford
  2022-02-09 17:02 ` [pushed 8/8] aarch64: Extend vec_concat patterns to 8-byte vectors Richard Sandiford
  7 siblings, 0 replies; 9+ messages in thread
From: Richard Sandiford @ 2022-02-09 17:01 UTC (permalink / raw)
  To: gcc-patches

This patch is the second of two to remove the old
move_lo/hi_quad expanders and move_hi_quad insns.

gcc/
	* config/aarch64/aarch64-simd.md (@aarch64_split_simd_mov<mode>):
	Use aarch64_combine instead of move_lo/hi_quad.  Tabify.
	(move_lo_quad_<mode>, aarch64_simd_move_hi_quad_<mode>): Delete.
	(aarch64_simd_move_hi_quad_be_<mode>, move_hi_quad_<mode>): Delete.
	(vec_pack_trunc_<mode>): Take general_operand elements and use
	aarch64_combine rather than move_lo/hi_quad to combine them.
	(vec_pack_trunc_df): Likewise.
---
 gcc/config/aarch64/aarch64-simd.md | 111 +++++------------------------
 1 file changed, 18 insertions(+), 93 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 7acde0dd099..ef6e772503d 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -272,7 +272,7 @@ (define_split
 
 (define_expand "@aarch64_split_simd_mov<mode>"
   [(set (match_operand:VQMOV 0)
-        (match_operand:VQMOV 1))]
+	(match_operand:VQMOV 1))]
   "TARGET_SIMD"
   {
     rtx dst = operands[0];
@@ -280,23 +280,22 @@ (define_expand "@aarch64_split_simd_mov<mode>"
 
     if (GP_REGNUM_P (REGNO (src)))
       {
-        rtx src_low_part = gen_lowpart (<VHALF>mode, src);
-        rtx src_high_part = gen_highpart (<VHALF>mode, src);
+	rtx src_low_part = gen_lowpart (<VHALF>mode, src);
+	rtx src_high_part = gen_highpart (<VHALF>mode, src);
+	rtx dst_low_part = gen_lowpart (<VHALF>mode, dst);
 
-        emit_insn
-          (gen_move_lo_quad_<mode> (dst, src_low_part));
-        emit_insn
-          (gen_move_hi_quad_<mode> (dst, src_high_part));
+	emit_move_insn (dst_low_part, src_low_part);
+	emit_insn (gen_aarch64_combine<Vhalf> (dst, dst_low_part,
+					       src_high_part));
       }
-
     else
       {
-        rtx dst_low_part = gen_lowpart (<VHALF>mode, dst);
-        rtx dst_high_part = gen_highpart (<VHALF>mode, dst);
+	rtx dst_low_part = gen_lowpart (<VHALF>mode, dst);
+	rtx dst_high_part = gen_highpart (<VHALF>mode, dst);
 	rtx lo = aarch64_simd_vect_par_cnst_half (<MODE>mode, <nunits>, false);
 	rtx hi = aarch64_simd_vect_par_cnst_half (<MODE>mode, <nunits>, true);
-        emit_insn (gen_aarch64_get_half<mode> (dst_low_part, src, lo));
-        emit_insn (gen_aarch64_get_half<mode> (dst_high_part, src, hi));
+	emit_insn (gen_aarch64_get_half<mode> (dst_low_part, src, lo));
+	emit_insn (gen_aarch64_get_half<mode> (dst_high_part, src, hi));
       }
     DONE;
   }
@@ -1580,69 +1579,6 @@ (define_insn "aarch64_<optab>p<mode>"
 ;; What that means, is that the RTL descriptions of the below patterns
 ;; need to change depending on endianness.
 
-;; Move to the low architectural bits of the register.
-;; On little-endian this is { operand, zeroes }
-;; On big-endian this is { zeroes, operand }
-
-(define_expand "move_lo_quad_<mode>"
-  [(match_operand:VQMOV 0 "register_operand")
-   (match_operand:<VHALF> 1 "register_operand")]
-  "TARGET_SIMD"
-{
-  emit_insn (gen_aarch64_combine<Vhalf> (operands[0], operands[1],
-					 CONST0_RTX (<VHALF>mode)));
-  DONE;
-}
-)
-
-;; Move operand1 to the high architectural bits of the register, keeping
-;; the low architectural bits of operand2.
-;; For little-endian this is { operand2, operand1 }
-;; For big-endian this is { operand1, operand2 }
-
-(define_insn "aarch64_simd_move_hi_quad_<mode>"
-  [(set (match_operand:VQMOV 0 "register_operand" "+w,w")
-        (vec_concat:VQMOV
-          (vec_select:<VHALF>
-                (match_dup 0)
-                (match_operand:VQMOV 2 "vect_par_cnst_lo_half" ""))
-	  (match_operand:<VHALF> 1 "register_operand" "w,r")))]
-  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
-  "@
-   ins\\t%0.d[1], %1.d[0]
-   ins\\t%0.d[1], %1"
-  [(set_attr "type" "neon_ins")]
-)
-
-(define_insn "aarch64_simd_move_hi_quad_be_<mode>"
-  [(set (match_operand:VQMOV 0 "register_operand" "+w,w")
-        (vec_concat:VQMOV
-	  (match_operand:<VHALF> 1 "register_operand" "w,r")
-          (vec_select:<VHALF>
-                (match_dup 0)
-                (match_operand:VQMOV 2 "vect_par_cnst_lo_half" ""))))]
-  "TARGET_SIMD && BYTES_BIG_ENDIAN"
-  "@
-   ins\\t%0.d[1], %1.d[0]
-   ins\\t%0.d[1], %1"
-  [(set_attr "type" "neon_ins")]
-)
-
-(define_expand "move_hi_quad_<mode>"
- [(match_operand:VQMOV 0 "register_operand")
-  (match_operand:<VHALF> 1 "register_operand")]
- "TARGET_SIMD"
-{
-  rtx p = aarch64_simd_vect_par_cnst_half (<MODE>mode, <nunits>, false);
-  if (BYTES_BIG_ENDIAN)
-    emit_insn (gen_aarch64_simd_move_hi_quad_be_<mode> (operands[0],
-		    operands[1], p));
-  else
-    emit_insn (gen_aarch64_simd_move_hi_quad_<mode> (operands[0],
-		    operands[1], p));
-  DONE;
-})
-
 ;; Narrowing operations.
 
 (define_insn "aarch64_xtn<mode>_insn_le"
@@ -1743,16 +1679,12 @@ (define_insn "*aarch64_narrow_trunc<mode>"
 
 (define_expand "vec_pack_trunc_<mode>"
  [(match_operand:<VNARROWD> 0 "register_operand")
-  (match_operand:VDN 1 "register_operand")
-  (match_operand:VDN 2 "register_operand")]
+  (match_operand:VDN 1 "general_operand")
+  (match_operand:VDN 2 "general_operand")]
  "TARGET_SIMD"
 {
   rtx tempreg = gen_reg_rtx (<VDBL>mode);
-  int lo = BYTES_BIG_ENDIAN ? 2 : 1;
-  int hi = BYTES_BIG_ENDIAN ? 1 : 2;
-
-  emit_insn (gen_move_lo_quad_<Vdbl> (tempreg, operands[lo]));
-  emit_insn (gen_move_hi_quad_<Vdbl> (tempreg, operands[hi]));
+  emit_insn (gen_aarch64_vec_concat<mode> (tempreg, operands[1], operands[2]));
   emit_insn (gen_trunc<Vdbl><Vnarrowd>2 (operands[0], tempreg));
   DONE;
 })
@@ -3402,20 +3334,13 @@ (define_expand "vec_pack_trunc_v2df"
 
 (define_expand "vec_pack_trunc_df"
   [(set (match_operand:V2SF 0 "register_operand")
-      (vec_concat:V2SF
-	(float_truncate:SF
-	    (match_operand:DF 1 "register_operand"))
-	(float_truncate:SF
-	    (match_operand:DF 2 "register_operand"))
-	  ))]
+	(vec_concat:V2SF
+	  (float_truncate:SF (match_operand:DF 1 "general_operand"))
+	  (float_truncate:SF (match_operand:DF 2 "general_operand"))))]
   "TARGET_SIMD"
   {
     rtx tmp = gen_reg_rtx (V2SFmode);
-    int lo = BYTES_BIG_ENDIAN ? 2 : 1;
-    int hi = BYTES_BIG_ENDIAN ? 1 : 2;
-
-    emit_insn (gen_move_lo_quad_v2df (tmp, operands[lo]));
-    emit_insn (gen_move_hi_quad_v2df (tmp, operands[hi]));
+    emit_insn (gen_aarch64_vec_concatdf (tmp, operands[1], operands[2]));
     emit_insn (gen_aarch64_float_truncate_lo_v2sf (operands[0], tmp));
     DONE;
   }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pushed 8/8] aarch64: Extend vec_concat patterns to 8-byte vectors
  2022-02-09 17:00 [pushed 0/8] aarch64: Fix regression in vec_init code quality Richard Sandiford
                   ` (6 preceding siblings ...)
  2022-02-09 17:01 ` [pushed 7/8] aarch64: Remove move_lo/hi_quad expanders Richard Sandiford
@ 2022-02-09 17:02 ` Richard Sandiford
  7 siblings, 0 replies; 9+ messages in thread
From: Richard Sandiford @ 2022-02-09 17:02 UTC (permalink / raw)
  To: gcc-patches

This patch extends the previous support for 16-byte vec_concat
so that it supports pairs of 4-byte elements.  This too isn't
strictly a regression fix, since the 8-byte forms weren't affected
by the same problems as the 16-byte forms, but it leaves things in
a more consistent state.

gcc/
	* config/aarch64/iterators.md (VDCSIF): New mode iterator.
	(VDBL): Handle SF.
	(single_wx, single_type, single_dtype, dblq): New mode attributes.
	* config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): Extend
	from VDC to VDCSIF.
	(store_pair_lanes<mode>): Likewise.
	(*aarch64_combine_internal<mode>): Likewise.
	(*aarch64_combine_internal_be<mode>): Likewise.
	(*aarch64_combinez<mode>): Likewise.
	(*aarch64_combinez_be<mode>): Likewise.
	* config/aarch64/aarch64.cc (aarch64_classify_address): Handle
	8-byte modes for ADDR_QUERY_LDP_STP_N.
	(aarch64_print_operand): Likewise for %y.

gcc/testsuite/
	* gcc.target/aarch64/vec-init-13.c: New test.
	* gcc.target/aarch64/vec-init-14.c: Likewise.
	* gcc.target/aarch64/vec-init-15.c: Likewise.
	* gcc.target/aarch64/vec-init-16.c: Likewise.
	* gcc.target/aarch64/vec-init-17.c: Likewise.
---
 gcc/config/aarch64/aarch64-simd.md            |  72 +++++-----
 gcc/config/aarch64/aarch64.cc                 |  16 ++-
 gcc/config/aarch64/iterators.md               |  38 +++++-
 .../gcc.target/aarch64/vec-init-13.c          | 123 ++++++++++++++++++
 .../gcc.target/aarch64/vec-init-14.c          | 123 ++++++++++++++++++
 .../gcc.target/aarch64/vec-init-15.c          |  15 +++
 .../gcc.target/aarch64/vec-init-16.c          |  12 ++
 .../gcc.target/aarch64/vec-init-17.c          |  73 +++++++++++
 8 files changed, 430 insertions(+), 42 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-13.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-14.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-15.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-17.c

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index ef6e772503d..18733428f3f 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4243,12 +4243,12 @@ (define_insn_and_split "aarch64_get_lane<mode>"
 (define_insn "load_pair_lanes<mode>"
   [(set (match_operand:<VDBL> 0 "register_operand" "=w")
 	(vec_concat:<VDBL>
-	   (match_operand:VDC 1 "memory_operand" "Utq")
-	   (match_operand:VDC 2 "memory_operand" "m")))]
+	   (match_operand:VDCSIF 1 "memory_operand" "Utq")
+	   (match_operand:VDCSIF 2 "memory_operand" "m")))]
   "TARGET_SIMD
    && aarch64_mergeable_load_pair_p (<VDBL>mode, operands[1], operands[2])"
-  "ldr\\t%q0, %1"
-  [(set_attr "type" "neon_load1_1reg_q")]
+  "ldr\\t%<single_dtype>0, %1"
+  [(set_attr "type" "neon_load1_1reg<dblq>")]
 )
 
 ;; This STP pattern is a partial duplicate of the general vec_concat patterns
@@ -4273,12 +4273,12 @@ (define_insn "load_pair_lanes<mode>"
 (define_insn "store_pair_lanes<mode>"
   [(set (match_operand:<VDBL> 0 "aarch64_mem_pair_lanes_operand" "=Umn, Umn")
 	(vec_concat:<VDBL>
-	   (match_operand:VDC 1 "register_operand" "w, r")
-	   (match_operand:VDC 2 "register_operand" "w, r")))]
+	   (match_operand:VDCSIF 1 "register_operand" "w, r")
+	   (match_operand:VDCSIF 2 "register_operand" "w, r")))]
   "TARGET_SIMD"
   "@
-   stp\\t%d1, %d2, %y0
-   stp\\t%x1, %x2, %y0"
+   stp\t%<single_type>1, %<single_type>2, %y0
+   stp\t%<single_wx>1, %<single_wx>2, %y0"
   [(set_attr "type" "neon_stp, store_16")]
 )
 
@@ -4292,37 +4292,37 @@ (define_insn "store_pair_lanes<mode>"
 (define_insn "*aarch64_combine_internal<mode>"
   [(set (match_operand:<VDBL> 0 "aarch64_reg_or_mem_pair_operand" "=w, w, w, Umn, Umn")
 	(vec_concat:<VDBL>
-	  (match_operand:VDC 1 "register_operand" "0, 0, 0, ?w, ?r")
-	  (match_operand:VDC 2 "aarch64_simd_nonimmediate_operand" "w, ?r, Utv, w, ?r")))]
+	  (match_operand:VDCSIF 1 "register_operand" "0, 0, 0, ?w, ?r")
+	  (match_operand:VDCSIF 2 "aarch64_simd_nonimmediate_operand" "w, ?r, Utv, w, ?r")))]
   "TARGET_SIMD
    && !BYTES_BIG_ENDIAN
    && (register_operand (operands[0], <VDBL>mode)
        || register_operand (operands[2], <MODE>mode))"
   "@
-   ins\t%0.d[1], %2.d[0]
-   ins\t%0.d[1], %2
-   ld1\t{%0.d}[1], %2
-   stp\t%d1, %d2, %y0
-   stp\t%x1, %x2, %y0"
-  [(set_attr "type" "neon_ins_q, neon_from_gp_q, neon_load1_one_lane_q, neon_stp, store_16")]
+   ins\t%0.<single_type>[1], %2.<single_type>[0]
+   ins\t%0.<single_type>[1], %<single_wx>2
+   ld1\t{%0.<single_type>}[1], %2
+   stp\t%<single_type>1, %<single_type>2, %y0
+   stp\t%<single_wx>1, %<single_wx>2, %y0"
+  [(set_attr "type" "neon_ins<dblq>, neon_from_gp<dblq>, neon_load1_one_lane<dblq>, neon_stp, store_16")]
 )
 
 (define_insn "*aarch64_combine_internal_be<mode>"
   [(set (match_operand:<VDBL> 0 "aarch64_reg_or_mem_pair_operand" "=w, w, w, Umn, Umn")
 	(vec_concat:<VDBL>
-	  (match_operand:VDC 2 "aarch64_simd_nonimmediate_operand" "w, ?r, Utv, ?w, ?r")
-	  (match_operand:VDC 1 "register_operand" "0, 0, 0, ?w, ?r")))]
+	  (match_operand:VDCSIF 2 "aarch64_simd_nonimmediate_operand" "w, ?r, Utv, ?w, ?r")
+	  (match_operand:VDCSIF 1 "register_operand" "0, 0, 0, ?w, ?r")))]
   "TARGET_SIMD
    && BYTES_BIG_ENDIAN
    && (register_operand (operands[0], <VDBL>mode)
        || register_operand (operands[2], <MODE>mode))"
   "@
-   ins\t%0.d[1], %2.d[0]
-   ins\t%0.d[1], %2
-   ld1\t{%0.d}[1], %2
-   stp\t%d2, %d1, %y0
-   stp\t%x2, %x1, %y0"
-  [(set_attr "type" "neon_ins_q, neon_from_gp_q, neon_load1_one_lane_q, neon_stp, store_16")]
+   ins\t%0.<single_type>[1], %2.<single_type>[0]
+   ins\t%0.<single_type>[1], %<single_wx>2
+   ld1\t{%0.<single_type>}[1], %2
+   stp\t%<single_type>2, %<single_type>1, %y0
+   stp\t%<single_wx>2, %<single_wx>1, %y0"
+  [(set_attr "type" "neon_ins<dblq>, neon_from_gp<dblq>, neon_load1_one_lane<dblq>, neon_stp, store_16")]
 )
 
 ;; In this insn, operand 1 should be low, and operand 2 the high part of the
@@ -4331,13 +4331,13 @@ (define_insn "*aarch64_combine_internal_be<mode>"
 (define_insn "*aarch64_combinez<mode>"
   [(set (match_operand:<VDBL> 0 "register_operand" "=w,w,w")
 	(vec_concat:<VDBL>
-	  (match_operand:VDC 1 "nonimmediate_operand" "w,?r,m")
-	  (match_operand:VDC 2 "aarch64_simd_or_scalar_imm_zero")))]
+	  (match_operand:VDCSIF 1 "nonimmediate_operand" "w,?r,m")
+	  (match_operand:VDCSIF 2 "aarch64_simd_or_scalar_imm_zero")))]
   "TARGET_SIMD && !BYTES_BIG_ENDIAN"
   "@
-   mov\\t%0.8b, %1.8b
-   fmov\t%d0, %1
-   ldr\\t%d0, %1"
+   fmov\\t%<single_type>0, %<single_type>1
+   fmov\t%<single_type>0, %<single_wx>1
+   ldr\\t%<single_type>0, %1"
   [(set_attr "type" "neon_move<q>, neon_from_gp, neon_load1_1reg")
    (set_attr "arch" "simd,fp,simd")]
 )
@@ -4345,13 +4345,13 @@ (define_insn "*aarch64_combinez<mode>"
 (define_insn "*aarch64_combinez_be<mode>"
   [(set (match_operand:<VDBL> 0 "register_operand" "=w,w,w")
         (vec_concat:<VDBL>
-	  (match_operand:VDC 2 "aarch64_simd_or_scalar_imm_zero")
-	  (match_operand:VDC 1 "nonimmediate_operand" "w,?r,m")))]
+	  (match_operand:VDCSIF 2 "aarch64_simd_or_scalar_imm_zero")
+	  (match_operand:VDCSIF 1 "nonimmediate_operand" "w,?r,m")))]
   "TARGET_SIMD && BYTES_BIG_ENDIAN"
   "@
-   mov\\t%0.8b, %1.8b
-   fmov\t%d0, %1
-   ldr\\t%d0, %1"
+   fmov\\t%<single_type>0, %<single_type>1
+   fmov\t%<single_type>0, %<single_wx>1
+   ldr\\t%<single_type>0, %1"
   [(set_attr "type" "neon_move<q>, neon_from_gp, neon_load1_1reg")
    (set_attr "arch" "simd,fp,simd")]
 )
@@ -4362,8 +4362,8 @@ (define_insn "*aarch64_combinez_be<mode>"
 (define_expand "@aarch64_vec_concat<mode>"
   [(set (match_operand:<VDBL> 0 "register_operand")
 	(vec_concat:<VDBL>
-	  (match_operand:VDC 1 "general_operand")
-	  (match_operand:VDC 2 "general_operand")))]
+	  (match_operand:VDCSIF 1 "general_operand")
+	  (match_operand:VDCSIF 2 "general_operand")))]
   "TARGET_SIMD"
 {
   int lo = BYTES_BIG_ENDIAN ? 2 : 1;
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index af42d1bedfe..7bb97bd48e4 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -9922,9 +9922,15 @@ aarch64_classify_address (struct aarch64_address_info *info,
   /* If we are dealing with ADDR_QUERY_LDP_STP_N that means the incoming mode
      corresponds to the actual size of the memory being loaded/stored and the
      mode of the corresponding addressing mode is half of that.  */
-  if (type == ADDR_QUERY_LDP_STP_N
-      && known_eq (GET_MODE_SIZE (mode), 16))
-    mode = DFmode;
+  if (type == ADDR_QUERY_LDP_STP_N)
+    {
+      if (known_eq (GET_MODE_SIZE (mode), 16))
+	mode = DFmode;
+      else if (known_eq (GET_MODE_SIZE (mode), 8))
+	mode = SFmode;
+      else
+	return false;
+    }
 
   bool allow_reg_index_p = (!load_store_pair_p
 			    && ((vec_flags == 0
@@ -11404,7 +11410,9 @@ aarch64_print_operand (FILE *f, rtx x, int code)
 	machine_mode mode = GET_MODE (x);
 
 	if (!MEM_P (x)
-	    || (code == 'y' && maybe_ne (GET_MODE_SIZE (mode), 16)))
+	    || (code == 'y'
+		&& maybe_ne (GET_MODE_SIZE (mode), 8)
+		&& maybe_ne (GET_MODE_SIZE (mode), 16)))
 	  {
 	    output_operand_lossage ("invalid operand for '%%%c'", code);
 	    return;
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index a0c02e4ac15..88067a3536a 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -236,6 +236,9 @@ (define_mode_iterator VQW [V16QI V8HI V4SI])
 ;; Double vector modes for combines.
 (define_mode_iterator VDC [V8QI V4HI V4BF V4HF V2SI V2SF DI DF])
 
+;; VDC plus SI and SF.
+(define_mode_iterator VDCSIF [V8QI V4HI V4BF V4HF V2SI V2SF SI SF DI DF])
+
 ;; Polynomial modes for vector combines.
 (define_mode_iterator VDC_P [V8QI V4HI DI])
 
@@ -1436,8 +1439,8 @@ (define_mode_attr Vhalf [(V8QI "v4qi")  (V16QI "v8qi")
 (define_mode_attr VDBL [(V8QI "V16QI") (V4HI "V8HI")
 			(V4HF "V8HF")  (V4BF "V8BF")
 			(V2SI "V4SI")  (V2SF "V4SF")
-			(SI   "V2SI")  (DI   "V2DI")
-			(DF   "V2DF")])
+			(SI   "V2SI")  (SF   "V2SF")
+			(DI   "V2DI")  (DF   "V2DF")])
 
 ;; Register suffix for double-length mode.
 (define_mode_attr Vdtype [(V4HF "8h") (V2SF "4s")])
@@ -1557,6 +1560,30 @@ (define_mode_attr Vhalftype [(V16QI "8b") (V8HI "4h")
 			     (V4SI "2s") (V8HF "4h")
 			     (V4SF "2s")])
 
+;; Whether a mode fits in W or X registers (i.e. "w" for 32-bit modes
+;; and "x" for 64-bit modes).
+(define_mode_attr single_wx [(SI   "w") (SF   "w")
+			     (V8QI "x") (V4HI "x")
+			     (V4HF "x") (V4BF "x")
+			     (V2SI "x") (V2SF "x")
+			     (DI   "x") (DF   "x")])
+
+;; Whether a mode fits in S or D registers (i.e. "s" for 32-bit modes
+;; and "d" for 64-bit modes).
+(define_mode_attr single_type [(SI   "s") (SF   "s")
+			       (V8QI "d") (V4HI "d")
+			       (V4HF "d") (V4BF "d")
+			       (V2SI "d") (V2SF "d")
+			       (DI   "d") (DF   "d")])
+
+;; Whether a double-width mode fits in D or Q registers (i.e. "d" for
+;; 32-bit modes and "q" for 64-bit modes).
+(define_mode_attr single_dtype [(SI   "d") (SF   "d")
+			        (V8QI "q") (V4HI "q")
+			        (V4HF "q") (V4BF "q")
+			        (V2SI "q") (V2SF "q")
+			        (DI   "q") (DF   "q")])
+
 ;; Define corresponding core/FP element mode for each vector mode.
 (define_mode_attr vw [(V8QI "w") (V16QI "w")
 		      (V4HI "w") (V8HI "w")
@@ -1849,6 +1876,13 @@ (define_mode_attr q [(V8QI "") (V16QI "_q")
 		     (V4x1DF "") (V4x2DF "_q")
 		     (V4x4BF "") (V4x8BF "_q")])
 
+;; Equivalent of the "q" attribute for the <VDBL> mode.
+(define_mode_attr dblq [(SI   "") (SF   "")
+		        (V8QI "_q") (V4HI "_q")
+		        (V4HF "_q") (V4BF "_q")
+		        (V2SI "_q") (V2SF "_q")
+		        (DI   "_q") (DF   "_q")])
+
 (define_mode_attr vp [(V8QI "v") (V16QI "v")
 		      (V4HI "v") (V8HI  "v")
 		      (V2SI "p") (V4SI  "v")
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-13.c b/gcc/testsuite/gcc.target/aarch64/vec-init-13.c
new file mode 100644
index 00000000000..d0f88cbe71a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-13.c
@@ -0,0 +1,123 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include <arm_neon.h>
+
+/*
+** s64q_1:
+**	fmov	d0, x0
+**	ret
+*/
+int64x2_t s64q_1(int64_t a0) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int64x2_t) { 0, a0 };
+  else
+    return (int64x2_t) { a0, 0 };
+}
+/*
+** s64q_2:
+**	ldr	d0, \[x0\]
+**	ret
+*/
+int64x2_t s64q_2(int64_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int64x2_t) { 0, ptr[0] };
+  else
+    return (int64x2_t) { ptr[0], 0 };
+}
+/*
+** s64q_3:
+**	ldr	d0, \[x0, #?8\]
+**	ret
+*/
+int64x2_t s64q_3(int64_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int64x2_t) { 0, ptr[1] };
+  else
+    return (int64x2_t) { ptr[1], 0 };
+}
+
+/*
+** f64q_1:
+**	fmov	d0, d0
+**	ret
+*/
+float64x2_t f64q_1(float64_t a0) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float64x2_t) { 0, a0 };
+  else
+    return (float64x2_t) { a0, 0 };
+}
+/*
+** f64q_2:
+**	ldr	d0, \[x0\]
+**	ret
+*/
+float64x2_t f64q_2(float64_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float64x2_t) { 0, ptr[0] };
+  else
+    return (float64x2_t) { ptr[0], 0 };
+}
+/*
+** f64q_3:
+**	ldr	d0, \[x0, #?8\]
+**	ret
+*/
+float64x2_t f64q_3(float64_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float64x2_t) { 0, ptr[1] };
+  else
+    return (float64x2_t) { ptr[1], 0 };
+}
+
+/*
+** s32q_1:
+**	fmov	d0, d0
+**	ret
+*/
+int32x4_t s32q_1(int32x2_t a0, int32x2_t a1) {
+  return vcombine_s32 (a0, (int32x2_t) { 0, 0 });
+}
+/*
+** s32q_2:
+**	ldr	d0, \[x0\]
+**	ret
+*/
+int32x4_t s32q_2(int32x2_t *ptr) {
+  return vcombine_s32 (ptr[0], (int32x2_t) { 0, 0 });
+}
+/*
+** s32q_3:
+**	ldr	d0, \[x0, #?8\]
+**	ret
+*/
+int32x4_t s32q_3(int32x2_t *ptr) {
+  return vcombine_s32 (ptr[1], (int32x2_t) { 0, 0 });
+}
+
+/*
+** f32q_1:
+**	fmov	d0, d0
+**	ret
+*/
+float32x4_t f32q_1(float32x2_t a0, float32x2_t a1) {
+  return vcombine_f32 (a0, (float32x2_t) { 0, 0 });
+}
+/*
+** f32q_2:
+**	ldr	d0, \[x0\]
+**	ret
+*/
+float32x4_t f32q_2(float32x2_t *ptr) {
+  return vcombine_f32 (ptr[0], (float32x2_t) { 0, 0 });
+}
+/*
+** f32q_3:
+**	ldr	d0, \[x0, #?8\]
+**	ret
+*/
+float32x4_t f32q_3(float32x2_t *ptr) {
+  return vcombine_f32 (ptr[1], (float32x2_t) { 0, 0 });
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-14.c b/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
new file mode 100644
index 00000000000..02875088cd9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
@@ -0,0 +1,123 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include <arm_neon.h>
+
+void ext();
+
+/*
+** s32_1:
+**	fmov	s0, w0
+**	ins	v0\.s\[1\], w1
+**	ret
+*/
+int32x2_t s32_1(int32_t a0, int32_t a1) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int32x2_t) { a1, a0 };
+  else
+    return (int32x2_t) { a0, a1 };
+}
+/*
+** s32_2:
+**	fmov	s0, w0
+**	ld1	{v0\.s}\[1\], \[x1\]
+**	ret
+*/
+int32x2_t s32_2(int32_t a0, int32_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int32x2_t) { ptr[0], a0 };
+  else
+    return (int32x2_t) { a0, ptr[0] };
+}
+/*
+** s32_3:
+**	ldr	s0, \[x0\]
+**	ins	v0\.s\[1\], w1
+**	ret
+*/
+int32x2_t s32_3(int32_t *ptr, int32_t a1) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int32x2_t) { a1, ptr[0] };
+  else
+    return (int32x2_t) { ptr[0], a1 };
+}
+/*
+** s32_4:
+**	stp	w1, w2, \[x0\]
+**	ret
+*/
+void s32_4(int32x2_t *res, int32_t a0, int32_t a1) {
+  res[0] = (int32x2_t) { a0, a1 };
+}
+/*
+** s32_5:
+**	stp	w1, w2, \[x0, #?4\]
+**	ret
+*/
+void s32_5(uintptr_t res, int32_t a0, int32_t a1) {
+  *(int32x2_t *)(res + 4) = (int32x2_t) { a0, a1 };
+}
+/* Currently uses d8 to hold res across the call.  */
+int32x2_t s32_6(int32_t a0, int32_t a1) {
+  int32x2_t res = { a0, a1 };
+  ext ();
+  return res;
+}
+
+/*
+** f32_1:
+**	ins	v0\.s\[1\], v1\.s\[0\]
+**	ret
+*/
+float32x2_t f32_1(float32_t a0, float32_t a1) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float32x2_t) { a1, a0 };
+  else
+    return (float32x2_t) { a0, a1 };
+}
+/*
+** f32_2:
+**	ld1	{v0\.s}\[1\], \[x0\]
+**	ret
+*/
+float32x2_t f32_2(float32_t a0, float32_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float32x2_t) { ptr[0], a0 };
+  else
+    return (float32x2_t) { a0, ptr[0] };
+}
+/*
+** f32_3:
+**	ldr	s0, \[x0\]
+**	ins	v0\.s\[1\], v1\.s\[0\]
+**	ret
+*/
+float32x2_t f32_3(float32_t a0, float32_t a1, float32_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float32x2_t) { a1, ptr[0] };
+  else
+    return (float32x2_t) { ptr[0], a1 };
+}
+/*
+** f32_4:
+**	stp	s0, s1, \[x0\]
+**	ret
+*/
+void f32_4(float32x2_t *res, float32_t a0, float32_t a1) {
+  res[0] = (float32x2_t) { a0, a1 };
+}
+/*
+** f32_5:
+**	stp	s0, s1, \[x0, #?4\]
+**	ret
+*/
+void f32_5(uintptr_t res, float32_t a0, float32_t a1) {
+  *(float32x2_t *)(res + 4) = (float32x2_t) { a0, a1 };
+}
+/* Currently uses d8 to hold res across the call.  */
+float32x2_t f32_6(float32_t a0, float32_t a1) {
+  float32x2_t res = { a0, a1 };
+  ext ();
+  return res;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-15.c b/gcc/testsuite/gcc.target/aarch64/vec-init-15.c
new file mode 100644
index 00000000000..82f0a8f55ee
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-15.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+
+#include <arm_neon.h>
+
+int32x2_t f1(int32_t *x, int c) {
+  return c ? (int32x2_t) { x[0], x[2] } : (int32x2_t) { 0, 0 };
+}
+
+int32x2_t f2(int32_t *x, int i0, int i1, int c) {
+  return c ? (int32x2_t) { x[i0], x[i1] } : (int32x2_t) { 0, 0 };
+}
+
+/* { dg-final { scan-assembler-times {\t(?:ldr\ts[0-9]+|ld1\t)} 4 } } */
+/* { dg-final { scan-assembler-not {\tldr\tw} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-16.c b/gcc/testsuite/gcc.target/aarch64/vec-init-16.c
new file mode 100644
index 00000000000..e00aec7a32c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-16.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+
+#include <arm_neon.h>
+
+void f1(int32x2_t *res, int32_t *x, int c0, int c1) {
+  res[0] = (int32x2_t) { c0 ? x[0] : 0, c1 ? x[2] : 0 };
+}
+
+/* { dg-final { scan-assembler-times {\tldr\tw[0-9]+} 2 } } */
+/* { dg-final { scan-assembler {\tstp\tw[0-9]+, w[0-9]+} } } */
+/* { dg-final { scan-assembler-not {\tldr\ts} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-17.c b/gcc/testsuite/gcc.target/aarch64/vec-init-17.c
new file mode 100644
index 00000000000..86191b3ca1d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-17.c
@@ -0,0 +1,73 @@
+/* { dg-do compile } */
+/* { dg-options "-O" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include <arm_neon.h>
+
+/*
+** s32_1:
+**	fmov	s0, w0
+**	ret
+*/
+int32x2_t s32_1(int32_t a0) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int32x2_t) { 0, a0 };
+  else
+    return (int32x2_t) { a0, 0 };
+}
+/*
+** s32_2:
+**	ldr	s0, \[x0\]
+**	ret
+*/
+int32x2_t s32_2(int32_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int32x2_t) { 0, ptr[0] };
+  else
+    return (int32x2_t) { ptr[0], 0 };
+}
+/*
+** s32_3:
+**	ldr	s0, \[x0, #?4\]
+**	ret
+*/
+int32x2_t s32_3(int32_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (int32x2_t) { 0, ptr[1] };
+  else
+    return (int32x2_t) { ptr[1], 0 };
+}
+
+/*
+** f32_1:
+**	fmov	s0, s0
+**	ret
+*/
+float32x2_t f32_1(float32_t a0) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float32x2_t) { 0, a0 };
+  else
+    return (float32x2_t) { a0, 0 };
+}
+/*
+** f32_2:
+**	ldr	s0, \[x0\]
+**	ret
+*/
+float32x2_t f32_2(float32_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float32x2_t) { 0, ptr[0] };
+  else
+    return (float32x2_t) { ptr[0], 0 };
+}
+/*
+** f32_3:
+**	ldr	s0, \[x0, #?4\]
+**	ret
+*/
+float32x2_t f32_3(float32_t *ptr) {
+  if (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
+    return (float32x2_t) { 0, ptr[1] };
+  else
+    return (float32x2_t) { ptr[1], 0 };
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-02-09 17:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-09 17:00 [pushed 0/8] aarch64: Fix regression in vec_init code quality Richard Sandiford
2022-02-09 17:00 ` [pushed 1/8] aarch64: Tighten general_operand predicates Richard Sandiford
2022-02-09 17:00 ` [pushed 2/8] aarch64: Generalise vec_set predicate Richard Sandiford
2022-02-09 17:00 ` [pushed 3/8] aarch64: Generalise adjacency check for load_pair_lanes Richard Sandiford
2022-02-09 17:01 ` [pushed 4/8] aarch64: Remove redundant vec_concat patterns Richard Sandiford
2022-02-09 17:01 ` [pushed 5/8] aarch64: Add more vec_combine patterns Richard Sandiford
2022-02-09 17:01 ` [pushed 6/8] aarch64: Add a general vec_concat expander Richard Sandiford
2022-02-09 17:01 ` [pushed 7/8] aarch64: Remove move_lo/hi_quad expanders Richard Sandiford
2022-02-09 17:02 ` [pushed 8/8] aarch64: Extend vec_concat patterns to 8-byte vectors Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).