public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
@ 2022-10-31 11:56 Tamar Christina
  2022-10-31 11:57 ` [PATCH 2/8]middle-end: Recognize scalar widening reductions Tamar Christina
                   ` (8 more replies)
  0 siblings, 9 replies; 50+ messages in thread
From: Tamar Christina @ 2022-10-31 11:56 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, rguenther, jeffreyalaw

[-- Attachment #1: Type: text/plain, Size: 8213 bytes --]

Hi All,

This patch series is to add recognition of pairwise operations (reductions)
in match.pd such that we can benefit from them even at -O1 when the vectorizer
isn't enabled.

Ths use of these allow for a lot simpler codegen in AArch64 and allows us to
avoid quite a lot of codegen warts.

As an example a simple:

typedef float v4sf __attribute__((vector_size (16)));

float
foo3 (v4sf x)
{
  return x[1] + x[2];
}

currently generates:

foo3:
        dup     s1, v0.s[1]
        dup     s0, v0.s[2]
        fadd    s0, s1, s0
        ret

while with this patch series now generates:

foo3:
	ext	v0.16b, v0.16b, v0.16b, #4
	faddp	s0, v0.2s
	ret

This patch will not perform the operation if the source is not a gimple
register and leaves memory sources to the vectorizer as it's able to deal
correctly with clobbers.

The use of these instruction makes a significant difference in codegen quality
for AArch64 and Arm.

NOTE: The last entry in the series contains tests for all of the previous
patches as it's a bit of an all or nothing thing.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* match.pd (adjacent_data_access_p): Import.
	Add new pattern for bitwise plus, min, max, fmax, fmin.
	* tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs.
	* tree.cc (adjacent_data_access_p): New.
	* tree.h (adjacent_data_access_p): New.

--- inline copy of patch -- 
diff --git a/gcc/match.pd b/gcc/match.pd
index 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10eb18db23c0cd9f7 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not see
    HONOR_NANS
    uniform_vector_p
    expand_vec_cmp_expr_p
-   bitmask_inv_cst_vector_p)
+   bitmask_inv_cst_vector_p
+   adjacent_data_access_p)
 
 /* Operator lists.  */
 (define_operator_list tcc_comparison
@@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 
 /* Canonicalizations of BIT_FIELD_REFs.  */
 
+/* Canonicalize BIT_FIELD_REFS to pairwise operations. */
+(for op (plus min max FMIN_ALL FMAX_ALL)
+     ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX
+	  IFN_REDUC_FMIN IFN_REDUC_FMAX)
+ (simplify
+  (op @0 @1)
+   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
+    (with { poly_uint64 nloc = 0;
+	    tree src = adjacent_data_access_p (@0, @1, &nloc, true);
+	    tree ntype = build_vector_type (type, 2);
+	    tree size = TYPE_SIZE (ntype);
+	    tree pos = build_int_cst (TREE_TYPE (size), nloc);
+	    poly_uint64 _sz;
+	    poly_uint64 _total; }
+     (if (src && is_gimple_reg (src) && ntype
+	  && poly_int_tree_p (size, &_sz)
+	  && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
+	  && known_ge (_total, _sz + nloc))
+      (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))))))
+
+(for op (lt gt)
+     ifni (IFN_REDUC_MIN IFN_REDUC_MAX)
+     ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX)
+ (simplify
+  (cond (op @0 @1) @0 @1)
+   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
+    (with { poly_uint64 nloc = 0;
+	    tree src = adjacent_data_access_p (@0, @1, &nloc, false);
+	    tree ntype = build_vector_type (type, 2);
+	    tree size = TYPE_SIZE (ntype);
+	    tree pos = build_int_cst (TREE_TYPE (size), nloc);
+	    poly_uint64 _sz;
+	    poly_uint64 _total; }
+     (if (src && is_gimple_reg (src) && ntype
+	  && poly_int_tree_p (size, &_sz)
+	  && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
+	  && known_ge (_total, _sz + nloc))
+      (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type)))
+       (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
+       (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))))
+
 (simplify
  (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
  (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); }))
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006603ae1e3deb6b76 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt)
     {
       tree arg = gimple_call_arg (stmt, i);
       if ((is_gimple_reg_type (TREE_TYPE (arg))
+	   && !is_gimple_variable (arg)
 	   && !is_gimple_val (arg))
 	  || (!is_gimple_reg_type (TREE_TYPE (arg))
 	      && !is_gimple_lvalue (arg)))
diff --git a/gcc/tree.h b/gcc/tree.h
index e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194640b8c1b531b798 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree);
 
 extern tree bitmask_inv_cst_vector_p (tree);
 
+/* TRUE if the two operands represent adjacent access of data such that a
+   pairwise operation can be used.  */
+
+extern tree adjacent_data_access_p (tree, tree, poly_uint64*, bool);
+
 /* integer_nonzerop (tree x) is nonzero if X is an integer constant
    with a nonzero value.  */
 
diff --git a/gcc/tree.cc b/gcc/tree.cc
index 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19de9841e29d311 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t)
   return builder.build ();
 }
 
+/* Returns base address if the two operands represent adjacent access of data
+   such that a pairwise operation can be used.  OP1 must be a lower subpart
+   than OP2.  If POS is not NULL then on return if a value is returned POS
+   will indicate the position of the lower address.  If COMMUTATIVE_P then
+   the operation is also tried by flipping op1 and op2.  */
+
+tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos,
+			     bool commutative_p)
+{
+  gcc_assert (op1);
+  gcc_assert (op2);
+  if (TREE_CODE (op1) != TREE_CODE (op2)
+      || TREE_TYPE (op1) != TREE_TYPE (op2))
+    return NULL;
+
+  tree type = TREE_TYPE (op1);
+  gimple *stmt1 = NULL, *stmt2 = NULL;
+  unsigned int bits = GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type)));
+
+  if (TREE_CODE (op1) == BIT_FIELD_REF
+      && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND (op2, 0), 0)
+      && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND (op2, 1), 0)
+      && known_eq (bit_field_size (op1), bits))
+    {
+      poly_uint64 offset1 = bit_field_offset (op1);
+      poly_uint64 offset2 = bit_field_offset (op2);
+      if (known_eq (offset2 - offset1, bits))
+	{
+	  if (pos)
+	    *pos = offset1;
+	  return TREE_OPERAND (op1, 0);
+	}
+      else if (commutative_p && known_eq (offset1 - offset2, bits))
+	{
+	  if (pos)
+	    *pos = offset2;
+	  return TREE_OPERAND (op1, 0);
+	}
+    }
+  else if (TREE_CODE (op1) == ARRAY_REF
+	   && operand_equal_p (get_base_address (op1), get_base_address (op2)))
+    {
+      wide_int size1 = wi::to_wide (array_ref_element_size (op1));
+      wide_int size2 = wi::to_wide (array_ref_element_size (op2));
+      if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8)
+	  || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1))
+	  || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1)))
+	return NULL;
+
+      poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, 1));
+      poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, 1));
+      if (known_eq (offset2 - offset1, 1UL))
+	{
+	  if (pos)
+	    *pos = offset1 * bits;
+	  return TREE_OPERAND (op1, 0);
+	}
+      else if (commutative_p && known_eq (offset1 - offset2, 1UL))
+	{
+	  if (pos)
+	    *pos = offset2 * bits;
+	  return TREE_OPERAND (op1, 0);
+	}
+    }
+  else if (TREE_CODE (op1) == SSA_NAME
+	   && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL
+	   && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL
+	   && is_gimple_assign (stmt1)
+	   && is_gimple_assign (stmt2))
+    {
+      if (gimple_assign_rhs_code (stmt1) != ARRAY_REF
+	  && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF
+	  && gimple_assign_rhs_code (stmt2) != ARRAY_REF
+	  && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF)
+	return NULL;
+
+      return adjacent_data_access_p (gimple_assign_rhs1 (stmt1),
+				     gimple_assign_rhs1 (stmt2), pos,
+				     commutative_p);
+    }
+
+  return NULL;
+}
+
 /* If VECTOR_CST T has a single nonzero element, return the index of that
    element, otherwise return -1.  */
 




-- 

[-- Attachment #2: rb16240.patch --]
[-- Type: text/plain, Size: 6739 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10eb18db23c0cd9f7 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not see
    HONOR_NANS
    uniform_vector_p
    expand_vec_cmp_expr_p
-   bitmask_inv_cst_vector_p)
+   bitmask_inv_cst_vector_p
+   adjacent_data_access_p)
 
 /* Operator lists.  */
 (define_operator_list tcc_comparison
@@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 
 /* Canonicalizations of BIT_FIELD_REFs.  */
 
+/* Canonicalize BIT_FIELD_REFS to pairwise operations. */
+(for op (plus min max FMIN_ALL FMAX_ALL)
+     ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX
+	  IFN_REDUC_FMIN IFN_REDUC_FMAX)
+ (simplify
+  (op @0 @1)
+   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
+    (with { poly_uint64 nloc = 0;
+	    tree src = adjacent_data_access_p (@0, @1, &nloc, true);
+	    tree ntype = build_vector_type (type, 2);
+	    tree size = TYPE_SIZE (ntype);
+	    tree pos = build_int_cst (TREE_TYPE (size), nloc);
+	    poly_uint64 _sz;
+	    poly_uint64 _total; }
+     (if (src && is_gimple_reg (src) && ntype
+	  && poly_int_tree_p (size, &_sz)
+	  && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
+	  && known_ge (_total, _sz + nloc))
+      (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))))))
+
+(for op (lt gt)
+     ifni (IFN_REDUC_MIN IFN_REDUC_MAX)
+     ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX)
+ (simplify
+  (cond (op @0 @1) @0 @1)
+   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
+    (with { poly_uint64 nloc = 0;
+	    tree src = adjacent_data_access_p (@0, @1, &nloc, false);
+	    tree ntype = build_vector_type (type, 2);
+	    tree size = TYPE_SIZE (ntype);
+	    tree pos = build_int_cst (TREE_TYPE (size), nloc);
+	    poly_uint64 _sz;
+	    poly_uint64 _total; }
+     (if (src && is_gimple_reg (src) && ntype
+	  && poly_int_tree_p (size, &_sz)
+	  && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
+	  && known_ge (_total, _sz + nloc))
+      (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type)))
+       (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
+       (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))))
+
 (simplify
  (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
  (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); }))
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006603ae1e3deb6b76 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt)
     {
       tree arg = gimple_call_arg (stmt, i);
       if ((is_gimple_reg_type (TREE_TYPE (arg))
+	   && !is_gimple_variable (arg)
 	   && !is_gimple_val (arg))
 	  || (!is_gimple_reg_type (TREE_TYPE (arg))
 	      && !is_gimple_lvalue (arg)))
diff --git a/gcc/tree.h b/gcc/tree.h
index e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194640b8c1b531b798 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree);
 
 extern tree bitmask_inv_cst_vector_p (tree);
 
+/* TRUE if the two operands represent adjacent access of data such that a
+   pairwise operation can be used.  */
+
+extern tree adjacent_data_access_p (tree, tree, poly_uint64*, bool);
+
 /* integer_nonzerop (tree x) is nonzero if X is an integer constant
    with a nonzero value.  */
 
diff --git a/gcc/tree.cc b/gcc/tree.cc
index 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19de9841e29d311 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t)
   return builder.build ();
 }
 
+/* Returns base address if the two operands represent adjacent access of data
+   such that a pairwise operation can be used.  OP1 must be a lower subpart
+   than OP2.  If POS is not NULL then on return if a value is returned POS
+   will indicate the position of the lower address.  If COMMUTATIVE_P then
+   the operation is also tried by flipping op1 and op2.  */
+
+tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos,
+			     bool commutative_p)
+{
+  gcc_assert (op1);
+  gcc_assert (op2);
+  if (TREE_CODE (op1) != TREE_CODE (op2)
+      || TREE_TYPE (op1) != TREE_TYPE (op2))
+    return NULL;
+
+  tree type = TREE_TYPE (op1);
+  gimple *stmt1 = NULL, *stmt2 = NULL;
+  unsigned int bits = GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type)));
+
+  if (TREE_CODE (op1) == BIT_FIELD_REF
+      && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND (op2, 0), 0)
+      && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND (op2, 1), 0)
+      && known_eq (bit_field_size (op1), bits))
+    {
+      poly_uint64 offset1 = bit_field_offset (op1);
+      poly_uint64 offset2 = bit_field_offset (op2);
+      if (known_eq (offset2 - offset1, bits))
+	{
+	  if (pos)
+	    *pos = offset1;
+	  return TREE_OPERAND (op1, 0);
+	}
+      else if (commutative_p && known_eq (offset1 - offset2, bits))
+	{
+	  if (pos)
+	    *pos = offset2;
+	  return TREE_OPERAND (op1, 0);
+	}
+    }
+  else if (TREE_CODE (op1) == ARRAY_REF
+	   && operand_equal_p (get_base_address (op1), get_base_address (op2)))
+    {
+      wide_int size1 = wi::to_wide (array_ref_element_size (op1));
+      wide_int size2 = wi::to_wide (array_ref_element_size (op2));
+      if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8)
+	  || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1))
+	  || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1)))
+	return NULL;
+
+      poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, 1));
+      poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, 1));
+      if (known_eq (offset2 - offset1, 1UL))
+	{
+	  if (pos)
+	    *pos = offset1 * bits;
+	  return TREE_OPERAND (op1, 0);
+	}
+      else if (commutative_p && known_eq (offset1 - offset2, 1UL))
+	{
+	  if (pos)
+	    *pos = offset2 * bits;
+	  return TREE_OPERAND (op1, 0);
+	}
+    }
+  else if (TREE_CODE (op1) == SSA_NAME
+	   && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL
+	   && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL
+	   && is_gimple_assign (stmt1)
+	   && is_gimple_assign (stmt2))
+    {
+      if (gimple_assign_rhs_code (stmt1) != ARRAY_REF
+	  && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF
+	  && gimple_assign_rhs_code (stmt2) != ARRAY_REF
+	  && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF)
+	return NULL;
+
+      return adjacent_data_access_p (gimple_assign_rhs1 (stmt1),
+				     gimple_assign_rhs1 (stmt2), pos,
+				     commutative_p);
+    }
+
+  return NULL;
+}
+
 /* If VECTOR_CST T has a single nonzero element, return the index of that
    element, otherwise return -1.  */
 




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 2/8]middle-end: Recognize scalar widening reductions
  2022-10-31 11:56 [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Tamar Christina
@ 2022-10-31 11:57 ` Tamar Christina
  2022-10-31 21:42   ` Jeff Law
  2022-11-07 13:21   ` Richard Biener
  2022-10-31 11:57 ` [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector Tamar Christina
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 50+ messages in thread
From: Tamar Christina @ 2022-10-31 11:57 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, rguenther, jeffreyalaw

[-- Attachment #1: Type: text/plain, Size: 4321 bytes --]

Hi All,

This adds a new optab and IFNs for REDUC_PLUS_WIDEN where the resulting
scalar reduction has twice the precision of the input elements.

At some point in a later patch I will also teach the vectorizer to recognize
this builtin once I figure out how the various bits of reductions work.

For now it's generated only by the match.pd pattern.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* internal-fn.def (REDUC_PLUS_WIDEN): New.
	* doc/md.texi: Document it.
	* match.pd: Recognize widening plus.
	* optabs.def (reduc_splus_widen_scal_optab,
	reduc_uplus_widen_scal_optab): New.

--- inline copy of patch -- 
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 34825549ed4e315b07d36dc3d63bae0cc0a3932d..c08691ab4c9a4bfe55ae81e5e228a414d6242d78 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5284,6 +5284,20 @@ Compute the sum of the elements of a vector. The vector is operand 1, and
 operand 0 is the scalar result, with mode equal to the mode of the elements of
 the input vector.
 
+@cindex @code{reduc_uplus_widen_scal_@var{m}} instruction pattern
+@item @samp{reduc_uplus_widen_scal_@var{m}}
+Compute the sum of the elements of a vector and zero-extend @var{m} to a mode
+that has twice the precision of @var{m}.. The vector is operand 1, and
+operand 0 is the scalar result, with mode equal to twice the precision of the
+mode of the elements of the input vector.
+
+@cindex @code{reduc_splus_widen_scal_@var{m}} instruction pattern
+@item @samp{reduc_splus_widen_scal_@var{m}}
+Compute the sum of the elements of a vector and sign-extend @var{m} to a mode
+that has twice the precision of @var{m}.. The vector is operand 1, and
+operand 0 is the scalar result, with mode equal to twice the precision of the
+mode of the elements of the input vector.
+
 @cindex @code{reduc_and_scal_@var{m}} instruction pattern
 @item @samp{reduc_and_scal_@var{m}}
 @cindex @code{reduc_ior_scal_@var{m}} instruction pattern
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 5e672183f4def9d0cdc29cf12fe17e8cff928f9f..f64a8421b1087b6c0f3602dc556876b0fd15c7ad 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -215,6 +215,9 @@ DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
 
 DEF_INTERNAL_OPTAB_FN (REDUC_PLUS, ECF_CONST | ECF_NOTHROW,
 		       reduc_plus_scal, unary)
+DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_PLUS_WIDEN, ECF_CONST | ECF_NOTHROW,
+			      first, reduc_splus_widen_scal,
+			      reduc_uplus_widen_scal, unary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_MAX, ECF_CONST | ECF_NOTHROW, first,
 			      reduc_smax_scal, reduc_umax_scal, unary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_MIN, ECF_CONST | ECF_NOTHROW, first,
diff --git a/gcc/match.pd b/gcc/match.pd
index aecaa3520b36e770d11ea9a10eb18db23c0cd9f7..1d407414bee278c64c00d425d9f025c1c58d853d 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -7237,6 +7237,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
        (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
        (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))))
 
+/* Widening reduction conversions. */
+(simplify
+ (convert (IFN_REDUC_PLUS @0))
+ (if (element_precision (TREE_TYPE (@0)) * 2 == element_precision (type)
+      && TYPE_UNSIGNED (type) == TYPE_UNSIGNED (TREE_TYPE (@0))
+      && ANY_INTEGRAL_TYPE_P (type) && ANY_INTEGRAL_TYPE_P (TREE_TYPE(@0)))
+  (IFN_REDUC_PLUS_WIDEN @0)))
+
 (simplify
  (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
  (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); }))
diff --git a/gcc/optabs.def b/gcc/optabs.def
index a6db2342bed6baf13ecbd84112c8432c6972e6fe..9947aed67fb8a3b675cb0aab9aeb059f89644106 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -346,6 +346,8 @@ OPTAB_D (reduc_fmin_scal_optab, "reduc_fmin_scal_$a")
 OPTAB_D (reduc_smax_scal_optab, "reduc_smax_scal_$a")
 OPTAB_D (reduc_smin_scal_optab, "reduc_smin_scal_$a")
 OPTAB_D (reduc_plus_scal_optab, "reduc_plus_scal_$a")
+OPTAB_D (reduc_splus_widen_scal_optab, "reduc_splus_widen_scal_$a")
+OPTAB_D (reduc_uplus_widen_scal_optab, "reduc_uplus_widen_scal_$a")
 OPTAB_D (reduc_umax_scal_optab, "reduc_umax_scal_$a")
 OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a")
 OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")




-- 

[-- Attachment #2: rb16241.patch --]
[-- Type: text/plain, Size: 3610 bytes --]

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 34825549ed4e315b07d36dc3d63bae0cc0a3932d..c08691ab4c9a4bfe55ae81e5e228a414d6242d78 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5284,6 +5284,20 @@ Compute the sum of the elements of a vector. The vector is operand 1, and
 operand 0 is the scalar result, with mode equal to the mode of the elements of
 the input vector.
 
+@cindex @code{reduc_uplus_widen_scal_@var{m}} instruction pattern
+@item @samp{reduc_uplus_widen_scal_@var{m}}
+Compute the sum of the elements of a vector and zero-extend @var{m} to a mode
+that has twice the precision of @var{m}.. The vector is operand 1, and
+operand 0 is the scalar result, with mode equal to twice the precision of the
+mode of the elements of the input vector.
+
+@cindex @code{reduc_splus_widen_scal_@var{m}} instruction pattern
+@item @samp{reduc_splus_widen_scal_@var{m}}
+Compute the sum of the elements of a vector and sign-extend @var{m} to a mode
+that has twice the precision of @var{m}.. The vector is operand 1, and
+operand 0 is the scalar result, with mode equal to twice the precision of the
+mode of the elements of the input vector.
+
 @cindex @code{reduc_and_scal_@var{m}} instruction pattern
 @item @samp{reduc_and_scal_@var{m}}
 @cindex @code{reduc_ior_scal_@var{m}} instruction pattern
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 5e672183f4def9d0cdc29cf12fe17e8cff928f9f..f64a8421b1087b6c0f3602dc556876b0fd15c7ad 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -215,6 +215,9 @@ DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
 
 DEF_INTERNAL_OPTAB_FN (REDUC_PLUS, ECF_CONST | ECF_NOTHROW,
 		       reduc_plus_scal, unary)
+DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_PLUS_WIDEN, ECF_CONST | ECF_NOTHROW,
+			      first, reduc_splus_widen_scal,
+			      reduc_uplus_widen_scal, unary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_MAX, ECF_CONST | ECF_NOTHROW, first,
 			      reduc_smax_scal, reduc_umax_scal, unary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_MIN, ECF_CONST | ECF_NOTHROW, first,
diff --git a/gcc/match.pd b/gcc/match.pd
index aecaa3520b36e770d11ea9a10eb18db23c0cd9f7..1d407414bee278c64c00d425d9f025c1c58d853d 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -7237,6 +7237,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
        (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
        (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))))
 
+/* Widening reduction conversions. */
+(simplify
+ (convert (IFN_REDUC_PLUS @0))
+ (if (element_precision (TREE_TYPE (@0)) * 2 == element_precision (type)
+      && TYPE_UNSIGNED (type) == TYPE_UNSIGNED (TREE_TYPE (@0))
+      && ANY_INTEGRAL_TYPE_P (type) && ANY_INTEGRAL_TYPE_P (TREE_TYPE(@0)))
+  (IFN_REDUC_PLUS_WIDEN @0)))
+
 (simplify
  (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
  (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); }))
diff --git a/gcc/optabs.def b/gcc/optabs.def
index a6db2342bed6baf13ecbd84112c8432c6972e6fe..9947aed67fb8a3b675cb0aab9aeb059f89644106 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -346,6 +346,8 @@ OPTAB_D (reduc_fmin_scal_optab, "reduc_fmin_scal_$a")
 OPTAB_D (reduc_smax_scal_optab, "reduc_smax_scal_$a")
 OPTAB_D (reduc_smin_scal_optab, "reduc_smin_scal_$a")
 OPTAB_D (reduc_plus_scal_optab, "reduc_plus_scal_$a")
+OPTAB_D (reduc_splus_widen_scal_optab, "reduc_splus_widen_scal_$a")
+OPTAB_D (reduc_uplus_widen_scal_optab, "reduc_uplus_widen_scal_$a")
 OPTAB_D (reduc_umax_scal_optab, "reduc_umax_scal_$a")
 OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a")
 OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-10-31 11:56 [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Tamar Christina
  2022-10-31 11:57 ` [PATCH 2/8]middle-end: Recognize scalar widening reductions Tamar Christina
@ 2022-10-31 11:57 ` Tamar Christina
  2022-10-31 21:44   ` Jeff Law
  2022-11-01 14:25   ` Richard Sandiford
  2022-10-31 11:58 ` [PATCH 4/8]AArch64 aarch64: Implement widening reduction patterns Tamar Christina
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 50+ messages in thread
From: Tamar Christina @ 2022-10-31 11:57 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, rguenther, jeffreyalaw

[-- Attachment #1: Type: text/plain, Size: 4657 bytes --]

Hi All,

The current vector extract pattern can only extract from a vector when the
position to extract is a multiple of the vector bitsize as a whole.

That means extract something like a V2SI from a V4SI vector from position 32
isn't possible as 32 is not a multiple of 64.  Ideally this optab should have
worked on multiple of the element size, but too many targets rely on this
semantic now.

So instead add a new case which allows any extraction as long as the bit pos
is a multiple of the element size.  We use a VEC_PERM to shuffle the elements
into the bottom parts of the vector and then use a subreg to extract the values
out.  This now allows various vector operations that before were being
decomposed into very inefficient scalar operations.

NOTE: I added 3 testcases, I only fixed the 3rd one.

The 1st one missed because we don't optimize VEC_PERM expressions into
bitfields.  The 2nd one is missed because extract_bit_field only works on
vector modes.  In this case the intermediate extract is DImode.

On targets where the scalar mode is tiable to vector modes the extract should
work fine.

However I ran out of time to fix the first two and so will do so in GCC 14.
For now this catches the case that my pattern now introduces more easily.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* expmed.cc (extract_bit_field_1): Add support for vector element
	extracts.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/ext_1.c: New.

--- inline copy of patch -- 
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index bab020c07222afa38305ef8d7333f271b1965b78..ffdf65210d17580a216477cfe4ac1598941ac9e4 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -1718,6 +1718,45 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, poly_uint64 bitnum,
 	      return target;
 	    }
 	}
+      else if (!known_eq (bitnum, 0U)
+	       && multiple_p (GET_MODE_UNIT_BITSIZE (tmode), bitnum, &pos))
+	{
+	  /* The encoding has a single stepped pattern.  */
+	  poly_uint64 nunits = GET_MODE_NUNITS (new_mode);
+	  int nelts = nunits.to_constant ();
+	  vec_perm_builder sel (nunits, nelts, 1);
+	  int delta = -pos.to_constant ();
+	  for (int i = 0; i < nelts; ++i)
+	    sel.quick_push ((i - delta) % nelts);
+	  vec_perm_indices indices (sel, 1, nunits);
+
+	  if (can_vec_perm_const_p (new_mode, new_mode, indices, false))
+	    {
+	      class expand_operand ops[4];
+	      machine_mode outermode = new_mode;
+	      machine_mode innermode = tmode;
+	      enum insn_code icode
+		= direct_optab_handler (vec_perm_optab, outermode);
+	      target = gen_reg_rtx (outermode);
+	      if (icode != CODE_FOR_nothing)
+		{
+		  rtx sel = vec_perm_indices_to_rtx (outermode, indices);
+		  create_output_operand (&ops[0], target, outermode);
+		  ops[0].target = 1;
+		  create_input_operand (&ops[1], op0, outermode);
+		  create_input_operand (&ops[2], op0, outermode);
+		  create_input_operand (&ops[3], sel, outermode);
+		  if (maybe_expand_insn (icode, 4, ops))
+		    return simplify_gen_subreg (innermode, target, outermode, 0);
+		}
+	      else if (targetm.vectorize.vec_perm_const != NULL)
+		{
+		  if (targetm.vectorize.vec_perm_const (outermode, outermode,
+							target, op0, op0, indices))
+		    return simplify_gen_subreg (innermode, target, outermode, 0);
+		}
+	    }
+	}
     }
 
   /* See if we can get a better vector mode before extracting.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/ext_1.c b/gcc/testsuite/gcc.target/aarch64/ext_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e571b3bc2ddf887a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ext_1.c
@@ -0,0 +1,54 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include <string.h>
+
+typedef unsigned int v4si __attribute__((vector_size (16)));
+typedef unsigned int v2si __attribute__((vector_size (8)));
+
+/*
+** extract: { xfail *-*-* }
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract (v4si x)
+{
+    v2si res = {x[1], x[2]};
+    return res;
+}
+
+/*
+** extract1: { xfail *-*-* }
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract1 (v4si x)
+{
+    v2si res;
+    memcpy (&res, ((int*)&x)+1, sizeof(res));
+    return res;
+}
+
+typedef struct cast {
+  int a;
+  v2si b __attribute__((packed));
+} cast_t;
+
+typedef union Data {
+   v4si x;
+   cast_t y;
+} data;  
+
+/*
+** extract2:
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract2 (v4si x)
+{
+    data d;
+    d.x = x;
+    return d.y.b;
+}
+




-- 

[-- Attachment #2: rb16242.patch --]
[-- Type: text/plain, Size: 3087 bytes --]

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index bab020c07222afa38305ef8d7333f271b1965b78..ffdf65210d17580a216477cfe4ac1598941ac9e4 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -1718,6 +1718,45 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, poly_uint64 bitnum,
 	      return target;
 	    }
 	}
+      else if (!known_eq (bitnum, 0U)
+	       && multiple_p (GET_MODE_UNIT_BITSIZE (tmode), bitnum, &pos))
+	{
+	  /* The encoding has a single stepped pattern.  */
+	  poly_uint64 nunits = GET_MODE_NUNITS (new_mode);
+	  int nelts = nunits.to_constant ();
+	  vec_perm_builder sel (nunits, nelts, 1);
+	  int delta = -pos.to_constant ();
+	  for (int i = 0; i < nelts; ++i)
+	    sel.quick_push ((i - delta) % nelts);
+	  vec_perm_indices indices (sel, 1, nunits);
+
+	  if (can_vec_perm_const_p (new_mode, new_mode, indices, false))
+	    {
+	      class expand_operand ops[4];
+	      machine_mode outermode = new_mode;
+	      machine_mode innermode = tmode;
+	      enum insn_code icode
+		= direct_optab_handler (vec_perm_optab, outermode);
+	      target = gen_reg_rtx (outermode);
+	      if (icode != CODE_FOR_nothing)
+		{
+		  rtx sel = vec_perm_indices_to_rtx (outermode, indices);
+		  create_output_operand (&ops[0], target, outermode);
+		  ops[0].target = 1;
+		  create_input_operand (&ops[1], op0, outermode);
+		  create_input_operand (&ops[2], op0, outermode);
+		  create_input_operand (&ops[3], sel, outermode);
+		  if (maybe_expand_insn (icode, 4, ops))
+		    return simplify_gen_subreg (innermode, target, outermode, 0);
+		}
+	      else if (targetm.vectorize.vec_perm_const != NULL)
+		{
+		  if (targetm.vectorize.vec_perm_const (outermode, outermode,
+							target, op0, op0, indices))
+		    return simplify_gen_subreg (innermode, target, outermode, 0);
+		}
+	    }
+	}
     }
 
   /* See if we can get a better vector mode before extracting.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/ext_1.c b/gcc/testsuite/gcc.target/aarch64/ext_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e571b3bc2ddf887a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ext_1.c
@@ -0,0 +1,54 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include <string.h>
+
+typedef unsigned int v4si __attribute__((vector_size (16)));
+typedef unsigned int v2si __attribute__((vector_size (8)));
+
+/*
+** extract: { xfail *-*-* }
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract (v4si x)
+{
+    v2si res = {x[1], x[2]};
+    return res;
+}
+
+/*
+** extract1: { xfail *-*-* }
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract1 (v4si x)
+{
+    v2si res;
+    memcpy (&res, ((int*)&x)+1, sizeof(res));
+    return res;
+}
+
+typedef struct cast {
+  int a;
+  v2si b __attribute__((packed));
+} cast_t;
+
+typedef union Data {
+   v4si x;
+   cast_t y;
+} data;  
+
+/*
+** extract2:
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract2 (v4si x)
+{
+    data d;
+    d.x = x;
+    return d.y.b;
+}
+




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 4/8]AArch64 aarch64: Implement widening reduction patterns
  2022-10-31 11:56 [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Tamar Christina
  2022-10-31 11:57 ` [PATCH 2/8]middle-end: Recognize scalar widening reductions Tamar Christina
  2022-10-31 11:57 ` [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector Tamar Christina
@ 2022-10-31 11:58 ` Tamar Christina
  2022-11-01 14:41   ` Richard Sandiford
  2022-10-31 11:58 ` [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable Tamar Christina
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-10-31 11:58 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 5664 bytes --]

Hi All,

This implements the new widening reduction optab in the backend.
Instead of introducing a duplicate definition for the same thing I have
renamed the intrinsics defintions to use the same optab.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd-builtins.def (saddlv, uaddlv): Rename to
	reduc_splus_widen_scal_ and reduc_uplus_widen_scal_ respectively.
	* config/aarch64/aarch64-simd.md (aarch64_<su>addlv<mode>): Renamed to
	...
	(reduc_<su>plus_widen_scal_<mode>): ... This.
	* config/aarch64/arm_neon.h (vaddlv_s8, vaddlv_s16, vaddlv_u8,
	vaddlv_u16, vaddlvq_s8, vaddlvq_s16, vaddlvq_s32, vaddlvq_u8,
	vaddlvq_u16, vaddlvq_u32, vaddlv_s32, vaddlv_u32): Use it.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index cf46b31627b84476a25762ffc708fd84a4086e43..a4b21e1495c5699d8557a4bcb9e73ef98ae60b35 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -190,9 +190,9 @@
   BUILTIN_VDQV_L (UNOP, saddlp, 0, NONE)
   BUILTIN_VDQV_L (UNOPU, uaddlp, 0, NONE)
 
-  /* Implemented by aarch64_<su>addlv<mode>.  */
-  BUILTIN_VDQV_L (UNOP, saddlv, 0, NONE)
-  BUILTIN_VDQV_L (UNOPU, uaddlv, 0, NONE)
+  /* Implemented by reduc_<su>plus_widen_scal_<mode>.  */
+  BUILTIN_VDQV_L (UNOP, reduc_splus_widen_scal_, 10, NONE)
+  BUILTIN_VDQV_L (UNOPU, reduc_uplus_widen_scal_, 10, NONE)
 
   /* Implemented by aarch64_<su>abd<mode>.  */
   BUILTIN_VDQ_BHSI (BINOP, sabd, 0, NONE)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index cf8c094bd4b76981cef2dd5dd7b8e6be0d56101f..25aed74f8cf939562ed65a578fe32ca76605b58a 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3455,7 +3455,7 @@ (define_expand "reduc_plus_scal_v4sf"
   DONE;
 })
 
-(define_insn "aarch64_<su>addlv<mode>"
+(define_insn "reduc_<su>plus_widen_scal_<mode>"
  [(set (match_operand:<VWIDE_S> 0 "register_operand" "=w")
        (unspec:<VWIDE_S> [(match_operand:VDQV_L 1 "register_operand" "w")]
 		    USADDLV))]
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index cf6af728ca99dae1cb6ab647466cfec32f7e913e..7b2c4c016191bcd6c3e075d27810faedb23854b7 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -3664,70 +3664,70 @@ __extension__ extern __inline int16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_s8 (int8x8_t __a)
 {
-  return __builtin_aarch64_saddlvv8qi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v8qi (__a);
 }
 
 __extension__ extern __inline int32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_s16 (int16x4_t __a)
 {
-  return __builtin_aarch64_saddlvv4hi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v4hi (__a);
 }
 
 __extension__ extern __inline uint16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_u8 (uint8x8_t __a)
 {
-  return __builtin_aarch64_uaddlvv8qi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v8qi_uu (__a);
 }
 
 __extension__ extern __inline uint32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_u16 (uint16x4_t __a)
 {
-  return __builtin_aarch64_uaddlvv4hi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v4hi_uu (__a);
 }
 
 __extension__ extern __inline int16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_s8 (int8x16_t __a)
 {
-  return __builtin_aarch64_saddlvv16qi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v16qi (__a);
 }
 
 __extension__ extern __inline int32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_s16 (int16x8_t __a)
 {
-  return __builtin_aarch64_saddlvv8hi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v8hi (__a);
 }
 
 __extension__ extern __inline int64_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_s32 (int32x4_t __a)
 {
-  return __builtin_aarch64_saddlvv4si (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v4si (__a);
 }
 
 __extension__ extern __inline uint16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_u8 (uint8x16_t __a)
 {
-  return __builtin_aarch64_uaddlvv16qi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v16qi_uu (__a);
 }
 
 __extension__ extern __inline uint32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_u16 (uint16x8_t __a)
 {
-  return __builtin_aarch64_uaddlvv8hi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v8hi_uu (__a);
 }
 
 __extension__ extern __inline uint64_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_u32 (uint32x4_t __a)
 {
-  return __builtin_aarch64_uaddlvv4si_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v4si_uu (__a);
 }
 
 __extension__ extern __inline float32x2_t
@@ -6461,14 +6461,14 @@ __extension__ extern __inline int64_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_s32 (int32x2_t __a)
 {
-  return __builtin_aarch64_saddlvv2si (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v2si (__a);
 }
 
 __extension__ extern __inline uint64_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_u32 (uint32x2_t __a)
 {
-  return __builtin_aarch64_uaddlvv2si_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v2si_uu (__a);
 }
 
 __extension__ extern __inline int16x4_t




-- 

[-- Attachment #2: rb16243.patch --]
[-- Type: text/plain, Size: 4863 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index cf46b31627b84476a25762ffc708fd84a4086e43..a4b21e1495c5699d8557a4bcb9e73ef98ae60b35 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -190,9 +190,9 @@
   BUILTIN_VDQV_L (UNOP, saddlp, 0, NONE)
   BUILTIN_VDQV_L (UNOPU, uaddlp, 0, NONE)
 
-  /* Implemented by aarch64_<su>addlv<mode>.  */
-  BUILTIN_VDQV_L (UNOP, saddlv, 0, NONE)
-  BUILTIN_VDQV_L (UNOPU, uaddlv, 0, NONE)
+  /* Implemented by reduc_<su>plus_widen_scal_<mode>.  */
+  BUILTIN_VDQV_L (UNOP, reduc_splus_widen_scal_, 10, NONE)
+  BUILTIN_VDQV_L (UNOPU, reduc_uplus_widen_scal_, 10, NONE)
 
   /* Implemented by aarch64_<su>abd<mode>.  */
   BUILTIN_VDQ_BHSI (BINOP, sabd, 0, NONE)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index cf8c094bd4b76981cef2dd5dd7b8e6be0d56101f..25aed74f8cf939562ed65a578fe32ca76605b58a 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3455,7 +3455,7 @@ (define_expand "reduc_plus_scal_v4sf"
   DONE;
 })
 
-(define_insn "aarch64_<su>addlv<mode>"
+(define_insn "reduc_<su>plus_widen_scal_<mode>"
  [(set (match_operand:<VWIDE_S> 0 "register_operand" "=w")
        (unspec:<VWIDE_S> [(match_operand:VDQV_L 1 "register_operand" "w")]
 		    USADDLV))]
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index cf6af728ca99dae1cb6ab647466cfec32f7e913e..7b2c4c016191bcd6c3e075d27810faedb23854b7 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -3664,70 +3664,70 @@ __extension__ extern __inline int16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_s8 (int8x8_t __a)
 {
-  return __builtin_aarch64_saddlvv8qi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v8qi (__a);
 }
 
 __extension__ extern __inline int32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_s16 (int16x4_t __a)
 {
-  return __builtin_aarch64_saddlvv4hi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v4hi (__a);
 }
 
 __extension__ extern __inline uint16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_u8 (uint8x8_t __a)
 {
-  return __builtin_aarch64_uaddlvv8qi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v8qi_uu (__a);
 }
 
 __extension__ extern __inline uint32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_u16 (uint16x4_t __a)
 {
-  return __builtin_aarch64_uaddlvv4hi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v4hi_uu (__a);
 }
 
 __extension__ extern __inline int16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_s8 (int8x16_t __a)
 {
-  return __builtin_aarch64_saddlvv16qi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v16qi (__a);
 }
 
 __extension__ extern __inline int32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_s16 (int16x8_t __a)
 {
-  return __builtin_aarch64_saddlvv8hi (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v8hi (__a);
 }
 
 __extension__ extern __inline int64_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_s32 (int32x4_t __a)
 {
-  return __builtin_aarch64_saddlvv4si (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v4si (__a);
 }
 
 __extension__ extern __inline uint16_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_u8 (uint8x16_t __a)
 {
-  return __builtin_aarch64_uaddlvv16qi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v16qi_uu (__a);
 }
 
 __extension__ extern __inline uint32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_u16 (uint16x8_t __a)
 {
-  return __builtin_aarch64_uaddlvv8hi_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v8hi_uu (__a);
 }
 
 __extension__ extern __inline uint64_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlvq_u32 (uint32x4_t __a)
 {
-  return __builtin_aarch64_uaddlvv4si_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v4si_uu (__a);
 }
 
 __extension__ extern __inline float32x2_t
@@ -6461,14 +6461,14 @@ __extension__ extern __inline int64_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_s32 (int32x2_t __a)
 {
-  return __builtin_aarch64_saddlvv2si (__a);
+  return __builtin_aarch64_reduc_splus_widen_scal_v2si (__a);
 }
 
 __extension__ extern __inline uint64_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vaddlv_u32 (uint32x2_t __a)
 {
-  return __builtin_aarch64_uaddlvv2si_uu (__a);
+  return __builtin_aarch64_reduc_uplus_widen_scal_v2si_uu (__a);
 }
 
 __extension__ extern __inline int16x4_t




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
  2022-10-31 11:56 [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Tamar Christina
                   ` (2 preceding siblings ...)
  2022-10-31 11:58 ` [PATCH 4/8]AArch64 aarch64: Implement widening reduction patterns Tamar Christina
@ 2022-10-31 11:58 ` Tamar Christina
  2022-11-01 14:58   ` Richard Sandiford
  2022-10-31 11:59 ` [PATCH 6/8]AArch64: Add peephole and scheduling logic for pairwise operations that appear late in RTL Tamar Christina
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-10-31 11:58 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 17495 bytes --]

Hi All,

The backend has an existing V2HFmode that is used by pairwise operations.
This mode was however never made fully functional.  Amongst other things it was
never declared as a vector type which made it unusable from the mid-end.

It's also lacking an implementation for load/stores so reload ICEs if this mode
is every used.  This finishes the implementation by providing the above.

Note that I have created a new iterator VHSDF_P instead of extending VHSDF
because the previous iterator is used in far more things than just load/stores.

It's also used for instance in intrinsics and extending this would force me to
provide support for mangling the type while we never expose it through
intrinsics.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_simd_movv2hf): New.
	(mov<mode>, movmisalign<mode>, aarch64_dup_lane<mode>,
	aarch64_store_lane0<mode>, aarch64_simd_vec_set<mode>,
	@aarch64_simd_vec_copy_lane<mode>, vec_set<mode>,
	reduc_<optab>_scal_<mode>, reduc_<fmaxmin>_scal_<mode>,
	aarch64_reduc_<optab>_internal<mode>, aarch64_get_lane<mode>,
	vec_init<mode><Vel>, vec_extract<mode><Vel>): Support V2HF.
	* config/aarch64/aarch64.cc (aarch64_classify_vector_mode):
	Add E_V2HFmode.
	* config/aarch64/iterators.md (VHSDF_P): New.
	(V2F, VALL_F16_FULL, nunits, Vtype, Vmtype, Vetype, stype, VEL,
	Vel, q, vp): Add V2HF.
	* config/arm/types.md (neon_fp_reduc_add_h): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/slp_1.c: Update testcase.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 25aed74f8cf939562ed65a578fe32ca76605b58a..93a2888f567460ad10ec050ea7d4f701df4729d1 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -19,10 +19,10 @@
 ;; <http://www.gnu.org/licenses/>.
 
 (define_expand "mov<mode>"
-  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
-	(match_operand:VALL_F16 1 "general_operand"))]
+  [(set (match_operand:VALL_F16_FULL 0 "nonimmediate_operand")
+	(match_operand:VALL_F16_FULL 1 "general_operand"))]
   "TARGET_SIMD"
-  "
+{
   /* Force the operand into a register if it is not an
      immediate whose use can be replaced with xzr.
      If the mode is 16 bytes wide, then we will be doing
@@ -46,12 +46,11 @@ (define_expand "mov<mode>"
       aarch64_expand_vector_init (operands[0], operands[1]);
       DONE;
     }
-  "
-)
+})
 
 (define_expand "movmisalign<mode>"
-  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
-        (match_operand:VALL_F16 1 "general_operand"))]
+  [(set (match_operand:VALL_F16_FULL 0 "nonimmediate_operand")
+        (match_operand:VALL_F16_FULL 1 "general_operand"))]
   "TARGET_SIMD && !STRICT_ALIGNMENT"
 {
   /* This pattern is not permitted to fail during expansion: if both arguments
@@ -85,10 +84,10 @@ (define_insn "aarch64_simd_dup<mode>"
 )
 
 (define_insn "aarch64_dup_lane<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
-	(vec_duplicate:VALL_F16
+  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w")
+	(vec_duplicate:VALL_F16_FULL
 	  (vec_select:<VEL>
-	    (match_operand:VALL_F16 1 "register_operand" "w")
+	    (match_operand:VALL_F16_FULL 1 "register_operand" "w")
 	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
           )))]
   "TARGET_SIMD"
@@ -142,6 +141,29 @@ (define_insn "*aarch64_simd_mov<VDMOV:mode>"
 		     mov_reg, neon_move<q>")]
 )
 
+(define_insn "*aarch64_simd_movv2hf"
+  [(set (match_operand:V2HF 0 "nonimmediate_operand"
+		"=w, m,  m,  w, ?r, ?w, ?r, w, w")
+	(match_operand:V2HF 1 "general_operand"
+		"m,  Dz, w,  w,  w,  r,  r, Dz, Dn"))]
+  "TARGET_SIMD_F16INST
+   && (register_operand (operands[0], V2HFmode)
+       || aarch64_simd_reg_or_zero (operands[1], V2HFmode))"
+   "@
+    ldr\\t%s0, %1
+    str\\twzr, %0
+    str\\t%s1, %0
+    mov\\t%0.2s[0], %1.2s[0]
+    umov\\t%w0, %1.s[0]
+    fmov\\t%s0, %1
+    mov\\t%0, %1
+    movi\\t%d0, 0
+    * return aarch64_output_simd_mov_immediate (operands[1], 32);"
+  [(set_attr "type" "neon_load1_1reg, store_8, neon_store1_1reg,\
+		     neon_logic, neon_to_gp, f_mcr,\
+		     mov_reg, neon_move, neon_move")]
+)
+
 (define_insn "*aarch64_simd_mov<VQMOV:mode>"
   [(set (match_operand:VQMOV 0 "nonimmediate_operand"
 		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
@@ -182,7 +204,7 @@ (define_insn "*aarch64_simd_mov<VQMOV:mode>"
 
 (define_insn "aarch64_store_lane0<mode>"
   [(set (match_operand:<VEL> 0 "memory_operand" "=m")
-	(vec_select:<VEL> (match_operand:VALL_F16 1 "register_operand" "w")
+	(vec_select:<VEL> (match_operand:VALL_F16_FULL 1 "register_operand" "w")
 			(parallel [(match_operand 2 "const_int_operand" "n")])))]
   "TARGET_SIMD
    && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 0"
@@ -1035,11 +1057,11 @@ (define_insn "one_cmpl<mode>2"
 )
 
 (define_insn "aarch64_simd_vec_set<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
-	(vec_merge:VALL_F16
-	    (vec_duplicate:VALL_F16
+  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w,w,w")
+	(vec_merge:VALL_F16_FULL
+	    (vec_duplicate:VALL_F16_FULL
 		(match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand" "w,?r,Utv"))
-	    (match_operand:VALL_F16 3 "register_operand" "0,0,0")
+	    (match_operand:VALL_F16_FULL 3 "register_operand" "0,0,0")
 	    (match_operand:SI 2 "immediate_operand" "i,i,i")))]
   "TARGET_SIMD"
   {
@@ -1061,14 +1083,14 @@ (define_insn "aarch64_simd_vec_set<mode>"
 )
 
 (define_insn "@aarch64_simd_vec_copy_lane<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
-	(vec_merge:VALL_F16
-	    (vec_duplicate:VALL_F16
+  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w")
+	(vec_merge:VALL_F16_FULL
+	    (vec_duplicate:VALL_F16_FULL
 	      (vec_select:<VEL>
-		(match_operand:VALL_F16 3 "register_operand" "w")
+		(match_operand:VALL_F16_FULL 3 "register_operand" "w")
 		(parallel
 		  [(match_operand:SI 4 "immediate_operand" "i")])))
-	    (match_operand:VALL_F16 1 "register_operand" "0")
+	    (match_operand:VALL_F16_FULL 1 "register_operand" "0")
 	    (match_operand:SI 2 "immediate_operand" "i")))]
   "TARGET_SIMD"
   {
@@ -1376,7 +1398,7 @@ (define_insn "vec_shr_<mode>"
 )
 
 (define_expand "vec_set<mode>"
-  [(match_operand:VALL_F16 0 "register_operand")
+  [(match_operand:VALL_F16_FULL 0 "register_operand")
    (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
    (match_operand:SI 2 "immediate_operand")]
   "TARGET_SIMD"
@@ -3503,7 +3525,7 @@ (define_insn "popcount<mode>2"
 ;; gimple_fold'd to the IFN_REDUC_(MAX|MIN) function.  (This is FP smax/smin).
 (define_expand "reduc_<optab>_scal_<mode>"
   [(match_operand:<VEL> 0 "register_operand")
-   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
+   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
 		 FMAXMINV)]
   "TARGET_SIMD"
   {
@@ -3518,7 +3540,7 @@ (define_expand "reduc_<optab>_scal_<mode>"
 
 (define_expand "reduc_<fmaxmin>_scal_<mode>"
   [(match_operand:<VEL> 0 "register_operand")
-   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
+   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
 		 FMAXMINNMV)]
   "TARGET_SIMD"
   {
@@ -3562,8 +3584,8 @@ (define_insn "aarch64_reduc_<optab>_internalv2si"
 )
 
 (define_insn "aarch64_reduc_<optab>_internal<mode>"
- [(set (match_operand:VHSDF 0 "register_operand" "=w")
-       (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")]
+ [(set (match_operand:VHSDF_P 0 "register_operand" "=w")
+       (unspec:VHSDF_P [(match_operand:VHSDF_P 1 "register_operand" "w")]
 		      FMAXMINV))]
  "TARGET_SIMD"
  "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
@@ -4208,7 +4230,7 @@ (define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
 (define_insn_and_split "aarch64_get_lane<mode>"
   [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand" "=?r, w, Utv")
 	(vec_select:<VEL>
-	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
+	  (match_operand:VALL_F16_FULL 1 "register_operand" "w, w, w")
 	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
   "TARGET_SIMD"
   {
@@ -7989,7 +8011,7 @@ (define_expand "aarch64_st1<VALL_F16:mode>"
 ;; Standard pattern name vec_init<mode><Vel>.
 
 (define_expand "vec_init<mode><Vel>"
-  [(match_operand:VALL_F16 0 "register_operand")
+  [(match_operand:VALL_F16_FULL 0 "register_operand")
    (match_operand 1 "" "")]
   "TARGET_SIMD"
 {
@@ -8068,7 +8090,7 @@ (define_insn "aarch64_urecpe<mode>"
 
 (define_expand "vec_extract<mode><Vel>"
   [(match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand")
-   (match_operand:VALL_F16 1 "register_operand")
+   (match_operand:VALL_F16_FULL 1 "register_operand")
    (match_operand:SI 2 "immediate_operand")]
   "TARGET_SIMD"
 {
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index f05bac713e88ea8c7feaa2367d55bd523ca66f57..1e08f8453688210afe1566092b19b59c9bdd0c97 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -3566,6 +3566,7 @@ aarch64_classify_vector_mode (machine_mode mode)
     case E_V8BFmode:
     case E_V4SFmode:
     case E_V2DFmode:
+    case E_V2HFmode:
       return TARGET_SIMD ? VEC_ADVSIMD : 0;
 
     default:
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 37d8161a33b1c399d80be82afa67613a087389d4..1df09f7fe2eb35aed96113476541e0faa5393551 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -160,6 +160,10 @@ (define_mode_iterator VDQF [V2SF V4SF V2DF])
 (define_mode_iterator VHSDF [(V4HF "TARGET_SIMD_F16INST")
 			     (V8HF "TARGET_SIMD_F16INST")
 			     V2SF V4SF V2DF])
+;; Advanced SIMD Float modes suitable for pairwise operations.
+(define_mode_iterator VHSDF_P [(V4HF "TARGET_SIMD_F16INST")
+			       (V8HF "TARGET_SIMD_F16INST")
+			       V2SF V4SF V2DF (V2HF "TARGET_SIMD_F16INST")])
 
 ;; Advanced SIMD Float modes, and DF.
 (define_mode_iterator VDQF_DF [V2SF V4SF V2DF DF])
@@ -188,15 +192,23 @@ (define_mode_iterator VDQF_COND [V2SF V2SI V4SF V4SI V2DF V2DI])
 (define_mode_iterator VALLF [V2SF V4SF V2DF SF DF])
 
 ;; Advanced SIMD Float modes with 2 elements.
-(define_mode_iterator V2F [V2SF V2DF])
+(define_mode_iterator V2F [V2SF V2DF V2HF])
 
 ;; All Advanced SIMD modes on which we support any arithmetic operations.
 (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF V2DF])
 
-;; All Advanced SIMD modes suitable for moving, loading, and storing.
+;; All Advanced SIMD modes suitable for moving, loading, and storing
+;; except V2HF.
 (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
 				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
 
+;; All Advanced SIMD modes suitable for moving, loading, and storing
+;; including V2HF
+(define_mode_iterator VALL_F16_FULL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
+				     V4HF V8HF V4BF V8BF V2SF V4SF V2DF
+				     (V2HF "TARGET_SIMD_F16INST")])
+
+
 ;; The VALL_F16 modes except the 128-bit 2-element ones.
 (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI V4SI
 				V4HF V8HF V2SF V4SF])
@@ -1076,7 +1088,7 @@ (define_mode_attr nunits [(V8QI "8") (V16QI "16")
 			  (V2SF "2") (V4SF "4")
 			  (V1DF "1") (V2DF "2")
 			  (DI "1") (DF "1")
-			  (V8DI "8")])
+			  (V8DI "8") (V2HF "2")])
 
 ;; Map a mode to the number of bits in it, if the size of the mode
 ;; is constant.
@@ -1090,6 +1102,7 @@ (define_mode_attr s [(HF "h") (SF "s") (DF "d") (SI "s") (DI "d")])
 
 ;; Give the length suffix letter for a sign- or zero-extension.
 (define_mode_attr size [(QI "b") (HI "h") (SI "w")])
+(define_mode_attr sizel [(QI "b") (HI "h") (SI "")])
 
 ;; Give the number of bits in the mode
 (define_mode_attr sizen [(QI "8") (HI "16") (SI "32") (DI "64")])
@@ -1134,8 +1147,9 @@ (define_mode_attr Vtype [(V8QI "8b") (V16QI "16b")
                          (V2SI "2s") (V4SI  "4s")
                          (DI   "1d") (DF    "1d")
                          (V2DI "2d") (V2SF "2s")
-			 (V4SF "4s") (V2DF "2d")
-			 (V4HF "4h") (V8HF "8h")
+			 (V2HF "2h") (V4SF "4s")
+			 (V2DF "2d") (V4HF "4h")
+			 (V8HF "8h")
 			 (V2x8QI "8b") (V2x4HI "4h")
 			 (V2x2SI "2s") (V2x1DI  "1d")
 			 (V2x4HF "4h") (V2x2SF "2s")
@@ -1175,9 +1189,10 @@ (define_mode_attr Vmtype [(V8QI ".8b") (V16QI ".16b")
 			 (V4HI ".4h") (V8HI  ".8h")
 			 (V2SI ".2s") (V4SI  ".4s")
 			 (V2DI ".2d") (V4HF ".4h")
-			 (V8HF ".8h") (V4BF ".4h")
-			 (V8BF ".8h") (V2SF ".2s")
-			 (V4SF ".4s") (V2DF ".2d")
+			 (V8HF ".8h") (V2HF ".2h")
+			 (V4BF ".4h") (V8BF ".8h")
+			 (V2SF ".2s") (V4SF ".4s")
+			 (V2DF ".2d")
 			 (DI   "")    (SI   "")
 			 (HI   "")    (QI   "")
 			 (TI   "")    (HF   "")
@@ -1193,7 +1208,7 @@ (define_mode_attr Vmntype [(V8HI ".8b") (V4SI ".4h")
 (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
 			  (V4HI "h") (V8HI  "h")
 			  (V2SI "s") (V4SI  "s")
-			  (V2DI "d")
+			  (V2DI "d") (V2HF  "h")
 			  (V4HF "h") (V8HF  "h")
 			  (V2SF "s") (V4SF  "s")
 			  (V2DF "d")
@@ -1285,7 +1300,7 @@ (define_mode_attr Vcwtype [(VNx16QI "b") (VNx8QI "h") (VNx4QI "w") (VNx2QI "d")
 ;; more accurately.
 (define_mode_attr stype [(V8QI "b") (V16QI "b") (V4HI "s") (V8HI "s")
 			 (V2SI "s") (V4SI "s") (V2DI "d") (V4HF "s")
-			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d")
+			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d") (V2HF "s")
 			 (HF "s") (SF "s") (DF "d") (QI "b") (HI "s")
 			 (SI "s") (DI "d")])
 
@@ -1360,8 +1375,8 @@ (define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
 		       (V4HF "HF") (V8HF  "HF")
 		       (V2SF "SF") (V4SF  "SF")
 		       (DF   "DF") (V2DF  "DF")
-		       (SI   "SI") (HI    "HI")
-		       (QI   "QI")
+		       (SI   "SI") (V2HF  "HF")
+		       (QI   "QI") (HI    "HI")
 		       (V4BF "BF") (V8BF "BF")
 		       (VNx16QI "QI") (VNx8QI "QI") (VNx4QI "QI") (VNx2QI "QI")
 		       (VNx8HI "HI") (VNx4HI "HI") (VNx2HI "HI")
@@ -1381,7 +1396,7 @@ (define_mode_attr Vel [(V8QI "qi") (V16QI "qi")
 		       (V2SF "sf") (V4SF "sf")
 		       (V2DF "df") (DF   "df")
 		       (SI   "si") (HI   "hi")
-		       (QI   "qi")
+		       (QI   "qi") (V2HF "hf")
 		       (V4BF "bf") (V8BF "bf")
 		       (VNx16QI "qi") (VNx8QI "qi") (VNx4QI "qi") (VNx2QI "qi")
 		       (VNx8HI "hi") (VNx4HI "hi") (VNx2HI "hi")
@@ -1866,7 +1881,7 @@ (define_mode_attr q [(V8QI "") (V16QI "_q")
 		     (V4HF "") (V8HF "_q")
 		     (V4BF "") (V8BF "_q")
 		     (V2SF "") (V4SF  "_q")
-			       (V2DF  "_q")
+		     (V2HF "") (V2DF  "_q")
 		     (QI "") (HI "") (SI "") (DI "") (HF "") (SF "") (DF "")
 		     (V2x8QI "") (V2x16QI "_q")
 		     (V2x4HI "") (V2x8HI "_q")
@@ -1905,6 +1920,7 @@ (define_mode_attr vp [(V8QI "v") (V16QI "v")
 		      (V2SI "p") (V4SI  "v")
 		      (V2DI "p") (V2DF  "p")
 		      (V2SF "p") (V4SF  "v")
+		      (V2HF "p")
 		      (V4HF "v") (V8HF  "v")])
 
 (define_mode_attr vsi2qi [(V2SI "v8qi") (V4SI "v16qi")
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
index 7d0504bdd944e9c0d1b545b0b66a9a1adc808714..3cfbc7a93cca1bea4925853e51d0a147c5722247 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -483,6 +483,7 @@ (define_attr "autodetect_type"
 ; neon_fp_minmax_s_q
 ; neon_fp_minmax_d
 ; neon_fp_minmax_d_q
+; neon_fp_reduc_add_h
 ; neon_fp_reduc_add_s
 ; neon_fp_reduc_add_s_q
 ; neon_fp_reduc_add_d
@@ -1033,6 +1034,7 @@ (define_attr "type"
   neon_fp_minmax_d,\
   neon_fp_minmax_d_q,\
 \
+  neon_fp_reduc_add_h,\
   neon_fp_reduc_add_s,\
   neon_fp_reduc_add_s_q,\
   neon_fp_reduc_add_d,\
@@ -1257,8 +1259,8 @@ (define_attr "is_neon_type" "yes,no"
           neon_fp_compare_d, neon_fp_compare_d_q, neon_fp_minmax_s,\
           neon_fp_minmax_s_q, neon_fp_minmax_d, neon_fp_minmax_d_q,\
           neon_fp_neg_s, neon_fp_neg_s_q, neon_fp_neg_d, neon_fp_neg_d_q,\
-          neon_fp_reduc_add_s, neon_fp_reduc_add_s_q, neon_fp_reduc_add_d,\
-          neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,
+          neon_fp_reduc_add_h, neon_fp_reduc_add_s, neon_fp_reduc_add_s_q,\
+          neon_fp_reduc_add_d, neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,\
           neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d,\
           neon_fp_reduc_minmax_d_q,\
           neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
index 07d71a63414b1066ea431e287286ad048515711a..8e35e0b574d49913b43c7d8d4f4ba75f127f42e9 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
@@ -30,11 +30,9 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int n)	\
 TEST_ALL (VEC_PERM)
 
 /* We should use one DUP for each of the 8-, 16- and 32-bit types,
-   although we currently use LD1RW for _Float16.  We should use two
-   DUPs for each of the three 64-bit types.  */
+   We should use two DUPs for each of the three 64-bit types.  */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 2 } } */
-/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 1 } } */
+/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
 /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 /* { dg-final { scan-assembler-not {\tzip2\t} } } */




-- 

[-- Attachment #2: rb16244.patch --]
[-- Type: text/plain, Size: 15886 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 25aed74f8cf939562ed65a578fe32ca76605b58a..93a2888f567460ad10ec050ea7d4f701df4729d1 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -19,10 +19,10 @@
 ;; <http://www.gnu.org/licenses/>.
 
 (define_expand "mov<mode>"
-  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
-	(match_operand:VALL_F16 1 "general_operand"))]
+  [(set (match_operand:VALL_F16_FULL 0 "nonimmediate_operand")
+	(match_operand:VALL_F16_FULL 1 "general_operand"))]
   "TARGET_SIMD"
-  "
+{
   /* Force the operand into a register if it is not an
      immediate whose use can be replaced with xzr.
      If the mode is 16 bytes wide, then we will be doing
@@ -46,12 +46,11 @@ (define_expand "mov<mode>"
       aarch64_expand_vector_init (operands[0], operands[1]);
       DONE;
     }
-  "
-)
+})
 
 (define_expand "movmisalign<mode>"
-  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
-        (match_operand:VALL_F16 1 "general_operand"))]
+  [(set (match_operand:VALL_F16_FULL 0 "nonimmediate_operand")
+        (match_operand:VALL_F16_FULL 1 "general_operand"))]
   "TARGET_SIMD && !STRICT_ALIGNMENT"
 {
   /* This pattern is not permitted to fail during expansion: if both arguments
@@ -85,10 +84,10 @@ (define_insn "aarch64_simd_dup<mode>"
 )
 
 (define_insn "aarch64_dup_lane<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
-	(vec_duplicate:VALL_F16
+  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w")
+	(vec_duplicate:VALL_F16_FULL
 	  (vec_select:<VEL>
-	    (match_operand:VALL_F16 1 "register_operand" "w")
+	    (match_operand:VALL_F16_FULL 1 "register_operand" "w")
 	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
           )))]
   "TARGET_SIMD"
@@ -142,6 +141,29 @@ (define_insn "*aarch64_simd_mov<VDMOV:mode>"
 		     mov_reg, neon_move<q>")]
 )
 
+(define_insn "*aarch64_simd_movv2hf"
+  [(set (match_operand:V2HF 0 "nonimmediate_operand"
+		"=w, m,  m,  w, ?r, ?w, ?r, w, w")
+	(match_operand:V2HF 1 "general_operand"
+		"m,  Dz, w,  w,  w,  r,  r, Dz, Dn"))]
+  "TARGET_SIMD_F16INST
+   && (register_operand (operands[0], V2HFmode)
+       || aarch64_simd_reg_or_zero (operands[1], V2HFmode))"
+   "@
+    ldr\\t%s0, %1
+    str\\twzr, %0
+    str\\t%s1, %0
+    mov\\t%0.2s[0], %1.2s[0]
+    umov\\t%w0, %1.s[0]
+    fmov\\t%s0, %1
+    mov\\t%0, %1
+    movi\\t%d0, 0
+    * return aarch64_output_simd_mov_immediate (operands[1], 32);"
+  [(set_attr "type" "neon_load1_1reg, store_8, neon_store1_1reg,\
+		     neon_logic, neon_to_gp, f_mcr,\
+		     mov_reg, neon_move, neon_move")]
+)
+
 (define_insn "*aarch64_simd_mov<VQMOV:mode>"
   [(set (match_operand:VQMOV 0 "nonimmediate_operand"
 		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
@@ -182,7 +204,7 @@ (define_insn "*aarch64_simd_mov<VQMOV:mode>"
 
 (define_insn "aarch64_store_lane0<mode>"
   [(set (match_operand:<VEL> 0 "memory_operand" "=m")
-	(vec_select:<VEL> (match_operand:VALL_F16 1 "register_operand" "w")
+	(vec_select:<VEL> (match_operand:VALL_F16_FULL 1 "register_operand" "w")
 			(parallel [(match_operand 2 "const_int_operand" "n")])))]
   "TARGET_SIMD
    && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 0"
@@ -1035,11 +1057,11 @@ (define_insn "one_cmpl<mode>2"
 )
 
 (define_insn "aarch64_simd_vec_set<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
-	(vec_merge:VALL_F16
-	    (vec_duplicate:VALL_F16
+  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w,w,w")
+	(vec_merge:VALL_F16_FULL
+	    (vec_duplicate:VALL_F16_FULL
 		(match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand" "w,?r,Utv"))
-	    (match_operand:VALL_F16 3 "register_operand" "0,0,0")
+	    (match_operand:VALL_F16_FULL 3 "register_operand" "0,0,0")
 	    (match_operand:SI 2 "immediate_operand" "i,i,i")))]
   "TARGET_SIMD"
   {
@@ -1061,14 +1083,14 @@ (define_insn "aarch64_simd_vec_set<mode>"
 )
 
 (define_insn "@aarch64_simd_vec_copy_lane<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
-	(vec_merge:VALL_F16
-	    (vec_duplicate:VALL_F16
+  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w")
+	(vec_merge:VALL_F16_FULL
+	    (vec_duplicate:VALL_F16_FULL
 	      (vec_select:<VEL>
-		(match_operand:VALL_F16 3 "register_operand" "w")
+		(match_operand:VALL_F16_FULL 3 "register_operand" "w")
 		(parallel
 		  [(match_operand:SI 4 "immediate_operand" "i")])))
-	    (match_operand:VALL_F16 1 "register_operand" "0")
+	    (match_operand:VALL_F16_FULL 1 "register_operand" "0")
 	    (match_operand:SI 2 "immediate_operand" "i")))]
   "TARGET_SIMD"
   {
@@ -1376,7 +1398,7 @@ (define_insn "vec_shr_<mode>"
 )
 
 (define_expand "vec_set<mode>"
-  [(match_operand:VALL_F16 0 "register_operand")
+  [(match_operand:VALL_F16_FULL 0 "register_operand")
    (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
    (match_operand:SI 2 "immediate_operand")]
   "TARGET_SIMD"
@@ -3503,7 +3525,7 @@ (define_insn "popcount<mode>2"
 ;; gimple_fold'd to the IFN_REDUC_(MAX|MIN) function.  (This is FP smax/smin).
 (define_expand "reduc_<optab>_scal_<mode>"
   [(match_operand:<VEL> 0 "register_operand")
-   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
+   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
 		 FMAXMINV)]
   "TARGET_SIMD"
   {
@@ -3518,7 +3540,7 @@ (define_expand "reduc_<optab>_scal_<mode>"
 
 (define_expand "reduc_<fmaxmin>_scal_<mode>"
   [(match_operand:<VEL> 0 "register_operand")
-   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
+   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
 		 FMAXMINNMV)]
   "TARGET_SIMD"
   {
@@ -3562,8 +3584,8 @@ (define_insn "aarch64_reduc_<optab>_internalv2si"
 )
 
 (define_insn "aarch64_reduc_<optab>_internal<mode>"
- [(set (match_operand:VHSDF 0 "register_operand" "=w")
-       (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")]
+ [(set (match_operand:VHSDF_P 0 "register_operand" "=w")
+       (unspec:VHSDF_P [(match_operand:VHSDF_P 1 "register_operand" "w")]
 		      FMAXMINV))]
  "TARGET_SIMD"
  "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
@@ -4208,7 +4230,7 @@ (define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
 (define_insn_and_split "aarch64_get_lane<mode>"
   [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand" "=?r, w, Utv")
 	(vec_select:<VEL>
-	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
+	  (match_operand:VALL_F16_FULL 1 "register_operand" "w, w, w")
 	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
   "TARGET_SIMD"
   {
@@ -7989,7 +8011,7 @@ (define_expand "aarch64_st1<VALL_F16:mode>"
 ;; Standard pattern name vec_init<mode><Vel>.
 
 (define_expand "vec_init<mode><Vel>"
-  [(match_operand:VALL_F16 0 "register_operand")
+  [(match_operand:VALL_F16_FULL 0 "register_operand")
    (match_operand 1 "" "")]
   "TARGET_SIMD"
 {
@@ -8068,7 +8090,7 @@ (define_insn "aarch64_urecpe<mode>"
 
 (define_expand "vec_extract<mode><Vel>"
   [(match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand")
-   (match_operand:VALL_F16 1 "register_operand")
+   (match_operand:VALL_F16_FULL 1 "register_operand")
    (match_operand:SI 2 "immediate_operand")]
   "TARGET_SIMD"
 {
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index f05bac713e88ea8c7feaa2367d55bd523ca66f57..1e08f8453688210afe1566092b19b59c9bdd0c97 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -3566,6 +3566,7 @@ aarch64_classify_vector_mode (machine_mode mode)
     case E_V8BFmode:
     case E_V4SFmode:
     case E_V2DFmode:
+    case E_V2HFmode:
       return TARGET_SIMD ? VEC_ADVSIMD : 0;
 
     default:
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 37d8161a33b1c399d80be82afa67613a087389d4..1df09f7fe2eb35aed96113476541e0faa5393551 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -160,6 +160,10 @@ (define_mode_iterator VDQF [V2SF V4SF V2DF])
 (define_mode_iterator VHSDF [(V4HF "TARGET_SIMD_F16INST")
 			     (V8HF "TARGET_SIMD_F16INST")
 			     V2SF V4SF V2DF])
+;; Advanced SIMD Float modes suitable for pairwise operations.
+(define_mode_iterator VHSDF_P [(V4HF "TARGET_SIMD_F16INST")
+			       (V8HF "TARGET_SIMD_F16INST")
+			       V2SF V4SF V2DF (V2HF "TARGET_SIMD_F16INST")])
 
 ;; Advanced SIMD Float modes, and DF.
 (define_mode_iterator VDQF_DF [V2SF V4SF V2DF DF])
@@ -188,15 +192,23 @@ (define_mode_iterator VDQF_COND [V2SF V2SI V4SF V4SI V2DF V2DI])
 (define_mode_iterator VALLF [V2SF V4SF V2DF SF DF])
 
 ;; Advanced SIMD Float modes with 2 elements.
-(define_mode_iterator V2F [V2SF V2DF])
+(define_mode_iterator V2F [V2SF V2DF V2HF])
 
 ;; All Advanced SIMD modes on which we support any arithmetic operations.
 (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF V2DF])
 
-;; All Advanced SIMD modes suitable for moving, loading, and storing.
+;; All Advanced SIMD modes suitable for moving, loading, and storing
+;; except V2HF.
 (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
 				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
 
+;; All Advanced SIMD modes suitable for moving, loading, and storing
+;; including V2HF
+(define_mode_iterator VALL_F16_FULL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
+				     V4HF V8HF V4BF V8BF V2SF V4SF V2DF
+				     (V2HF "TARGET_SIMD_F16INST")])
+
+
 ;; The VALL_F16 modes except the 128-bit 2-element ones.
 (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI V4SI
 				V4HF V8HF V2SF V4SF])
@@ -1076,7 +1088,7 @@ (define_mode_attr nunits [(V8QI "8") (V16QI "16")
 			  (V2SF "2") (V4SF "4")
 			  (V1DF "1") (V2DF "2")
 			  (DI "1") (DF "1")
-			  (V8DI "8")])
+			  (V8DI "8") (V2HF "2")])
 
 ;; Map a mode to the number of bits in it, if the size of the mode
 ;; is constant.
@@ -1090,6 +1102,7 @@ (define_mode_attr s [(HF "h") (SF "s") (DF "d") (SI "s") (DI "d")])
 
 ;; Give the length suffix letter for a sign- or zero-extension.
 (define_mode_attr size [(QI "b") (HI "h") (SI "w")])
+(define_mode_attr sizel [(QI "b") (HI "h") (SI "")])
 
 ;; Give the number of bits in the mode
 (define_mode_attr sizen [(QI "8") (HI "16") (SI "32") (DI "64")])
@@ -1134,8 +1147,9 @@ (define_mode_attr Vtype [(V8QI "8b") (V16QI "16b")
                          (V2SI "2s") (V4SI  "4s")
                          (DI   "1d") (DF    "1d")
                          (V2DI "2d") (V2SF "2s")
-			 (V4SF "4s") (V2DF "2d")
-			 (V4HF "4h") (V8HF "8h")
+			 (V2HF "2h") (V4SF "4s")
+			 (V2DF "2d") (V4HF "4h")
+			 (V8HF "8h")
 			 (V2x8QI "8b") (V2x4HI "4h")
 			 (V2x2SI "2s") (V2x1DI  "1d")
 			 (V2x4HF "4h") (V2x2SF "2s")
@@ -1175,9 +1189,10 @@ (define_mode_attr Vmtype [(V8QI ".8b") (V16QI ".16b")
 			 (V4HI ".4h") (V8HI  ".8h")
 			 (V2SI ".2s") (V4SI  ".4s")
 			 (V2DI ".2d") (V4HF ".4h")
-			 (V8HF ".8h") (V4BF ".4h")
-			 (V8BF ".8h") (V2SF ".2s")
-			 (V4SF ".4s") (V2DF ".2d")
+			 (V8HF ".8h") (V2HF ".2h")
+			 (V4BF ".4h") (V8BF ".8h")
+			 (V2SF ".2s") (V4SF ".4s")
+			 (V2DF ".2d")
 			 (DI   "")    (SI   "")
 			 (HI   "")    (QI   "")
 			 (TI   "")    (HF   "")
@@ -1193,7 +1208,7 @@ (define_mode_attr Vmntype [(V8HI ".8b") (V4SI ".4h")
 (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
 			  (V4HI "h") (V8HI  "h")
 			  (V2SI "s") (V4SI  "s")
-			  (V2DI "d")
+			  (V2DI "d") (V2HF  "h")
 			  (V4HF "h") (V8HF  "h")
 			  (V2SF "s") (V4SF  "s")
 			  (V2DF "d")
@@ -1285,7 +1300,7 @@ (define_mode_attr Vcwtype [(VNx16QI "b") (VNx8QI "h") (VNx4QI "w") (VNx2QI "d")
 ;; more accurately.
 (define_mode_attr stype [(V8QI "b") (V16QI "b") (V4HI "s") (V8HI "s")
 			 (V2SI "s") (V4SI "s") (V2DI "d") (V4HF "s")
-			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d")
+			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d") (V2HF "s")
 			 (HF "s") (SF "s") (DF "d") (QI "b") (HI "s")
 			 (SI "s") (DI "d")])
 
@@ -1360,8 +1375,8 @@ (define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
 		       (V4HF "HF") (V8HF  "HF")
 		       (V2SF "SF") (V4SF  "SF")
 		       (DF   "DF") (V2DF  "DF")
-		       (SI   "SI") (HI    "HI")
-		       (QI   "QI")
+		       (SI   "SI") (V2HF  "HF")
+		       (QI   "QI") (HI    "HI")
 		       (V4BF "BF") (V8BF "BF")
 		       (VNx16QI "QI") (VNx8QI "QI") (VNx4QI "QI") (VNx2QI "QI")
 		       (VNx8HI "HI") (VNx4HI "HI") (VNx2HI "HI")
@@ -1381,7 +1396,7 @@ (define_mode_attr Vel [(V8QI "qi") (V16QI "qi")
 		       (V2SF "sf") (V4SF "sf")
 		       (V2DF "df") (DF   "df")
 		       (SI   "si") (HI   "hi")
-		       (QI   "qi")
+		       (QI   "qi") (V2HF "hf")
 		       (V4BF "bf") (V8BF "bf")
 		       (VNx16QI "qi") (VNx8QI "qi") (VNx4QI "qi") (VNx2QI "qi")
 		       (VNx8HI "hi") (VNx4HI "hi") (VNx2HI "hi")
@@ -1866,7 +1881,7 @@ (define_mode_attr q [(V8QI "") (V16QI "_q")
 		     (V4HF "") (V8HF "_q")
 		     (V4BF "") (V8BF "_q")
 		     (V2SF "") (V4SF  "_q")
-			       (V2DF  "_q")
+		     (V2HF "") (V2DF  "_q")
 		     (QI "") (HI "") (SI "") (DI "") (HF "") (SF "") (DF "")
 		     (V2x8QI "") (V2x16QI "_q")
 		     (V2x4HI "") (V2x8HI "_q")
@@ -1905,6 +1920,7 @@ (define_mode_attr vp [(V8QI "v") (V16QI "v")
 		      (V2SI "p") (V4SI  "v")
 		      (V2DI "p") (V2DF  "p")
 		      (V2SF "p") (V4SF  "v")
+		      (V2HF "p")
 		      (V4HF "v") (V8HF  "v")])
 
 (define_mode_attr vsi2qi [(V2SI "v8qi") (V4SI "v16qi")
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
index 7d0504bdd944e9c0d1b545b0b66a9a1adc808714..3cfbc7a93cca1bea4925853e51d0a147c5722247 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -483,6 +483,7 @@ (define_attr "autodetect_type"
 ; neon_fp_minmax_s_q
 ; neon_fp_minmax_d
 ; neon_fp_minmax_d_q
+; neon_fp_reduc_add_h
 ; neon_fp_reduc_add_s
 ; neon_fp_reduc_add_s_q
 ; neon_fp_reduc_add_d
@@ -1033,6 +1034,7 @@ (define_attr "type"
   neon_fp_minmax_d,\
   neon_fp_minmax_d_q,\
 \
+  neon_fp_reduc_add_h,\
   neon_fp_reduc_add_s,\
   neon_fp_reduc_add_s_q,\
   neon_fp_reduc_add_d,\
@@ -1257,8 +1259,8 @@ (define_attr "is_neon_type" "yes,no"
           neon_fp_compare_d, neon_fp_compare_d_q, neon_fp_minmax_s,\
           neon_fp_minmax_s_q, neon_fp_minmax_d, neon_fp_minmax_d_q,\
           neon_fp_neg_s, neon_fp_neg_s_q, neon_fp_neg_d, neon_fp_neg_d_q,\
-          neon_fp_reduc_add_s, neon_fp_reduc_add_s_q, neon_fp_reduc_add_d,\
-          neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,
+          neon_fp_reduc_add_h, neon_fp_reduc_add_s, neon_fp_reduc_add_s_q,\
+          neon_fp_reduc_add_d, neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,\
           neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d,\
           neon_fp_reduc_minmax_d_q,\
           neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
index 07d71a63414b1066ea431e287286ad048515711a..8e35e0b574d49913b43c7d8d4f4ba75f127f42e9 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
@@ -30,11 +30,9 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int n)	\
 TEST_ALL (VEC_PERM)
 
 /* We should use one DUP for each of the 8-, 16- and 32-bit types,
-   although we currently use LD1RW for _Float16.  We should use two
-   DUPs for each of the three 64-bit types.  */
+   We should use two DUPs for each of the three 64-bit types.  */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 2 } } */
-/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 1 } } */
+/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
 /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 /* { dg-final { scan-assembler-not {\tzip2\t} } } */




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 6/8]AArch64: Add peephole and scheduling logic for pairwise operations that appear late in RTL.
  2022-10-31 11:56 [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Tamar Christina
                   ` (3 preceding siblings ...)
  2022-10-31 11:58 ` [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable Tamar Christina
@ 2022-10-31 11:59 ` Tamar Christina
  2022-10-31 11:59 ` [PATCH 7/8]AArch64: Consolidate zero and sign extension patterns and add missing ones Tamar Christina
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Tamar Christina @ 2022-10-31 11:59 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 3123 bytes --]

Hi All,

Says what it does on the tin.  In case some operations form in RTL due to
a split, combine or any RTL pass then still try to recognize them.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md: Add new peepholes.
	* config/aarch64/aarch64.cc (aarch_macro_fusion_pair_p): Schedule
	sequential PLUS operations next to each other to increase the chance of
	forming pairwise operations.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 93a2888f567460ad10ec050ea7d4f701df4729d1..20e9adbf7b9b484f9a19f0c62770930dc3941eb2 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3425,6 +3425,22 @@ (define_insn "aarch64_faddp<mode>"
   [(set_attr "type" "neon_fp_reduc_add_<stype><q>")]
 )
 
+(define_peephole2
+  [(set (match_operand:<VEL> 0 "register_operand")
+	(vec_select:<VEL>
+	  (match_operand:VHSDF 1 "register_operand")
+	  (parallel [(match_operand 2 "const_int_operand")])))
+   (set (match_operand:<VEL> 3 "register_operand")
+	(plus:<VEL>
+	  (match_dup 0)
+	  (match_operand:<VEL> 5 "register_operand")))]
+  "TARGET_SIMD
+   && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 1
+   && REGNO (operands[5]) == REGNO (operands[1])
+   && peep2_reg_dead_p (2, operands[0])"
+  [(set (match_dup 3) (unspec:<VEL> [(match_dup 1)] UNSPEC_FADDV))]
+)
+
 (define_insn "reduc_plus_scal_<mode>"
  [(set (match_operand:<VEL> 0 "register_operand" "=w")
        (unspec:<VEL> [(match_operand:VDQV 1 "register_operand" "w")]
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index f3bd71c9f10868f9e6ab50d8e36ed3ee3d48ac22..4023b1729d92bf37f5a2fc8fc8cd3a5194532079 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25372,6 +25372,29 @@ aarch_macro_fusion_pair_p (rtx_insn *prev, rtx_insn *curr)
         }
     }
 
+  /* Try to schedule vec_select and add together so the peephole works.  */
+  if (simple_sets_p && REG_P (SET_DEST (prev_set)) && REG_P (SET_DEST (curr_set))
+      && GET_CODE (SET_SRC (prev_set)) == VEC_SELECT && GET_CODE (SET_SRC (curr_set)) == PLUS)
+  {
+    /* We're trying to match:
+       prev (vec_select) == (set (reg r0)
+				 (vec_select (reg r1) n)
+       curr (plus) == (set (reg r2)
+			   (plus (reg r0) (reg r1)))  */
+    rtx prev_src = SET_SRC (prev_set);
+    rtx curr_src = SET_SRC (curr_set);
+    rtx parallel = XEXP (prev_src, 1);
+    auto idx
+      = ENDIAN_LANE_N (GET_MODE_NUNITS (GET_MODE (XEXP (prev_src, 0))), 1);
+    if (GET_CODE (parallel) == PARALLEL
+	&& XVECLEN (parallel, 0) == 1
+	&& known_eq (INTVAL (XVECEXP (parallel, 0, 0)), idx)
+	&& GET_MODE (SET_DEST (prev_set)) == GET_MODE (curr_src)
+	&& GET_MODE_INNER (GET_MODE (XEXP (prev_src, 0)))
+	    == GET_MODE (XEXP (curr_src, 1)))
+      return true;
+  }
+
   /* Fuse compare (CMP/CMN/TST/BICS) and conditional branch.  */
   if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_BRANCH)
       && prev_set && curr_set && any_condjump_p (curr)




-- 

[-- Attachment #2: rb16245.patch --]
[-- Type: text/plain, Size: 2601 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 93a2888f567460ad10ec050ea7d4f701df4729d1..20e9adbf7b9b484f9a19f0c62770930dc3941eb2 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3425,6 +3425,22 @@ (define_insn "aarch64_faddp<mode>"
   [(set_attr "type" "neon_fp_reduc_add_<stype><q>")]
 )
 
+(define_peephole2
+  [(set (match_operand:<VEL> 0 "register_operand")
+	(vec_select:<VEL>
+	  (match_operand:VHSDF 1 "register_operand")
+	  (parallel [(match_operand 2 "const_int_operand")])))
+   (set (match_operand:<VEL> 3 "register_operand")
+	(plus:<VEL>
+	  (match_dup 0)
+	  (match_operand:<VEL> 5 "register_operand")))]
+  "TARGET_SIMD
+   && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 1
+   && REGNO (operands[5]) == REGNO (operands[1])
+   && peep2_reg_dead_p (2, operands[0])"
+  [(set (match_dup 3) (unspec:<VEL> [(match_dup 1)] UNSPEC_FADDV))]
+)
+
 (define_insn "reduc_plus_scal_<mode>"
  [(set (match_operand:<VEL> 0 "register_operand" "=w")
        (unspec:<VEL> [(match_operand:VDQV 1 "register_operand" "w")]
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index f3bd71c9f10868f9e6ab50d8e36ed3ee3d48ac22..4023b1729d92bf37f5a2fc8fc8cd3a5194532079 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25372,6 +25372,29 @@ aarch_macro_fusion_pair_p (rtx_insn *prev, rtx_insn *curr)
         }
     }
 
+  /* Try to schedule vec_select and add together so the peephole works.  */
+  if (simple_sets_p && REG_P (SET_DEST (prev_set)) && REG_P (SET_DEST (curr_set))
+      && GET_CODE (SET_SRC (prev_set)) == VEC_SELECT && GET_CODE (SET_SRC (curr_set)) == PLUS)
+  {
+    /* We're trying to match:
+       prev (vec_select) == (set (reg r0)
+				 (vec_select (reg r1) n)
+       curr (plus) == (set (reg r2)
+			   (plus (reg r0) (reg r1)))  */
+    rtx prev_src = SET_SRC (prev_set);
+    rtx curr_src = SET_SRC (curr_set);
+    rtx parallel = XEXP (prev_src, 1);
+    auto idx
+      = ENDIAN_LANE_N (GET_MODE_NUNITS (GET_MODE (XEXP (prev_src, 0))), 1);
+    if (GET_CODE (parallel) == PARALLEL
+	&& XVECLEN (parallel, 0) == 1
+	&& known_eq (INTVAL (XVECEXP (parallel, 0, 0)), idx)
+	&& GET_MODE (SET_DEST (prev_set)) == GET_MODE (curr_src)
+	&& GET_MODE_INNER (GET_MODE (XEXP (prev_src, 0)))
+	    == GET_MODE (XEXP (curr_src, 1)))
+      return true;
+  }
+
   /* Fuse compare (CMP/CMN/TST/BICS) and conditional branch.  */
   if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_BRANCH)
       && prev_set && curr_set && any_condjump_p (curr)




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 7/8]AArch64: Consolidate zero and sign extension patterns and add missing ones.
  2022-10-31 11:56 [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Tamar Christina
                   ` (4 preceding siblings ...)
  2022-10-31 11:59 ` [PATCH 6/8]AArch64: Add peephole and scheduling logic for pairwise operations that appear late in RTL Tamar Christina
@ 2022-10-31 11:59 ` Tamar Christina
  2022-11-30  4:28   ` Tamar Christina
  2022-12-06 15:59   ` Richard Sandiford
  2022-10-31 12:00 ` [PATCH 8/8]AArch64: Have reload not choose to do add on the scalar side if both values exist on the SIMD side Tamar Christina
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 50+ messages in thread
From: Tamar Christina @ 2022-10-31 11:59 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 14104 bytes --]

Hi All,

The target has various zero and sign extension patterns.  These however live in
various locations around the MD file and almost all of them are split
differently.  Due to the various patterns we also ended up missing valid
extensions.  For instance smov is almost never generated.

This change tries to make this more manageable by consolidating the patterns as
much as possible and in doing so fix the missing alternatives.

There were also some duplicate patterns.  Note that the
zero_extend<*_ONLY:mode><SD_HSDI:mode>2  patterns are nearly identical however
QImode lacks an alternative that the others don't have, so I have left them as
3 different patterns next to each other.

In a lot of cases the wrong iterator was used leaving out cases that should
exist.

I've also changed the masks used for zero extensions to hex instead of decimal
as it's more clear what they do that way, and aligns better with output of
other compilers.

This leave the bulk of the extensions in just 3 patterns.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md
	(*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>): Changed to ...
	(*aarch64_get_lane_zero_extend<GPI:mode><VDQV_L:mode>): ... This.
	(*aarch64_get_lane_extenddi<VS:mode>): New.
	* config/aarch64/aarch64.md (<optab>sidi2, *extendsidi2_aarch64,
	<optab>qihi2, *extendqihi2_aarch64, *zero_extendsidi2_aarch64): Remove
	duplicate patterns.
	(<ANY_EXTEND:optab><SHORT:mode><GPI:mode>2,
	*extend<SHORT:mode><GPI:mode>2_aarch64): Remove, consolidate
	into ...
	(extend<ALLX:mode><SD_HSDI:mode>2): ... This.
	(*zero_extendqihi2_aarch64,
	*zero_extend<SHORT:mode><GPI:mode>2_aarch64): Remove, consolidate into
	...
	(zero_extend<SI_ONLY:mode><SD_HSDI:mode>2,
	zero_extend<HI_ONLY:mode><SD_HSDI:mode>2,
	zero_extend<QI_ONLY:mode><SD_HSDI:mode>2):
	(*ands<GPI:mode>_compare0): Renamed to ...
	(*ands<SD_HSDI:mode>_compare0): ... This.
	* config/aarch64/iterators.md (HI_ONLY, QI_ONLY): New.
	(short_mask): Use hex rather than dec and add SI.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/ands_3.c: Update codegen.
	* gcc.target/aarch64/sve/slp_1.c: Likewise.
	* gcc.target/aarch64/tst_5.c: Likewise.
	* gcc.target/aarch64/tst_6.c: Likewise.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 8a84a8560e982b8155b18541f5504801b3330124..d0b37c4dd48aeafd3d87c90dc3270e71af5a72b9 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4237,19 +4237,34 @@ (define_insn "*aarch64_get_lane_extend<GPI:mode><VDQQH:mode>"
   [(set_attr "type" "neon_to_gp<VDQQH:q>")]
 )
 
-(define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
+(define_insn "*aarch64_get_lane_extenddi<VS:mode>"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(sign_extend:DI
+	  (vec_select:<VS:VEL>
+	    (match_operand:VS 1 "register_operand" "w")
+	    (parallel [(match_operand:SI 2 "immediate_operand" "i")]))))]
+  "TARGET_SIMD"
+  {
+    operands[2] = aarch64_endian_lane_rtx (<VS:MODE>mode,
+					   INTVAL (operands[2]));
+    return "smov\\t%x0, %1.<VS:Vetype>[%2]";
+  }
+  [(set_attr "type" "neon_to_gp<VS:q>")]
+)
+
+(define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQV_L:mode>"
   [(set (match_operand:GPI 0 "register_operand" "=r")
 	(zero_extend:GPI
-	  (vec_select:<VDQQH:VEL>
-	    (match_operand:VDQQH 1 "register_operand" "w")
+	  (vec_select:<VDQV_L:VEL>
+	    (match_operand:VDQV_L 1 "register_operand" "w")
 	    (parallel [(match_operand:SI 2 "immediate_operand" "i")]))))]
   "TARGET_SIMD"
   {
-    operands[2] = aarch64_endian_lane_rtx (<VDQQH:MODE>mode,
+    operands[2] = aarch64_endian_lane_rtx (<VDQV_L:MODE>mode,
 					   INTVAL (operands[2]));
-    return "umov\\t%w0, %1.<VDQQH:Vetype>[%2]";
+    return "umov\\t%w0, %1.<VDQV_L:Vetype>[%2]";
   }
-  [(set_attr "type" "neon_to_gp<VDQQH:q>")]
+  [(set_attr "type" "neon_to_gp<VDQV_L:q>")]
 )
 
 ;; Lane extraction of a value, neither sign nor zero extension
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 3ea16dbc2557c6a4f37104d44a49f77f768eb53d..09ae1118371f82ca63146fceb953eb9e820d05a4 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1911,22 +1911,6 @@ (define_insn "storewb_pair<TX:mode>_<P:mode>"
 ;; Sign/Zero extension
 ;; -------------------------------------------------------------------
 
-(define_expand "<optab>sidi2"
-  [(set (match_operand:DI 0 "register_operand")
-	(ANY_EXTEND:DI (match_operand:SI 1 "nonimmediate_operand")))]
-  ""
-)
-
-(define_insn "*extendsidi2_aarch64"
-  [(set (match_operand:DI 0 "register_operand" "=r,r")
-        (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" "r,m")))]
-  ""
-  "@
-   sxtw\t%0, %w1
-   ldrsw\t%0, %1"
-  [(set_attr "type" "extend,load_4")]
-)
-
 (define_insn "*load_pair_extendsidi2_aarch64"
   [(set (match_operand:DI 0 "register_operand" "=r")
 	(sign_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand" "Ump")))
@@ -1940,21 +1924,6 @@ (define_insn "*load_pair_extendsidi2_aarch64"
   [(set_attr "type" "load_8")]
 )
 
-(define_insn "*zero_extendsidi2_aarch64"
-  [(set (match_operand:DI 0 "register_operand" "=r,r,w,w,r,w")
-        (zero_extend:DI (match_operand:SI 1 "nonimmediate_operand" "r,m,r,m,w,w")))]
-  ""
-  "@
-   uxtw\t%0, %w1
-   ldr\t%w0, %1
-   fmov\t%s0, %w1
-   ldr\t%s0, %1
-   fmov\t%w0, %s1
-   fmov\t%s0, %s1"
-  [(set_attr "type" "mov_reg,load_4,f_mcr,f_loads,f_mrc,fmov")
-   (set_attr "arch" "*,*,fp,fp,fp,fp")]
-)
-
 (define_insn "*load_pair_zero_extendsidi2_aarch64"
   [(set (match_operand:DI 0 "register_operand" "=r,w")
 	(zero_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand" "Ump,Ump")))
@@ -1971,61 +1940,64 @@ (define_insn "*load_pair_zero_extendsidi2_aarch64"
    (set_attr "arch" "*,fp")]
 )
 
-(define_expand "<ANY_EXTEND:optab><SHORT:mode><GPI:mode>2"
-  [(set (match_operand:GPI 0 "register_operand")
-        (ANY_EXTEND:GPI (match_operand:SHORT 1 "nonimmediate_operand")))]
-  ""
-)
-
-(define_insn "*extend<SHORT:mode><GPI:mode>2_aarch64"
-  [(set (match_operand:GPI 0 "register_operand" "=r,r,r")
-        (sign_extend:GPI (match_operand:SHORT 1 "nonimmediate_operand" "r,m,w")))]
+(define_insn "extend<ALLX:mode><SD_HSDI:mode>2"
+  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,r")
+        (sign_extend:SD_HSDI
+	  (match_operand:ALLX 1 "nonimmediate_operand" "r,m,w")))]
   ""
   "@
-   sxt<SHORT:size>\t%<GPI:w>0, %w1
-   ldrs<SHORT:size>\t%<GPI:w>0, %1
-   smov\t%<GPI:w>0, %1.<SHORT:size>[0]"
+   sxt<ALLX:size>\t%<SD_HSDI:w>0, %w1
+   ldrs<ALLX:size>\t%<SD_HSDI:w>0, %1
+   smov\t%<SD_HSDI:w>0, %1.<ALLX:Vetype>[0]"
   [(set_attr "type" "extend,load_4,neon_to_gp")
    (set_attr "arch" "*,*,fp")]
 )
 
-(define_insn "*zero_extend<SHORT:mode><GPI:mode>2_aarch64"
-  [(set (match_operand:GPI 0 "register_operand" "=r,r,w,r")
-        (zero_extend:GPI (match_operand:SHORT 1 "nonimmediate_operand" "r,m,m,w")))]
+(define_insn "zero_extend<SI_ONLY:mode><SD_HSDI:mode>2"
+  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,w,r,w")
+        (zero_extend:SD_HSDI
+	  (match_operand:SI_ONLY 1 "nonimmediate_operand" "r,m,r,m,w,w")))]
   ""
   "@
-   and\t%<GPI:w>0, %<GPI:w>1, <SHORT:short_mask>
-   ldr<SHORT:size>\t%w0, %1
-   ldr\t%<SHORT:size>0, %1
-   umov\t%w0, %1.<SHORT:size>[0]"
-  [(set_attr "type" "logic_imm,load_4,f_loads,neon_to_gp")
-   (set_attr "arch" "*,*,fp,fp")]
-)
-
-(define_expand "<optab>qihi2"
-  [(set (match_operand:HI 0 "register_operand")
-        (ANY_EXTEND:HI (match_operand:QI 1 "nonimmediate_operand")))]
-  ""
+   uxt<SI_ONLY:size>\t%<SD_HSDI:w>0, %w1
+   ldr<SI_ONLY:sizel>\t%w0, %1
+   fmov\t%<SI_ONLY:Vetype>0, %w1
+   ldr\t%<SI_ONLY:Vetype>0, %1
+   fmov\t%w0, %<SI_ONLY:Vetype>1
+   fmov\t%<SI_ONLY:Vetype>0, %<SI_ONLY:Vetype>1"
+  [(set_attr "type" "mov_reg,load_4,f_mcr,f_loads,f_mrc,fmov")
+   (set_attr "arch" "*,*,fp,fp,fp,fp")]
 )
 
-(define_insn "*extendqihi2_aarch64"
-  [(set (match_operand:HI 0 "register_operand" "=r,r")
-	(sign_extend:HI (match_operand:QI 1 "nonimmediate_operand" "r,m")))]
+(define_insn "zero_extend<HI_ONLY:mode><SD_HSDI:mode>2"
+  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,w,r,w")
+        (zero_extend:SD_HSDI
+	  (match_operand:HI_ONLY 1 "nonimmediate_operand" "r,m,r,m,w,w")))]
   ""
   "@
-   sxtb\t%w0, %w1
-   ldrsb\t%w0, %1"
-  [(set_attr "type" "extend,load_4")]
+   uxt<HI_ONLY:size>\t%<SD_HSDI:w>0, %w1
+   ldr<HI_ONLY:sizel>\t%w0, %1
+   fmov\t%<HI_ONLY:Vetype>0, %w1
+   ldr\t%<HI_ONLY:Vetype>0, %1
+   umov\t%w0, %1.<HI_ONLY:Vetype>[0]
+   fmov\t%<HI_ONLY:Vetype>0, %<HI_ONLY:Vetype>1"
+  [(set_attr "type" "mov_reg,load_4,f_mcr,f_loads,f_mrc,fmov")
+   (set_attr "arch" "*,*,fp16,fp,fp,fp16")]
 )
 
-(define_insn "*zero_extendqihi2_aarch64"
-  [(set (match_operand:HI 0 "register_operand" "=r,r")
-	(zero_extend:HI (match_operand:QI 1 "nonimmediate_operand" "r,m")))]
+(define_insn "zero_extend<QI_ONLY:mode><SD_HSDI:mode>2"
+  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,r,w")
+        (zero_extend:SD_HSDI
+	  (match_operand:QI_ONLY 1 "nonimmediate_operand" "r,m,m,w,w")))]
   ""
   "@
-   and\t%w0, %w1, 255
-   ldrb\t%w0, %1"
-  [(set_attr "type" "logic_imm,load_4")]
+   uxt<QI_ONLY:size>\t%<SD_HSDI:w>0, %w1
+   ldr<QI_ONLY:sizel>\t%w0, %1
+   ldr\t%<QI_ONLY:Vetype>0, %1
+   umov\t%w0, %1.<QI_ONLY:Vetype>[0]
+   dup\t%<QI_ONLY:Vetype>0, %1.<QI_ONLY:Vetype>[0]"
+  [(set_attr "type" "mov_reg,load_4,f_loads,f_mrc,fmov")
+   (set_attr "arch" "*,*,fp,fp,fp")]
 )
 
 ;; -------------------------------------------------------------------
@@ -5029,15 +5001,15 @@ (define_insn "*and<mode>_compare0"
   [(set_attr "type" "alus_imm")]
 )
 
-(define_insn "*ands<GPI:mode>_compare0"
+(define_insn "*ands<SD_HSDI:mode>_compare0"
   [(set (reg:CC_NZ CC_REGNUM)
 	(compare:CC_NZ
-	 (zero_extend:GPI (match_operand:SHORT 1 "register_operand" "r"))
+	 (zero_extend:SD_HSDI (match_operand:ALLX 1 "register_operand" "r"))
 	 (const_int 0)))
-   (set (match_operand:GPI 0 "register_operand" "=r")
-	(zero_extend:GPI (match_dup 1)))]
+   (set (match_operand:SD_HSDI 0 "register_operand" "=r")
+	(zero_extend:SD_HSDI (match_dup 1)))]
   ""
-  "ands\\t%<GPI:w>0, %<GPI:w>1, <short_mask>"
+  "ands\\t%<SD_HSDI:w>0, %<SD_HSDI:w>1, <ALLX:short_mask>"
   [(set_attr "type" "alus_imm")]
 )
 
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 1df09f7fe2eb35aed96113476541e0faa5393551..e904407b2169e589b7007ff966b2d9347a6d0fd2 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -41,6 +41,8 @@ (define_mode_iterator SHORT [QI HI])
 ;; Iterators for single modes, for "@" patterns.
 (define_mode_iterator SI_ONLY [SI])
 (define_mode_iterator DI_ONLY [DI])
+(define_mode_iterator HI_ONLY [HI])
+(define_mode_iterator QI_ONLY [QI])
 
 ;; Iterator for all integer modes (up to 64-bit)
 (define_mode_iterator ALLI [QI HI SI DI])
@@ -1033,7 +1035,7 @@ (define_mode_attr w2 [(HF "x") (SF "x") (DF "w")])
 ;; For width of fp registers in fcvt instruction
 (define_mode_attr fpw [(DI "s") (SI "d")])
 
-(define_mode_attr short_mask [(HI "65535") (QI "255")])
+(define_mode_attr short_mask [(SI "0xffffffff") (HI "0xffff") (QI "0xff")])
 
 ;; For constraints used in scalar immediate vector moves
 (define_mode_attr hq [(HI "h") (QI "q")])
diff --git a/gcc/testsuite/gcc.target/aarch64/ands_3.c b/gcc/testsuite/gcc.target/aarch64/ands_3.c
index 42cb7f0f0bc86a4aceb09851c31eb2e888d93403..421aa5cea7a51ad810cc9c5653a149cb21bb871c 100644
--- a/gcc/testsuite/gcc.target/aarch64/ands_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/ands_3.c
@@ -9,4 +9,4 @@ f9 (unsigned char x, int y)
   return x;
 }
 
-/* { dg-final { scan-assembler "ands\t(x|w)\[0-9\]+,\[ \t\]*(x|w)\[0-9\]+,\[ \t\]*255" } } */
+/* { dg-final { scan-assembler "ands\t(x|w)\[0-9\]+,\[ \t\]*(x|w)\[0-9\]+,\[ \t\]*0xff" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
index 8e35e0b574d49913b43c7d8d4f4ba75f127f42e9..03288976b3397cdbe0e822f94f2a6448d9fa9a52 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
@@ -51,7 +51,6 @@ TEST_ALL (VEC_PERM)
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
 /* { dg-final { scan-assembler-not {\tldr} } } */
-/* { dg-final { scan-assembler-times {\tstr} 2 } } */
-/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
+/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[1\], v[0-9]+\.h\[0\]} 1 } } */
 
 /* { dg-final { scan-assembler-not {\tuqdec} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/tst_5.c b/gcc/testsuite/gcc.target/aarch64/tst_5.c
index 0de40a6c47a7d63c1b7a81aeba438a096c0041b8..19034cd74ed07ea4d670c25d9ab3d1cff805a483 100644
--- a/gcc/testsuite/gcc.target/aarch64/tst_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/tst_5.c
@@ -4,7 +4,7 @@
 int
 f255 (int x)
 {
-  if (x & 255)
+  if (x & 0xff)
     return 1;
   return x;
 }
@@ -12,10 +12,10 @@ f255 (int x)
 int
 f65535 (int x)
 {
-  if (x & 65535)
+  if (x & 0xffff)
     return 1;
   return x;
 }
 
-/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*255" } } */
-/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*65535" } } */
+/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xff" } } */
+/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xffff" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/tst_6.c b/gcc/testsuite/gcc.target/aarch64/tst_6.c
index f15ec114c391fed79cc43b7740fde83fb3d4ea53..1c047cfae214b60e5bf003e6781a277202fcc588 100644
--- a/gcc/testsuite/gcc.target/aarch64/tst_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/tst_6.c
@@ -7,4 +7,4 @@ foo (long x)
    return ((short) x != 0) ? x : 1;
 }
 
-/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*65535" } } */
+/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xffff" } } */




-- 

[-- Attachment #2: rb16246.patch --]
[-- Type: text/plain, Size: 11778 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 8a84a8560e982b8155b18541f5504801b3330124..d0b37c4dd48aeafd3d87c90dc3270e71af5a72b9 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4237,19 +4237,34 @@ (define_insn "*aarch64_get_lane_extend<GPI:mode><VDQQH:mode>"
   [(set_attr "type" "neon_to_gp<VDQQH:q>")]
 )
 
-(define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
+(define_insn "*aarch64_get_lane_extenddi<VS:mode>"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(sign_extend:DI
+	  (vec_select:<VS:VEL>
+	    (match_operand:VS 1 "register_operand" "w")
+	    (parallel [(match_operand:SI 2 "immediate_operand" "i")]))))]
+  "TARGET_SIMD"
+  {
+    operands[2] = aarch64_endian_lane_rtx (<VS:MODE>mode,
+					   INTVAL (operands[2]));
+    return "smov\\t%x0, %1.<VS:Vetype>[%2]";
+  }
+  [(set_attr "type" "neon_to_gp<VS:q>")]
+)
+
+(define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQV_L:mode>"
   [(set (match_operand:GPI 0 "register_operand" "=r")
 	(zero_extend:GPI
-	  (vec_select:<VDQQH:VEL>
-	    (match_operand:VDQQH 1 "register_operand" "w")
+	  (vec_select:<VDQV_L:VEL>
+	    (match_operand:VDQV_L 1 "register_operand" "w")
 	    (parallel [(match_operand:SI 2 "immediate_operand" "i")]))))]
   "TARGET_SIMD"
   {
-    operands[2] = aarch64_endian_lane_rtx (<VDQQH:MODE>mode,
+    operands[2] = aarch64_endian_lane_rtx (<VDQV_L:MODE>mode,
 					   INTVAL (operands[2]));
-    return "umov\\t%w0, %1.<VDQQH:Vetype>[%2]";
+    return "umov\\t%w0, %1.<VDQV_L:Vetype>[%2]";
   }
-  [(set_attr "type" "neon_to_gp<VDQQH:q>")]
+  [(set_attr "type" "neon_to_gp<VDQV_L:q>")]
 )
 
 ;; Lane extraction of a value, neither sign nor zero extension
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 3ea16dbc2557c6a4f37104d44a49f77f768eb53d..09ae1118371f82ca63146fceb953eb9e820d05a4 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1911,22 +1911,6 @@ (define_insn "storewb_pair<TX:mode>_<P:mode>"
 ;; Sign/Zero extension
 ;; -------------------------------------------------------------------
 
-(define_expand "<optab>sidi2"
-  [(set (match_operand:DI 0 "register_operand")
-	(ANY_EXTEND:DI (match_operand:SI 1 "nonimmediate_operand")))]
-  ""
-)
-
-(define_insn "*extendsidi2_aarch64"
-  [(set (match_operand:DI 0 "register_operand" "=r,r")
-        (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" "r,m")))]
-  ""
-  "@
-   sxtw\t%0, %w1
-   ldrsw\t%0, %1"
-  [(set_attr "type" "extend,load_4")]
-)
-
 (define_insn "*load_pair_extendsidi2_aarch64"
   [(set (match_operand:DI 0 "register_operand" "=r")
 	(sign_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand" "Ump")))
@@ -1940,21 +1924,6 @@ (define_insn "*load_pair_extendsidi2_aarch64"
   [(set_attr "type" "load_8")]
 )
 
-(define_insn "*zero_extendsidi2_aarch64"
-  [(set (match_operand:DI 0 "register_operand" "=r,r,w,w,r,w")
-        (zero_extend:DI (match_operand:SI 1 "nonimmediate_operand" "r,m,r,m,w,w")))]
-  ""
-  "@
-   uxtw\t%0, %w1
-   ldr\t%w0, %1
-   fmov\t%s0, %w1
-   ldr\t%s0, %1
-   fmov\t%w0, %s1
-   fmov\t%s0, %s1"
-  [(set_attr "type" "mov_reg,load_4,f_mcr,f_loads,f_mrc,fmov")
-   (set_attr "arch" "*,*,fp,fp,fp,fp")]
-)
-
 (define_insn "*load_pair_zero_extendsidi2_aarch64"
   [(set (match_operand:DI 0 "register_operand" "=r,w")
 	(zero_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand" "Ump,Ump")))
@@ -1971,61 +1940,64 @@ (define_insn "*load_pair_zero_extendsidi2_aarch64"
    (set_attr "arch" "*,fp")]
 )
 
-(define_expand "<ANY_EXTEND:optab><SHORT:mode><GPI:mode>2"
-  [(set (match_operand:GPI 0 "register_operand")
-        (ANY_EXTEND:GPI (match_operand:SHORT 1 "nonimmediate_operand")))]
-  ""
-)
-
-(define_insn "*extend<SHORT:mode><GPI:mode>2_aarch64"
-  [(set (match_operand:GPI 0 "register_operand" "=r,r,r")
-        (sign_extend:GPI (match_operand:SHORT 1 "nonimmediate_operand" "r,m,w")))]
+(define_insn "extend<ALLX:mode><SD_HSDI:mode>2"
+  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,r")
+        (sign_extend:SD_HSDI
+	  (match_operand:ALLX 1 "nonimmediate_operand" "r,m,w")))]
   ""
   "@
-   sxt<SHORT:size>\t%<GPI:w>0, %w1
-   ldrs<SHORT:size>\t%<GPI:w>0, %1
-   smov\t%<GPI:w>0, %1.<SHORT:size>[0]"
+   sxt<ALLX:size>\t%<SD_HSDI:w>0, %w1
+   ldrs<ALLX:size>\t%<SD_HSDI:w>0, %1
+   smov\t%<SD_HSDI:w>0, %1.<ALLX:Vetype>[0]"
   [(set_attr "type" "extend,load_4,neon_to_gp")
    (set_attr "arch" "*,*,fp")]
 )
 
-(define_insn "*zero_extend<SHORT:mode><GPI:mode>2_aarch64"
-  [(set (match_operand:GPI 0 "register_operand" "=r,r,w,r")
-        (zero_extend:GPI (match_operand:SHORT 1 "nonimmediate_operand" "r,m,m,w")))]
+(define_insn "zero_extend<SI_ONLY:mode><SD_HSDI:mode>2"
+  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,w,r,w")
+        (zero_extend:SD_HSDI
+	  (match_operand:SI_ONLY 1 "nonimmediate_operand" "r,m,r,m,w,w")))]
   ""
   "@
-   and\t%<GPI:w>0, %<GPI:w>1, <SHORT:short_mask>
-   ldr<SHORT:size>\t%w0, %1
-   ldr\t%<SHORT:size>0, %1
-   umov\t%w0, %1.<SHORT:size>[0]"
-  [(set_attr "type" "logic_imm,load_4,f_loads,neon_to_gp")
-   (set_attr "arch" "*,*,fp,fp")]
-)
-
-(define_expand "<optab>qihi2"
-  [(set (match_operand:HI 0 "register_operand")
-        (ANY_EXTEND:HI (match_operand:QI 1 "nonimmediate_operand")))]
-  ""
+   uxt<SI_ONLY:size>\t%<SD_HSDI:w>0, %w1
+   ldr<SI_ONLY:sizel>\t%w0, %1
+   fmov\t%<SI_ONLY:Vetype>0, %w1
+   ldr\t%<SI_ONLY:Vetype>0, %1
+   fmov\t%w0, %<SI_ONLY:Vetype>1
+   fmov\t%<SI_ONLY:Vetype>0, %<SI_ONLY:Vetype>1"
+  [(set_attr "type" "mov_reg,load_4,f_mcr,f_loads,f_mrc,fmov")
+   (set_attr "arch" "*,*,fp,fp,fp,fp")]
 )
 
-(define_insn "*extendqihi2_aarch64"
-  [(set (match_operand:HI 0 "register_operand" "=r,r")
-	(sign_extend:HI (match_operand:QI 1 "nonimmediate_operand" "r,m")))]
+(define_insn "zero_extend<HI_ONLY:mode><SD_HSDI:mode>2"
+  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,w,r,w")
+        (zero_extend:SD_HSDI
+	  (match_operand:HI_ONLY 1 "nonimmediate_operand" "r,m,r,m,w,w")))]
   ""
   "@
-   sxtb\t%w0, %w1
-   ldrsb\t%w0, %1"
-  [(set_attr "type" "extend,load_4")]
+   uxt<HI_ONLY:size>\t%<SD_HSDI:w>0, %w1
+   ldr<HI_ONLY:sizel>\t%w0, %1
+   fmov\t%<HI_ONLY:Vetype>0, %w1
+   ldr\t%<HI_ONLY:Vetype>0, %1
+   umov\t%w0, %1.<HI_ONLY:Vetype>[0]
+   fmov\t%<HI_ONLY:Vetype>0, %<HI_ONLY:Vetype>1"
+  [(set_attr "type" "mov_reg,load_4,f_mcr,f_loads,f_mrc,fmov")
+   (set_attr "arch" "*,*,fp16,fp,fp,fp16")]
 )
 
-(define_insn "*zero_extendqihi2_aarch64"
-  [(set (match_operand:HI 0 "register_operand" "=r,r")
-	(zero_extend:HI (match_operand:QI 1 "nonimmediate_operand" "r,m")))]
+(define_insn "zero_extend<QI_ONLY:mode><SD_HSDI:mode>2"
+  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,r,w")
+        (zero_extend:SD_HSDI
+	  (match_operand:QI_ONLY 1 "nonimmediate_operand" "r,m,m,w,w")))]
   ""
   "@
-   and\t%w0, %w1, 255
-   ldrb\t%w0, %1"
-  [(set_attr "type" "logic_imm,load_4")]
+   uxt<QI_ONLY:size>\t%<SD_HSDI:w>0, %w1
+   ldr<QI_ONLY:sizel>\t%w0, %1
+   ldr\t%<QI_ONLY:Vetype>0, %1
+   umov\t%w0, %1.<QI_ONLY:Vetype>[0]
+   dup\t%<QI_ONLY:Vetype>0, %1.<QI_ONLY:Vetype>[0]"
+  [(set_attr "type" "mov_reg,load_4,f_loads,f_mrc,fmov")
+   (set_attr "arch" "*,*,fp,fp,fp")]
 )
 
 ;; -------------------------------------------------------------------
@@ -5029,15 +5001,15 @@ (define_insn "*and<mode>_compare0"
   [(set_attr "type" "alus_imm")]
 )
 
-(define_insn "*ands<GPI:mode>_compare0"
+(define_insn "*ands<SD_HSDI:mode>_compare0"
   [(set (reg:CC_NZ CC_REGNUM)
 	(compare:CC_NZ
-	 (zero_extend:GPI (match_operand:SHORT 1 "register_operand" "r"))
+	 (zero_extend:SD_HSDI (match_operand:ALLX 1 "register_operand" "r"))
 	 (const_int 0)))
-   (set (match_operand:GPI 0 "register_operand" "=r")
-	(zero_extend:GPI (match_dup 1)))]
+   (set (match_operand:SD_HSDI 0 "register_operand" "=r")
+	(zero_extend:SD_HSDI (match_dup 1)))]
   ""
-  "ands\\t%<GPI:w>0, %<GPI:w>1, <short_mask>"
+  "ands\\t%<SD_HSDI:w>0, %<SD_HSDI:w>1, <ALLX:short_mask>"
   [(set_attr "type" "alus_imm")]
 )
 
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 1df09f7fe2eb35aed96113476541e0faa5393551..e904407b2169e589b7007ff966b2d9347a6d0fd2 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -41,6 +41,8 @@ (define_mode_iterator SHORT [QI HI])
 ;; Iterators for single modes, for "@" patterns.
 (define_mode_iterator SI_ONLY [SI])
 (define_mode_iterator DI_ONLY [DI])
+(define_mode_iterator HI_ONLY [HI])
+(define_mode_iterator QI_ONLY [QI])
 
 ;; Iterator for all integer modes (up to 64-bit)
 (define_mode_iterator ALLI [QI HI SI DI])
@@ -1033,7 +1035,7 @@ (define_mode_attr w2 [(HF "x") (SF "x") (DF "w")])
 ;; For width of fp registers in fcvt instruction
 (define_mode_attr fpw [(DI "s") (SI "d")])
 
-(define_mode_attr short_mask [(HI "65535") (QI "255")])
+(define_mode_attr short_mask [(SI "0xffffffff") (HI "0xffff") (QI "0xff")])
 
 ;; For constraints used in scalar immediate vector moves
 (define_mode_attr hq [(HI "h") (QI "q")])
diff --git a/gcc/testsuite/gcc.target/aarch64/ands_3.c b/gcc/testsuite/gcc.target/aarch64/ands_3.c
index 42cb7f0f0bc86a4aceb09851c31eb2e888d93403..421aa5cea7a51ad810cc9c5653a149cb21bb871c 100644
--- a/gcc/testsuite/gcc.target/aarch64/ands_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/ands_3.c
@@ -9,4 +9,4 @@ f9 (unsigned char x, int y)
   return x;
 }
 
-/* { dg-final { scan-assembler "ands\t(x|w)\[0-9\]+,\[ \t\]*(x|w)\[0-9\]+,\[ \t\]*255" } } */
+/* { dg-final { scan-assembler "ands\t(x|w)\[0-9\]+,\[ \t\]*(x|w)\[0-9\]+,\[ \t\]*0xff" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
index 8e35e0b574d49913b43c7d8d4f4ba75f127f42e9..03288976b3397cdbe0e822f94f2a6448d9fa9a52 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
@@ -51,7 +51,6 @@ TEST_ALL (VEC_PERM)
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
 /* { dg-final { scan-assembler-not {\tldr} } } */
-/* { dg-final { scan-assembler-times {\tstr} 2 } } */
-/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
+/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[1\], v[0-9]+\.h\[0\]} 1 } } */
 
 /* { dg-final { scan-assembler-not {\tuqdec} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/tst_5.c b/gcc/testsuite/gcc.target/aarch64/tst_5.c
index 0de40a6c47a7d63c1b7a81aeba438a096c0041b8..19034cd74ed07ea4d670c25d9ab3d1cff805a483 100644
--- a/gcc/testsuite/gcc.target/aarch64/tst_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/tst_5.c
@@ -4,7 +4,7 @@
 int
 f255 (int x)
 {
-  if (x & 255)
+  if (x & 0xff)
     return 1;
   return x;
 }
@@ -12,10 +12,10 @@ f255 (int x)
 int
 f65535 (int x)
 {
-  if (x & 65535)
+  if (x & 0xffff)
     return 1;
   return x;
 }
 
-/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*255" } } */
-/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*65535" } } */
+/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xff" } } */
+/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xffff" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/tst_6.c b/gcc/testsuite/gcc.target/aarch64/tst_6.c
index f15ec114c391fed79cc43b7740fde83fb3d4ea53..1c047cfae214b60e5bf003e6781a277202fcc588 100644
--- a/gcc/testsuite/gcc.target/aarch64/tst_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/tst_6.c
@@ -7,4 +7,4 @@ foo (long x)
    return ((short) x != 0) ? x : 1;
 }
 
-/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*65535" } } */
+/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xffff" } } */




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 8/8]AArch64: Have reload not choose to do add on the scalar side if both values exist on the SIMD side.
  2022-10-31 11:56 [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Tamar Christina
                   ` (5 preceding siblings ...)
  2022-10-31 11:59 ` [PATCH 7/8]AArch64: Consolidate zero and sign extension patterns and add missing ones Tamar Christina
@ 2022-10-31 12:00 ` Tamar Christina
  2022-11-01 15:04   ` Richard Sandiford
  2022-10-31 21:41 ` [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Jeff Law
  2022-11-05 11:32 ` Richard Biener
  8 siblings, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-10-31 12:00 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 11322 bytes --]

Hi All,

Currently we often times generate an r -> r add even if it means we need two
reloads to perform it, i.e. in the case that the values are on the SIMD side.

The pairwise operations expose these more now and so we get suboptimal codegen.

Normally I would have liked to use ^ or $ here, but while this works for the
simple examples, reload inexplicably falls apart on examples that should have
been trivial. It forces a move to r -> w to use the w ADD, which is counter to
what ^ and $ should do.

However ! seems to fix all the regression and still maintains the good codegen.

I have tried looking into whether it's our costings that are off, but I can't
seem anything logical here.  So I'd like to push this change instead along with
test that augment the other testcases that guard the r -> r variants.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64.md (*add<mode>3_aarch64): Add ! to the r -> r
	alternative.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/simd/scalar_addp.c: New test.
	* gcc.target/aarch64/simd/scalar_faddp.c: New test.
	* gcc.target/aarch64/simd/scalar_faddp2.c: New test.
	* gcc.target/aarch64/simd/scalar_fmaxp.c: New test.
	* gcc.target/aarch64/simd/scalar_fminp.c: New test.
	* gcc.target/aarch64/simd/scalar_maxp.c: New test.
	* gcc.target/aarch64/simd/scalar_minp.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 09ae1118371f82ca63146fceb953eb9e820d05a4..c333fb1f72725992bb304c560f1245a242d5192d 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -2043,7 +2043,7 @@ (define_expand "add<mode>3"
 
 (define_insn "*add<mode>3_aarch64"
   [(set
-    (match_operand:GPI 0 "register_operand" "=rk,rk,w,rk,r,r,rk")
+    (match_operand:GPI 0 "register_operand" "=rk,!rk,w,rk,r,r,rk")
     (plus:GPI
      (match_operand:GPI 1 "register_operand" "%rk,rk,w,rk,rk,0,rk")
      (match_operand:GPI 2 "aarch64_pluslong_operand" "I,r,w,J,Uaa,Uai,Uav")))]
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
new file mode 100644
index 0000000000000000000000000000000000000000..5b8d40f19884fc7b4e7decd80758bc36fa76d058
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
@@ -0,0 +1,70 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-save-temps -O1 -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef long long v2di __attribute__((vector_size (16)));
+typedef unsigned long long v2udi __attribute__((vector_size (16)));
+typedef int v2si __attribute__((vector_size (16)));
+typedef unsigned int v2usi __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	addp	d0, v0.2d
+** 	fmov	x0, d0
+** 	ret
+*/
+long long
+foo (v2di x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo1:
+** 	saddlp	v0.1d, v0.2s
+** 	fmov	x0, d0
+** 	ret
+*/
+long long
+foo1 (v2si x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo2:
+** 	uaddlp	v0.1d, v0.2s
+** 	fmov	x0, d0
+** 	ret
+*/
+unsigned long long
+foo2 (v2usi x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo3:
+** 	uaddlp	v0.1d, v0.2s
+** 	add	d0, d0, d1
+** 	fmov	x0, d0
+** 	ret
+*/
+unsigned long long
+foo3 (v2usi x, v2udi y)
+{
+  return (x[1] + x[0]) + y[0];
+}
+
+/*
+** foo4:
+** 	saddlp	v0.1d, v0.2s
+** 	add	d0, d0, d1
+** 	fmov	x0, d0
+** 	ret
+*/
+long long
+foo4 (v2si x, v2di y)
+{
+  return (x[1] + x[0]) + y[0];
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
new file mode 100644
index 0000000000000000000000000000000000000000..ff455e060fc833b2f63e89c467b91a76fbe31aff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
@@ -0,0 +1,66 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
+/* { dg-add-options arm_v8_2a_fp16_scalar } */
+/* { dg-additional-options "-save-temps -O1" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef double v2df __attribute__((vector_size (16)));
+typedef float v4sf __attribute__((vector_size (16)));
+typedef __fp16 v8hf __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	faddp	d0, v0.2d
+** 	ret
+*/
+double
+foo (v2df x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo1:
+** 	faddp	s0, v0.2s
+** 	ret
+*/
+float
+foo1 (v4sf x)
+{
+  return x[0] + x[1];
+}
+
+/*
+** foo2:
+** 	faddp	h0, v0.2h
+** 	ret
+*/
+__fp16
+foo2 (v8hf x)
+{
+  return x[0] + x[1];
+}
+
+/*
+** foo3:
+** 	ext	v0.16b, v0.16b, v0.16b, #4
+** 	faddp	s0, v0.2s
+** 	ret
+*/
+float
+foo3 (v4sf x)
+{
+  return x[1] + x[2];
+}
+
+/*
+** foo4:
+** 	dup	s0, v0.s\[3\]
+** 	faddp	h0, v0.2h
+** 	ret
+*/
+__fp16
+foo4 (v8hf x)
+{
+  return x[6] + x[7];
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c
new file mode 100644
index 0000000000000000000000000000000000000000..04412c3b45c51648e46ff20f730b1213e940391a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-save-temps -O1 -w" } */
+
+typedef __m128i __attribute__((__vector_size__(2 * sizeof(long))));
+double a[];
+*b;
+fn1() {
+  __m128i c;
+  *(__m128i *)a = c;
+  *b = a[0] + a[1];
+}
+
+/* { dg-final { scan-assembler-times {faddp\td0, v0\.2d} 1 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c
new file mode 100644
index 0000000000000000000000000000000000000000..aa1d2bf17cd707b74d8f7c574506610ab4fd7299
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c
@@ -0,0 +1,56 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
+/* { dg-add-options arm_v8_2a_fp16_scalar } */
+/* { dg-additional-options "-save-temps -O1" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef double v2df __attribute__((vector_size (16)));
+typedef float v4sf __attribute__((vector_size (16)));
+typedef __fp16 v8hf __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	fmaxnmp	d0, v0.2d
+** 	ret
+*/
+double
+foo (v2df x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo1:
+** 	fmaxnmp	s0, v0.2s
+** 	ret
+*/
+float
+foo1 (v4sf x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/*
+** foo2:
+** 	fmaxnmp	h0, v0.2h
+** 	ret
+*/
+__fp16
+foo2 (v8hf x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo3:
+** 	fmaxnmp	s0, v0.2s
+** 	fcvt	d0, s0
+** 	fadd	d0, d0, d1
+** 	ret
+*/
+double
+foo3 (v4sf x, v2df y)
+{
+  return (x[0] > x[1] ? x[0] : x[1]) + y[0];
+}
+
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c
new file mode 100644
index 0000000000000000000000000000000000000000..6136c5272069c4d86f09951cdff25f1494e839f0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c
@@ -0,0 +1,55 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
+/* { dg-add-options arm_v8_2a_fp16_scalar } */
+/* { dg-additional-options "-save-temps -O1" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef double v2df __attribute__((vector_size (16)));
+typedef float v4sf __attribute__((vector_size (16)));
+typedef __fp16 v8hf __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	fminnmp	d0, v0.2d
+** 	ret
+*/
+double
+foo (v2df x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo1:
+** 	fminnmp	s0, v0.2s
+** 	ret
+*/
+float
+foo1 (v4sf x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo2:
+** 	fminnmp	h0, v0.2h
+** 	ret
+*/
+__fp16
+foo2 (v8hf x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo3:
+** 	fminnmp	s0, v0.2s
+** 	fcvt	d0, s0
+** 	fadd	d0, d0, d1
+** 	ret
+*/
+double
+foo3 (v4sf x, v2df y)
+{
+  return (x[0] < x[1] ? x[0] : x[1]) + y[0];
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c
new file mode 100644
index 0000000000000000000000000000000000000000..e219a13abc745b83dca58633fd2d812e276d6b2d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c
@@ -0,0 +1,74 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-save-temps -O1 -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef long long v2di __attribute__((vector_size (16)));
+typedef unsigned long long v2udi __attribute__((vector_size (16)));
+typedef int v2si __attribute__((vector_size (16)));
+typedef unsigned int v2usi __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	umov	x0, v0.d\[1\]
+** 	fmov	x1, d0
+** 	cmp	x0, x1
+** 	csel	x0, x0, x1, ge
+** 	ret
+*/
+long long
+foo (v2di x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo1:
+** 	smaxp	v0.2s, v0.2s, v0.2s
+** 	smov	x0, v0.s\[0\]
+** 	ret
+*/
+long long
+foo1 (v2si x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo2:
+** 	umaxp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	ret
+*/
+unsigned long long
+foo2 (v2usi x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo3:
+** 	umaxp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	fmov	x1, d1
+** 	add	x0, x1, w0, uxtw
+** 	ret
+*/
+unsigned long long
+foo3 (v2usi x, v2udi y)
+{
+  return (x[0] > x[1] ? x[0] : x[1]) + y[0];
+}
+
+/* 
+** foo4:
+** 	smaxp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	fmov	x1, d1
+** 	add	x0, x1, w0, sxtw
+** 	ret
+*/
+long long
+foo4 (v2si x, v2di y)
+{
+  return (x[0] > x[1] ? x[0] : x[1]) + y[0];
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c
new file mode 100644
index 0000000000000000000000000000000000000000..2a32fb4ea3edaa4c547a7a481c3ddca6b477430e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c
@@ -0,0 +1,74 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-save-temps -O1 -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef long long v2di __attribute__((vector_size (16)));
+typedef unsigned long long v2udi __attribute__((vector_size (16)));
+typedef int v2si __attribute__((vector_size (16)));
+typedef unsigned int v2usi __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	umov	x0, v0.d\[1\]
+** 	fmov	x1, d0
+** 	cmp	x0, x1
+** 	csel	x0, x0, x1, le
+** 	ret
+*/
+long long
+foo (v2di x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo1:
+** 	sminp	v0.2s, v0.2s, v0.2s
+** 	smov	x0, v0.s\[0\]
+** 	ret
+*/
+long long
+foo1 (v2si x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo2:
+** 	uminp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	ret
+*/
+unsigned long long
+foo2 (v2usi x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo3:
+** 	uminp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	fmov	x1, d1
+** 	add	x0, x1, w0, uxtw
+** 	ret
+*/
+unsigned long long
+foo3 (v2usi x, v2udi y)
+{
+  return (x[0] < x[1] ? x[0] : x[1]) + y[0];
+}
+
+/* 
+** foo4:
+** 	sminp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	fmov	x1, d1
+** 	add	x0, x1, w0, sxtw
+** 	ret
+*/
+long long
+foo4 (v2si x, v2di y)
+{
+  return (x[0] < x[1] ? x[0] : x[1]) + y[0];
+}




-- 

[-- Attachment #2: rb16247.patch --]
[-- Type: text/plain, Size: 9878 bytes --]

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 09ae1118371f82ca63146fceb953eb9e820d05a4..c333fb1f72725992bb304c560f1245a242d5192d 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -2043,7 +2043,7 @@ (define_expand "add<mode>3"
 
 (define_insn "*add<mode>3_aarch64"
   [(set
-    (match_operand:GPI 0 "register_operand" "=rk,rk,w,rk,r,r,rk")
+    (match_operand:GPI 0 "register_operand" "=rk,!rk,w,rk,r,r,rk")
     (plus:GPI
      (match_operand:GPI 1 "register_operand" "%rk,rk,w,rk,rk,0,rk")
      (match_operand:GPI 2 "aarch64_pluslong_operand" "I,r,w,J,Uaa,Uai,Uav")))]
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
new file mode 100644
index 0000000000000000000000000000000000000000..5b8d40f19884fc7b4e7decd80758bc36fa76d058
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
@@ -0,0 +1,70 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-save-temps -O1 -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef long long v2di __attribute__((vector_size (16)));
+typedef unsigned long long v2udi __attribute__((vector_size (16)));
+typedef int v2si __attribute__((vector_size (16)));
+typedef unsigned int v2usi __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	addp	d0, v0.2d
+** 	fmov	x0, d0
+** 	ret
+*/
+long long
+foo (v2di x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo1:
+** 	saddlp	v0.1d, v0.2s
+** 	fmov	x0, d0
+** 	ret
+*/
+long long
+foo1 (v2si x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo2:
+** 	uaddlp	v0.1d, v0.2s
+** 	fmov	x0, d0
+** 	ret
+*/
+unsigned long long
+foo2 (v2usi x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo3:
+** 	uaddlp	v0.1d, v0.2s
+** 	add	d0, d0, d1
+** 	fmov	x0, d0
+** 	ret
+*/
+unsigned long long
+foo3 (v2usi x, v2udi y)
+{
+  return (x[1] + x[0]) + y[0];
+}
+
+/*
+** foo4:
+** 	saddlp	v0.1d, v0.2s
+** 	add	d0, d0, d1
+** 	fmov	x0, d0
+** 	ret
+*/
+long long
+foo4 (v2si x, v2di y)
+{
+  return (x[1] + x[0]) + y[0];
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
new file mode 100644
index 0000000000000000000000000000000000000000..ff455e060fc833b2f63e89c467b91a76fbe31aff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
@@ -0,0 +1,66 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
+/* { dg-add-options arm_v8_2a_fp16_scalar } */
+/* { dg-additional-options "-save-temps -O1" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef double v2df __attribute__((vector_size (16)));
+typedef float v4sf __attribute__((vector_size (16)));
+typedef __fp16 v8hf __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	faddp	d0, v0.2d
+** 	ret
+*/
+double
+foo (v2df x)
+{
+  return x[1] + x[0];
+}
+
+/*
+** foo1:
+** 	faddp	s0, v0.2s
+** 	ret
+*/
+float
+foo1 (v4sf x)
+{
+  return x[0] + x[1];
+}
+
+/*
+** foo2:
+** 	faddp	h0, v0.2h
+** 	ret
+*/
+__fp16
+foo2 (v8hf x)
+{
+  return x[0] + x[1];
+}
+
+/*
+** foo3:
+** 	ext	v0.16b, v0.16b, v0.16b, #4
+** 	faddp	s0, v0.2s
+** 	ret
+*/
+float
+foo3 (v4sf x)
+{
+  return x[1] + x[2];
+}
+
+/*
+** foo4:
+** 	dup	s0, v0.s\[3\]
+** 	faddp	h0, v0.2h
+** 	ret
+*/
+__fp16
+foo4 (v8hf x)
+{
+  return x[6] + x[7];
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c
new file mode 100644
index 0000000000000000000000000000000000000000..04412c3b45c51648e46ff20f730b1213e940391a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-save-temps -O1 -w" } */
+
+typedef __m128i __attribute__((__vector_size__(2 * sizeof(long))));
+double a[];
+*b;
+fn1() {
+  __m128i c;
+  *(__m128i *)a = c;
+  *b = a[0] + a[1];
+}
+
+/* { dg-final { scan-assembler-times {faddp\td0, v0\.2d} 1 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c
new file mode 100644
index 0000000000000000000000000000000000000000..aa1d2bf17cd707b74d8f7c574506610ab4fd7299
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c
@@ -0,0 +1,56 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
+/* { dg-add-options arm_v8_2a_fp16_scalar } */
+/* { dg-additional-options "-save-temps -O1" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef double v2df __attribute__((vector_size (16)));
+typedef float v4sf __attribute__((vector_size (16)));
+typedef __fp16 v8hf __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	fmaxnmp	d0, v0.2d
+** 	ret
+*/
+double
+foo (v2df x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo1:
+** 	fmaxnmp	s0, v0.2s
+** 	ret
+*/
+float
+foo1 (v4sf x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/*
+** foo2:
+** 	fmaxnmp	h0, v0.2h
+** 	ret
+*/
+__fp16
+foo2 (v8hf x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo3:
+** 	fmaxnmp	s0, v0.2s
+** 	fcvt	d0, s0
+** 	fadd	d0, d0, d1
+** 	ret
+*/
+double
+foo3 (v4sf x, v2df y)
+{
+  return (x[0] > x[1] ? x[0] : x[1]) + y[0];
+}
+
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c
new file mode 100644
index 0000000000000000000000000000000000000000..6136c5272069c4d86f09951cdff25f1494e839f0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c
@@ -0,0 +1,55 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
+/* { dg-add-options arm_v8_2a_fp16_scalar } */
+/* { dg-additional-options "-save-temps -O1" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef double v2df __attribute__((vector_size (16)));
+typedef float v4sf __attribute__((vector_size (16)));
+typedef __fp16 v8hf __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	fminnmp	d0, v0.2d
+** 	ret
+*/
+double
+foo (v2df x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo1:
+** 	fminnmp	s0, v0.2s
+** 	ret
+*/
+float
+foo1 (v4sf x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo2:
+** 	fminnmp	h0, v0.2h
+** 	ret
+*/
+__fp16
+foo2 (v8hf x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo3:
+** 	fminnmp	s0, v0.2s
+** 	fcvt	d0, s0
+** 	fadd	d0, d0, d1
+** 	ret
+*/
+double
+foo3 (v4sf x, v2df y)
+{
+  return (x[0] < x[1] ? x[0] : x[1]) + y[0];
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c
new file mode 100644
index 0000000000000000000000000000000000000000..e219a13abc745b83dca58633fd2d812e276d6b2d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c
@@ -0,0 +1,74 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-save-temps -O1 -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef long long v2di __attribute__((vector_size (16)));
+typedef unsigned long long v2udi __attribute__((vector_size (16)));
+typedef int v2si __attribute__((vector_size (16)));
+typedef unsigned int v2usi __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	umov	x0, v0.d\[1\]
+** 	fmov	x1, d0
+** 	cmp	x0, x1
+** 	csel	x0, x0, x1, ge
+** 	ret
+*/
+long long
+foo (v2di x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo1:
+** 	smaxp	v0.2s, v0.2s, v0.2s
+** 	smov	x0, v0.s\[0\]
+** 	ret
+*/
+long long
+foo1 (v2si x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo2:
+** 	umaxp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	ret
+*/
+unsigned long long
+foo2 (v2usi x)
+{
+  return x[0] > x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo3:
+** 	umaxp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	fmov	x1, d1
+** 	add	x0, x1, w0, uxtw
+** 	ret
+*/
+unsigned long long
+foo3 (v2usi x, v2udi y)
+{
+  return (x[0] > x[1] ? x[0] : x[1]) + y[0];
+}
+
+/* 
+** foo4:
+** 	smaxp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	fmov	x1, d1
+** 	add	x0, x1, w0, sxtw
+** 	ret
+*/
+long long
+foo4 (v2si x, v2di y)
+{
+  return (x[0] > x[1] ? x[0] : x[1]) + y[0];
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c
new file mode 100644
index 0000000000000000000000000000000000000000..2a32fb4ea3edaa4c547a7a481c3ddca6b477430e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c
@@ -0,0 +1,74 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-save-temps -O1 -std=c99" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+typedef long long v2di __attribute__((vector_size (16)));
+typedef unsigned long long v2udi __attribute__((vector_size (16)));
+typedef int v2si __attribute__((vector_size (16)));
+typedef unsigned int v2usi __attribute__((vector_size (16)));
+
+/*
+** foo:
+** 	umov	x0, v0.d\[1\]
+** 	fmov	x1, d0
+** 	cmp	x0, x1
+** 	csel	x0, x0, x1, le
+** 	ret
+*/
+long long
+foo (v2di x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo1:
+** 	sminp	v0.2s, v0.2s, v0.2s
+** 	smov	x0, v0.s\[0\]
+** 	ret
+*/
+long long
+foo1 (v2si x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo2:
+** 	uminp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	ret
+*/
+unsigned long long
+foo2 (v2usi x)
+{
+  return x[0] < x[1] ? x[0] : x[1];
+}
+
+/* 
+** foo3:
+** 	uminp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	fmov	x1, d1
+** 	add	x0, x1, w0, uxtw
+** 	ret
+*/
+unsigned long long
+foo3 (v2usi x, v2udi y)
+{
+  return (x[0] < x[1] ? x[0] : x[1]) + y[0];
+}
+
+/* 
+** foo4:
+** 	sminp	v0.2s, v0.2s, v0.2s
+** 	fmov	w0, s0
+** 	fmov	x1, d1
+** 	add	x0, x1, w0, sxtw
+** 	ret
+*/
+long long
+foo4 (v2si x, v2di y)
+{
+  return (x[0] < x[1] ? x[0] : x[1]) + y[0];
+}




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-10-31 11:56 [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Tamar Christina
                   ` (6 preceding siblings ...)
  2022-10-31 12:00 ` [PATCH 8/8]AArch64: Have reload not choose to do add on the scalar side if both values exist on the SIMD side Tamar Christina
@ 2022-10-31 21:41 ` Jeff Law
  2022-11-05 11:32 ` Richard Biener
  8 siblings, 0 replies; 50+ messages in thread
From: Jeff Law @ 2022-10-31 21:41 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches; +Cc: nd, rguenther


On 10/31/22 05:56, Tamar Christina wrote:
> Hi All,
>
> This patch series is to add recognition of pairwise operations (reductions)
> in match.pd such that we can benefit from them even at -O1 when the vectorizer
> isn't enabled.
>
> Ths use of these allow for a lot simpler codegen in AArch64 and allows us to
> avoid quite a lot of codegen warts.
>
> As an example a simple:
>
> typedef float v4sf __attribute__((vector_size (16)));
>
> float
> foo3 (v4sf x)
> {
>    return x[1] + x[2];
> }
>
> currently generates:
>
> foo3:
>          dup     s1, v0.s[1]
>          dup     s0, v0.s[2]
>          fadd    s0, s1, s0
>          ret
>
> while with this patch series now generates:
>
> foo3:
> 	ext	v0.16b, v0.16b, v0.16b, #4
> 	faddp	s0, v0.2s
> 	ret
>
> This patch will not perform the operation if the source is not a gimple
> register and leaves memory sources to the vectorizer as it's able to deal
> correctly with clobbers.
>
> The use of these instruction makes a significant difference in codegen quality
> for AArch64 and Arm.
>
> NOTE: The last entry in the series contains tests for all of the previous
> patches as it's a bit of an all or nothing thing.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* match.pd (adjacent_data_access_p): Import.
> 	Add new pattern for bitwise plus, min, max, fmax, fmin.
> 	* tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs.
> 	* tree.cc (adjacent_data_access_p): New.
> 	* tree.h (adjacent_data_access_p): New.

Nice stuff.  I'd pondered some similar stuff at Tachyum, but got dragged 
away before it could be implemented.





> diff --git a/gcc/tree.cc b/gcc/tree.cc
> index 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19de9841e29d311 100644
> --- a/gcc/tree.cc
> +++ b/gcc/tree.cc
> @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t)
>     return builder.build ();
>   }
>   
> +/* Returns base address if the two operands represent adjacent access of data
> +   such that a pairwise operation can be used.  OP1 must be a lower subpart
> +   than OP2.  If POS is not NULL then on return if a value is returned POS
> +   will indicate the position of the lower address.  If COMMUTATIVE_P then
> +   the operation is also tried by flipping op1 and op2.  */
> +
> +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos,
> +			     bool commutative_p)

Formatting nit.  Return type on a different line.


OK with that fixed.


jeff



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/8]middle-end: Recognize scalar widening reductions
  2022-10-31 11:57 ` [PATCH 2/8]middle-end: Recognize scalar widening reductions Tamar Christina
@ 2022-10-31 21:42   ` Jeff Law
  2022-11-07 13:21   ` Richard Biener
  1 sibling, 0 replies; 50+ messages in thread
From: Jeff Law @ 2022-10-31 21:42 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches; +Cc: nd, rguenther


On 10/31/22 05:57, Tamar Christina wrote:
> Hi All,
>
> This adds a new optab and IFNs for REDUC_PLUS_WIDEN where the resulting
> scalar reduction has twice the precision of the input elements.
>
> At some point in a later patch I will also teach the vectorizer to recognize
> this builtin once I figure out how the various bits of reductions work.
>
> For now it's generated only by the match.pd pattern.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* internal-fn.def (REDUC_PLUS_WIDEN): New.
> 	* doc/md.texi: Document it.
> 	* match.pd: Recognize widening plus.
> 	* optabs.def (reduc_splus_widen_scal_optab,
> 	reduc_uplus_widen_scal_optab): New.

OK

jeff



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-10-31 11:57 ` [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector Tamar Christina
@ 2022-10-31 21:44   ` Jeff Law
  2022-11-01 14:25   ` Richard Sandiford
  1 sibling, 0 replies; 50+ messages in thread
From: Jeff Law @ 2022-10-31 21:44 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches; +Cc: nd, rguenther


On 10/31/22 05:57, Tamar Christina wrote:
> Hi All,
>
> The current vector extract pattern can only extract from a vector when the
> position to extract is a multiple of the vector bitsize as a whole.
>
> That means extract something like a V2SI from a V4SI vector from position 32
> isn't possible as 32 is not a multiple of 64.  Ideally this optab should have
> worked on multiple of the element size, but too many targets rely on this
> semantic now.
>
> So instead add a new case which allows any extraction as long as the bit pos
> is a multiple of the element size.  We use a VEC_PERM to shuffle the elements
> into the bottom parts of the vector and then use a subreg to extract the values
> out.  This now allows various vector operations that before were being
> decomposed into very inefficient scalar operations.
>
> NOTE: I added 3 testcases, I only fixed the 3rd one.
>
> The 1st one missed because we don't optimize VEC_PERM expressions into
> bitfields.  The 2nd one is missed because extract_bit_field only works on
> vector modes.  In this case the intermediate extract is DImode.
>
> On targets where the scalar mode is tiable to vector modes the extract should
> work fine.
>
> However I ran out of time to fix the first two and so will do so in GCC 14.
> For now this catches the case that my pattern now introduces more easily.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* expmed.cc (extract_bit_field_1): Add support for vector element
> 	extracts.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/ext_1.c: New.

OK.

jeff



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-10-31 11:57 ` [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector Tamar Christina
  2022-10-31 21:44   ` Jeff Law
@ 2022-11-01 14:25   ` Richard Sandiford
  2022-11-11 14:33     ` Tamar Christina
  1 sibling, 1 reply; 50+ messages in thread
From: Richard Sandiford @ 2022-11-01 14:25 UTC (permalink / raw)
  To: Tamar Christina via Gcc-patches; +Cc: Tamar Christina, nd, rguenther

Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Hi All,
>
> The current vector extract pattern can only extract from a vector when the
> position to extract is a multiple of the vector bitsize as a whole.
>
> That means extract something like a V2SI from a V4SI vector from position 32
> isn't possible as 32 is not a multiple of 64.  Ideally this optab should have
> worked on multiple of the element size, but too many targets rely on this
> semantic now.
>
> So instead add a new case which allows any extraction as long as the bit pos
> is a multiple of the element size.  We use a VEC_PERM to shuffle the elements
> into the bottom parts of the vector and then use a subreg to extract the values
> out.  This now allows various vector operations that before were being
> decomposed into very inefficient scalar operations.
>
> NOTE: I added 3 testcases, I only fixed the 3rd one.
>
> The 1st one missed because we don't optimize VEC_PERM expressions into
> bitfields.  The 2nd one is missed because extract_bit_field only works on
> vector modes.  In this case the intermediate extract is DImode.
>
> On targets where the scalar mode is tiable to vector modes the extract should
> work fine.
>
> However I ran out of time to fix the first two and so will do so in GCC 14.
> For now this catches the case that my pattern now introduces more easily.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* expmed.cc (extract_bit_field_1): Add support for vector element
> 	extracts.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/ext_1.c: New.
>
> --- inline copy of patch -- 
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index bab020c07222afa38305ef8d7333f271b1965b78..ffdf65210d17580a216477cfe4ac1598941ac9e4 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -1718,6 +1718,45 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, poly_uint64 bitnum,
>  	      return target;
>  	    }
>  	}
> +      else if (!known_eq (bitnum, 0U)
> +	       && multiple_p (GET_MODE_UNIT_BITSIZE (tmode), bitnum, &pos))
> +	{
> +	  /* The encoding has a single stepped pattern.  */
> +	  poly_uint64 nunits = GET_MODE_NUNITS (new_mode);
> +	  int nelts = nunits.to_constant ();
> +	  vec_perm_builder sel (nunits, nelts, 1);
> +	  int delta = -pos.to_constant ();
> +	  for (int i = 0; i < nelts; ++i)
> +	    sel.quick_push ((i - delta) % nelts);
> +	  vec_perm_indices indices (sel, 1, nunits);

Thanks for doing this, looks good.  But I don't think the to_constant
calls are safe.  new_mode and pos could in principle be nonconstant.

To build a stepped pattern, we just need:

    vec_perm_builder sel (nunits, 1, 3);

and then push pos, pos + 1, and pos + 2 to it.  There's no need to
clamp the position to nelts, it happens automatically.

> +
> +	  if (can_vec_perm_const_p (new_mode, new_mode, indices, false))
> +	    {
> +	      class expand_operand ops[4];
> +	      machine_mode outermode = new_mode;
> +	      machine_mode innermode = tmode;
> +	      enum insn_code icode
> +		= direct_optab_handler (vec_perm_optab, outermode);
> +	      target = gen_reg_rtx (outermode);
> +	      if (icode != CODE_FOR_nothing)
> +		{
> +		  rtx sel = vec_perm_indices_to_rtx (outermode, indices);
> +		  create_output_operand (&ops[0], target, outermode);
> +		  ops[0].target = 1;
> +		  create_input_operand (&ops[1], op0, outermode);
> +		  create_input_operand (&ops[2], op0, outermode);
> +		  create_input_operand (&ops[3], sel, outermode);

I think this should be GET_MODE (sel).  Looks like the current
version would ICE for float vectors.  That said...

> +		  if (maybe_expand_insn (icode, 4, ops))
> +		    return simplify_gen_subreg (innermode, target, outermode, 0);
> +		}
> +	      else if (targetm.vectorize.vec_perm_const != NULL)
> +		{
> +		  if (targetm.vectorize.vec_perm_const (outermode, outermode,
> +							target, op0, op0, indices))
> +		    return simplify_gen_subreg (innermode, target, outermode, 0);
> +		}

...can we use expand_vec_perm_const here?  It will try the constant
expansion first, which is the preferred order.  It also has a few
variations up its sleeve.

Thanks,
Richard


> +	    }
> +	}
>      }
>  
>    /* See if we can get a better vector mode before extracting.  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ext_1.c b/gcc/testsuite/gcc.target/aarch64/ext_1.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e571b3bc2ddf887a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ext_1.c
> @@ -0,0 +1,54 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <string.h>
> +
> +typedef unsigned int v4si __attribute__((vector_size (16)));
> +typedef unsigned int v2si __attribute__((vector_size (8)));
> +
> +/*
> +** extract: { xfail *-*-* }
> +**	ext	v0.16b, v0.16b, v0.16b, #4
> +**	ret
> +*/
> +v2si extract (v4si x)
> +{
> +    v2si res = {x[1], x[2]};
> +    return res;
> +}
> +
> +/*
> +** extract1: { xfail *-*-* }
> +**	ext	v0.16b, v0.16b, v0.16b, #4
> +**	ret
> +*/
> +v2si extract1 (v4si x)
> +{
> +    v2si res;
> +    memcpy (&res, ((int*)&x)+1, sizeof(res));
> +    return res;
> +}
> +
> +typedef struct cast {
> +  int a;
> +  v2si b __attribute__((packed));
> +} cast_t;
> +
> +typedef union Data {
> +   v4si x;
> +   cast_t y;
> +} data;  
> +
> +/*
> +** extract2:
> +**	ext	v0.16b, v0.16b, v0.16b, #4
> +**	ret
> +*/
> +v2si extract2 (v4si x)
> +{
> +    data d;
> +    d.x = x;
> +    return d.y.b;
> +}
> +

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 4/8]AArch64 aarch64: Implement widening reduction patterns
  2022-10-31 11:58 ` [PATCH 4/8]AArch64 aarch64: Implement widening reduction patterns Tamar Christina
@ 2022-11-01 14:41   ` Richard Sandiford
  0 siblings, 0 replies; 50+ messages in thread
From: Richard Sandiford @ 2022-11-01 14:41 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov

Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> This implements the new widening reduction optab in the backend.
> Instead of introducing a duplicate definition for the same thing I have
> renamed the intrinsics defintions to use the same optab.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64-simd-builtins.def (saddlv, uaddlv): Rename to
> 	reduc_splus_widen_scal_ and reduc_uplus_widen_scal_ respectively.
> 	* config/aarch64/aarch64-simd.md (aarch64_<su>addlv<mode>): Renamed to
> 	...
> 	(reduc_<su>plus_widen_scal_<mode>): ... This.
> 	* config/aarch64/arm_neon.h (vaddlv_s8, vaddlv_s16, vaddlv_u8,
> 	vaddlv_u16, vaddlvq_s8, vaddlvq_s16, vaddlvq_s32, vaddlvq_u8,
> 	vaddlvq_u16, vaddlvq_u32, vaddlv_s32, vaddlv_u32): Use it.

OK, thanks.

Richard

> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
> index cf46b31627b84476a25762ffc708fd84a4086e43..a4b21e1495c5699d8557a4bcb9e73ef98ae60b35 100644
> --- a/gcc/config/aarch64/aarch64-simd-builtins.def
> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def
> @@ -190,9 +190,9 @@
>    BUILTIN_VDQV_L (UNOP, saddlp, 0, NONE)
>    BUILTIN_VDQV_L (UNOPU, uaddlp, 0, NONE)
>  
> -  /* Implemented by aarch64_<su>addlv<mode>.  */
> -  BUILTIN_VDQV_L (UNOP, saddlv, 0, NONE)
> -  BUILTIN_VDQV_L (UNOPU, uaddlv, 0, NONE)
> +  /* Implemented by reduc_<su>plus_widen_scal_<mode>.  */
> +  BUILTIN_VDQV_L (UNOP, reduc_splus_widen_scal_, 10, NONE)
> +  BUILTIN_VDQV_L (UNOPU, reduc_uplus_widen_scal_, 10, NONE)
>  
>    /* Implemented by aarch64_<su>abd<mode>.  */
>    BUILTIN_VDQ_BHSI (BINOP, sabd, 0, NONE)
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index cf8c094bd4b76981cef2dd5dd7b8e6be0d56101f..25aed74f8cf939562ed65a578fe32ca76605b58a 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -3455,7 +3455,7 @@ (define_expand "reduc_plus_scal_v4sf"
>    DONE;
>  })
>  
> -(define_insn "aarch64_<su>addlv<mode>"
> +(define_insn "reduc_<su>plus_widen_scal_<mode>"
>   [(set (match_operand:<VWIDE_S> 0 "register_operand" "=w")
>         (unspec:<VWIDE_S> [(match_operand:VDQV_L 1 "register_operand" "w")]
>  		    USADDLV))]
> diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
> index cf6af728ca99dae1cb6ab647466cfec32f7e913e..7b2c4c016191bcd6c3e075d27810faedb23854b7 100644
> --- a/gcc/config/aarch64/arm_neon.h
> +++ b/gcc/config/aarch64/arm_neon.h
> @@ -3664,70 +3664,70 @@ __extension__ extern __inline int16_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlv_s8 (int8x8_t __a)
>  {
> -  return __builtin_aarch64_saddlvv8qi (__a);
> +  return __builtin_aarch64_reduc_splus_widen_scal_v8qi (__a);
>  }
>  
>  __extension__ extern __inline int32_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlv_s16 (int16x4_t __a)
>  {
> -  return __builtin_aarch64_saddlvv4hi (__a);
> +  return __builtin_aarch64_reduc_splus_widen_scal_v4hi (__a);
>  }
>  
>  __extension__ extern __inline uint16_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlv_u8 (uint8x8_t __a)
>  {
> -  return __builtin_aarch64_uaddlvv8qi_uu (__a);
> +  return __builtin_aarch64_reduc_uplus_widen_scal_v8qi_uu (__a);
>  }
>  
>  __extension__ extern __inline uint32_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlv_u16 (uint16x4_t __a)
>  {
> -  return __builtin_aarch64_uaddlvv4hi_uu (__a);
> +  return __builtin_aarch64_reduc_uplus_widen_scal_v4hi_uu (__a);
>  }
>  
>  __extension__ extern __inline int16_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlvq_s8 (int8x16_t __a)
>  {
> -  return __builtin_aarch64_saddlvv16qi (__a);
> +  return __builtin_aarch64_reduc_splus_widen_scal_v16qi (__a);
>  }
>  
>  __extension__ extern __inline int32_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlvq_s16 (int16x8_t __a)
>  {
> -  return __builtin_aarch64_saddlvv8hi (__a);
> +  return __builtin_aarch64_reduc_splus_widen_scal_v8hi (__a);
>  }
>  
>  __extension__ extern __inline int64_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlvq_s32 (int32x4_t __a)
>  {
> -  return __builtin_aarch64_saddlvv4si (__a);
> +  return __builtin_aarch64_reduc_splus_widen_scal_v4si (__a);
>  }
>  
>  __extension__ extern __inline uint16_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlvq_u8 (uint8x16_t __a)
>  {
> -  return __builtin_aarch64_uaddlvv16qi_uu (__a);
> +  return __builtin_aarch64_reduc_uplus_widen_scal_v16qi_uu (__a);
>  }
>  
>  __extension__ extern __inline uint32_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlvq_u16 (uint16x8_t __a)
>  {
> -  return __builtin_aarch64_uaddlvv8hi_uu (__a);
> +  return __builtin_aarch64_reduc_uplus_widen_scal_v8hi_uu (__a);
>  }
>  
>  __extension__ extern __inline uint64_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlvq_u32 (uint32x4_t __a)
>  {
> -  return __builtin_aarch64_uaddlvv4si_uu (__a);
> +  return __builtin_aarch64_reduc_uplus_widen_scal_v4si_uu (__a);
>  }
>  
>  __extension__ extern __inline float32x2_t
> @@ -6461,14 +6461,14 @@ __extension__ extern __inline int64_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlv_s32 (int32x2_t __a)
>  {
> -  return __builtin_aarch64_saddlvv2si (__a);
> +  return __builtin_aarch64_reduc_splus_widen_scal_v2si (__a);
>  }
>  
>  __extension__ extern __inline uint64_t
>  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
>  vaddlv_u32 (uint32x2_t __a)
>  {
> -  return __builtin_aarch64_uaddlvv2si_uu (__a);
> +  return __builtin_aarch64_reduc_uplus_widen_scal_v2si_uu (__a);
>  }
>  
>  __extension__ extern __inline int16x4_t

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
  2022-10-31 11:58 ` [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable Tamar Christina
@ 2022-11-01 14:58   ` Richard Sandiford
  2022-11-01 15:11     ` Tamar Christina
  2022-11-11 14:39     ` Tamar Christina
  0 siblings, 2 replies; 50+ messages in thread
From: Richard Sandiford @ 2022-11-01 14:58 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov

Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> The backend has an existing V2HFmode that is used by pairwise operations.
> This mode was however never made fully functional.  Amongst other things it was
> never declared as a vector type which made it unusable from the mid-end.
>
> It's also lacking an implementation for load/stores so reload ICEs if this mode
> is every used.  This finishes the implementation by providing the above.
>
> Note that I have created a new iterator VHSDF_P instead of extending VHSDF
> because the previous iterator is used in far more things than just load/stores.
>
> It's also used for instance in intrinsics and extending this would force me to
> provide support for mangling the type while we never expose it through
> intrinsics.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64-simd.md (*aarch64_simd_movv2hf): New.
> 	(mov<mode>, movmisalign<mode>, aarch64_dup_lane<mode>,
> 	aarch64_store_lane0<mode>, aarch64_simd_vec_set<mode>,
> 	@aarch64_simd_vec_copy_lane<mode>, vec_set<mode>,
> 	reduc_<optab>_scal_<mode>, reduc_<fmaxmin>_scal_<mode>,
> 	aarch64_reduc_<optab>_internal<mode>, aarch64_get_lane<mode>,
> 	vec_init<mode><Vel>, vec_extract<mode><Vel>): Support V2HF.
> 	* config/aarch64/aarch64.cc (aarch64_classify_vector_mode):
> 	Add E_V2HFmode.
> 	* config/aarch64/iterators.md (VHSDF_P): New.
> 	(V2F, VALL_F16_FULL, nunits, Vtype, Vmtype, Vetype, stype, VEL,
> 	Vel, q, vp): Add V2HF.
> 	* config/arm/types.md (neon_fp_reduc_add_h): New.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/sve/slp_1.c: Update testcase.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 25aed74f8cf939562ed65a578fe32ca76605b58a..93a2888f567460ad10ec050ea7d4f701df4729d1 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -19,10 +19,10 @@
>  ;; <http://www.gnu.org/licenses/>.
>  
>  (define_expand "mov<mode>"
> -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> -	(match_operand:VALL_F16 1 "general_operand"))]
> +  [(set (match_operand:VALL_F16_FULL 0 "nonimmediate_operand")
> +	(match_operand:VALL_F16_FULL 1 "general_operand"))]
>    "TARGET_SIMD"
> -  "
> +{
>    /* Force the operand into a register if it is not an
>       immediate whose use can be replaced with xzr.
>       If the mode is 16 bytes wide, then we will be doing
> @@ -46,12 +46,11 @@ (define_expand "mov<mode>"
>        aarch64_expand_vector_init (operands[0], operands[1]);
>        DONE;
>      }
> -  "
> -)
> +})
>  
>  (define_expand "movmisalign<mode>"
> -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> -        (match_operand:VALL_F16 1 "general_operand"))]
> +  [(set (match_operand:VALL_F16_FULL 0 "nonimmediate_operand")
> +        (match_operand:VALL_F16_FULL 1 "general_operand"))]
>    "TARGET_SIMD && !STRICT_ALIGNMENT"
>  {
>    /* This pattern is not permitted to fail during expansion: if both arguments
> @@ -85,10 +84,10 @@ (define_insn "aarch64_simd_dup<mode>"
>  )
>  
>  (define_insn "aarch64_dup_lane<mode>"
> -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> -	(vec_duplicate:VALL_F16
> +  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w")
> +	(vec_duplicate:VALL_F16_FULL
>  	  (vec_select:<VEL>
> -	    (match_operand:VALL_F16 1 "register_operand" "w")
> +	    (match_operand:VALL_F16_FULL 1 "register_operand" "w")
>  	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
>            )))]
>    "TARGET_SIMD"
> @@ -142,6 +141,29 @@ (define_insn "*aarch64_simd_mov<VDMOV:mode>"
>  		     mov_reg, neon_move<q>")]
>  )
>  
> +(define_insn "*aarch64_simd_movv2hf"
> +  [(set (match_operand:V2HF 0 "nonimmediate_operand"
> +		"=w, m,  m,  w, ?r, ?w, ?r, w, w")
> +	(match_operand:V2HF 1 "general_operand"
> +		"m,  Dz, w,  w,  w,  r,  r, Dz, Dn"))]
> +  "TARGET_SIMD_F16INST
> +   && (register_operand (operands[0], V2HFmode)
> +       || aarch64_simd_reg_or_zero (operands[1], V2HFmode))"
> +   "@
> +    ldr\\t%s0, %1
> +    str\\twzr, %0
> +    str\\t%s1, %0
> +    mov\\t%0.2s[0], %1.2s[0]
> +    umov\\t%w0, %1.s[0]
> +    fmov\\t%s0, %1
> +    mov\\t%0, %1
> +    movi\\t%d0, 0
> +    * return aarch64_output_simd_mov_immediate (operands[1], 32);"
> +  [(set_attr "type" "neon_load1_1reg, store_8, neon_store1_1reg,\
> +		     neon_logic, neon_to_gp, f_mcr,\
> +		     mov_reg, neon_move, neon_move")]
> +)
> +
>  (define_insn "*aarch64_simd_mov<VQMOV:mode>"
>    [(set (match_operand:VQMOV 0 "nonimmediate_operand"
>  		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
> @@ -182,7 +204,7 @@ (define_insn "*aarch64_simd_mov<VQMOV:mode>"
>  
>  (define_insn "aarch64_store_lane0<mode>"
>    [(set (match_operand:<VEL> 0 "memory_operand" "=m")
> -	(vec_select:<VEL> (match_operand:VALL_F16 1 "register_operand" "w")
> +	(vec_select:<VEL> (match_operand:VALL_F16_FULL 1 "register_operand" "w")
>  			(parallel [(match_operand 2 "const_int_operand" "n")])))]
>    "TARGET_SIMD
>     && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 0"
> @@ -1035,11 +1057,11 @@ (define_insn "one_cmpl<mode>2"
>  )
>  
>  (define_insn "aarch64_simd_vec_set<mode>"
> -  [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
> -	(vec_merge:VALL_F16
> -	    (vec_duplicate:VALL_F16
> +  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w,w,w")
> +	(vec_merge:VALL_F16_FULL
> +	    (vec_duplicate:VALL_F16_FULL
>  		(match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand" "w,?r,Utv"))
> -	    (match_operand:VALL_F16 3 "register_operand" "0,0,0")
> +	    (match_operand:VALL_F16_FULL 3 "register_operand" "0,0,0")
>  	    (match_operand:SI 2 "immediate_operand" "i,i,i")))]
>    "TARGET_SIMD"
>    {
> @@ -1061,14 +1083,14 @@ (define_insn "aarch64_simd_vec_set<mode>"
>  )
>  
>  (define_insn "@aarch64_simd_vec_copy_lane<mode>"
> -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> -	(vec_merge:VALL_F16
> -	    (vec_duplicate:VALL_F16
> +  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w")
> +	(vec_merge:VALL_F16_FULL
> +	    (vec_duplicate:VALL_F16_FULL
>  	      (vec_select:<VEL>
> -		(match_operand:VALL_F16 3 "register_operand" "w")
> +		(match_operand:VALL_F16_FULL 3 "register_operand" "w")
>  		(parallel
>  		  [(match_operand:SI 4 "immediate_operand" "i")])))
> -	    (match_operand:VALL_F16 1 "register_operand" "0")
> +	    (match_operand:VALL_F16_FULL 1 "register_operand" "0")
>  	    (match_operand:SI 2 "immediate_operand" "i")))]
>    "TARGET_SIMD"
>    {
> @@ -1376,7 +1398,7 @@ (define_insn "vec_shr_<mode>"
>  )
>  
>  (define_expand "vec_set<mode>"
> -  [(match_operand:VALL_F16 0 "register_operand")
> +  [(match_operand:VALL_F16_FULL 0 "register_operand")
>     (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
>     (match_operand:SI 2 "immediate_operand")]
>    "TARGET_SIMD"
> @@ -3503,7 +3525,7 @@ (define_insn "popcount<mode>2"
>  ;; gimple_fold'd to the IFN_REDUC_(MAX|MIN) function.  (This is FP smax/smin).
>  (define_expand "reduc_<optab>_scal_<mode>"
>    [(match_operand:<VEL> 0 "register_operand")
> -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
>  		 FMAXMINV)]
>    "TARGET_SIMD"
>    {
> @@ -3518,7 +3540,7 @@ (define_expand "reduc_<optab>_scal_<mode>"
>  
>  (define_expand "reduc_<fmaxmin>_scal_<mode>"
>    [(match_operand:<VEL> 0 "register_operand")
> -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
>  		 FMAXMINNMV)]
>    "TARGET_SIMD"
>    {
> @@ -3562,8 +3584,8 @@ (define_insn "aarch64_reduc_<optab>_internalv2si"
>  )
>  
>  (define_insn "aarch64_reduc_<optab>_internal<mode>"
> - [(set (match_operand:VHSDF 0 "register_operand" "=w")
> -       (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")]
> + [(set (match_operand:VHSDF_P 0 "register_operand" "=w")
> +       (unspec:VHSDF_P [(match_operand:VHSDF_P 1 "register_operand" "w")]
>  		      FMAXMINV))]
>   "TARGET_SIMD"
>   "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
> @@ -4208,7 +4230,7 @@ (define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
>  (define_insn_and_split "aarch64_get_lane<mode>"
>    [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand" "=?r, w, Utv")
>  	(vec_select:<VEL>
> -	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
> +	  (match_operand:VALL_F16_FULL 1 "register_operand" "w, w, w")
>  	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
>    "TARGET_SIMD"
>    {
> @@ -7989,7 +8011,7 @@ (define_expand "aarch64_st1<VALL_F16:mode>"
>  ;; Standard pattern name vec_init<mode><Vel>.
>  
>  (define_expand "vec_init<mode><Vel>"
> -  [(match_operand:VALL_F16 0 "register_operand")
> +  [(match_operand:VALL_F16_FULL 0 "register_operand")
>     (match_operand 1 "" "")]
>    "TARGET_SIMD"
>  {
> @@ -8068,7 +8090,7 @@ (define_insn "aarch64_urecpe<mode>"
>  
>  (define_expand "vec_extract<mode><Vel>"
>    [(match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand")
> -   (match_operand:VALL_F16 1 "register_operand")
> +   (match_operand:VALL_F16_FULL 1 "register_operand")
>     (match_operand:SI 2 "immediate_operand")]
>    "TARGET_SIMD"
>  {
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index f05bac713e88ea8c7feaa2367d55bd523ca66f57..1e08f8453688210afe1566092b19b59c9bdd0c97 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -3566,6 +3566,7 @@ aarch64_classify_vector_mode (machine_mode mode)
>      case E_V8BFmode:
>      case E_V4SFmode:
>      case E_V2DFmode:
> +    case E_V2HFmode:
>        return TARGET_SIMD ? VEC_ADVSIMD : 0;
>  
>      default:
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 37d8161a33b1c399d80be82afa67613a087389d4..1df09f7fe2eb35aed96113476541e0faa5393551 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -160,6 +160,10 @@ (define_mode_iterator VDQF [V2SF V4SF V2DF])
>  (define_mode_iterator VHSDF [(V4HF "TARGET_SIMD_F16INST")
>  			     (V8HF "TARGET_SIMD_F16INST")
>  			     V2SF V4SF V2DF])
> +;; Advanced SIMD Float modes suitable for pairwise operations.
> +(define_mode_iterator VHSDF_P [(V4HF "TARGET_SIMD_F16INST")
> +			       (V8HF "TARGET_SIMD_F16INST")
> +			       V2SF V4SF V2DF (V2HF "TARGET_SIMD_F16INST")])
>  
>  ;; Advanced SIMD Float modes, and DF.
>  (define_mode_iterator VDQF_DF [V2SF V4SF V2DF DF])
> @@ -188,15 +192,23 @@ (define_mode_iterator VDQF_COND [V2SF V2SI V4SF V4SI V2DF V2DI])
>  (define_mode_iterator VALLF [V2SF V4SF V2DF SF DF])
>  
>  ;; Advanced SIMD Float modes with 2 elements.
> -(define_mode_iterator V2F [V2SF V2DF])
> +(define_mode_iterator V2F [V2SF V2DF V2HF])
>  
>  ;; All Advanced SIMD modes on which we support any arithmetic operations.
>  (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF V2DF])
>  
> -;; All Advanced SIMD modes suitable for moving, loading, and storing.
> +;; All Advanced SIMD modes suitable for moving, loading, and storing
> +;; except V2HF.
>  (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
>  				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
>  
> +;; All Advanced SIMD modes suitable for moving, loading, and storing
> +;; including V2HF
> +(define_mode_iterator VALL_F16_FULL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
> +				     V4HF V8HF V4BF V8BF V2SF V4SF V2DF
> +				     (V2HF "TARGET_SIMD_F16INST")])

This name might cause confusion with the SVE iterators, where FULL
means "every bit of the register is used".  How about something like
VMOVE instead?

With this change, I guess VALL_F16 represents "The set of all modes
for which the vld1 intrinsics are provided" and VMOVE or whatever
is "All Advanced SIMD modes suitable for moving, loading, and storing".
That is, VMOVE extends VALL_F16 with modes that are not manifested
via intrinsics.

> +
> +
>  ;; The VALL_F16 modes except the 128-bit 2-element ones.
>  (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI V4SI
>  				V4HF V8HF V2SF V4SF])
> @@ -1076,7 +1088,7 @@ (define_mode_attr nunits [(V8QI "8") (V16QI "16")
>  			  (V2SF "2") (V4SF "4")
>  			  (V1DF "1") (V2DF "2")
>  			  (DI "1") (DF "1")
> -			  (V8DI "8")])
> +			  (V8DI "8") (V2HF "2")])
>  
>  ;; Map a mode to the number of bits in it, if the size of the mode
>  ;; is constant.
> @@ -1090,6 +1102,7 @@ (define_mode_attr s [(HF "h") (SF "s") (DF "d") (SI "s") (DI "d")])
>  
>  ;; Give the length suffix letter for a sign- or zero-extension.
>  (define_mode_attr size [(QI "b") (HI "h") (SI "w")])
> +(define_mode_attr sizel [(QI "b") (HI "h") (SI "")])
>  
>  ;; Give the number of bits in the mode
>  (define_mode_attr sizen [(QI "8") (HI "16") (SI "32") (DI "64")])
> @@ -1134,8 +1147,9 @@ (define_mode_attr Vtype [(V8QI "8b") (V16QI "16b")
>                           (V2SI "2s") (V4SI  "4s")
>                           (DI   "1d") (DF    "1d")
>                           (V2DI "2d") (V2SF "2s")
> -			 (V4SF "4s") (V2DF "2d")
> -			 (V4HF "4h") (V8HF "8h")
> +			 (V2HF "2h") (V4SF "4s")
> +			 (V2DF "2d") (V4HF "4h")
> +			 (V8HF "8h")
>  			 (V2x8QI "8b") (V2x4HI "4h")
>  			 (V2x2SI "2s") (V2x1DI  "1d")
>  			 (V2x4HF "4h") (V2x2SF "2s")

Where is the 2h used, and is it valid syntax in that context?

Same for later instances of 2h.

Thanks,
Richard

> @@ -1175,9 +1189,10 @@ (define_mode_attr Vmtype [(V8QI ".8b") (V16QI ".16b")
>  			 (V4HI ".4h") (V8HI  ".8h")
>  			 (V2SI ".2s") (V4SI  ".4s")
>  			 (V2DI ".2d") (V4HF ".4h")
> -			 (V8HF ".8h") (V4BF ".4h")
> -			 (V8BF ".8h") (V2SF ".2s")
> -			 (V4SF ".4s") (V2DF ".2d")
> +			 (V8HF ".8h") (V2HF ".2h")
> +			 (V4BF ".4h") (V8BF ".8h")
> +			 (V2SF ".2s") (V4SF ".4s")
> +			 (V2DF ".2d")
>  			 (DI   "")    (SI   "")
>  			 (HI   "")    (QI   "")
>  			 (TI   "")    (HF   "")
> @@ -1193,7 +1208,7 @@ (define_mode_attr Vmntype [(V8HI ".8b") (V4SI ".4h")
>  (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
>  			  (V4HI "h") (V8HI  "h")
>  			  (V2SI "s") (V4SI  "s")
> -			  (V2DI "d")
> +			  (V2DI "d") (V2HF  "h")
>  			  (V4HF "h") (V8HF  "h")
>  			  (V2SF "s") (V4SF  "s")
>  			  (V2DF "d")
> @@ -1285,7 +1300,7 @@ (define_mode_attr Vcwtype [(VNx16QI "b") (VNx8QI "h") (VNx4QI "w") (VNx2QI "d")
>  ;; more accurately.
>  (define_mode_attr stype [(V8QI "b") (V16QI "b") (V4HI "s") (V8HI "s")
>  			 (V2SI "s") (V4SI "s") (V2DI "d") (V4HF "s")
> -			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d")
> +			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d") (V2HF "s")
>  			 (HF "s") (SF "s") (DF "d") (QI "b") (HI "s")
>  			 (SI "s") (DI "d")])
>  
> @@ -1360,8 +1375,8 @@ (define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
>  		       (V4HF "HF") (V8HF  "HF")
>  		       (V2SF "SF") (V4SF  "SF")
>  		       (DF   "DF") (V2DF  "DF")
> -		       (SI   "SI") (HI    "HI")
> -		       (QI   "QI")
> +		       (SI   "SI") (V2HF  "HF")
> +		       (QI   "QI") (HI    "HI")
>  		       (V4BF "BF") (V8BF "BF")
>  		       (VNx16QI "QI") (VNx8QI "QI") (VNx4QI "QI") (VNx2QI "QI")
>  		       (VNx8HI "HI") (VNx4HI "HI") (VNx2HI "HI")
> @@ -1381,7 +1396,7 @@ (define_mode_attr Vel [(V8QI "qi") (V16QI "qi")
>  		       (V2SF "sf") (V4SF "sf")
>  		       (V2DF "df") (DF   "df")
>  		       (SI   "si") (HI   "hi")
> -		       (QI   "qi")
> +		       (QI   "qi") (V2HF "hf")
>  		       (V4BF "bf") (V8BF "bf")
>  		       (VNx16QI "qi") (VNx8QI "qi") (VNx4QI "qi") (VNx2QI "qi")
>  		       (VNx8HI "hi") (VNx4HI "hi") (VNx2HI "hi")
> @@ -1866,7 +1881,7 @@ (define_mode_attr q [(V8QI "") (V16QI "_q")
>  		     (V4HF "") (V8HF "_q")
>  		     (V4BF "") (V8BF "_q")
>  		     (V2SF "") (V4SF  "_q")
> -			       (V2DF  "_q")
> +		     (V2HF "") (V2DF  "_q")
>  		     (QI "") (HI "") (SI "") (DI "") (HF "") (SF "") (DF "")
>  		     (V2x8QI "") (V2x16QI "_q")
>  		     (V2x4HI "") (V2x8HI "_q")
> @@ -1905,6 +1920,7 @@ (define_mode_attr vp [(V8QI "v") (V16QI "v")
>  		      (V2SI "p") (V4SI  "v")
>  		      (V2DI "p") (V2DF  "p")
>  		      (V2SF "p") (V4SF  "v")
> +		      (V2HF "p")
>  		      (V4HF "v") (V8HF  "v")])
>  
>  (define_mode_attr vsi2qi [(V2SI "v8qi") (V4SI "v16qi")
> diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
> index 7d0504bdd944e9c0d1b545b0b66a9a1adc808714..3cfbc7a93cca1bea4925853e51d0a147c5722247 100644
> --- a/gcc/config/arm/types.md
> +++ b/gcc/config/arm/types.md
> @@ -483,6 +483,7 @@ (define_attr "autodetect_type"
>  ; neon_fp_minmax_s_q
>  ; neon_fp_minmax_d
>  ; neon_fp_minmax_d_q
> +; neon_fp_reduc_add_h
>  ; neon_fp_reduc_add_s
>  ; neon_fp_reduc_add_s_q
>  ; neon_fp_reduc_add_d
> @@ -1033,6 +1034,7 @@ (define_attr "type"
>    neon_fp_minmax_d,\
>    neon_fp_minmax_d_q,\
>  \
> +  neon_fp_reduc_add_h,\
>    neon_fp_reduc_add_s,\
>    neon_fp_reduc_add_s_q,\
>    neon_fp_reduc_add_d,\
> @@ -1257,8 +1259,8 @@ (define_attr "is_neon_type" "yes,no"
>            neon_fp_compare_d, neon_fp_compare_d_q, neon_fp_minmax_s,\
>            neon_fp_minmax_s_q, neon_fp_minmax_d, neon_fp_minmax_d_q,\
>            neon_fp_neg_s, neon_fp_neg_s_q, neon_fp_neg_d, neon_fp_neg_d_q,\
> -          neon_fp_reduc_add_s, neon_fp_reduc_add_s_q, neon_fp_reduc_add_d,\
> -          neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,
> +          neon_fp_reduc_add_h, neon_fp_reduc_add_s, neon_fp_reduc_add_s_q,\
> +          neon_fp_reduc_add_d, neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,\
>            neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d,\
>            neon_fp_reduc_minmax_d_q,\
>            neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> index 07d71a63414b1066ea431e287286ad048515711a..8e35e0b574d49913b43c7d8d4f4ba75f127f42e9 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> @@ -30,11 +30,9 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int n)	\
>  TEST_ALL (VEC_PERM)
>  
>  /* We should use one DUP for each of the 8-, 16- and 32-bit types,
> -   although we currently use LD1RW for _Float16.  We should use two
> -   DUPs for each of the three 64-bit types.  */
> +   We should use two DUPs for each of the three 64-bit types.  */
>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 2 } } */
> -/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 1 } } */
> +/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } } */
>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
>  /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
>  /* { dg-final { scan-assembler-not {\tzip2\t} } } */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/8]AArch64: Have reload not choose to do add on the scalar side if both values exist on the SIMD side.
  2022-10-31 12:00 ` [PATCH 8/8]AArch64: Have reload not choose to do add on the scalar side if both values exist on the SIMD side Tamar Christina
@ 2022-11-01 15:04   ` Richard Sandiford
  2022-11-01 15:20     ` Tamar Christina
  0 siblings, 1 reply; 50+ messages in thread
From: Richard Sandiford @ 2022-11-01 15:04 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov

Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> Currently we often times generate an r -> r add even if it means we need two
> reloads to perform it, i.e. in the case that the values are on the SIMD side.
>
> The pairwise operations expose these more now and so we get suboptimal codegen.
>
> Normally I would have liked to use ^ or $ here, but while this works for the
> simple examples, reload inexplicably falls apart on examples that should have
> been trivial. It forces a move to r -> w to use the w ADD, which is counter to
> what ^ and $ should do.
>
> However ! seems to fix all the regression and still maintains the good codegen.
>
> I have tried looking into whether it's our costings that are off, but I can't
> seem anything logical here.  So I'd like to push this change instead along with
> test that augment the other testcases that guard the r -> r variants.

This feels like a hack though.  r<-r+r is one of the simplest thing
the processor can do, so I don't think it makes logical sense to mark
it with !, which means "prohibitively expensive".  It's likely to
push operations that require reloads onto the SIMD side.

Thanks,
Richard

> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64.md (*add<mode>3_aarch64): Add ! to the r -> r
> 	alternative.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/simd/scalar_addp.c: New test.
> 	* gcc.target/aarch64/simd/scalar_faddp.c: New test.
> 	* gcc.target/aarch64/simd/scalar_faddp2.c: New test.
> 	* gcc.target/aarch64/simd/scalar_fmaxp.c: New test.
> 	* gcc.target/aarch64/simd/scalar_fminp.c: New test.
> 	* gcc.target/aarch64/simd/scalar_maxp.c: New test.
> 	* gcc.target/aarch64/simd/scalar_minp.c: New test.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 09ae1118371f82ca63146fceb953eb9e820d05a4..c333fb1f72725992bb304c560f1245a242d5192d 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -2043,7 +2043,7 @@ (define_expand "add<mode>3"
>  
>  (define_insn "*add<mode>3_aarch64"
>    [(set
> -    (match_operand:GPI 0 "register_operand" "=rk,rk,w,rk,r,r,rk")
> +    (match_operand:GPI 0 "register_operand" "=rk,!rk,w,rk,r,r,rk")
>      (plus:GPI
>       (match_operand:GPI 1 "register_operand" "%rk,rk,w,rk,rk,0,rk")
>       (match_operand:GPI 2 "aarch64_pluslong_operand" "I,r,w,J,Uaa,Uai,Uav")))]
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..5b8d40f19884fc7b4e7decd80758bc36fa76d058
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
> @@ -0,0 +1,70 @@
> +/* { dg-do assemble } */
> +/* { dg-additional-options "-save-temps -O1 -std=c99" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
> +
> +typedef long long v2di __attribute__((vector_size (16)));
> +typedef unsigned long long v2udi __attribute__((vector_size (16)));
> +typedef int v2si __attribute__((vector_size (16)));
> +typedef unsigned int v2usi __attribute__((vector_size (16)));
> +
> +/*
> +** foo:
> +** 	addp	d0, v0.2d
> +** 	fmov	x0, d0
> +** 	ret
> +*/
> +long long
> +foo (v2di x)
> +{
> +  return x[1] + x[0];
> +}
> +
> +/*
> +** foo1:
> +** 	saddlp	v0.1d, v0.2s
> +** 	fmov	x0, d0
> +** 	ret
> +*/
> +long long
> +foo1 (v2si x)
> +{
> +  return x[1] + x[0];
> +}
> +
> +/*
> +** foo2:
> +** 	uaddlp	v0.1d, v0.2s
> +** 	fmov	x0, d0
> +** 	ret
> +*/
> +unsigned long long
> +foo2 (v2usi x)
> +{
> +  return x[1] + x[0];
> +}
> +
> +/*
> +** foo3:
> +** 	uaddlp	v0.1d, v0.2s
> +** 	add	d0, d0, d1
> +** 	fmov	x0, d0
> +** 	ret
> +*/
> +unsigned long long
> +foo3 (v2usi x, v2udi y)
> +{
> +  return (x[1] + x[0]) + y[0];
> +}
> +
> +/*
> +** foo4:
> +** 	saddlp	v0.1d, v0.2s
> +** 	add	d0, d0, d1
> +** 	fmov	x0, d0
> +** 	ret
> +*/
> +long long
> +foo4 (v2si x, v2di y)
> +{
> +  return (x[1] + x[0]) + y[0];
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..ff455e060fc833b2f63e89c467b91a76fbe31aff
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
> @@ -0,0 +1,66 @@
> +/* { dg-do assemble } */
> +/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
> +/* { dg-add-options arm_v8_2a_fp16_scalar } */
> +/* { dg-additional-options "-save-temps -O1" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
> +
> +typedef double v2df __attribute__((vector_size (16)));
> +typedef float v4sf __attribute__((vector_size (16)));
> +typedef __fp16 v8hf __attribute__((vector_size (16)));
> +
> +/*
> +** foo:
> +** 	faddp	d0, v0.2d
> +** 	ret
> +*/
> +double
> +foo (v2df x)
> +{
> +  return x[1] + x[0];
> +}
> +
> +/*
> +** foo1:
> +** 	faddp	s0, v0.2s
> +** 	ret
> +*/
> +float
> +foo1 (v4sf x)
> +{
> +  return x[0] + x[1];
> +}
> +
> +/*
> +** foo2:
> +** 	faddp	h0, v0.2h
> +** 	ret
> +*/
> +__fp16
> +foo2 (v8hf x)
> +{
> +  return x[0] + x[1];
> +}
> +
> +/*
> +** foo3:
> +** 	ext	v0.16b, v0.16b, v0.16b, #4
> +** 	faddp	s0, v0.2s
> +** 	ret
> +*/
> +float
> +foo3 (v4sf x)
> +{
> +  return x[1] + x[2];
> +}
> +
> +/*
> +** foo4:
> +** 	dup	s0, v0.s\[3\]
> +** 	faddp	h0, v0.2h
> +** 	ret
> +*/
> +__fp16
> +foo4 (v8hf x)
> +{
> +  return x[6] + x[7];
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..04412c3b45c51648e46ff20f730b1213e940391a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c
> @@ -0,0 +1,14 @@
> +/* { dg-do assemble } */
> +/* { dg-additional-options "-save-temps -O1 -w" } */
> +
> +typedef __m128i __attribute__((__vector_size__(2 * sizeof(long))));
> +double a[];
> +*b;
> +fn1() {
> +  __m128i c;
> +  *(__m128i *)a = c;
> +  *b = a[0] + a[1];
> +}
> +
> +/* { dg-final { scan-assembler-times {faddp\td0, v0\.2d} 1 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..aa1d2bf17cd707b74d8f7c574506610ab4fd7299
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c
> @@ -0,0 +1,56 @@
> +/* { dg-do assemble } */
> +/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
> +/* { dg-add-options arm_v8_2a_fp16_scalar } */
> +/* { dg-additional-options "-save-temps -O1" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
> +
> +typedef double v2df __attribute__((vector_size (16)));
> +typedef float v4sf __attribute__((vector_size (16)));
> +typedef __fp16 v8hf __attribute__((vector_size (16)));
> +
> +/*
> +** foo:
> +** 	fmaxnmp	d0, v0.2d
> +** 	ret
> +*/
> +double
> +foo (v2df x)
> +{
> +  return x[0] > x[1] ? x[0] : x[1];
> +}
> +
> +/* 
> +** foo1:
> +** 	fmaxnmp	s0, v0.2s
> +** 	ret
> +*/
> +float
> +foo1 (v4sf x)
> +{
> +  return x[0] > x[1] ? x[0] : x[1];
> +}
> +
> +/*
> +** foo2:
> +** 	fmaxnmp	h0, v0.2h
> +** 	ret
> +*/
> +__fp16
> +foo2 (v8hf x)
> +{
> +  return x[0] > x[1] ? x[0] : x[1];
> +}
> +
> +/* 
> +** foo3:
> +** 	fmaxnmp	s0, v0.2s
> +** 	fcvt	d0, s0
> +** 	fadd	d0, d0, d1
> +** 	ret
> +*/
> +double
> +foo3 (v4sf x, v2df y)
> +{
> +  return (x[0] > x[1] ? x[0] : x[1]) + y[0];
> +}
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..6136c5272069c4d86f09951cdff25f1494e839f0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c
> @@ -0,0 +1,55 @@
> +/* { dg-do assemble } */
> +/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
> +/* { dg-add-options arm_v8_2a_fp16_scalar } */
> +/* { dg-additional-options "-save-temps -O1" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
> +
> +typedef double v2df __attribute__((vector_size (16)));
> +typedef float v4sf __attribute__((vector_size (16)));
> +typedef __fp16 v8hf __attribute__((vector_size (16)));
> +
> +/*
> +** foo:
> +** 	fminnmp	d0, v0.2d
> +** 	ret
> +*/
> +double
> +foo (v2df x)
> +{
> +  return x[0] < x[1] ? x[0] : x[1];
> +}
> +
> +/* 
> +** foo1:
> +** 	fminnmp	s0, v0.2s
> +** 	ret
> +*/
> +float
> +foo1 (v4sf x)
> +{
> +  return x[0] < x[1] ? x[0] : x[1];
> +}
> +
> +/* 
> +** foo2:
> +** 	fminnmp	h0, v0.2h
> +** 	ret
> +*/
> +__fp16
> +foo2 (v8hf x)
> +{
> +  return x[0] < x[1] ? x[0] : x[1];
> +}
> +
> +/* 
> +** foo3:
> +** 	fminnmp	s0, v0.2s
> +** 	fcvt	d0, s0
> +** 	fadd	d0, d0, d1
> +** 	ret
> +*/
> +double
> +foo3 (v4sf x, v2df y)
> +{
> +  return (x[0] < x[1] ? x[0] : x[1]) + y[0];
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..e219a13abc745b83dca58633fd2d812e276d6b2d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c
> @@ -0,0 +1,74 @@
> +/* { dg-do assemble } */
> +/* { dg-additional-options "-save-temps -O1 -std=c99" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
> +
> +typedef long long v2di __attribute__((vector_size (16)));
> +typedef unsigned long long v2udi __attribute__((vector_size (16)));
> +typedef int v2si __attribute__((vector_size (16)));
> +typedef unsigned int v2usi __attribute__((vector_size (16)));
> +
> +/*
> +** foo:
> +** 	umov	x0, v0.d\[1\]
> +** 	fmov	x1, d0
> +** 	cmp	x0, x1
> +** 	csel	x0, x0, x1, ge
> +** 	ret
> +*/
> +long long
> +foo (v2di x)
> +{
> +  return x[0] > x[1] ? x[0] : x[1];
> +}
> +
> +/* 
> +** foo1:
> +** 	smaxp	v0.2s, v0.2s, v0.2s
> +** 	smov	x0, v0.s\[0\]
> +** 	ret
> +*/
> +long long
> +foo1 (v2si x)
> +{
> +  return x[0] > x[1] ? x[0] : x[1];
> +}
> +
> +/* 
> +** foo2:
> +** 	umaxp	v0.2s, v0.2s, v0.2s
> +** 	fmov	w0, s0
> +** 	ret
> +*/
> +unsigned long long
> +foo2 (v2usi x)
> +{
> +  return x[0] > x[1] ? x[0] : x[1];
> +}
> +
> +/* 
> +** foo3:
> +** 	umaxp	v0.2s, v0.2s, v0.2s
> +** 	fmov	w0, s0
> +** 	fmov	x1, d1
> +** 	add	x0, x1, w0, uxtw
> +** 	ret
> +*/
> +unsigned long long
> +foo3 (v2usi x, v2udi y)
> +{
> +  return (x[0] > x[1] ? x[0] : x[1]) + y[0];
> +}
> +
> +/* 
> +** foo4:
> +** 	smaxp	v0.2s, v0.2s, v0.2s
> +** 	fmov	w0, s0
> +** 	fmov	x1, d1
> +** 	add	x0, x1, w0, sxtw
> +** 	ret
> +*/
> +long long
> +foo4 (v2si x, v2di y)
> +{
> +  return (x[0] > x[1] ? x[0] : x[1]) + y[0];
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..2a32fb4ea3edaa4c547a7a481c3ddca6b477430e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c
> @@ -0,0 +1,74 @@
> +/* { dg-do assemble } */
> +/* { dg-additional-options "-save-temps -O1 -std=c99" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
> +
> +typedef long long v2di __attribute__((vector_size (16)));
> +typedef unsigned long long v2udi __attribute__((vector_size (16)));
> +typedef int v2si __attribute__((vector_size (16)));
> +typedef unsigned int v2usi __attribute__((vector_size (16)));
> +
> +/*
> +** foo:
> +** 	umov	x0, v0.d\[1\]
> +** 	fmov	x1, d0
> +** 	cmp	x0, x1
> +** 	csel	x0, x0, x1, le
> +** 	ret
> +*/
> +long long
> +foo (v2di x)
> +{
> +  return x[0] < x[1] ? x[0] : x[1];
> +}
> +
> +/* 
> +** foo1:
> +** 	sminp	v0.2s, v0.2s, v0.2s
> +** 	smov	x0, v0.s\[0\]
> +** 	ret
> +*/
> +long long
> +foo1 (v2si x)
> +{
> +  return x[0] < x[1] ? x[0] : x[1];
> +}
> +
> +/* 
> +** foo2:
> +** 	uminp	v0.2s, v0.2s, v0.2s
> +** 	fmov	w0, s0
> +** 	ret
> +*/
> +unsigned long long
> +foo2 (v2usi x)
> +{
> +  return x[0] < x[1] ? x[0] : x[1];
> +}
> +
> +/* 
> +** foo3:
> +** 	uminp	v0.2s, v0.2s, v0.2s
> +** 	fmov	w0, s0
> +** 	fmov	x1, d1
> +** 	add	x0, x1, w0, uxtw
> +** 	ret
> +*/
> +unsigned long long
> +foo3 (v2usi x, v2udi y)
> +{
> +  return (x[0] < x[1] ? x[0] : x[1]) + y[0];
> +}
> +
> +/* 
> +** foo4:
> +** 	sminp	v0.2s, v0.2s, v0.2s
> +** 	fmov	w0, s0
> +** 	fmov	x1, d1
> +** 	add	x0, x1, w0, sxtw
> +** 	ret
> +*/
> +long long
> +foo4 (v2si x, v2di y)
> +{
> +  return (x[0] < x[1] ? x[0] : x[1]) + y[0];
> +}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
  2022-11-01 14:58   ` Richard Sandiford
@ 2022-11-01 15:11     ` Tamar Christina
  2022-11-11 14:39     ` Tamar Christina
  1 sibling, 0 replies; 50+ messages in thread
From: Tamar Christina @ 2022-11-01 15:11 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Tuesday, November 1, 2022 2:59 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: Re: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
> 
> Tamar Christina <tamar.christina@arm.com> writes:
> > Hi All,
> >
> > The backend has an existing V2HFmode that is used by pairwise operations.
> > This mode was however never made fully functional.  Amongst other
> > things it was never declared as a vector type which made it unusable from
> the mid-end.
> >
> > It's also lacking an implementation for load/stores so reload ICEs if
> > this mode is every used.  This finishes the implementation by providing the
> above.
> >
> > Note that I have created a new iterator VHSDF_P instead of extending
> > VHSDF because the previous iterator is used in far more things than just
> load/stores.
> >
> > It's also used for instance in intrinsics and extending this would
> > force me to provide support for mangling the type while we never
> > expose it through intrinsics.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* config/aarch64/aarch64-simd.md (*aarch64_simd_movv2hf): New.
> > 	(mov<mode>, movmisalign<mode>, aarch64_dup_lane<mode>,
> > 	aarch64_store_lane0<mode>, aarch64_simd_vec_set<mode>,
> > 	@aarch64_simd_vec_copy_lane<mode>, vec_set<mode>,
> > 	reduc_<optab>_scal_<mode>, reduc_<fmaxmin>_scal_<mode>,
> > 	aarch64_reduc_<optab>_internal<mode>,
> aarch64_get_lane<mode>,
> > 	vec_init<mode><Vel>, vec_extract<mode><Vel>): Support V2HF.
> > 	* config/aarch64/aarch64.cc (aarch64_classify_vector_mode):
> > 	Add E_V2HFmode.
> > 	* config/aarch64/iterators.md (VHSDF_P): New.
> > 	(V2F, VALL_F16_FULL, nunits, Vtype, Vmtype, Vetype, stype, VEL,
> > 	Vel, q, vp): Add V2HF.
> > 	* config/arm/types.md (neon_fp_reduc_add_h): New.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	* gcc.target/aarch64/sve/slp_1.c: Update testcase.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > b/gcc/config/aarch64/aarch64-simd.md
> > index
> >
> 25aed74f8cf939562ed65a578fe32ca76605b58a..93a2888f567460ad10ec050ea7
> d4
> > f701df4729d1 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -19,10 +19,10 @@
> >  ;; <http://www.gnu.org/licenses/>.
> >
> >  (define_expand "mov<mode>"
> > -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> > -	(match_operand:VALL_F16 1 "general_operand"))]
> > +  [(set (match_operand:VALL_F16_FULL 0 "nonimmediate_operand")
> > +	(match_operand:VALL_F16_FULL 1 "general_operand"))]
> >    "TARGET_SIMD"
> > -  "
> > +{
> >    /* Force the operand into a register if it is not an
> >       immediate whose use can be replaced with xzr.
> >       If the mode is 16 bytes wide, then we will be doing @@ -46,12
> > +46,11 @@ (define_expand "mov<mode>"
> >        aarch64_expand_vector_init (operands[0], operands[1]);
> >        DONE;
> >      }
> > -  "
> > -)
> > +})
> >
> >  (define_expand "movmisalign<mode>"
> > -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> > -        (match_operand:VALL_F16 1 "general_operand"))]
> > +  [(set (match_operand:VALL_F16_FULL 0 "nonimmediate_operand")
> > +        (match_operand:VALL_F16_FULL 1 "general_operand"))]
> >    "TARGET_SIMD && !STRICT_ALIGNMENT"
> >  {
> >    /* This pattern is not permitted to fail during expansion: if both
> > arguments @@ -85,10 +84,10 @@ (define_insn
> "aarch64_simd_dup<mode>"
> >  )
> >
> >  (define_insn "aarch64_dup_lane<mode>"
> > -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> > -	(vec_duplicate:VALL_F16
> > +  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w")
> > +	(vec_duplicate:VALL_F16_FULL
> >  	  (vec_select:<VEL>
> > -	    (match_operand:VALL_F16 1 "register_operand" "w")
> > +	    (match_operand:VALL_F16_FULL 1 "register_operand" "w")
> >  	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
> >            )))]
> >    "TARGET_SIMD"
> > @@ -142,6 +141,29 @@ (define_insn
> "*aarch64_simd_mov<VDMOV:mode>"
> >  		     mov_reg, neon_move<q>")]
> >  )
> >
> > +(define_insn "*aarch64_simd_movv2hf"
> > +  [(set (match_operand:V2HF 0 "nonimmediate_operand"
> > +		"=w, m,  m,  w, ?r, ?w, ?r, w, w")
> > +	(match_operand:V2HF 1 "general_operand"
> > +		"m,  Dz, w,  w,  w,  r,  r, Dz, Dn"))]
> > +  "TARGET_SIMD_F16INST
> > +   && (register_operand (operands[0], V2HFmode)
> > +       || aarch64_simd_reg_or_zero (operands[1], V2HFmode))"
> > +   "@
> > +    ldr\\t%s0, %1
> > +    str\\twzr, %0
> > +    str\\t%s1, %0
> > +    mov\\t%0.2s[0], %1.2s[0]
> > +    umov\\t%w0, %1.s[0]
> > +    fmov\\t%s0, %1
> > +    mov\\t%0, %1
> > +    movi\\t%d0, 0
> > +    * return aarch64_output_simd_mov_immediate (operands[1], 32);"
> > +  [(set_attr "type" "neon_load1_1reg, store_8, neon_store1_1reg,\
> > +		     neon_logic, neon_to_gp, f_mcr,\
> > +		     mov_reg, neon_move, neon_move")]
> > +)
> > +
> >  (define_insn "*aarch64_simd_mov<VQMOV:mode>"
> >    [(set (match_operand:VQMOV 0 "nonimmediate_operand"
> >  		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
> > @@ -182,7 +204,7 @@ (define_insn
> "*aarch64_simd_mov<VQMOV:mode>"
> >
> >  (define_insn "aarch64_store_lane0<mode>"
> >    [(set (match_operand:<VEL> 0 "memory_operand" "=m")
> > -	(vec_select:<VEL> (match_operand:VALL_F16 1 "register_operand"
> "w")
> > +	(vec_select:<VEL> (match_operand:VALL_F16_FULL 1
> "register_operand"
> > +"w")
> >  			(parallel [(match_operand 2 "const_int_operand"
> "n")])))]
> >    "TARGET_SIMD
> >     && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 0"
> > @@ -1035,11 +1057,11 @@ (define_insn "one_cmpl<mode>2"
> >  )
> >
> >  (define_insn "aarch64_simd_vec_set<mode>"
> > -  [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
> > -	(vec_merge:VALL_F16
> > -	    (vec_duplicate:VALL_F16
> > +  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w,w,w")
> > +	(vec_merge:VALL_F16_FULL
> > +	    (vec_duplicate:VALL_F16_FULL
> >  		(match_operand:<VEL> 1
> "aarch64_simd_nonimmediate_operand" "w,?r,Utv"))
> > -	    (match_operand:VALL_F16 3 "register_operand" "0,0,0")
> > +	    (match_operand:VALL_F16_FULL 3 "register_operand" "0,0,0")
> >  	    (match_operand:SI 2 "immediate_operand" "i,i,i")))]
> >    "TARGET_SIMD"
> >    {
> > @@ -1061,14 +1083,14 @@ (define_insn "aarch64_simd_vec_set<mode>"
> >  )
> >
> >  (define_insn "@aarch64_simd_vec_copy_lane<mode>"
> > -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> > -	(vec_merge:VALL_F16
> > -	    (vec_duplicate:VALL_F16
> > +  [(set (match_operand:VALL_F16_FULL 0 "register_operand" "=w")
> > +	(vec_merge:VALL_F16_FULL
> > +	    (vec_duplicate:VALL_F16_FULL
> >  	      (vec_select:<VEL>
> > -		(match_operand:VALL_F16 3 "register_operand" "w")
> > +		(match_operand:VALL_F16_FULL 3 "register_operand" "w")
> >  		(parallel
> >  		  [(match_operand:SI 4 "immediate_operand" "i")])))
> > -	    (match_operand:VALL_F16 1 "register_operand" "0")
> > +	    (match_operand:VALL_F16_FULL 1 "register_operand" "0")
> >  	    (match_operand:SI 2 "immediate_operand" "i")))]
> >    "TARGET_SIMD"
> >    {
> > @@ -1376,7 +1398,7 @@ (define_insn "vec_shr_<mode>"
> >  )
> >
> >  (define_expand "vec_set<mode>"
> > -  [(match_operand:VALL_F16 0 "register_operand")
> > +  [(match_operand:VALL_F16_FULL 0 "register_operand")
> >     (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
> >     (match_operand:SI 2 "immediate_operand")]
> >    "TARGET_SIMD"
> > @@ -3503,7 +3525,7 @@ (define_insn "popcount<mode>2"
> >  ;; gimple_fold'd to the IFN_REDUC_(MAX|MIN) function.  (This is FP
> smax/smin).
> >  (define_expand "reduc_<optab>_scal_<mode>"
> >    [(match_operand:<VEL> 0 "register_operand")
> > -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> > +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
> >  		 FMAXMINV)]
> >    "TARGET_SIMD"
> >    {
> > @@ -3518,7 +3540,7 @@ (define_expand "reduc_<optab>_scal_<mode>"
> >
> >  (define_expand "reduc_<fmaxmin>_scal_<mode>"
> >    [(match_operand:<VEL> 0 "register_operand")
> > -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> > +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
> >  		 FMAXMINNMV)]
> >    "TARGET_SIMD"
> >    {
> > @@ -3562,8 +3584,8 @@ (define_insn
> "aarch64_reduc_<optab>_internalv2si"
> >  )
> >
> >  (define_insn "aarch64_reduc_<optab>_internal<mode>"
> > - [(set (match_operand:VHSDF 0 "register_operand" "=w")
> > -       (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")]
> > + [(set (match_operand:VHSDF_P 0 "register_operand" "=w")
> > +       (unspec:VHSDF_P [(match_operand:VHSDF_P 1 "register_operand"
> > + "w")]
> >  		      FMAXMINV))]
> >   "TARGET_SIMD"
> >   "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
> > @@ -4208,7 +4230,7 @@ (define_insn
> "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
> >  (define_insn_and_split "aarch64_get_lane<mode>"
> >    [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand"
> "=?r, w, Utv")
> >  	(vec_select:<VEL>
> > -	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
> > +	  (match_operand:VALL_F16_FULL 1 "register_operand" "w, w, w")
> >  	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
> >    "TARGET_SIMD"
> >    {
> > @@ -7989,7 +8011,7 @@ (define_expand "aarch64_st1<VALL_F16:mode>"
> >  ;; Standard pattern name vec_init<mode><Vel>.
> >
> >  (define_expand "vec_init<mode><Vel>"
> > -  [(match_operand:VALL_F16 0 "register_operand")
> > +  [(match_operand:VALL_F16_FULL 0 "register_operand")
> >     (match_operand 1 "" "")]
> >    "TARGET_SIMD"
> >  {
> > @@ -8068,7 +8090,7 @@ (define_insn "aarch64_urecpe<mode>"
> >
> >  (define_expand "vec_extract<mode><Vel>"
> >    [(match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand")
> > -   (match_operand:VALL_F16 1 "register_operand")
> > +   (match_operand:VALL_F16_FULL 1 "register_operand")
> >     (match_operand:SI 2 "immediate_operand")]
> >    "TARGET_SIMD"
> >  {
> > diff --git a/gcc/config/aarch64/aarch64.cc
> > b/gcc/config/aarch64/aarch64.cc index
> >
> f05bac713e88ea8c7feaa2367d55bd523ca66f57..1e08f8453688210afe1566092b
> 19
> > b59c9bdd0c97 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -3566,6 +3566,7 @@ aarch64_classify_vector_mode (machine_mode
> mode)
> >      case E_V8BFmode:
> >      case E_V4SFmode:
> >      case E_V2DFmode:
> > +    case E_V2HFmode:
> >        return TARGET_SIMD ? VEC_ADVSIMD : 0;
> >
> >      default:
> > diff --git a/gcc/config/aarch64/iterators.md
> > b/gcc/config/aarch64/iterators.md index
> >
> 37d8161a33b1c399d80be82afa67613a087389d4..1df09f7fe2eb35aed96113476
> 541
> > e0faa5393551 100644
> > --- a/gcc/config/aarch64/iterators.md
> > +++ b/gcc/config/aarch64/iterators.md
> > @@ -160,6 +160,10 @@ (define_mode_iterator VDQF [V2SF V4SF V2DF])
> > (define_mode_iterator VHSDF [(V4HF "TARGET_SIMD_F16INST")
> >  			     (V8HF "TARGET_SIMD_F16INST")
> >  			     V2SF V4SF V2DF])
> > +;; Advanced SIMD Float modes suitable for pairwise operations.
> > +(define_mode_iterator VHSDF_P [(V4HF "TARGET_SIMD_F16INST")
> > +			       (V8HF "TARGET_SIMD_F16INST")
> > +			       V2SF V4SF V2DF (V2HF
> "TARGET_SIMD_F16INST")])
> >
> >  ;; Advanced SIMD Float modes, and DF.
> >  (define_mode_iterator VDQF_DF [V2SF V4SF V2DF DF]) @@ -188,15
> +192,23
> > @@ (define_mode_iterator VDQF_COND [V2SF V2SI V4SF V4SI V2DF
> V2DI])
> > (define_mode_iterator VALLF [V2SF V4SF V2DF SF DF])
> >
> >  ;; Advanced SIMD Float modes with 2 elements.
> > -(define_mode_iterator V2F [V2SF V2DF])
> > +(define_mode_iterator V2F [V2SF V2DF V2HF])
> >
> >  ;; All Advanced SIMD modes on which we support any arithmetic
> operations.
> >  (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF
> > V4SF V2DF])
> >
> > -;; All Advanced SIMD modes suitable for moving, loading, and storing.
> > +;; All Advanced SIMD modes suitable for moving, loading, and storing
> > +;; except V2HF.
> >  (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
> >  				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
> >
> > +;; All Advanced SIMD modes suitable for moving, loading, and storing
> > +;; including V2HF (define_mode_iterator VALL_F16_FULL [V8QI V16QI
> > +V4HI V8HI V2SI V4SI V2DI
> > +				     V4HF V8HF V4BF V8BF V2SF V4SF V2DF
> > +				     (V2HF "TARGET_SIMD_F16INST")])
> 
> This name might cause confusion with the SVE iterators, where FULL means
> "every bit of the register is used".  How about something like VMOVE
> instead?
> 
> With this change, I guess VALL_F16 represents "The set of all modes for
> which the vld1 intrinsics are provided" and VMOVE or whatever is "All
> Advanced SIMD modes suitable for moving, loading, and storing".
> That is, VMOVE extends VALL_F16 with modes that are not manifested via
> intrinsics.
> 
> > +
> > +
> >  ;; The VALL_F16 modes except the 128-bit 2-element ones.
> >  (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI
> V4SI
> >  				V4HF V8HF V2SF V4SF])
> > @@ -1076,7 +1088,7 @@ (define_mode_attr nunits [(V8QI "8") (V16QI
> "16")
> >  			  (V2SF "2") (V4SF "4")
> >  			  (V1DF "1") (V2DF "2")
> >  			  (DI "1") (DF "1")
> > -			  (V8DI "8")])
> > +			  (V8DI "8") (V2HF "2")])
> >
> >  ;; Map a mode to the number of bits in it, if the size of the mode
> > ;; is constant.
> > @@ -1090,6 +1102,7 @@ (define_mode_attr s [(HF "h") (SF "s") (DF "d")
> > (SI "s") (DI "d")])
> >
> >  ;; Give the length suffix letter for a sign- or zero-extension.
> >  (define_mode_attr size [(QI "b") (HI "h") (SI "w")])
> > +(define_mode_attr sizel [(QI "b") (HI "h") (SI "")])
> >
> >  ;; Give the number of bits in the mode  (define_mode_attr sizen [(QI
> > "8") (HI "16") (SI "32") (DI "64")]) @@ -1134,8 +1147,9 @@
> > (define_mode_attr Vtype [(V8QI "8b") (V16QI "16b")
> >                           (V2SI "2s") (V4SI  "4s")
> >                           (DI   "1d") (DF    "1d")
> >                           (V2DI "2d") (V2SF "2s")
> > -			 (V4SF "4s") (V2DF "2d")
> > -			 (V4HF "4h") (V8HF "8h")
> > +			 (V2HF "2h") (V4SF "4s")
> > +			 (V2DF "2d") (V4HF "4h")
> > +			 (V8HF "8h")
> >  			 (V2x8QI "8b") (V2x4HI "4h")
> >  			 (V2x2SI "2s") (V2x1DI  "1d")
> >  			 (V2x4HF "4h") (V2x2SF "2s")
> 
> Where is the 2h used, and is it valid syntax in that context?
> 

The singular instrance in the ISA where 2h is a valid syntax is for faddp.
I'll double check the usage contexts but it should be the only place.

I'll check and get back to you as I respin the patch.

Thanks,
Tamar


> Same for later instances of 2h.
> 
> Thanks,
> Richard
> 
> > @@ -1175,9 +1189,10 @@ (define_mode_attr Vmtype [(V8QI ".8b")
> (V16QI ".16b")
> >  			 (V4HI ".4h") (V8HI  ".8h")
> >  			 (V2SI ".2s") (V4SI  ".4s")
> >  			 (V2DI ".2d") (V4HF ".4h")
> > -			 (V8HF ".8h") (V4BF ".4h")
> > -			 (V8BF ".8h") (V2SF ".2s")
> > -			 (V4SF ".4s") (V2DF ".2d")
> > +			 (V8HF ".8h") (V2HF ".2h")
> > +			 (V4BF ".4h") (V8BF ".8h")
> > +			 (V2SF ".2s") (V4SF ".4s")
> > +			 (V2DF ".2d")
> >  			 (DI   "")    (SI   "")
> >  			 (HI   "")    (QI   "")
> >  			 (TI   "")    (HF   "")
> > @@ -1193,7 +1208,7 @@ (define_mode_attr Vmntype [(V8HI ".8b") (V4SI
> > ".4h")  (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
> >  			  (V4HI "h") (V8HI  "h")
> >  			  (V2SI "s") (V4SI  "s")
> > -			  (V2DI "d")
> > +			  (V2DI "d") (V2HF  "h")
> >  			  (V4HF "h") (V8HF  "h")
> >  			  (V2SF "s") (V4SF  "s")
> >  			  (V2DF "d")
> > @@ -1285,7 +1300,7 @@ (define_mode_attr Vcwtype [(VNx16QI "b")
> (VNx8QI
> > "h") (VNx4QI "w") (VNx2QI "d")  ;; more accurately.
> >  (define_mode_attr stype [(V8QI "b") (V16QI "b") (V4HI "s") (V8HI "s")
> >  			 (V2SI "s") (V4SI "s") (V2DI "d") (V4HF "s")
> > -			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d")
> > +			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d") (V2HF
> "s")
> >  			 (HF "s") (SF "s") (DF "d") (QI "b") (HI "s")
> >  			 (SI "s") (DI "d")])
> >
> > @@ -1360,8 +1375,8 @@ (define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
> >  		       (V4HF "HF") (V8HF  "HF")
> >  		       (V2SF "SF") (V4SF  "SF")
> >  		       (DF   "DF") (V2DF  "DF")
> > -		       (SI   "SI") (HI    "HI")
> > -		       (QI   "QI")
> > +		       (SI   "SI") (V2HF  "HF")
> > +		       (QI   "QI") (HI    "HI")
> >  		       (V4BF "BF") (V8BF "BF")
> >  		       (VNx16QI "QI") (VNx8QI "QI") (VNx4QI "QI") (VNx2QI
> "QI")
> >  		       (VNx8HI "HI") (VNx4HI "HI") (VNx2HI "HI") @@ -1381,7
> +1396,7
> > @@ (define_mode_attr Vel [(V8QI "qi") (V16QI "qi")
> >  		       (V2SF "sf") (V4SF "sf")
> >  		       (V2DF "df") (DF   "df")
> >  		       (SI   "si") (HI   "hi")
> > -		       (QI   "qi")
> > +		       (QI   "qi") (V2HF "hf")
> >  		       (V4BF "bf") (V8BF "bf")
> >  		       (VNx16QI "qi") (VNx8QI "qi") (VNx4QI "qi") (VNx2QI "qi")
> >  		       (VNx8HI "hi") (VNx4HI "hi") (VNx2HI "hi") @@ -1866,7
> +1881,7
> > @@ (define_mode_attr q [(V8QI "") (V16QI "_q")
> >  		     (V4HF "") (V8HF "_q")
> >  		     (V4BF "") (V8BF "_q")
> >  		     (V2SF "") (V4SF  "_q")
> > -			       (V2DF  "_q")
> > +		     (V2HF "") (V2DF  "_q")
> >  		     (QI "") (HI "") (SI "") (DI "") (HF "") (SF "") (DF "")
> >  		     (V2x8QI "") (V2x16QI "_q")
> >  		     (V2x4HI "") (V2x8HI "_q")
> > @@ -1905,6 +1920,7 @@ (define_mode_attr vp [(V8QI "v") (V16QI "v")
> >  		      (V2SI "p") (V4SI  "v")
> >  		      (V2DI "p") (V2DF  "p")
> >  		      (V2SF "p") (V4SF  "v")
> > +		      (V2HF "p")
> >  		      (V4HF "v") (V8HF  "v")])
> >
> >  (define_mode_attr vsi2qi [(V2SI "v8qi") (V4SI "v16qi") diff --git
> > a/gcc/config/arm/types.md b/gcc/config/arm/types.md index
> >
> 7d0504bdd944e9c0d1b545b0b66a9a1adc808714..3cfbc7a93cca1bea4925853e5
> 1d0
> > a147c5722247 100644
> > --- a/gcc/config/arm/types.md
> > +++ b/gcc/config/arm/types.md
> > @@ -483,6 +483,7 @@ (define_attr "autodetect_type"
> >  ; neon_fp_minmax_s_q
> >  ; neon_fp_minmax_d
> >  ; neon_fp_minmax_d_q
> > +; neon_fp_reduc_add_h
> >  ; neon_fp_reduc_add_s
> >  ; neon_fp_reduc_add_s_q
> >  ; neon_fp_reduc_add_d
> > @@ -1033,6 +1034,7 @@ (define_attr "type"
> >    neon_fp_minmax_d,\
> >    neon_fp_minmax_d_q,\
> >  \
> > +  neon_fp_reduc_add_h,\
> >    neon_fp_reduc_add_s,\
> >    neon_fp_reduc_add_s_q,\
> >    neon_fp_reduc_add_d,\
> > @@ -1257,8 +1259,8 @@ (define_attr "is_neon_type" "yes,no"
> >            neon_fp_compare_d, neon_fp_compare_d_q, neon_fp_minmax_s,\
> >            neon_fp_minmax_s_q, neon_fp_minmax_d,
> neon_fp_minmax_d_q,\
> >            neon_fp_neg_s, neon_fp_neg_s_q, neon_fp_neg_d,
> neon_fp_neg_d_q,\
> > -          neon_fp_reduc_add_s, neon_fp_reduc_add_s_q,
> neon_fp_reduc_add_d,\
> > -          neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,
> > +          neon_fp_reduc_add_h, neon_fp_reduc_add_s,
> neon_fp_reduc_add_s_q,\
> > +          neon_fp_reduc_add_d, neon_fp_reduc_add_d_q,
> > + neon_fp_reduc_minmax_s,\
> >            neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d,\
> >            neon_fp_reduc_minmax_d_q,\
> >            neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\ diff --git
> > a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > index
> >
> 07d71a63414b1066ea431e287286ad048515711a..8e35e0b574d49913b43c7d8d
> 4f4b
> > a75f127f42e9 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > @@ -30,11 +30,9 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int
> n)	\
> >  TEST_ALL (VEC_PERM)
> >
> >  /* We should use one DUP for each of the 8-, 16- and 32-bit types,
> > -   although we currently use LD1RW for _Float16.  We should use two
> > -   DUPs for each of the three 64-bit types.  */
> > +   We should use two DUPs for each of the three 64-bit types.  */
> >  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } }
> > */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 2 } }
> > */
> > -/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 1 } } */
> > +/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } }
> > +*/
> >  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } }
> > */
> >  /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d,
> > z[0-9]+\.d\n} 3 } } */
> >  /* { dg-final { scan-assembler-not {\tzip2\t} } } */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 8/8]AArch64: Have reload not choose to do add on the scalar side if both values exist on the SIMD side.
  2022-11-01 15:04   ` Richard Sandiford
@ 2022-11-01 15:20     ` Tamar Christina
  0 siblings, 0 replies; 50+ messages in thread
From: Tamar Christina @ 2022-11-01 15:20 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Tuesday, November 1, 2022 3:05 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: Re: [PATCH 8/8]AArch64: Have reload not choose to do add on the
> scalar side if both values exist on the SIMD side.
> 
> Tamar Christina <tamar.christina@arm.com> writes:
> > Hi All,
> >
> > Currently we often times generate an r -> r add even if it means we
> > need two reloads to perform it, i.e. in the case that the values are on the
> SIMD side.
> >
> > The pairwise operations expose these more now and so we get suboptimal
> codegen.
> >
> > Normally I would have liked to use ^ or $ here, but while this works
> > for the simple examples, reload inexplicably falls apart on examples
> > that should have been trivial. It forces a move to r -> w to use the w
> > ADD, which is counter to what ^ and $ should do.
> >
> > However ! seems to fix all the regression and still maintains the good
> codegen.
> >
> > I have tried looking into whether it's our costings that are off, but
> > I can't seem anything logical here.  So I'd like to push this change
> > instead along with test that augment the other testcases that guard the r ->
> r variants.
> 
> This feels like a hack though.  r<-r+r is one of the simplest thing the processor
> can do, so I don't think it makes logical sense to mark it with !, which means
> "prohibitively expensive".  It's likely to push operations that require reloads
> onto the SIMD side.

I agree. Though at the moment, reload isn't behaving as it should. It's almost as if
the register transfer costs are not taken into account when deciding on an alternative.

It seems to think that an r->r and w->w are as cheap even when the value has been assigned
to w before.  For instance, some of the testcases below don't work correctly because of this.

I don't think I can influence this costing, and as I mentioned ^ works for the simple example
But then somehow makes w->w cheaper even though the value was assigned to r.

I'm not really sure where to look here, but the current version is also equally broken..
It basically always forces to r.

Thanks,
Tamar

> 
> Thanks,
> Richard
> 
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* config/aarch64/aarch64.md (*add<mode>3_aarch64): Add ! to the
> r -> r
> > 	alternative.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	* gcc.target/aarch64/simd/scalar_addp.c: New test.
> > 	* gcc.target/aarch64/simd/scalar_faddp.c: New test.
> > 	* gcc.target/aarch64/simd/scalar_faddp2.c: New test.
> > 	* gcc.target/aarch64/simd/scalar_fmaxp.c: New test.
> > 	* gcc.target/aarch64/simd/scalar_fminp.c: New test.
> > 	* gcc.target/aarch64/simd/scalar_maxp.c: New test.
> > 	* gcc.target/aarch64/simd/scalar_minp.c: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/config/aarch64/aarch64.md
> > b/gcc/config/aarch64/aarch64.md index
> >
> 09ae1118371f82ca63146fceb953eb9e820d05a4..c333fb1f72725992bb304c560f
> 12
> > 45a242d5192d 100644
> > --- a/gcc/config/aarch64/aarch64.md
> > +++ b/gcc/config/aarch64/aarch64.md
> > @@ -2043,7 +2043,7 @@ (define_expand "add<mode>3"
> >
> >  (define_insn "*add<mode>3_aarch64"
> >    [(set
> > -    (match_operand:GPI 0 "register_operand" "=rk,rk,w,rk,r,r,rk")
> > +    (match_operand:GPI 0 "register_operand" "=rk,!rk,w,rk,r,r,rk")
> >      (plus:GPI
> >       (match_operand:GPI 1 "register_operand" "%rk,rk,w,rk,rk,0,rk")
> >       (match_operand:GPI 2 "aarch64_pluslong_operand"
> > "I,r,w,J,Uaa,Uai,Uav")))] diff --git
> > a/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
> > b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..5b8d40f19884fc7b4e7decd80
> 758
> > bc36fa76d058
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_addp.c
> > @@ -0,0 +1,70 @@
> > +/* { dg-do assemble } */
> > +/* { dg-additional-options "-save-temps -O1 -std=c99" } */
> > +/* { dg-final { check-function-bodies "**" "" "" { target { le } } }
> > +} */
> > +
> > +typedef long long v2di __attribute__((vector_size (16))); typedef
> > +unsigned long long v2udi __attribute__((vector_size (16))); typedef
> > +int v2si __attribute__((vector_size (16))); typedef unsigned int
> > +v2usi __attribute__((vector_size (16)));
> > +
> > +/*
> > +** foo:
> > +** 	addp	d0, v0.2d
> > +** 	fmov	x0, d0
> > +** 	ret
> > +*/
> > +long long
> > +foo (v2di x)
> > +{
> > +  return x[1] + x[0];
> > +}
> > +
> > +/*
> > +** foo1:
> > +** 	saddlp	v0.1d, v0.2s
> > +** 	fmov	x0, d0
> > +** 	ret
> > +*/
> > +long long
> > +foo1 (v2si x)
> > +{
> > +  return x[1] + x[0];
> > +}
> > +
> > +/*
> > +** foo2:
> > +** 	uaddlp	v0.1d, v0.2s
> > +** 	fmov	x0, d0
> > +** 	ret
> > +*/
> > +unsigned long long
> > +foo2 (v2usi x)
> > +{
> > +  return x[1] + x[0];
> > +}
> > +
> > +/*
> > +** foo3:
> > +** 	uaddlp	v0.1d, v0.2s
> > +** 	add	d0, d0, d1
> > +** 	fmov	x0, d0
> > +** 	ret
> > +*/
> > +unsigned long long
> > +foo3 (v2usi x, v2udi y)
> > +{
> > +  return (x[1] + x[0]) + y[0];
> > +}
> > +
> > +/*
> > +** foo4:
> > +** 	saddlp	v0.1d, v0.2s
> > +** 	add	d0, d0, d1
> > +** 	fmov	x0, d0
> > +** 	ret
> > +*/
> > +long long
> > +foo4 (v2si x, v2di y)
> > +{
> > +  return (x[1] + x[0]) + y[0];
> > +}
> > diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
> > b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..ff455e060fc833b2f63e89c467
> b9
> > 1a76fbe31aff
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-do assemble } */
> > +/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
> > +/* { dg-add-options arm_v8_2a_fp16_scalar } */
> > +/* { dg-additional-options "-save-temps -O1" } */
> > +/* { dg-final { check-function-bodies "**" "" "" { target { le } } }
> > +} */
> > +
> > +typedef double v2df __attribute__((vector_size (16))); typedef float
> > +v4sf __attribute__((vector_size (16))); typedef __fp16 v8hf
> > +__attribute__((vector_size (16)));
> > +
> > +/*
> > +** foo:
> > +** 	faddp	d0, v0.2d
> > +** 	ret
> > +*/
> > +double
> > +foo (v2df x)
> > +{
> > +  return x[1] + x[0];
> > +}
> > +
> > +/*
> > +** foo1:
> > +** 	faddp	s0, v0.2s
> > +** 	ret
> > +*/
> > +float
> > +foo1 (v4sf x)
> > +{
> > +  return x[0] + x[1];
> > +}
> > +
> > +/*
> > +** foo2:
> > +** 	faddp	h0, v0.2h
> > +** 	ret
> > +*/
> > +__fp16
> > +foo2 (v8hf x)
> > +{
> > +  return x[0] + x[1];
> > +}
> > +
> > +/*
> > +** foo3:
> > +** 	ext	v0.16b, v0.16b, v0.16b, #4
> > +** 	faddp	s0, v0.2s
> > +** 	ret
> > +*/
> > +float
> > +foo3 (v4sf x)
> > +{
> > +  return x[1] + x[2];
> > +}
> > +
> > +/*
> > +** foo4:
> > +** 	dup	s0, v0.s\[3\]
> > +** 	faddp	h0, v0.2h
> > +** 	ret
> > +*/
> > +__fp16
> > +foo4 (v8hf x)
> > +{
> > +  return x[6] + x[7];
> > +}
> > diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c
> > b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..04412c3b45c51648e46ff20f73
> 0b
> > 1213e940391a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp2.c
> > @@ -0,0 +1,14 @@
> > +/* { dg-do assemble } */
> > +/* { dg-additional-options "-save-temps -O1 -w" } */
> > +
> > +typedef __m128i __attribute__((__vector_size__(2 * sizeof(long))));
> > +double a[]; *b;
> > +fn1() {
> > +  __m128i c;
> > +  *(__m128i *)a = c;
> > +  *b = a[0] + a[1];
> > +}
> > +
> > +/* { dg-final { scan-assembler-times {faddp\td0, v0\.2d} 1 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c
> > b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..aa1d2bf17cd707b74d8f7c5745
> 06
> > 610ab4fd7299
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fmaxp.c
> > @@ -0,0 +1,56 @@
> > +/* { dg-do assemble } */
> > +/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
> > +/* { dg-add-options arm_v8_2a_fp16_scalar } */
> > +/* { dg-additional-options "-save-temps -O1" } */
> > +/* { dg-final { check-function-bodies "**" "" "" { target { le } } }
> > +} */
> > +
> > +typedef double v2df __attribute__((vector_size (16))); typedef float
> > +v4sf __attribute__((vector_size (16))); typedef __fp16 v8hf
> > +__attribute__((vector_size (16)));
> > +
> > +/*
> > +** foo:
> > +** 	fmaxnmp	d0, v0.2d
> > +** 	ret
> > +*/
> > +double
> > +foo (v2df x)
> > +{
> > +  return x[0] > x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo1:
> > +** 	fmaxnmp	s0, v0.2s
> > +** 	ret
> > +*/
> > +float
> > +foo1 (v4sf x)
> > +{
> > +  return x[0] > x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo2:
> > +** 	fmaxnmp	h0, v0.2h
> > +** 	ret
> > +*/
> > +__fp16
> > +foo2 (v8hf x)
> > +{
> > +  return x[0] > x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo3:
> > +** 	fmaxnmp	s0, v0.2s
> > +** 	fcvt	d0, s0
> > +** 	fadd	d0, d0, d1
> > +** 	ret
> > +*/
> > +double
> > +foo3 (v4sf x, v2df y)
> > +{
> > +  return (x[0] > x[1] ? x[0] : x[1]) + y[0]; }
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c
> > b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..6136c5272069c4d86f09951cdff
> 2
> > 5f1494e839f0
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_fminp.c
> > @@ -0,0 +1,55 @@
> > +/* { dg-do assemble } */
> > +/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
> > +/* { dg-add-options arm_v8_2a_fp16_scalar } */
> > +/* { dg-additional-options "-save-temps -O1" } */
> > +/* { dg-final { check-function-bodies "**" "" "" { target { le } } }
> > +} */
> > +
> > +typedef double v2df __attribute__((vector_size (16))); typedef float
> > +v4sf __attribute__((vector_size (16))); typedef __fp16 v8hf
> > +__attribute__((vector_size (16)));
> > +
> > +/*
> > +** foo:
> > +** 	fminnmp	d0, v0.2d
> > +** 	ret
> > +*/
> > +double
> > +foo (v2df x)
> > +{
> > +  return x[0] < x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo1:
> > +** 	fminnmp	s0, v0.2s
> > +** 	ret
> > +*/
> > +float
> > +foo1 (v4sf x)
> > +{
> > +  return x[0] < x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo2:
> > +** 	fminnmp	h0, v0.2h
> > +** 	ret
> > +*/
> > +__fp16
> > +foo2 (v8hf x)
> > +{
> > +  return x[0] < x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo3:
> > +** 	fminnmp	s0, v0.2s
> > +** 	fcvt	d0, s0
> > +** 	fadd	d0, d0, d1
> > +** 	ret
> > +*/
> > +double
> > +foo3 (v4sf x, v2df y)
> > +{
> > +  return (x[0] < x[1] ? x[0] : x[1]) + y[0]; }
> > diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c
> > b/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..e219a13abc745b83dca58633f
> d2d
> > 812e276d6b2d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_maxp.c
> > @@ -0,0 +1,74 @@
> > +/* { dg-do assemble } */
> > +/* { dg-additional-options "-save-temps -O1 -std=c99" } */
> > +/* { dg-final { check-function-bodies "**" "" "" { target { le } } }
> > +} */
> > +
> > +typedef long long v2di __attribute__((vector_size (16))); typedef
> > +unsigned long long v2udi __attribute__((vector_size (16))); typedef
> > +int v2si __attribute__((vector_size (16))); typedef unsigned int
> > +v2usi __attribute__((vector_size (16)));
> > +
> > +/*
> > +** foo:
> > +** 	umov	x0, v0.d\[1\]
> > +** 	fmov	x1, d0
> > +** 	cmp	x0, x1
> > +** 	csel	x0, x0, x1, ge
> > +** 	ret
> > +*/
> > +long long
> > +foo (v2di x)
> > +{
> > +  return x[0] > x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo1:
> > +** 	smaxp	v0.2s, v0.2s, v0.2s
> > +** 	smov	x0, v0.s\[0\]
> > +** 	ret
> > +*/
> > +long long
> > +foo1 (v2si x)
> > +{
> > +  return x[0] > x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo2:
> > +** 	umaxp	v0.2s, v0.2s, v0.2s
> > +** 	fmov	w0, s0
> > +** 	ret
> > +*/
> > +unsigned long long
> > +foo2 (v2usi x)
> > +{
> > +  return x[0] > x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo3:
> > +** 	umaxp	v0.2s, v0.2s, v0.2s
> > +** 	fmov	w0, s0
> > +** 	fmov	x1, d1
> > +** 	add	x0, x1, w0, uxtw
> > +** 	ret
> > +*/
> > +unsigned long long
> > +foo3 (v2usi x, v2udi y)
> > +{
> > +  return (x[0] > x[1] ? x[0] : x[1]) + y[0]; }
> > +
> > +/*
> > +** foo4:
> > +** 	smaxp	v0.2s, v0.2s, v0.2s
> > +** 	fmov	w0, s0
> > +** 	fmov	x1, d1
> > +** 	add	x0, x1, w0, sxtw
> > +** 	ret
> > +*/
> > +long long
> > +foo4 (v2si x, v2di y)
> > +{
> > +  return (x[0] > x[1] ? x[0] : x[1]) + y[0]; }
> > diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c
> > b/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..2a32fb4ea3edaa4c547a7a481c
> 3d
> > dca6b477430e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_minp.c
> > @@ -0,0 +1,74 @@
> > +/* { dg-do assemble } */
> > +/* { dg-additional-options "-save-temps -O1 -std=c99" } */
> > +/* { dg-final { check-function-bodies "**" "" "" { target { le } } }
> > +} */
> > +
> > +typedef long long v2di __attribute__((vector_size (16))); typedef
> > +unsigned long long v2udi __attribute__((vector_size (16))); typedef
> > +int v2si __attribute__((vector_size (16))); typedef unsigned int
> > +v2usi __attribute__((vector_size (16)));
> > +
> > +/*
> > +** foo:
> > +** 	umov	x0, v0.d\[1\]
> > +** 	fmov	x1, d0
> > +** 	cmp	x0, x1
> > +** 	csel	x0, x0, x1, le
> > +** 	ret
> > +*/
> > +long long
> > +foo (v2di x)
> > +{
> > +  return x[0] < x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo1:
> > +** 	sminp	v0.2s, v0.2s, v0.2s
> > +** 	smov	x0, v0.s\[0\]
> > +** 	ret
> > +*/
> > +long long
> > +foo1 (v2si x)
> > +{
> > +  return x[0] < x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo2:
> > +** 	uminp	v0.2s, v0.2s, v0.2s
> > +** 	fmov	w0, s0
> > +** 	ret
> > +*/
> > +unsigned long long
> > +foo2 (v2usi x)
> > +{
> > +  return x[0] < x[1] ? x[0] : x[1];
> > +}
> > +
> > +/*
> > +** foo3:
> > +** 	uminp	v0.2s, v0.2s, v0.2s
> > +** 	fmov	w0, s0
> > +** 	fmov	x1, d1
> > +** 	add	x0, x1, w0, uxtw
> > +** 	ret
> > +*/
> > +unsigned long long
> > +foo3 (v2usi x, v2udi y)
> > +{
> > +  return (x[0] < x[1] ? x[0] : x[1]) + y[0]; }
> > +
> > +/*
> > +** foo4:
> > +** 	sminp	v0.2s, v0.2s, v0.2s
> > +** 	fmov	w0, s0
> > +** 	fmov	x1, d1
> > +** 	add	x0, x1, w0, sxtw
> > +** 	ret
> > +*/
> > +long long
> > +foo4 (v2si x, v2di y)
> > +{
> > +  return (x[0] < x[1] ? x[0] : x[1]) + y[0]; }

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-10-31 11:56 [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Tamar Christina
                   ` (7 preceding siblings ...)
  2022-10-31 21:41 ` [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Jeff Law
@ 2022-11-05 11:32 ` Richard Biener
  2022-11-07  7:16   ` Tamar Christina
  8 siblings, 1 reply; 50+ messages in thread
From: Richard Biener @ 2022-11-05 11:32 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, rguenther

On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi All,
>
> This patch series is to add recognition of pairwise operations (reductions)
> in match.pd such that we can benefit from them even at -O1 when the vectorizer
> isn't enabled.
>
> Ths use of these allow for a lot simpler codegen in AArch64 and allows us to
> avoid quite a lot of codegen warts.
>
> As an example a simple:
>
> typedef float v4sf __attribute__((vector_size (16)));
>
> float
> foo3 (v4sf x)
> {
>   return x[1] + x[2];
> }
>
> currently generates:
>
> foo3:
>         dup     s1, v0.s[1]
>         dup     s0, v0.s[2]
>         fadd    s0, s1, s0
>         ret
>
> while with this patch series now generates:
>
> foo3:
>         ext     v0.16b, v0.16b, v0.16b, #4
>         faddp   s0, v0.2s
>         ret
>
> This patch will not perform the operation if the source is not a gimple
> register and leaves memory sources to the vectorizer as it's able to deal
> correctly with clobbers.

But the vectorizer should also be able to cope with the above.  I don't think
we want to do this as part of general folding.  Iff, then this belongs
in specific
points of the pass pipeline, no?

> The use of these instruction makes a significant difference in codegen quality
> for AArch64 and Arm.
>
> NOTE: The last entry in the series contains tests for all of the previous
> patches as it's a bit of an all or nothing thing.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>         * match.pd (adjacent_data_access_p): Import.
>         Add new pattern for bitwise plus, min, max, fmax, fmin.
>         * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs.
>         * tree.cc (adjacent_data_access_p): New.
>         * tree.h (adjacent_data_access_p): New.
>
> --- inline copy of patch --
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10eb18db23c0cd9f7 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not see
>     HONOR_NANS
>     uniform_vector_p
>     expand_vec_cmp_expr_p
> -   bitmask_inv_cst_vector_p)
> +   bitmask_inv_cst_vector_p
> +   adjacent_data_access_p)
>
>  /* Operator lists.  */
>  (define_operator_list tcc_comparison
> @@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>
>  /* Canonicalizations of BIT_FIELD_REFs.  */
>
> +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */
> +(for op (plus min max FMIN_ALL FMAX_ALL)
> +     ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX
> +         IFN_REDUC_FMIN IFN_REDUC_FMAX)
> + (simplify
> +  (op @0 @1)
> +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> +    (with { poly_uint64 nloc = 0;
> +           tree src = adjacent_data_access_p (@0, @1, &nloc, true);
> +           tree ntype = build_vector_type (type, 2);
> +           tree size = TYPE_SIZE (ntype);
> +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> +           poly_uint64 _sz;
> +           poly_uint64 _total; }
> +     (if (src && is_gimple_reg (src) && ntype
> +         && poly_int_tree_p (size, &_sz)
> +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> +         && known_ge (_total, _sz + nloc))
> +      (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))))))
> +
> +(for op (lt gt)
> +     ifni (IFN_REDUC_MIN IFN_REDUC_MAX)
> +     ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX)
> + (simplify
> +  (cond (op @0 @1) @0 @1)
> +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> +    (with { poly_uint64 nloc = 0;
> +           tree src = adjacent_data_access_p (@0, @1, &nloc, false);
> +           tree ntype = build_vector_type (type, 2);
> +           tree size = TYPE_SIZE (ntype);
> +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> +           poly_uint64 _sz;
> +           poly_uint64 _total; }
> +     (if (src && is_gimple_reg (src) && ntype
> +         && poly_int_tree_p (size, &_sz)
> +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> +         && known_ge (_total, _sz + nloc))
> +      (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type)))
> +       (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
> +       (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))))
> +
>  (simplify
>   (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
>   (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); }))
> diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
> index 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006603ae1e3deb6b76 100644
> --- a/gcc/tree-cfg.cc
> +++ b/gcc/tree-cfg.cc
> @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt)
>      {
>        tree arg = gimple_call_arg (stmt, i);
>        if ((is_gimple_reg_type (TREE_TYPE (arg))
> +          && !is_gimple_variable (arg)
>            && !is_gimple_val (arg))
>           || (!is_gimple_reg_type (TREE_TYPE (arg))
>               && !is_gimple_lvalue (arg)))
> diff --git a/gcc/tree.h b/gcc/tree.h
> index e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194640b8c1b531b798 100644
> --- a/gcc/tree.h
> +++ b/gcc/tree.h
> @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree);
>
>  extern tree bitmask_inv_cst_vector_p (tree);
>
> +/* TRUE if the two operands represent adjacent access of data such that a
> +   pairwise operation can be used.  */
> +
> +extern tree adjacent_data_access_p (tree, tree, poly_uint64*, bool);
> +
>  /* integer_nonzerop (tree x) is nonzero if X is an integer constant
>     with a nonzero value.  */
>
> diff --git a/gcc/tree.cc b/gcc/tree.cc
> index 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19de9841e29d311 100644
> --- a/gcc/tree.cc
> +++ b/gcc/tree.cc
> @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t)
>    return builder.build ();
>  }
>
> +/* Returns base address if the two operands represent adjacent access of data
> +   such that a pairwise operation can be used.  OP1 must be a lower subpart
> +   than OP2.  If POS is not NULL then on return if a value is returned POS
> +   will indicate the position of the lower address.  If COMMUTATIVE_P then
> +   the operation is also tried by flipping op1 and op2.  */
> +
> +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos,
> +                            bool commutative_p)
> +{
> +  gcc_assert (op1);
> +  gcc_assert (op2);
> +  if (TREE_CODE (op1) != TREE_CODE (op2)
> +      || TREE_TYPE (op1) != TREE_TYPE (op2))
> +    return NULL;
> +
> +  tree type = TREE_TYPE (op1);
> +  gimple *stmt1 = NULL, *stmt2 = NULL;
> +  unsigned int bits = GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type)));
> +
> +  if (TREE_CODE (op1) == BIT_FIELD_REF
> +      && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND (op2, 0), 0)
> +      && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND (op2, 1), 0)
> +      && known_eq (bit_field_size (op1), bits))
> +    {
> +      poly_uint64 offset1 = bit_field_offset (op1);
> +      poly_uint64 offset2 = bit_field_offset (op2);
> +      if (known_eq (offset2 - offset1, bits))
> +       {
> +         if (pos)
> +           *pos = offset1;
> +         return TREE_OPERAND (op1, 0);
> +       }
> +      else if (commutative_p && known_eq (offset1 - offset2, bits))
> +       {
> +         if (pos)
> +           *pos = offset2;
> +         return TREE_OPERAND (op1, 0);
> +       }
> +    }
> +  else if (TREE_CODE (op1) == ARRAY_REF
> +          && operand_equal_p (get_base_address (op1), get_base_address (op2)))
> +    {
> +      wide_int size1 = wi::to_wide (array_ref_element_size (op1));
> +      wide_int size2 = wi::to_wide (array_ref_element_size (op2));
> +      if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8)
> +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1))
> +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1)))
> +       return NULL;
> +
> +      poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, 1));
> +      poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, 1));
> +      if (known_eq (offset2 - offset1, 1UL))
> +       {
> +         if (pos)
> +           *pos = offset1 * bits;
> +         return TREE_OPERAND (op1, 0);
> +       }
> +      else if (commutative_p && known_eq (offset1 - offset2, 1UL))
> +       {
> +         if (pos)
> +           *pos = offset2 * bits;
> +         return TREE_OPERAND (op1, 0);
> +       }
> +    }
> +  else if (TREE_CODE (op1) == SSA_NAME
> +          && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL
> +          && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL
> +          && is_gimple_assign (stmt1)
> +          && is_gimple_assign (stmt2))
> +    {
> +      if (gimple_assign_rhs_code (stmt1) != ARRAY_REF
> +         && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF
> +         && gimple_assign_rhs_code (stmt2) != ARRAY_REF
> +         && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF)
> +       return NULL;
> +
> +      return adjacent_data_access_p (gimple_assign_rhs1 (stmt1),
> +                                    gimple_assign_rhs1 (stmt2), pos,
> +                                    commutative_p);
> +    }
> +
> +  return NULL;
> +}
> +
>  /* If VECTOR_CST T has a single nonzero element, return the index of that
>     element, otherwise return -1.  */
>
>
>
>
>
> --

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-11-05 11:32 ` Richard Biener
@ 2022-11-07  7:16   ` Tamar Christina
  2022-11-07 10:17     ` Richard Biener
  0 siblings, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-11-07  7:16 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, nd, rguenther

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Saturday, November 5, 2022 11:33 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de
> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> bitfields and array_refs
> 
> On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches <gcc-
> patches@gcc.gnu.org> wrote:
> >
> > Hi All,
> >
> > This patch series is to add recognition of pairwise operations
> > (reductions) in match.pd such that we can benefit from them even at
> > -O1 when the vectorizer isn't enabled.
> >
> > Ths use of these allow for a lot simpler codegen in AArch64 and allows
> > us to avoid quite a lot of codegen warts.
> >
> > As an example a simple:
> >
> > typedef float v4sf __attribute__((vector_size (16)));
> >
> > float
> > foo3 (v4sf x)
> > {
> >   return x[1] + x[2];
> > }
> >
> > currently generates:
> >
> > foo3:
> >         dup     s1, v0.s[1]
> >         dup     s0, v0.s[2]
> >         fadd    s0, s1, s0
> >         ret
> >
> > while with this patch series now generates:
> >
> > foo3:
> >         ext     v0.16b, v0.16b, v0.16b, #4
> >         faddp   s0, v0.2s
> >         ret
> >
> > This patch will not perform the operation if the source is not a
> > gimple register and leaves memory sources to the vectorizer as it's
> > able to deal correctly with clobbers.
> 
> But the vectorizer should also be able to cope with the above.  

There are several problems with leaving it up to the vectorizer to do:

1. We only get it at -O2 and higher.
2. The way the vectorizer costs the reduction makes the resulting cost always too high for AArch64.

As an example the following:

typedef unsigned int u32v4 __attribute__((vector_size(16)));
unsigned int f (u32v4 a, u32v4 b)
{
    return a[0] + a[1];
}

Doesn't get SLP'ed because the vectorizer costs it as:

node 0x485eb30 0 times vec_perm costs 0 in body
_1 + _2 1 times vector_stmt costs 1 in body
_1 + _2 1 times vec_perm costs 2 in body
_1 + _2 1 times vec_to_scalar costs 2 in body

And so ultimately you fail because:

/app/example.c:8:17: note: Cost model analysis for part in loop 0:
  Vector cost: 5
  Scalar cost: 3

This looks like it's because the vectorizer costs the operation to create the BIT_FIELD_REF <a_3(D), 64, 0>;
For the reduction as requiring two scalar extracts and a permute. While it ultimately does produce a
BIT_FIELD_REF <a_3(D), 64, 0>; that's not what it costs.

This causes the reduction to almost always be more expensive, so unless the rest of the SLP tree amortizes
the cost we never generate them.

3. The SLP only happens on operation that are SLP shaped and where SLP didn't fail.

As a simple example, the vectorizer can't SLP the following:

unsigned int f (u32v4 a, u32v4 b)
{
    a[0] += b[0];
    return a[0] + a[1];
}

Because there's not enough VF here and it can't unroll. This and many others fail because they're not an
SLP-able operation, or SLP build fails.

This causes us to generate for e.g. this example:

f:
        dup     s2, v0.s[1]
        fmov    w1, s1
        add     v0.2s, v2.2s, v0.2s
        fmov    w0, s0
        add     w0, w0, w1
        ret

instead of with my patch:

f:
        addp    v0.2s, v0.2s, v0.2s
        add     v0.2s, v0.2s, v1.2s
        fmov    w0, s0
        ret

which is significantly better code.  So I don't think the vectorizer is the right solution for this.

> I don't think
> we want to do this as part of general folding.  Iff, then this belongs in specific
> points of the pass pipeline, no?

The reason I currently have it as such is because in general the compiler doesn't really deal with
horizontal reductions at all.  Also since the vectorizer itself can introduce reductions I figured it's
better to have one representation for this.  So admittedly perhaps this should only be done after
vectorization as that's when today we expect reductions to be in Gimple.

As for having it in a specific point in the pass pipeline, I have it as a general one since a number of
passes could create the form for the reduction, for instance vec_lower could break up an operation
to allow this to match.  The bigger BIT_FIELD_EXPR it creates could also lead to other optimizations.

Additionally you had mentioned last time that Andrew was trying to move min/max detection to match.pd
So I had figured this was the correct place for it.

That said I have no intuition for what would be better here. Since the check is quite cheap.  But do you have
a particular place you want this move to then?  Ideally I'd want it before the last FRE pass, but perhaps
isel?

Thanks,
Tamar

> 
> > The use of these instruction makes a significant difference in codegen
> > quality for AArch64 and Arm.
> >
> > NOTE: The last entry in the series contains tests for all of the
> > previous patches as it's a bit of an all or nothing thing.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> > and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >         * match.pd (adjacent_data_access_p): Import.
> >         Add new pattern for bitwise plus, min, max, fmax, fmin.
> >         * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs.
> >         * tree.cc (adjacent_data_access_p): New.
> >         * tree.h (adjacent_data_access_p): New.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/match.pd b/gcc/match.pd index
> >
> 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10
> eb1
> > 8db23c0cd9f7 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not see
> >     HONOR_NANS
> >     uniform_vector_p
> >     expand_vec_cmp_expr_p
> > -   bitmask_inv_cst_vector_p)
> > +   bitmask_inv_cst_vector_p
> > +   adjacent_data_access_p)
> >
> >  /* Operator lists.  */
> >  (define_operator_list tcc_comparison
> > @@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >
> >  /* Canonicalizations of BIT_FIELD_REFs.  */
> >
> > +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ (for op
> > +(plus min max FMIN_ALL FMAX_ALL)
> > +     ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX
> > +         IFN_REDUC_FMIN IFN_REDUC_FMAX)  (simplify
> > +  (op @0 @1)
> > +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> > +    (with { poly_uint64 nloc = 0;
> > +           tree src = adjacent_data_access_p (@0, @1, &nloc, true);
> > +           tree ntype = build_vector_type (type, 2);
> > +           tree size = TYPE_SIZE (ntype);
> > +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> > +           poly_uint64 _sz;
> > +           poly_uint64 _total; }
> > +     (if (src && is_gimple_reg (src) && ntype
> > +         && poly_int_tree_p (size, &_sz)
> > +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> > +         && known_ge (_total, _sz + nloc))
> > +      (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))))))
> > +
> > +(for op (lt gt)
> > +     ifni (IFN_REDUC_MIN IFN_REDUC_MAX)
> > +     ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX)  (simplify
> > +  (cond (op @0 @1) @0 @1)
> > +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> > +    (with { poly_uint64 nloc = 0;
> > +           tree src = adjacent_data_access_p (@0, @1, &nloc, false);
> > +           tree ntype = build_vector_type (type, 2);
> > +           tree size = TYPE_SIZE (ntype);
> > +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> > +           poly_uint64 _sz;
> > +           poly_uint64 _total; }
> > +     (if (src && is_gimple_reg (src) && ntype
> > +         && poly_int_tree_p (size, &_sz)
> > +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> > +         && known_ge (_total, _sz + nloc))
> > +      (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type)))
> > +       (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
> > +       (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))))
> > +
> >  (simplify
> >   (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
> >   (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4);
> > })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index
> >
> 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006
> 603
> > ae1e3deb6b76 100644
> > --- a/gcc/tree-cfg.cc
> > +++ b/gcc/tree-cfg.cc
> > @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt)
> >      {
> >        tree arg = gimple_call_arg (stmt, i);
> >        if ((is_gimple_reg_type (TREE_TYPE (arg))
> > +          && !is_gimple_variable (arg)
> >            && !is_gimple_val (arg))
> >           || (!is_gimple_reg_type (TREE_TYPE (arg))
> >               && !is_gimple_lvalue (arg))) diff --git a/gcc/tree.h
> > b/gcc/tree.h index
> >
> e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194
> 640
> > b8c1b531b798 100644
> > --- a/gcc/tree.h
> > +++ b/gcc/tree.h
> > @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree);
> >
> >  extern tree bitmask_inv_cst_vector_p (tree);
> >
> > +/* TRUE if the two operands represent adjacent access of data such that a
> > +   pairwise operation can be used.  */
> > +
> > +extern tree adjacent_data_access_p (tree, tree, poly_uint64*, bool);
> > +
> >  /* integer_nonzerop (tree x) is nonzero if X is an integer constant
> >     with a nonzero value.  */
> >
> > diff --git a/gcc/tree.cc b/gcc/tree.cc index
> >
> 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19
> d
> > e9841e29d311 100644
> > --- a/gcc/tree.cc
> > +++ b/gcc/tree.cc
> > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t)
> >    return builder.build ();
> >  }
> >
> > +/* Returns base address if the two operands represent adjacent access of
> data
> > +   such that a pairwise operation can be used.  OP1 must be a lower
> subpart
> > +   than OP2.  If POS is not NULL then on return if a value is returned POS
> > +   will indicate the position of the lower address.  If COMMUTATIVE_P
> then
> > +   the operation is also tried by flipping op1 and op2.  */
> > +
> > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos,
> > +                            bool commutative_p) {
> > +  gcc_assert (op1);
> > +  gcc_assert (op2);
> > +  if (TREE_CODE (op1) != TREE_CODE (op2)
> > +      || TREE_TYPE (op1) != TREE_TYPE (op2))
> > +    return NULL;
> > +
> > +  tree type = TREE_TYPE (op1);
> > +  gimple *stmt1 = NULL, *stmt2 = NULL;  unsigned int bits =
> > + GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type)));
> > +
> > +  if (TREE_CODE (op1) == BIT_FIELD_REF
> > +      && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND (op2,
> 0), 0)
> > +      && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND (op2,
> 1), 0)
> > +      && known_eq (bit_field_size (op1), bits))
> > +    {
> > +      poly_uint64 offset1 = bit_field_offset (op1);
> > +      poly_uint64 offset2 = bit_field_offset (op2);
> > +      if (known_eq (offset2 - offset1, bits))
> > +       {
> > +         if (pos)
> > +           *pos = offset1;
> > +         return TREE_OPERAND (op1, 0);
> > +       }
> > +      else if (commutative_p && known_eq (offset1 - offset2, bits))
> > +       {
> > +         if (pos)
> > +           *pos = offset2;
> > +         return TREE_OPERAND (op1, 0);
> > +       }
> > +    }
> > +  else if (TREE_CODE (op1) == ARRAY_REF
> > +          && operand_equal_p (get_base_address (op1), get_base_address
> (op2)))
> > +    {
> > +      wide_int size1 = wi::to_wide (array_ref_element_size (op1));
> > +      wide_int size2 = wi::to_wide (array_ref_element_size (op2));
> > +      if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8)
> > +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1))
> > +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1)))
> > +       return NULL;
> > +
> > +      poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, 1));
> > +      poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, 1));
> > +      if (known_eq (offset2 - offset1, 1UL))
> > +       {
> > +         if (pos)
> > +           *pos = offset1 * bits;
> > +         return TREE_OPERAND (op1, 0);
> > +       }
> > +      else if (commutative_p && known_eq (offset1 - offset2, 1UL))
> > +       {
> > +         if (pos)
> > +           *pos = offset2 * bits;
> > +         return TREE_OPERAND (op1, 0);
> > +       }
> > +    }
> > +  else if (TREE_CODE (op1) == SSA_NAME
> > +          && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL
> > +          && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL
> > +          && is_gimple_assign (stmt1)
> > +          && is_gimple_assign (stmt2))
> > +    {
> > +      if (gimple_assign_rhs_code (stmt1) != ARRAY_REF
> > +         && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF
> > +         && gimple_assign_rhs_code (stmt2) != ARRAY_REF
> > +         && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF)
> > +       return NULL;
> > +
> > +      return adjacent_data_access_p (gimple_assign_rhs1 (stmt1),
> > +                                    gimple_assign_rhs1 (stmt2), pos,
> > +                                    commutative_p);
> > +    }
> > +
> > +  return NULL;
> > +}
> > +
> >  /* If VECTOR_CST T has a single nonzero element, return the index of that
> >     element, otherwise return -1.  */
> >
> >
> >
> >
> >
> > --

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-11-07  7:16   ` Tamar Christina
@ 2022-11-07 10:17     ` Richard Biener
  2022-11-07 11:00       ` Tamar Christina
  0 siblings, 1 reply; 50+ messages in thread
From: Richard Biener @ 2022-11-07 10:17 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Richard Biener, gcc-patches, nd

On Mon, 7 Nov 2022, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Saturday, November 5, 2022 11:33 AM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de
> > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> > bitfields and array_refs
> > 
> > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches <gcc-
> > patches@gcc.gnu.org> wrote:
> > >
> > > Hi All,
> > >
> > > This patch series is to add recognition of pairwise operations
> > > (reductions) in match.pd such that we can benefit from them even at
> > > -O1 when the vectorizer isn't enabled.
> > >
> > > Ths use of these allow for a lot simpler codegen in AArch64 and allows
> > > us to avoid quite a lot of codegen warts.
> > >
> > > As an example a simple:
> > >
> > > typedef float v4sf __attribute__((vector_size (16)));
> > >
> > > float
> > > foo3 (v4sf x)
> > > {
> > >   return x[1] + x[2];
> > > }
> > >
> > > currently generates:
> > >
> > > foo3:
> > >         dup     s1, v0.s[1]
> > >         dup     s0, v0.s[2]
> > >         fadd    s0, s1, s0
> > >         ret
> > >
> > > while with this patch series now generates:
> > >
> > > foo3:
> > >         ext     v0.16b, v0.16b, v0.16b, #4
> > >         faddp   s0, v0.2s
> > >         ret
> > >
> > > This patch will not perform the operation if the source is not a
> > > gimple register and leaves memory sources to the vectorizer as it's
> > > able to deal correctly with clobbers.
> > 
> > But the vectorizer should also be able to cope with the above.  
> 
> There are several problems with leaving it up to the vectorizer to do:
> 
> 1. We only get it at -O2 and higher.
> 2. The way the vectorizer costs the reduction makes the resulting cost always too high for AArch64.
> 
> As an example the following:
> 
> typedef unsigned int u32v4 __attribute__((vector_size(16)));
> unsigned int f (u32v4 a, u32v4 b)
> {
>     return a[0] + a[1];
> }
> 
> Doesn't get SLP'ed because the vectorizer costs it as:
> 
> node 0x485eb30 0 times vec_perm costs 0 in body
> _1 + _2 1 times vector_stmt costs 1 in body
> _1 + _2 1 times vec_perm costs 2 in body
> _1 + _2 1 times vec_to_scalar costs 2 in body
> 
> And so ultimately you fail because:
> 
> /app/example.c:8:17: note: Cost model analysis for part in loop 0:
>   Vector cost: 5
>   Scalar cost: 3
> 
> This looks like it's because the vectorizer costs the operation to create the BIT_FIELD_REF <a_3(D), 64, 0>;
> For the reduction as requiring two scalar extracts and a permute. While it ultimately does produce a
> BIT_FIELD_REF <a_3(D), 64, 0>; that's not what it costs.
> 
> This causes the reduction to almost always be more expensive, so unless the rest of the SLP tree amortizes
> the cost we never generate them.

On x86 for example the hadds are prohibitly expensive here.  Are you sure
the horizontal add is actually profitable on arm?  Your pattern-matching
has no cost modeling at all?

> 3. The SLP only happens on operation that are SLP shaped and where SLP didn't fail.
> 
> As a simple example, the vectorizer can't SLP the following:
> 
> unsigned int f (u32v4 a, u32v4 b)
> {
>     a[0] += b[0];
>     return a[0] + a[1];
> }
> 
> Because there's not enough VF here and it can't unroll. This and many others fail because they're not an
> SLP-able operation, or SLP build fails.

That's of course because the pattern matching for reductions is too simple
here, getting us a group size of three.  Bad association would make your
simple pattern matching fail as well.

> This causes us to generate for e.g. this example:
> 
> f:
>         dup     s2, v0.s[1]
>         fmov    w1, s1
>         add     v0.2s, v2.2s, v0.2s
>         fmov    w0, s0
>         add     w0, w0, w1
>         ret
> 
> instead of with my patch:
> 
> f:
>         addp    v0.2s, v0.2s, v0.2s
>         add     v0.2s, v0.2s, v1.2s
>         fmov    w0, s0
>         ret
> 
> which is significantly better code.  So I don't think the vectorizer is the right solution for this.

Simple pattern matching isn't either.  In fact basic-block SLP is supposed
to be the advanced pattern matching including a cost model.

IMHO the correct approach is to improve that, 
vect_slp_check_for_constructors plus how we handle/recover from SLP
discovery fails as in your second example above.

> > I don't think
> > we want to do this as part of general folding.  Iff, then this belongs in specific
> > points of the pass pipeline, no?
> 
> The reason I currently have it as such is because in general the compiler doesn't really deal with
> horizontal reductions at all.  Also since the vectorizer itself can introduce reductions I figured it's
> better to have one representation for this.  So admittedly perhaps this should only be done after
> vectorization as that's when today we expect reductions to be in Gimple.
> 
> As for having it in a specific point in the pass pipeline, I have it as a general one since a number of
> passes could create the form for the reduction, for instance vec_lower could break up an operation
> to allow this to match.  The bigger BIT_FIELD_EXPR it creates could also lead to other optimizations.
> 
> Additionally you had mentioned last time that Andrew was trying to move min/max detection to match.pd
> So I had figured this was the correct place for it.

That's mostly because we have fold-const patterns for ?: min/max and
CFG patterns for min/max in phiopt and it's possible to unify both.

> That said I have no intuition for what would be better here. Since the check is quite cheap.  But do you have
> a particular place you want this move to then?  Ideally I'd want it before the last FRE pass, but perhaps
> isel?

As said, I think it belongs where we can do costing which means the
vectorizer.  Iff there are two/three instruction sequences that can
be peepholed do it in the targets machine description instead.

Richard.

> Thanks,
> Tamar
> 
> > 
> > > The use of these instruction makes a significant difference in codegen
> > > quality for AArch64 and Arm.
> > >
> > > NOTE: The last entry in the series contains tests for all of the
> > > previous patches as it's a bit of an all or nothing thing.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> > > and no issues.
> > >
> > > Ok for master?
> > >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > >         * match.pd (adjacent_data_access_p): Import.
> > >         Add new pattern for bitwise plus, min, max, fmax, fmin.
> > >         * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs.
> > >         * tree.cc (adjacent_data_access_p): New.
> > >         * tree.h (adjacent_data_access_p): New.
> > >
> > > --- inline copy of patch --
> > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > >
> > 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10
> > eb1
> > > 8db23c0cd9f7 100644
> > > --- a/gcc/match.pd
> > > +++ b/gcc/match.pd
> > > @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not see
> > >     HONOR_NANS
> > >     uniform_vector_p
> > >     expand_vec_cmp_expr_p
> > > -   bitmask_inv_cst_vector_p)
> > > +   bitmask_inv_cst_vector_p
> > > +   adjacent_data_access_p)
> > >
> > >  /* Operator lists.  */
> > >  (define_operator_list tcc_comparison
> > > @@ -7195,6 +7196,47 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > >
> > >  /* Canonicalizations of BIT_FIELD_REFs.  */
> > >
> > > +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ (for op
> > > +(plus min max FMIN_ALL FMAX_ALL)
> > > +     ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX
> > > +         IFN_REDUC_FMIN IFN_REDUC_FMAX)  (simplify
> > > +  (op @0 @1)
> > > +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> > > +    (with { poly_uint64 nloc = 0;
> > > +           tree src = adjacent_data_access_p (@0, @1, &nloc, true);
> > > +           tree ntype = build_vector_type (type, 2);
> > > +           tree size = TYPE_SIZE (ntype);
> > > +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> > > +           poly_uint64 _sz;
> > > +           poly_uint64 _total; }
> > > +     (if (src && is_gimple_reg (src) && ntype
> > > +         && poly_int_tree_p (size, &_sz)
> > > +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> > > +         && known_ge (_total, _sz + nloc))
> > > +      (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos; })))))))
> > > +
> > > +(for op (lt gt)
> > > +     ifni (IFN_REDUC_MIN IFN_REDUC_MAX)
> > > +     ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX)  (simplify
> > > +  (cond (op @0 @1) @0 @1)
> > > +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> > > +    (with { poly_uint64 nloc = 0;
> > > +           tree src = adjacent_data_access_p (@0, @1, &nloc, false);
> > > +           tree ntype = build_vector_type (type, 2);
> > > +           tree size = TYPE_SIZE (ntype);
> > > +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> > > +           poly_uint64 _sz;
> > > +           poly_uint64 _total; }
> > > +     (if (src && is_gimple_reg (src) && ntype
> > > +         && poly_int_tree_p (size, &_sz)
> > > +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> > > +         && known_ge (_total, _sz + nloc))
> > > +      (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type)))
> > > +       (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
> > > +       (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))))
> > > +
> > >  (simplify
> > >   (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
> > >   (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4);
> > > })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index
> > >
> > 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006
> > 603
> > > ae1e3deb6b76 100644
> > > --- a/gcc/tree-cfg.cc
> > > +++ b/gcc/tree-cfg.cc
> > > @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt)
> > >      {
> > >        tree arg = gimple_call_arg (stmt, i);
> > >        if ((is_gimple_reg_type (TREE_TYPE (arg))
> > > +          && !is_gimple_variable (arg)
> > >            && !is_gimple_val (arg))
> > >           || (!is_gimple_reg_type (TREE_TYPE (arg))
> > >               && !is_gimple_lvalue (arg))) diff --git a/gcc/tree.h
> > > b/gcc/tree.h index
> > >
> > e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194
> > 640
> > > b8c1b531b798 100644
> > > --- a/gcc/tree.h
> > > +++ b/gcc/tree.h
> > > @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree);
> > >
> > >  extern tree bitmask_inv_cst_vector_p (tree);
> > >
> > > +/* TRUE if the two operands represent adjacent access of data such that a
> > > +   pairwise operation can be used.  */
> > > +
> > > +extern tree adjacent_data_access_p (tree, tree, poly_uint64*, bool);
> > > +
> > >  /* integer_nonzerop (tree x) is nonzero if X is an integer constant
> > >     with a nonzero value.  */
> > >
> > > diff --git a/gcc/tree.cc b/gcc/tree.cc index
> > >
> > 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b19
> > d
> > > e9841e29d311 100644
> > > --- a/gcc/tree.cc
> > > +++ b/gcc/tree.cc
> > > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t)
> > >    return builder.build ();
> > >  }
> > >
> > > +/* Returns base address if the two operands represent adjacent access of
> > data
> > > +   such that a pairwise operation can be used.  OP1 must be a lower
> > subpart
> > > +   than OP2.  If POS is not NULL then on return if a value is returned POS
> > > +   will indicate the position of the lower address.  If COMMUTATIVE_P
> > then
> > > +   the operation is also tried by flipping op1 and op2.  */
> > > +
> > > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos,
> > > +                            bool commutative_p) {
> > > +  gcc_assert (op1);
> > > +  gcc_assert (op2);
> > > +  if (TREE_CODE (op1) != TREE_CODE (op2)
> > > +      || TREE_TYPE (op1) != TREE_TYPE (op2))
> > > +    return NULL;
> > > +
> > > +  tree type = TREE_TYPE (op1);
> > > +  gimple *stmt1 = NULL, *stmt2 = NULL;  unsigned int bits =
> > > + GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type)));
> > > +
> > > +  if (TREE_CODE (op1) == BIT_FIELD_REF
> > > +      && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND (op2,
> > 0), 0)
> > > +      && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND (op2,
> > 1), 0)
> > > +      && known_eq (bit_field_size (op1), bits))
> > > +    {
> > > +      poly_uint64 offset1 = bit_field_offset (op1);
> > > +      poly_uint64 offset2 = bit_field_offset (op2);
> > > +      if (known_eq (offset2 - offset1, bits))
> > > +       {
> > > +         if (pos)
> > > +           *pos = offset1;
> > > +         return TREE_OPERAND (op1, 0);
> > > +       }
> > > +      else if (commutative_p && known_eq (offset1 - offset2, bits))
> > > +       {
> > > +         if (pos)
> > > +           *pos = offset2;
> > > +         return TREE_OPERAND (op1, 0);
> > > +       }
> > > +    }
> > > +  else if (TREE_CODE (op1) == ARRAY_REF
> > > +          && operand_equal_p (get_base_address (op1), get_base_address
> > (op2)))
> > > +    {
> > > +      wide_int size1 = wi::to_wide (array_ref_element_size (op1));
> > > +      wide_int size2 = wi::to_wide (array_ref_element_size (op2));
> > > +      if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8)
> > > +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1))
> > > +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1)))
> > > +       return NULL;
> > > +
> > > +      poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1, 1));
> > > +      poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2, 1));
> > > +      if (known_eq (offset2 - offset1, 1UL))
> > > +       {
> > > +         if (pos)
> > > +           *pos = offset1 * bits;
> > > +         return TREE_OPERAND (op1, 0);
> > > +       }
> > > +      else if (commutative_p && known_eq (offset1 - offset2, 1UL))
> > > +       {
> > > +         if (pos)
> > > +           *pos = offset2 * bits;
> > > +         return TREE_OPERAND (op1, 0);
> > > +       }
> > > +    }
> > > +  else if (TREE_CODE (op1) == SSA_NAME
> > > +          && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL
> > > +          && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL
> > > +          && is_gimple_assign (stmt1)
> > > +          && is_gimple_assign (stmt2))
> > > +    {
> > > +      if (gimple_assign_rhs_code (stmt1) != ARRAY_REF
> > > +         && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF
> > > +         && gimple_assign_rhs_code (stmt2) != ARRAY_REF
> > > +         && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF)
> > > +       return NULL;
> > > +
> > > +      return adjacent_data_access_p (gimple_assign_rhs1 (stmt1),
> > > +                                    gimple_assign_rhs1 (stmt2), pos,
> > > +                                    commutative_p);
> > > +    }
> > > +
> > > +  return NULL;
> > > +}
> > > +
> > >  /* If VECTOR_CST T has a single nonzero element, return the index of that
> > >     element, otherwise return -1.  */
> > >
> > >
> > >
> > >
> > >
> > > --
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-11-07 10:17     ` Richard Biener
@ 2022-11-07 11:00       ` Tamar Christina
  2022-11-07 11:22         ` Richard Biener
  0 siblings, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-11-07 11:00 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, nd

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Monday, November 7, 2022 10:18 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Biener <richard.guenther@gmail.com>; gcc-
> patches@gcc.gnu.org; nd <nd@arm.com>
> Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from
> bitfields and array_refs
> 
> On Mon, 7 Nov 2022, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <richard.guenther@gmail.com>
> > > Sent: Saturday, November 5, 2022 11:33 AM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de
> > > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> > > bitfields and array_refs
> > >
> > > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches
> > > <gcc- patches@gcc.gnu.org> wrote:
> > > >
> > > > Hi All,
> > > >
> > > > This patch series is to add recognition of pairwise operations
> > > > (reductions) in match.pd such that we can benefit from them even
> > > > at
> > > > -O1 when the vectorizer isn't enabled.
> > > >
> > > > Ths use of these allow for a lot simpler codegen in AArch64 and
> > > > allows us to avoid quite a lot of codegen warts.
> > > >
> > > > As an example a simple:
> > > >
> > > > typedef float v4sf __attribute__((vector_size (16)));
> > > >
> > > > float
> > > > foo3 (v4sf x)
> > > > {
> > > >   return x[1] + x[2];
> > > > }
> > > >
> > > > currently generates:
> > > >
> > > > foo3:
> > > >         dup     s1, v0.s[1]
> > > >         dup     s0, v0.s[2]
> > > >         fadd    s0, s1, s0
> > > >         ret
> > > >
> > > > while with this patch series now generates:
> > > >
> > > > foo3:
> > > >         ext     v0.16b, v0.16b, v0.16b, #4
> > > >         faddp   s0, v0.2s
> > > >         ret
> > > >
> > > > This patch will not perform the operation if the source is not a
> > > > gimple register and leaves memory sources to the vectorizer as
> > > > it's able to deal correctly with clobbers.
> > >
> > > But the vectorizer should also be able to cope with the above.
> >
> > There are several problems with leaving it up to the vectorizer to do:
> >
> > 1. We only get it at -O2 and higher.
> > 2. The way the vectorizer costs the reduction makes the resulting cost
> always too high for AArch64.
> >
> > As an example the following:
> >
> > typedef unsigned int u32v4 __attribute__((vector_size(16))); unsigned
> > int f (u32v4 a, u32v4 b) {
> >     return a[0] + a[1];
> > }
> >
> > Doesn't get SLP'ed because the vectorizer costs it as:
> >
> > node 0x485eb30 0 times vec_perm costs 0 in body
> > _1 + _2 1 times vector_stmt costs 1 in body
> > _1 + _2 1 times vec_perm costs 2 in body
> > _1 + _2 1 times vec_to_scalar costs 2 in body
> >
> > And so ultimately you fail because:
> >
> > /app/example.c:8:17: note: Cost model analysis for part in loop 0:
> >   Vector cost: 5
> >   Scalar cost: 3
> >
> > This looks like it's because the vectorizer costs the operation to
> > create the BIT_FIELD_REF <a_3(D), 64, 0>; For the reduction as
> > requiring two scalar extracts and a permute. While it ultimately does
> produce a BIT_FIELD_REF <a_3(D), 64, 0>; that's not what it costs.
> >
> > This causes the reduction to almost always be more expensive, so
> > unless the rest of the SLP tree amortizes the cost we never generate them.
> 
> On x86 for example the hadds are prohibitly expensive here.  Are you sure
> the horizontal add is actually profitable on arm?  Your pattern-matching has
> no cost modeling at all?

Yes, they are dirt cheap, that's why we use them for a lot of our codegen for
e.g. compressing values.

> 
> > 3. The SLP only happens on operation that are SLP shaped and where SLP
> didn't fail.
> >
> > As a simple example, the vectorizer can't SLP the following:
> >
> > unsigned int f (u32v4 a, u32v4 b)
> > {
> >     a[0] += b[0];
> >     return a[0] + a[1];
> > }
> >
> > Because there's not enough VF here and it can't unroll. This and many
> > others fail because they're not an SLP-able operation, or SLP build fails.
> 
> That's of course because the pattern matching for reductions is too simple
> here, getting us a group size of three.  Bad association would make your
> simple pattern matching fail as well.
> 
> > This causes us to generate for e.g. this example:
> >
> > f:
> >         dup     s2, v0.s[1]
> >         fmov    w1, s1
> >         add     v0.2s, v2.2s, v0.2s
> >         fmov    w0, s0
> >         add     w0, w0, w1
> >         ret
> >
> > instead of with my patch:
> >
> > f:
> >         addp    v0.2s, v0.2s, v0.2s
> >         add     v0.2s, v0.2s, v1.2s
> >         fmov    w0, s0
> >         ret
> >
> > which is significantly better code.  So I don't think the vectorizer is the right
> solution for this.
> 
> Simple pattern matching isn't either.  In fact basic-block SLP is supposed to be
> the advanced pattern matching including a cost model.

The cost model seems a bit moot here, at least on AArch64.  There is no sequence
of events that would make these pairwise operations more expensive then the
alternative, which is to do vector extraction and crossing a register file to do simple
addition.

And in fact the ISA classifies these instructions as scalar not vector, and it doesn't
seem right to need the vectorizer for something that's basic codegen.

It seems like the problem here is that the current reductions are designed around
x86 specific limitations.  So perhaps the solution here is to just have an AArch64
specific Gimple pass or gate this transform on a target hook, or new cheap reduction
codes.

> 
> IMHO the correct approach is to improve that,
> vect_slp_check_for_constructors plus how we handle/recover from SLP
> discovery fails as in your second example above.

Is this feasible in the general sense? SLP tree decomposition would then require
you to cost every sub tree possible that gets built.  That seems quite expensive...

The bigger the tree the longer the more you have to decompose..

> 
> > > I don't think
> > > we want to do this as part of general folding.  Iff, then this
> > > belongs in specific points of the pass pipeline, no?
> >
> > The reason I currently have it as such is because in general the
> > compiler doesn't really deal with horizontal reductions at all.  Also
> > since the vectorizer itself can introduce reductions I figured it's
> > better to have one representation for this.  So admittedly perhaps this
> should only be done after vectorization as that's when today we expect
> reductions to be in Gimple.
> >
> > As for having it in a specific point in the pass pipeline, I have it
> > as a general one since a number of passes could create the form for
> > the reduction, for instance vec_lower could break up an operation to allow
> this to match.  The bigger BIT_FIELD_EXPR it creates could also lead to other
> optimizations.
> >
> > Additionally you had mentioned last time that Andrew was trying to
> > move min/max detection to match.pd So I had figured this was the correct
> place for it.
> 
> That's mostly because we have fold-const patterns for ?: min/max and CFG
> patterns for min/max in phiopt and it's possible to unify both.
> 
> > That said I have no intuition for what would be better here. Since the
> > check is quite cheap.  But do you have a particular place you want
> > this move to then?  Ideally I'd want it before the last FRE pass, but perhaps
> isel?
> 
> As said, I think it belongs where we can do costing which means the
> vectorizer.  Iff there are two/three instruction sequences that can be
> peepholed do it in the targets machine description instead.

We can't do it in RTL, because we don't know when things are sequentially
are after register allocator, for integer mode these would have then been
assigned to scalar hard register.  And so this is unrecoverable.

So quite literally, you cannot peephole this. You also cannot use combine
because there's no way to ensure that reload generates the same register
from subregs.

Thanks,
Tamar

> Richard.
> 
> > Thanks,
> > Tamar
> >
> > >
> > > > The use of these instruction makes a significant difference in
> > > > codegen quality for AArch64 and Arm.
> > > >
> > > > NOTE: The last entry in the series contains tests for all of the
> > > > previous patches as it's a bit of an all or nothing thing.
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > x86_64-pc-linux-gnu and no issues.
> > > >
> > > > Ok for master?
> > > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * match.pd (adjacent_data_access_p): Import.
> > > >         Add new pattern for bitwise plus, min, max, fmax, fmin.
> > > >         * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs.
> > > >         * tree.cc (adjacent_data_access_p): New.
> > > >         * tree.h (adjacent_data_access_p): New.
> > > >
> > > > --- inline copy of patch --
> > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > >
> > >
> 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10
> > > eb1
> > > > 8db23c0cd9f7 100644
> > > > --- a/gcc/match.pd
> > > > +++ b/gcc/match.pd
> > > > @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not see
> > > >     HONOR_NANS
> > > >     uniform_vector_p
> > > >     expand_vec_cmp_expr_p
> > > > -   bitmask_inv_cst_vector_p)
> > > > +   bitmask_inv_cst_vector_p
> > > > +   adjacent_data_access_p)
> > > >
> > > >  /* Operator lists.  */
> > > >  (define_operator_list tcc_comparison @@ -7195,6 +7196,47 @@
> > > > DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > > >
> > > >  /* Canonicalizations of BIT_FIELD_REFs.  */
> > > >
> > > > +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ (for op
> > > > +(plus min max FMIN_ALL FMAX_ALL)
> > > > +     ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX
> > > > +         IFN_REDUC_FMIN IFN_REDUC_FMAX)  (simplify
> > > > +  (op @0 @1)
> > > > +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> > > > +    (with { poly_uint64 nloc = 0;
> > > > +           tree src = adjacent_data_access_p (@0, @1, &nloc, true);
> > > > +           tree ntype = build_vector_type (type, 2);
> > > > +           tree size = TYPE_SIZE (ntype);
> > > > +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> > > > +           poly_uint64 _sz;
> > > > +           poly_uint64 _total; }
> > > > +     (if (src && is_gimple_reg (src) && ntype
> > > > +         && poly_int_tree_p (size, &_sz)
> > > > +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> > > > +         && known_ge (_total, _sz + nloc))
> > > > +      (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos;
> > > > +})))))))
> > > > +
> > > > +(for op (lt gt)
> > > > +     ifni (IFN_REDUC_MIN IFN_REDUC_MAX)
> > > > +     ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX)  (simplify
> > > > +  (cond (op @0 @1) @0 @1)
> > > > +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> > > > +    (with { poly_uint64 nloc = 0;
> > > > +           tree src = adjacent_data_access_p (@0, @1, &nloc, false);
> > > > +           tree ntype = build_vector_type (type, 2);
> > > > +           tree size = TYPE_SIZE (ntype);
> > > > +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> > > > +           poly_uint64 _sz;
> > > > +           poly_uint64 _total; }
> > > > +     (if (src && is_gimple_reg (src) && ntype
> > > > +         && poly_int_tree_p (size, &_sz)
> > > > +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> > > > +         && known_ge (_total, _sz + nloc))
> > > > +      (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type)))
> > > > +       (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
> > > > +       (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos;
> > > > +}))))))))
> > > > +
> > > >  (simplify
> > > >   (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
> > > >   (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2,
> > > > @4);
> > > > })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index
> > > >
> > >
> 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006
> > > 603
> > > > ae1e3deb6b76 100644
> > > > --- a/gcc/tree-cfg.cc
> > > > +++ b/gcc/tree-cfg.cc
> > > > @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt)
> > > >      {
> > > >        tree arg = gimple_call_arg (stmt, i);
> > > >        if ((is_gimple_reg_type (TREE_TYPE (arg))
> > > > +          && !is_gimple_variable (arg)
> > > >            && !is_gimple_val (arg))
> > > >           || (!is_gimple_reg_type (TREE_TYPE (arg))
> > > >               && !is_gimple_lvalue (arg))) diff --git a/gcc/tree.h
> > > > b/gcc/tree.h index
> > > >
> > >
> e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194
> > > 640
> > > > b8c1b531b798 100644
> > > > --- a/gcc/tree.h
> > > > +++ b/gcc/tree.h
> > > > @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree);
> > > >
> > > >  extern tree bitmask_inv_cst_vector_p (tree);
> > > >
> > > > +/* TRUE if the two operands represent adjacent access of data such
> that a
> > > > +   pairwise operation can be used.  */
> > > > +
> > > > +extern tree adjacent_data_access_p (tree, tree, poly_uint64*,
> > > > +bool);
> > > > +
> > > >  /* integer_nonzerop (tree x) is nonzero if X is an integer constant
> > > >     with a nonzero value.  */
> > > >
> > > > diff --git a/gcc/tree.cc b/gcc/tree.cc index
> > > >
> > >
> 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b1
> > > 9
> > > d
> > > > e9841e29d311 100644
> > > > --- a/gcc/tree.cc
> > > > +++ b/gcc/tree.cc
> > > > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t)
> > > >    return builder.build ();
> > > >  }
> > > >
> > > > +/* Returns base address if the two operands represent adjacent
> > > > +access of
> > > data
> > > > +   such that a pairwise operation can be used.  OP1 must be a
> > > > + lower
> > > subpart
> > > > +   than OP2.  If POS is not NULL then on return if a value is returned
> POS
> > > > +   will indicate the position of the lower address.  If
> > > > + COMMUTATIVE_P
> > > then
> > > > +   the operation is also tried by flipping op1 and op2.  */
> > > > +
> > > > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos,
> > > > +                            bool commutative_p) {
> > > > +  gcc_assert (op1);
> > > > +  gcc_assert (op2);
> > > > +  if (TREE_CODE (op1) != TREE_CODE (op2)
> > > > +      || TREE_TYPE (op1) != TREE_TYPE (op2))
> > > > +    return NULL;
> > > > +
> > > > +  tree type = TREE_TYPE (op1);
> > > > +  gimple *stmt1 = NULL, *stmt2 = NULL;  unsigned int bits =
> > > > + GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type)));
> > > > +
> > > > +  if (TREE_CODE (op1) == BIT_FIELD_REF
> > > > +      && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND
> > > > + (op2,
> > > 0), 0)
> > > > +      && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND
> > > > + (op2,
> > > 1), 0)
> > > > +      && known_eq (bit_field_size (op1), bits))
> > > > +    {
> > > > +      poly_uint64 offset1 = bit_field_offset (op1);
> > > > +      poly_uint64 offset2 = bit_field_offset (op2);
> > > > +      if (known_eq (offset2 - offset1, bits))
> > > > +       {
> > > > +         if (pos)
> > > > +           *pos = offset1;
> > > > +         return TREE_OPERAND (op1, 0);
> > > > +       }
> > > > +      else if (commutative_p && known_eq (offset1 - offset2, bits))
> > > > +       {
> > > > +         if (pos)
> > > > +           *pos = offset2;
> > > > +         return TREE_OPERAND (op1, 0);
> > > > +       }
> > > > +    }
> > > > +  else if (TREE_CODE (op1) == ARRAY_REF
> > > > +          && operand_equal_p (get_base_address (op1),
> > > > + get_base_address
> > > (op2)))
> > > > +    {
> > > > +      wide_int size1 = wi::to_wide (array_ref_element_size (op1));
> > > > +      wide_int size2 = wi::to_wide (array_ref_element_size (op2));
> > > > +      if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8)
> > > > +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1))
> > > > +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1)))
> > > > +       return NULL;
> > > > +
> > > > +      poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1,
> 1));
> > > > +      poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2,
> 1));
> > > > +      if (known_eq (offset2 - offset1, 1UL))
> > > > +       {
> > > > +         if (pos)
> > > > +           *pos = offset1 * bits;
> > > > +         return TREE_OPERAND (op1, 0);
> > > > +       }
> > > > +      else if (commutative_p && known_eq (offset1 - offset2, 1UL))
> > > > +       {
> > > > +         if (pos)
> > > > +           *pos = offset2 * bits;
> > > > +         return TREE_OPERAND (op1, 0);
> > > > +       }
> > > > +    }
> > > > +  else if (TREE_CODE (op1) == SSA_NAME
> > > > +          && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL
> > > > +          && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL
> > > > +          && is_gimple_assign (stmt1)
> > > > +          && is_gimple_assign (stmt2))
> > > > +    {
> > > > +      if (gimple_assign_rhs_code (stmt1) != ARRAY_REF
> > > > +         && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF
> > > > +         && gimple_assign_rhs_code (stmt2) != ARRAY_REF
> > > > +         && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF)
> > > > +       return NULL;
> > > > +
> > > > +      return adjacent_data_access_p (gimple_assign_rhs1 (stmt1),
> > > > +                                    gimple_assign_rhs1 (stmt2), pos,
> > > > +                                    commutative_p);
> > > > +    }
> > > > +
> > > > +  return NULL;
> > > > +}
> > > > +
> > > >  /* If VECTOR_CST T has a single nonzero element, return the index of
> that
> > > >     element, otherwise return -1.  */
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> Boudien Moerman; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-11-07 11:00       ` Tamar Christina
@ 2022-11-07 11:22         ` Richard Biener
  2022-11-07 11:56           ` Tamar Christina
  0 siblings, 1 reply; 50+ messages in thread
From: Richard Biener @ 2022-11-07 11:22 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Richard Biener, gcc-patches, nd

On Mon, 7 Nov 2022, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Monday, November 7, 2022 10:18 AM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: Richard Biener <richard.guenther@gmail.com>; gcc-
> > patches@gcc.gnu.org; nd <nd@arm.com>
> > Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from
> > bitfields and array_refs
> > 
> > On Mon, 7 Nov 2022, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > Sent: Saturday, November 5, 2022 11:33 AM
> > > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de
> > > > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> > > > bitfields and array_refs
> > > >
> > > > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches
> > > > <gcc- patches@gcc.gnu.org> wrote:
> > > > >
> > > > > Hi All,
> > > > >
> > > > > This patch series is to add recognition of pairwise operations
> > > > > (reductions) in match.pd such that we can benefit from them even
> > > > > at
> > > > > -O1 when the vectorizer isn't enabled.
> > > > >
> > > > > Ths use of these allow for a lot simpler codegen in AArch64 and
> > > > > allows us to avoid quite a lot of codegen warts.
> > > > >
> > > > > As an example a simple:
> > > > >
> > > > > typedef float v4sf __attribute__((vector_size (16)));
> > > > >
> > > > > float
> > > > > foo3 (v4sf x)
> > > > > {
> > > > >   return x[1] + x[2];
> > > > > }
> > > > >
> > > > > currently generates:
> > > > >
> > > > > foo3:
> > > > >         dup     s1, v0.s[1]
> > > > >         dup     s0, v0.s[2]
> > > > >         fadd    s0, s1, s0
> > > > >         ret
> > > > >
> > > > > while with this patch series now generates:
> > > > >
> > > > > foo3:
> > > > >         ext     v0.16b, v0.16b, v0.16b, #4
> > > > >         faddp   s0, v0.2s
> > > > >         ret
> > > > >
> > > > > This patch will not perform the operation if the source is not a
> > > > > gimple register and leaves memory sources to the vectorizer as
> > > > > it's able to deal correctly with clobbers.
> > > >
> > > > But the vectorizer should also be able to cope with the above.
> > >
> > > There are several problems with leaving it up to the vectorizer to do:
> > >
> > > 1. We only get it at -O2 and higher.
> > > 2. The way the vectorizer costs the reduction makes the resulting cost
> > always too high for AArch64.
> > >
> > > As an example the following:
> > >
> > > typedef unsigned int u32v4 __attribute__((vector_size(16))); unsigned
> > > int f (u32v4 a, u32v4 b) {
> > >     return a[0] + a[1];
> > > }
> > >
> > > Doesn't get SLP'ed because the vectorizer costs it as:
> > >
> > > node 0x485eb30 0 times vec_perm costs 0 in body
> > > _1 + _2 1 times vector_stmt costs 1 in body
> > > _1 + _2 1 times vec_perm costs 2 in body
> > > _1 + _2 1 times vec_to_scalar costs 2 in body
> > >
> > > And so ultimately you fail because:
> > >
> > > /app/example.c:8:17: note: Cost model analysis for part in loop 0:
> > >   Vector cost: 5
> > >   Scalar cost: 3
> > >
> > > This looks like it's because the vectorizer costs the operation to
> > > create the BIT_FIELD_REF <a_3(D), 64, 0>; For the reduction as
> > > requiring two scalar extracts and a permute. While it ultimately does
> > produce a BIT_FIELD_REF <a_3(D), 64, 0>; that's not what it costs.
> > >
> > > This causes the reduction to almost always be more expensive, so
> > > unless the rest of the SLP tree amortizes the cost we never generate them.
> > 
> > On x86 for example the hadds are prohibitly expensive here.  Are you sure
> > the horizontal add is actually profitable on arm?  Your pattern-matching has
> > no cost modeling at all?
> 
> Yes, they are dirt cheap, that's why we use them for a lot of our codegen for
> e.g. compressing values.
> 
> > 
> > > 3. The SLP only happens on operation that are SLP shaped and where SLP
> > didn't fail.
> > >
> > > As a simple example, the vectorizer can't SLP the following:
> > >
> > > unsigned int f (u32v4 a, u32v4 b)
> > > {
> > >     a[0] += b[0];
> > >     return a[0] + a[1];
> > > }
> > >
> > > Because there's not enough VF here and it can't unroll. This and many
> > > others fail because they're not an SLP-able operation, or SLP build fails.
> > 
> > That's of course because the pattern matching for reductions is too simple
> > here, getting us a group size of three.  Bad association would make your
> > simple pattern matching fail as well.
> > 
> > > This causes us to generate for e.g. this example:
> > >
> > > f:
> > >         dup     s2, v0.s[1]
> > >         fmov    w1, s1
> > >         add     v0.2s, v2.2s, v0.2s
> > >         fmov    w0, s0
> > >         add     w0, w0, w1
> > >         ret
> > >
> > > instead of with my patch:
> > >
> > > f:
> > >         addp    v0.2s, v0.2s, v0.2s
> > >         add     v0.2s, v0.2s, v1.2s
> > >         fmov    w0, s0
> > >         ret
> > >
> > > which is significantly better code.  So I don't think the vectorizer is the right
> > solution for this.
> > 
> > Simple pattern matching isn't either.  In fact basic-block SLP is supposed to be
> > the advanced pattern matching including a cost model.
> 
> The cost model seems a bit moot here, at least on AArch64.  There is no sequence
> of events that would make these pairwise operations more expensive then the
> alternative, which is to do vector extraction and crossing a register file to do simple
> addition.
> 
> And in fact the ISA classifies these instructions as scalar not vector, and it doesn't
> seem right to need the vectorizer for something that's basic codegen.

I probably fail to decipher the asm, 'addp    v0.2s, v0.2s, v0.2s'
either hides the fact that the output is scalar or that the input is 
vector.

> It seems like the problem here is that the current reductions are designed around
> x86 specific limitations.  So perhaps the solution here is to just have an AArch64
> specific Gimple pass or gate this transform on a target hook, or new cheap reduction
> codes.

No, the current reductions are designed to be used by the vectorizer - you
are using them for GIMPLE peepholing.  x86 assumes cost modeling gets
applied before using them (but IIRC x86 doesn't have integer horizontal
reductions, but the backend has patterns to create optimal sequences
for them.

The problem with doing the pattern matching too early is that .REDUC_PLUS
isn't recognized widely so any followup simplifications are unlikely
(like reassociating after inlining, etc.).

> > 
> > IMHO the correct approach is to improve that,
> > vect_slp_check_for_constructors plus how we handle/recover from SLP
> > discovery fails as in your second example above.
> 
> Is this feasible in the general sense? SLP tree decomposition would then require
> you to cost every sub tree possible that gets built.  That seems quite expensive...

Well, that's kind-of what we do.  But how do you figure the "optimal"
way to match a[0] + a[1] + a[0] + a[1]?

Your match.pd pattern will apply on each and every add in a chain,
the SLP pattern matching is careful to only start matching from
the _last_ element of a chain so is actually cheaper.  It just
isn't very clever in pruning a non-power-of-two chain or in
splitting a chain at points where different sources come in.

> The bigger the tree the longer the more you have to decompose..

And the more (random) places you have where eventually your pattern
matches.

> > 
> > > > I don't think
> > > > we want to do this as part of general folding.  Iff, then this
> > > > belongs in specific points of the pass pipeline, no?
> > >
> > > The reason I currently have it as such is because in general the
> > > compiler doesn't really deal with horizontal reductions at all.  Also
> > > since the vectorizer itself can introduce reductions I figured it's
> > > better to have one representation for this.  So admittedly perhaps this
> > should only be done after vectorization as that's when today we expect
> > reductions to be in Gimple.
> > >
> > > As for having it in a specific point in the pass pipeline, I have it
> > > as a general one since a number of passes could create the form for
> > > the reduction, for instance vec_lower could break up an operation to allow
> > this to match.  The bigger BIT_FIELD_EXPR it creates could also lead to other
> > optimizations.
> > >
> > > Additionally you had mentioned last time that Andrew was trying to
> > > move min/max detection to match.pd So I had figured this was the correct
> > place for it.
> > 
> > That's mostly because we have fold-const patterns for ?: min/max and CFG
> > patterns for min/max in phiopt and it's possible to unify both.
> > 
> > > That said I have no intuition for what would be better here. Since the
> > > check is quite cheap.  But do you have a particular place you want
> > > this move to then?  Ideally I'd want it before the last FRE pass, but perhaps
> > isel?
> > 
> > As said, I think it belongs where we can do costing which means the
> > vectorizer.  Iff there are two/three instruction sequences that can be
> > peepholed do it in the targets machine description instead.
> 
> We can't do it in RTL, because we don't know when things are sequentially
> are after register allocator, for integer mode these would have then been
> assigned to scalar hard register.  And so this is unrecoverable.
> 
> So quite literally, you cannot peephole this. You also cannot use combine
> because there's no way to ensure that reload generates the same register
> from subregs.

So it's not easily possible the within current infrastructure.  But
it does look like ARM might eventually benefit from something like
STV on x86?

Richard.

> Thanks,
> Tamar
> 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > >
> > > > > The use of these instruction makes a significant difference in
> > > > > codegen quality for AArch64 and Arm.
> > > > >
> > > > > NOTE: The last entry in the series contains tests for all of the
> > > > > previous patches as it's a bit of an all or nothing thing.
> > > > >
> > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > x86_64-pc-linux-gnu and no issues.
> > > > >
> > > > > Ok for master?
> > > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         * match.pd (adjacent_data_access_p): Import.
> > > > >         Add new pattern for bitwise plus, min, max, fmax, fmin.
> > > > >         * tree-cfg.cc (verify_gimple_call): Allow function arguments in IFNs.
> > > > >         * tree.cc (adjacent_data_access_p): New.
> > > > >         * tree.h (adjacent_data_access_p): New.
> > > > >
> > > > > --- inline copy of patch --
> > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > >
> > > >
> > 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10
> > > > eb1
> > > > > 8db23c0cd9f7 100644
> > > > > --- a/gcc/match.pd
> > > > > +++ b/gcc/match.pd
> > > > > @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not see
> > > > >     HONOR_NANS
> > > > >     uniform_vector_p
> > > > >     expand_vec_cmp_expr_p
> > > > > -   bitmask_inv_cst_vector_p)
> > > > > +   bitmask_inv_cst_vector_p
> > > > > +   adjacent_data_access_p)
> > > > >
> > > > >  /* Operator lists.  */
> > > > >  (define_operator_list tcc_comparison @@ -7195,6 +7196,47 @@
> > > > > DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > > > >
> > > > >  /* Canonicalizations of BIT_FIELD_REFs.  */
> > > > >
> > > > > +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */ (for op
> > > > > +(plus min max FMIN_ALL FMAX_ALL)
> > > > > +     ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX
> > > > > +         IFN_REDUC_FMIN IFN_REDUC_FMAX)  (simplify
> > > > > +  (op @0 @1)
> > > > > +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> > > > > +    (with { poly_uint64 nloc = 0;
> > > > > +           tree src = adjacent_data_access_p (@0, @1, &nloc, true);
> > > > > +           tree ntype = build_vector_type (type, 2);
> > > > > +           tree size = TYPE_SIZE (ntype);
> > > > > +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> > > > > +           poly_uint64 _sz;
> > > > > +           poly_uint64 _total; }
> > > > > +     (if (src && is_gimple_reg (src) && ntype
> > > > > +         && poly_int_tree_p (size, &_sz)
> > > > > +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> > > > > +         && known_ge (_total, _sz + nloc))
> > > > > +      (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos;
> > > > > +})))))))
> > > > > +
> > > > > +(for op (lt gt)
> > > > > +     ifni (IFN_REDUC_MIN IFN_REDUC_MAX)
> > > > > +     ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX)  (simplify
> > > > > +  (cond (op @0 @1) @0 @1)
> > > > > +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> > > > > +    (with { poly_uint64 nloc = 0;
> > > > > +           tree src = adjacent_data_access_p (@0, @1, &nloc, false);
> > > > > +           tree ntype = build_vector_type (type, 2);
> > > > > +           tree size = TYPE_SIZE (ntype);
> > > > > +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> > > > > +           poly_uint64 _sz;
> > > > > +           poly_uint64 _total; }
> > > > > +     (if (src && is_gimple_reg (src) && ntype
> > > > > +         && poly_int_tree_p (size, &_sz)
> > > > > +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> > > > > +         && known_ge (_total, _sz + nloc))
> > > > > +      (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type)))
> > > > > +       (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
> > > > > +       (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos;
> > > > > +}))))))))
> > > > > +
> > > > >  (simplify
> > > > >   (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
> > > > >   (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2,
> > > > > @4);
> > > > > })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index
> > > > >
> > > >
> > 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006
> > > > 603
> > > > > ae1e3deb6b76 100644
> > > > > --- a/gcc/tree-cfg.cc
> > > > > +++ b/gcc/tree-cfg.cc
> > > > > @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt)
> > > > >      {
> > > > >        tree arg = gimple_call_arg (stmt, i);
> > > > >        if ((is_gimple_reg_type (TREE_TYPE (arg))
> > > > > +          && !is_gimple_variable (arg)
> > > > >            && !is_gimple_val (arg))
> > > > >           || (!is_gimple_reg_type (TREE_TYPE (arg))
> > > > >               && !is_gimple_lvalue (arg))) diff --git a/gcc/tree.h
> > > > > b/gcc/tree.h index
> > > > >
> > > >
> > e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194
> > > > 640
> > > > > b8c1b531b798 100644
> > > > > --- a/gcc/tree.h
> > > > > +++ b/gcc/tree.h
> > > > > @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree);
> > > > >
> > > > >  extern tree bitmask_inv_cst_vector_p (tree);
> > > > >
> > > > > +/* TRUE if the two operands represent adjacent access of data such
> > that a
> > > > > +   pairwise operation can be used.  */
> > > > > +
> > > > > +extern tree adjacent_data_access_p (tree, tree, poly_uint64*,
> > > > > +bool);
> > > > > +
> > > > >  /* integer_nonzerop (tree x) is nonzero if X is an integer constant
> > > > >     with a nonzero value.  */
> > > > >
> > > > > diff --git a/gcc/tree.cc b/gcc/tree.cc index
> > > > >
> > > >
> > 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b1
> > > > 9
> > > > d
> > > > > e9841e29d311 100644
> > > > > --- a/gcc/tree.cc
> > > > > +++ b/gcc/tree.cc
> > > > > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t)
> > > > >    return builder.build ();
> > > > >  }
> > > > >
> > > > > +/* Returns base address if the two operands represent adjacent
> > > > > +access of
> > > > data
> > > > > +   such that a pairwise operation can be used.  OP1 must be a
> > > > > + lower
> > > > subpart
> > > > > +   than OP2.  If POS is not NULL then on return if a value is returned
> > POS
> > > > > +   will indicate the position of the lower address.  If
> > > > > + COMMUTATIVE_P
> > > > then
> > > > > +   the operation is also tried by flipping op1 and op2.  */
> > > > > +
> > > > > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64 *pos,
> > > > > +                            bool commutative_p) {
> > > > > +  gcc_assert (op1);
> > > > > +  gcc_assert (op2);
> > > > > +  if (TREE_CODE (op1) != TREE_CODE (op2)
> > > > > +      || TREE_TYPE (op1) != TREE_TYPE (op2))
> > > > > +    return NULL;
> > > > > +
> > > > > +  tree type = TREE_TYPE (op1);
> > > > > +  gimple *stmt1 = NULL, *stmt2 = NULL;  unsigned int bits =
> > > > > + GET_MODE_BITSIZE (GET_MODE_INNER (TYPE_MODE (type)));
> > > > > +
> > > > > +  if (TREE_CODE (op1) == BIT_FIELD_REF
> > > > > +      && operand_equal_p (TREE_OPERAND (op1, 0), TREE_OPERAND
> > > > > + (op2,
> > > > 0), 0)
> > > > > +      && operand_equal_p (TREE_OPERAND (op1, 1), TREE_OPERAND
> > > > > + (op2,
> > > > 1), 0)
> > > > > +      && known_eq (bit_field_size (op1), bits))
> > > > > +    {
> > > > > +      poly_uint64 offset1 = bit_field_offset (op1);
> > > > > +      poly_uint64 offset2 = bit_field_offset (op2);
> > > > > +      if (known_eq (offset2 - offset1, bits))
> > > > > +       {
> > > > > +         if (pos)
> > > > > +           *pos = offset1;
> > > > > +         return TREE_OPERAND (op1, 0);
> > > > > +       }
> > > > > +      else if (commutative_p && known_eq (offset1 - offset2, bits))
> > > > > +       {
> > > > > +         if (pos)
> > > > > +           *pos = offset2;
> > > > > +         return TREE_OPERAND (op1, 0);
> > > > > +       }
> > > > > +    }
> > > > > +  else if (TREE_CODE (op1) == ARRAY_REF
> > > > > +          && operand_equal_p (get_base_address (op1),
> > > > > + get_base_address
> > > > (op2)))
> > > > > +    {
> > > > > +      wide_int size1 = wi::to_wide (array_ref_element_size (op1));
> > > > > +      wide_int size2 = wi::to_wide (array_ref_element_size (op2));
> > > > > +      if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8)
> > > > > +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1))
> > > > > +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1)))
> > > > > +       return NULL;
> > > > > +
> > > > > +      poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND (op1,
> > 1));
> > > > > +      poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND (op2,
> > 1));
> > > > > +      if (known_eq (offset2 - offset1, 1UL))
> > > > > +       {
> > > > > +         if (pos)
> > > > > +           *pos = offset1 * bits;
> > > > > +         return TREE_OPERAND (op1, 0);
> > > > > +       }
> > > > > +      else if (commutative_p && known_eq (offset1 - offset2, 1UL))
> > > > > +       {
> > > > > +         if (pos)
> > > > > +           *pos = offset2 * bits;
> > > > > +         return TREE_OPERAND (op1, 0);
> > > > > +       }
> > > > > +    }
> > > > > +  else if (TREE_CODE (op1) == SSA_NAME
> > > > > +          && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL
> > > > > +          && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL
> > > > > +          && is_gimple_assign (stmt1)
> > > > > +          && is_gimple_assign (stmt2))
> > > > > +    {
> > > > > +      if (gimple_assign_rhs_code (stmt1) != ARRAY_REF
> > > > > +         && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF
> > > > > +         && gimple_assign_rhs_code (stmt2) != ARRAY_REF
> > > > > +         && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF)
> > > > > +       return NULL;
> > > > > +
> > > > > +      return adjacent_data_access_p (gimple_assign_rhs1 (stmt1),
> > > > > +                                    gimple_assign_rhs1 (stmt2), pos,
> > > > > +                                    commutative_p);
> > > > > +    }
> > > > > +
> > > > > +  return NULL;
> > > > > +}
> > > > > +
> > > > >  /* If VECTOR_CST T has a single nonzero element, return the index of
> > that
> > > > >     element, otherwise return -1.  */
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> > Boudien Moerman; HRB 36809 (AG Nuernberg)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-11-07 11:22         ` Richard Biener
@ 2022-11-07 11:56           ` Tamar Christina
  2022-11-22 10:36             ` Richard Sandiford
  0 siblings, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-11-07 11:56 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches, nd

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Monday, November 7, 2022 11:23 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Biener <richard.guenther@gmail.com>; gcc-
> patches@gcc.gnu.org; nd <nd@arm.com>
> Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from
> bitfields and array_refs
> 
> On Mon, 7 Nov 2022, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguenther@suse.de>
> > > Sent: Monday, November 7, 2022 10:18 AM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: Richard Biener <richard.guenther@gmail.com>; gcc-
> > > patches@gcc.gnu.org; nd <nd@arm.com>
> > > Subject: RE: [PATCH 1/8]middle-end: Recognize scalar reductions from
> > > bitfields and array_refs
> > >
> > > On Mon, 7 Nov 2022, Tamar Christina wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > Sent: Saturday, November 5, 2022 11:33 AM
> > > > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>;
> rguenther@suse.de
> > > > > Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions
> > > > > from bitfields and array_refs
> > > > >
> > > > > On Mon, Oct 31, 2022 at 1:00 PM Tamar Christina via Gcc-patches
> > > > > <gcc- patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > This patch series is to add recognition of pairwise operations
> > > > > > (reductions) in match.pd such that we can benefit from them
> > > > > > even at
> > > > > > -O1 when the vectorizer isn't enabled.
> > > > > >
> > > > > > Ths use of these allow for a lot simpler codegen in AArch64
> > > > > > and allows us to avoid quite a lot of codegen warts.
> > > > > >
> > > > > > As an example a simple:
> > > > > >
> > > > > > typedef float v4sf __attribute__((vector_size (16)));
> > > > > >
> > > > > > float
> > > > > > foo3 (v4sf x)
> > > > > > {
> > > > > >   return x[1] + x[2];
> > > > > > }
> > > > > >
> > > > > > currently generates:
> > > > > >
> > > > > > foo3:
> > > > > >         dup     s1, v0.s[1]
> > > > > >         dup     s0, v0.s[2]
> > > > > >         fadd    s0, s1, s0
> > > > > >         ret
> > > > > >
> > > > > > while with this patch series now generates:
> > > > > >
> > > > > > foo3:
> > > > > >         ext     v0.16b, v0.16b, v0.16b, #4
> > > > > >         faddp   s0, v0.2s
> > > > > >         ret
> > > > > >
> > > > > > This patch will not perform the operation if the source is not
> > > > > > a gimple register and leaves memory sources to the vectorizer
> > > > > > as it's able to deal correctly with clobbers.
> > > > >
> > > > > But the vectorizer should also be able to cope with the above.
> > > >
> > > > There are several problems with leaving it up to the vectorizer to do:
> > > >
> > > > 1. We only get it at -O2 and higher.
> > > > 2. The way the vectorizer costs the reduction makes the resulting
> > > > cost
> > > always too high for AArch64.
> > > >
> > > > As an example the following:
> > > >
> > > > typedef unsigned int u32v4 __attribute__((vector_size(16)));
> > > > unsigned int f (u32v4 a, u32v4 b) {
> > > >     return a[0] + a[1];
> > > > }
> > > >
> > > > Doesn't get SLP'ed because the vectorizer costs it as:
> > > >
> > > > node 0x485eb30 0 times vec_perm costs 0 in body
> > > > _1 + _2 1 times vector_stmt costs 1 in body
> > > > _1 + _2 1 times vec_perm costs 2 in body
> > > > _1 + _2 1 times vec_to_scalar costs 2 in body
> > > >
> > > > And so ultimately you fail because:
> > > >
> > > > /app/example.c:8:17: note: Cost model analysis for part in loop 0:
> > > >   Vector cost: 5
> > > >   Scalar cost: 3
> > > >
> > > > This looks like it's because the vectorizer costs the operation to
> > > > create the BIT_FIELD_REF <a_3(D), 64, 0>; For the reduction as
> > > > requiring two scalar extracts and a permute. While it ultimately
> > > > does
> > > produce a BIT_FIELD_REF <a_3(D), 64, 0>; that's not what it costs.
> > > >
> > > > This causes the reduction to almost always be more expensive, so
> > > > unless the rest of the SLP tree amortizes the cost we never generate
> them.
> > >
> > > On x86 for example the hadds are prohibitly expensive here.  Are you
> > > sure the horizontal add is actually profitable on arm?  Your
> > > pattern-matching has no cost modeling at all?
> >
> > Yes, they are dirt cheap, that's why we use them for a lot of our
> > codegen for e.g. compressing values.
> >
> > >
> > > > 3. The SLP only happens on operation that are SLP shaped and where
> > > > SLP
> > > didn't fail.
> > > >
> > > > As a simple example, the vectorizer can't SLP the following:
> > > >
> > > > unsigned int f (u32v4 a, u32v4 b)
> > > > {
> > > >     a[0] += b[0];
> > > >     return a[0] + a[1];
> > > > }
> > > >
> > > > Because there's not enough VF here and it can't unroll. This and
> > > > many others fail because they're not an SLP-able operation, or SLP build
> fails.
> > >
> > > That's of course because the pattern matching for reductions is too
> > > simple here, getting us a group size of three.  Bad association
> > > would make your simple pattern matching fail as well.
> > >
> > > > This causes us to generate for e.g. this example:
> > > >
> > > > f:
> > > >         dup     s2, v0.s[1]
> > > >         fmov    w1, s1
> > > >         add     v0.2s, v2.2s, v0.2s
> > > >         fmov    w0, s0
> > > >         add     w0, w0, w1
> > > >         ret
> > > >
> > > > instead of with my patch:
> > > >
> > > > f:
> > > >         addp    v0.2s, v0.2s, v0.2s
> > > >         add     v0.2s, v0.2s, v1.2s
> > > >         fmov    w0, s0
> > > >         ret
> > > >
> > > > which is significantly better code.  So I don't think the
> > > > vectorizer is the right
> > > solution for this.
> > >
> > > Simple pattern matching isn't either.  In fact basic-block SLP is
> > > supposed to be the advanced pattern matching including a cost model.
> >
> > The cost model seems a bit moot here, at least on AArch64.  There is
> > no sequence of events that would make these pairwise operations more
> > expensive then the alternative, which is to do vector extraction and
> > crossing a register file to do simple addition.
> >
> > And in fact the ISA classifies these instructions as scalar not
> > vector, and it doesn't seem right to need the vectorizer for something
> that's basic codegen.
> 
> I probably fail to decipher the asm, 'addp    v0.2s, v0.2s, v0.2s'
> either hides the fact that the output is scalar or that the input is vector.

That's because of the codegen trick we use to get it for integers as well.
e.g. https://developer.arm.com/documentation/dui0801/h/A64-SIMD-Scalar-Instructions/FADDP--scalar- 
is the reduction for floats.   The addp v0.2s is just because there wasn't a
point in having both a three operand and two operand version of the instruction.

> 
> > It seems like the problem here is that the current reductions are
> > designed around
> > x86 specific limitations.  So perhaps the solution here is to just
> > have an AArch64 specific Gimple pass or gate this transform on a
> > target hook, or new cheap reduction codes.
> 
> No, the current reductions are designed to be used by the vectorizer - you
> are using them for GIMPLE peepholing.  x86 assumes cost modeling gets
> applied before using them (but IIRC x86 doesn't have integer horizontal
> reductions, but the backend has patterns to create optimal sequences for
> them.
> 
> The problem with doing the pattern matching too early is that .REDUC_PLUS
> isn't recognized widely so any followup simplifications are unlikely (like
> reassociating after inlining, etc.). 

Agreed, but that can be solved by doing the replacement late per the previous emails.

> > >
> > > IMHO the correct approach is to improve that,
> > > vect_slp_check_for_constructors plus how we handle/recover from SLP
> > > discovery fails as in your second example above.
> >
> > Is this feasible in the general sense? SLP tree decomposition would
> > then require you to cost every sub tree possible that gets built.  That seems
> quite expensive...
> 
> Well, that's kind-of what we do.  But how do you figure the "optimal"
> way to match a[0] + a[1] + a[0] + a[1]?
> 

I'd expect either pre or post order would both end up doing the optimal thing,
So either match the first two, or second two first.

If it decided to match the middle two that's also fine but requires re-association
to get to the final version. Sequence.

> Your match.pd pattern will apply on each and every add in a chain, the SLP
> pattern matching is careful to only start matching from the _last_ element of
> a chain so is actually cheaper.  It just isn't very clever in pruning a non-power-
> of-two chain or in splitting a chain at points where different sources come in.
> 
> > The bigger the tree the longer the more you have to decompose..
> 
> And the more (random) places you have where eventually your pattern
> matches.
> 

Yes true, but isn't the point of match.pd to amortize the costs of doing this by
grouping the same class of operations together?

> > >
> > > > > I don't think
> > > > > we want to do this as part of general folding.  Iff, then this
> > > > > belongs in specific points of the pass pipeline, no?
> > > >
> > > > The reason I currently have it as such is because in general the
> > > > compiler doesn't really deal with horizontal reductions at all.
> > > > Also since the vectorizer itself can introduce reductions I
> > > > figured it's better to have one representation for this.  So
> > > > admittedly perhaps this
> > > should only be done after vectorization as that's when today we
> > > expect reductions to be in Gimple.
> > > >
> > > > As for having it in a specific point in the pass pipeline, I have
> > > > it as a general one since a number of passes could create the form
> > > > for the reduction, for instance vec_lower could break up an
> > > > operation to allow
> > > this to match.  The bigger BIT_FIELD_EXPR it creates could also lead
> > > to other optimizations.
> > > >
> > > > Additionally you had mentioned last time that Andrew was trying to
> > > > move min/max detection to match.pd So I had figured this was the
> > > > correct
> > > place for it.
> > >
> > > That's mostly because we have fold-const patterns for ?: min/max and
> > > CFG patterns for min/max in phiopt and it's possible to unify both.
> > >
> > > > That said I have no intuition for what would be better here. Since
> > > > the check is quite cheap.  But do you have a particular place you
> > > > want this move to then?  Ideally I'd want it before the last FRE
> > > > pass, but perhaps
> > > isel?
> > >
> > > As said, I think it belongs where we can do costing which means the
> > > vectorizer.  Iff there are two/three instruction sequences that can
> > > be peepholed do it in the targets machine description instead.
> >
> > We can't do it in RTL, because we don't know when things are
> > sequentially are after register allocator, for integer mode these
> > would have then been assigned to scalar hard register.  And so this is
> unrecoverable.
> >
> > So quite literally, you cannot peephole this. You also cannot use
> > combine because there's no way to ensure that reload generates the
> > same register from subregs.
> 
> So it's not easily possible the within current infrastructure.  But it does look
> like ARM might eventually benefit from something like STV on x86?
> 

I'm not sure.  The problem with trying to do this in RTL is that you'd have to be
able to decide from two psuedos whether they come from extracts that are
sequential. When coming in from a hard register that's easy yes.  When coming in
from a load, or any other operation that produces psuedos that becomes harder.

But ok, I guess from this thread I can see the patch is dead so I'll drop it.

Thanks,
Tamar

> Richard.
> 
> > Thanks,
> > Tamar
> >
> > > Richard.
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > >
> > > > > > The use of these instruction makes a significant difference in
> > > > > > codegen quality for AArch64 and Arm.
> > > > > >
> > > > > > NOTE: The last entry in the series contains tests for all of
> > > > > > the previous patches as it's a bit of an all or nothing thing.
> > > > > >
> > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > > x86_64-pc-linux-gnu and no issues.
> > > > > >
> > > > > > Ok for master?
> > > > > >
> > > > > > Thanks,
> > > > > > Tamar
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >         * match.pd (adjacent_data_access_p): Import.
> > > > > >         Add new pattern for bitwise plus, min, max, fmax, fmin.
> > > > > >         * tree-cfg.cc (verify_gimple_call): Allow function arguments in
> IFNs.
> > > > > >         * tree.cc (adjacent_data_access_p): New.
> > > > > >         * tree.h (adjacent_data_access_p): New.
> > > > > >
> > > > > > --- inline copy of patch --
> > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > >
> > > > >
> > >
> 2617d56091dfbd41ae49f980ee0af3757f5ec1cf..aecaa3520b36e770d11ea9a10
> > > > > eb1
> > > > > > 8db23c0cd9f7 100644
> > > > > > --- a/gcc/match.pd
> > > > > > +++ b/gcc/match.pd
> > > > > > @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not
> see
> > > > > >     HONOR_NANS
> > > > > >     uniform_vector_p
> > > > > >     expand_vec_cmp_expr_p
> > > > > > -   bitmask_inv_cst_vector_p)
> > > > > > +   bitmask_inv_cst_vector_p
> > > > > > +   adjacent_data_access_p)
> > > > > >
> > > > > >  /* Operator lists.  */
> > > > > >  (define_operator_list tcc_comparison @@ -7195,6 +7196,47 @@
> > > > > > DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > > > > >
> > > > > >  /* Canonicalizations of BIT_FIELD_REFs.  */
> > > > > >
> > > > > > +/* Canonicalize BIT_FIELD_REFS to pairwise operations. */
> > > > > > +(for op (plus min max FMIN_ALL FMAX_ALL)
> > > > > > +     ifn (IFN_REDUC_PLUS IFN_REDUC_MIN IFN_REDUC_MAX
> > > > > > +         IFN_REDUC_FMIN IFN_REDUC_FMAX)  (simplify
> > > > > > +  (op @0 @1)
> > > > > > +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> > > > > > +    (with { poly_uint64 nloc = 0;
> > > > > > +           tree src = adjacent_data_access_p (@0, @1, &nloc, true);
> > > > > > +           tree ntype = build_vector_type (type, 2);
> > > > > > +           tree size = TYPE_SIZE (ntype);
> > > > > > +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> > > > > > +           poly_uint64 _sz;
> > > > > > +           poly_uint64 _total; }
> > > > > > +     (if (src && is_gimple_reg (src) && ntype
> > > > > > +         && poly_int_tree_p (size, &_sz)
> > > > > > +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> > > > > > +         && known_ge (_total, _sz + nloc))
> > > > > > +      (ifn (BIT_FIELD_REF:ntype { src; } { size; } { pos;
> > > > > > +})))))))
> > > > > > +
> > > > > > +(for op (lt gt)
> > > > > > +     ifni (IFN_REDUC_MIN IFN_REDUC_MAX)
> > > > > > +     ifnf (IFN_REDUC_FMIN IFN_REDUC_FMAX)  (simplify
> > > > > > +  (cond (op @0 @1) @0 @1)
> > > > > > +   (if (INTEGRAL_TYPE_P (type) || SCALAR_FLOAT_TYPE_P (type))
> > > > > > +    (with { poly_uint64 nloc = 0;
> > > > > > +           tree src = adjacent_data_access_p (@0, @1, &nloc, false);
> > > > > > +           tree ntype = build_vector_type (type, 2);
> > > > > > +           tree size = TYPE_SIZE (ntype);
> > > > > > +           tree pos = build_int_cst (TREE_TYPE (size), nloc);
> > > > > > +           poly_uint64 _sz;
> > > > > > +           poly_uint64 _total; }
> > > > > > +     (if (src && is_gimple_reg (src) && ntype
> > > > > > +         && poly_int_tree_p (size, &_sz)
> > > > > > +         && poly_int_tree_p (TYPE_SIZE (TREE_TYPE (src)), &_total)
> > > > > > +         && known_ge (_total, _sz + nloc))
> > > > > > +      (if (SCALAR_FLOAT_MODE_P (TYPE_MODE (type)))
> > > > > > +       (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
> > > > > > +       (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos;
> > > > > > +}))))))))
> > > > > > +
> > > > > >  (simplify
> > > > > >   (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
> > > > > >   (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype,
> > > > > > @2, @4);
> > > > > > })) diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index
> > > > > >
> > > > >
> > >
> 91ec33c80a41e1e0cc6224e137dd42144724a168..b19710392940cf469de52d006
> > > > > 603
> > > > > > ae1e3deb6b76 100644
> > > > > > --- a/gcc/tree-cfg.cc
> > > > > > +++ b/gcc/tree-cfg.cc
> > > > > > @@ -3492,6 +3492,7 @@ verify_gimple_call (gcall *stmt)
> > > > > >      {
> > > > > >        tree arg = gimple_call_arg (stmt, i);
> > > > > >        if ((is_gimple_reg_type (TREE_TYPE (arg))
> > > > > > +          && !is_gimple_variable (arg)
> > > > > >            && !is_gimple_val (arg))
> > > > > >           || (!is_gimple_reg_type (TREE_TYPE (arg))
> > > > > >               && !is_gimple_lvalue (arg))) diff --git
> > > > > > a/gcc/tree.h b/gcc/tree.h index
> > > > > >
> > > > >
> > >
> e6564aaccb7b69cd938ff60b6121aec41b7e8a59..8f8a9660c9e0605eb516de194
> > > > > 640
> > > > > > b8c1b531b798 100644
> > > > > > --- a/gcc/tree.h
> > > > > > +++ b/gcc/tree.h
> > > > > > @@ -5006,6 +5006,11 @@ extern bool integer_pow2p (const_tree);
> > > > > >
> > > > > >  extern tree bitmask_inv_cst_vector_p (tree);
> > > > > >
> > > > > > +/* TRUE if the two operands represent adjacent access of data
> > > > > > +such
> > > that a
> > > > > > +   pairwise operation can be used.  */
> > > > > > +
> > > > > > +extern tree adjacent_data_access_p (tree, tree, poly_uint64*,
> > > > > > +bool);
> > > > > > +
> > > > > >  /* integer_nonzerop (tree x) is nonzero if X is an integer constant
> > > > > >     with a nonzero value.  */
> > > > > >
> > > > > > diff --git a/gcc/tree.cc b/gcc/tree.cc index
> > > > > >
> > > > >
> > >
> 007c9325b17076f474e6681c49966c59cf6b91c7..5315af38a1ead89ca5f75dc4b1
> > > > > 9
> > > > > d
> > > > > > e9841e29d311 100644
> > > > > > --- a/gcc/tree.cc
> > > > > > +++ b/gcc/tree.cc
> > > > > > @@ -10457,6 +10457,90 @@ bitmask_inv_cst_vector_p (tree t)
> > > > > >    return builder.build ();
> > > > > >  }
> > > > > >
> > > > > > +/* Returns base address if the two operands represent
> > > > > > +adjacent access of
> > > > > data
> > > > > > +   such that a pairwise operation can be used.  OP1 must be a
> > > > > > + lower
> > > > > subpart
> > > > > > +   than OP2.  If POS is not NULL then on return if a value is
> > > > > > + returned
> > > POS
> > > > > > +   will indicate the position of the lower address.  If
> > > > > > + COMMUTATIVE_P
> > > > > then
> > > > > > +   the operation is also tried by flipping op1 and op2.  */
> > > > > > +
> > > > > > +tree adjacent_data_access_p (tree op1, tree op2, poly_uint64
> *pos,
> > > > > > +                            bool commutative_p) {
> > > > > > +  gcc_assert (op1);
> > > > > > +  gcc_assert (op2);
> > > > > > +  if (TREE_CODE (op1) != TREE_CODE (op2)
> > > > > > +      || TREE_TYPE (op1) != TREE_TYPE (op2))
> > > > > > +    return NULL;
> > > > > > +
> > > > > > +  tree type = TREE_TYPE (op1);  gimple *stmt1 = NULL, *stmt2
> > > > > > + = NULL;  unsigned int bits = GET_MODE_BITSIZE
> > > > > > + (GET_MODE_INNER (TYPE_MODE (type)));
> > > > > > +
> > > > > > +  if (TREE_CODE (op1) == BIT_FIELD_REF
> > > > > > +      && operand_equal_p (TREE_OPERAND (op1, 0),
> TREE_OPERAND
> > > > > > + (op2,
> > > > > 0), 0)
> > > > > > +      && operand_equal_p (TREE_OPERAND (op1, 1),
> TREE_OPERAND
> > > > > > + (op2,
> > > > > 1), 0)
> > > > > > +      && known_eq (bit_field_size (op1), bits))
> > > > > > +    {
> > > > > > +      poly_uint64 offset1 = bit_field_offset (op1);
> > > > > > +      poly_uint64 offset2 = bit_field_offset (op2);
> > > > > > +      if (known_eq (offset2 - offset1, bits))
> > > > > > +       {
> > > > > > +         if (pos)
> > > > > > +           *pos = offset1;
> > > > > > +         return TREE_OPERAND (op1, 0);
> > > > > > +       }
> > > > > > +      else if (commutative_p && known_eq (offset1 - offset2, bits))
> > > > > > +       {
> > > > > > +         if (pos)
> > > > > > +           *pos = offset2;
> > > > > > +         return TREE_OPERAND (op1, 0);
> > > > > > +       }
> > > > > > +    }
> > > > > > +  else if (TREE_CODE (op1) == ARRAY_REF
> > > > > > +          && operand_equal_p (get_base_address (op1),
> > > > > > + get_base_address
> > > > > (op2)))
> > > > > > +    {
> > > > > > +      wide_int size1 = wi::to_wide (array_ref_element_size (op1));
> > > > > > +      wide_int size2 = wi::to_wide (array_ref_element_size (op2));
> > > > > > +      if (wi::ne_p (size1, size2) || wi::ne_p (size1, bits / 8)
> > > > > > +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op1, 1))
> > > > > > +         || !tree_fits_poly_uint64_p (TREE_OPERAND (op2, 1)))
> > > > > > +       return NULL;
> > > > > > +
> > > > > > +      poly_uint64 offset1 = tree_to_poly_uint64 (TREE_OPERAND
> > > > > > + (op1,
> > > 1));
> > > > > > +      poly_uint64 offset2 = tree_to_poly_uint64 (TREE_OPERAND
> > > > > > + (op2,
> > > 1));
> > > > > > +      if (known_eq (offset2 - offset1, 1UL))
> > > > > > +       {
> > > > > > +         if (pos)
> > > > > > +           *pos = offset1 * bits;
> > > > > > +         return TREE_OPERAND (op1, 0);
> > > > > > +       }
> > > > > > +      else if (commutative_p && known_eq (offset1 - offset2, 1UL))
> > > > > > +       {
> > > > > > +         if (pos)
> > > > > > +           *pos = offset2 * bits;
> > > > > > +         return TREE_OPERAND (op1, 0);
> > > > > > +       }
> > > > > > +    }
> > > > > > +  else if (TREE_CODE (op1) == SSA_NAME
> > > > > > +          && (stmt1 = SSA_NAME_DEF_STMT (op1)) != NULL
> > > > > > +          && (stmt2 = SSA_NAME_DEF_STMT (op2)) != NULL
> > > > > > +          && is_gimple_assign (stmt1)
> > > > > > +          && is_gimple_assign (stmt2))
> > > > > > +    {
> > > > > > +      if (gimple_assign_rhs_code (stmt1) != ARRAY_REF
> > > > > > +         && gimple_assign_rhs_code (stmt1) != BIT_FIELD_REF
> > > > > > +         && gimple_assign_rhs_code (stmt2) != ARRAY_REF
> > > > > > +         && gimple_assign_rhs_code (stmt2) != BIT_FIELD_REF)
> > > > > > +       return NULL;
> > > > > > +
> > > > > > +      return adjacent_data_access_p (gimple_assign_rhs1 (stmt1),
> > > > > > +                                    gimple_assign_rhs1 (stmt2), pos,
> > > > > > +                                    commutative_p);
> > > > > > +    }
> > > > > > +
> > > > > > +  return NULL;
> > > > > > +}
> > > > > > +
> > > > > >  /* If VECTOR_CST T has a single nonzero element, return the
> > > > > > index of
> > > that
> > > > > >     element, otherwise return -1.  */
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> > > Boudien Moerman; HRB 36809 (AG Nuernberg)
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> Boudien Moerman; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/8]middle-end: Recognize scalar widening reductions
  2022-10-31 11:57 ` [PATCH 2/8]middle-end: Recognize scalar widening reductions Tamar Christina
  2022-10-31 21:42   ` Jeff Law
@ 2022-11-07 13:21   ` Richard Biener
  1 sibling, 0 replies; 50+ messages in thread
From: Richard Biener @ 2022-11-07 13:21 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, jeffreyalaw

On Mon, 31 Oct 2022, Tamar Christina wrote:

> Hi All,
> 
> This adds a new optab and IFNs for REDUC_PLUS_WIDEN where the resulting
> scalar reduction has twice the precision of the input elements.
> 
> At some point in a later patch I will also teach the vectorizer to recognize
> this builtin once I figure out how the various bits of reductions work.
> 
> For now it's generated only by the match.pd pattern.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* internal-fn.def (REDUC_PLUS_WIDEN): New.
> 	* doc/md.texi: Document it.
> 	* match.pd: Recognize widening plus.
> 	* optabs.def (reduc_splus_widen_scal_optab,
> 	reduc_uplus_widen_scal_optab): New.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 34825549ed4e315b07d36dc3d63bae0cc0a3932d..c08691ab4c9a4bfe55ae81e5e228a414d6242d78 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5284,6 +5284,20 @@ Compute the sum of the elements of a vector. The vector is operand 1, and
>  operand 0 is the scalar result, with mode equal to the mode of the elements of
>  the input vector.
>  
> +@cindex @code{reduc_uplus_widen_scal_@var{m}} instruction pattern
> +@item @samp{reduc_uplus_widen_scal_@var{m}}
> +Compute the sum of the elements of a vector and zero-extend @var{m} to a mode
> +that has twice the precision of @var{m}.. The vector is operand 1, and
> +operand 0 is the scalar result, with mode equal to twice the precision of the
> +mode of the elements of the input vector.
> +
> +@cindex @code{reduc_splus_widen_scal_@var{m}} instruction pattern
> +@item @samp{reduc_splus_widen_scal_@var{m}}
> +Compute the sum of the elements of a vector and sign-extend @var{m} to a mode
> +that has twice the precision of @var{m}.. The vector is operand 1, and
> +operand 0 is the scalar result, with mode equal to twice the precision of the
> +mode of the elements of the input vector.
> +
>  @cindex @code{reduc_and_scal_@var{m}} instruction pattern
>  @item @samp{reduc_and_scal_@var{m}}
>  @cindex @code{reduc_ior_scal_@var{m}} instruction pattern
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 5e672183f4def9d0cdc29cf12fe17e8cff928f9f..f64a8421b1087b6c0f3602dc556876b0fd15c7ad 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -215,6 +215,9 @@ DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
>  
>  DEF_INTERNAL_OPTAB_FN (REDUC_PLUS, ECF_CONST | ECF_NOTHROW,
>  		       reduc_plus_scal, unary)
> +DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_PLUS_WIDEN, ECF_CONST | ECF_NOTHROW,
> +			      first, reduc_splus_widen_scal,
> +			      reduc_uplus_widen_scal, unary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_MAX, ECF_CONST | ECF_NOTHROW, first,
>  			      reduc_smax_scal, reduc_umax_scal, unary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (REDUC_MIN, ECF_CONST | ECF_NOTHROW, first,
> diff --git a/gcc/match.pd b/gcc/match.pd
> index aecaa3520b36e770d11ea9a10eb18db23c0cd9f7..1d407414bee278c64c00d425d9f025c1c58d853d 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -7237,6 +7237,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>         (ifnf (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))
>         (ifni (BIT_FIELD_REF:ntype { src; } { size; } { pos; }))))))))
>  
> +/* Widening reduction conversions. */
> +(simplify
> + (convert (IFN_REDUC_PLUS @0))
> + (if (element_precision (TREE_TYPE (@0)) * 2 == element_precision (type)
> +      && TYPE_UNSIGNED (type) == TYPE_UNSIGNED (TREE_TYPE (@0))
> +      && ANY_INTEGRAL_TYPE_P (type) && ANY_INTEGRAL_TYPE_P (TREE_TYPE(@0)))
> +  (IFN_REDUC_PLUS_WIDEN @0)))

But that's not the same?  REDUC_PLUS_WIDEN first widens, then sums while
REDUC_PLUS on overflow "truncates", no?

> +
>  (simplify
>   (BIT_FIELD_REF (BIT_FIELD_REF @0 @1 @2) @3 @4)
>   (BIT_FIELD_REF @0 @3 { const_binop (PLUS_EXPR, bitsizetype, @2, @4); }))
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index a6db2342bed6baf13ecbd84112c8432c6972e6fe..9947aed67fb8a3b675cb0aab9aeb059f89644106 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -346,6 +346,8 @@ OPTAB_D (reduc_fmin_scal_optab, "reduc_fmin_scal_$a")
>  OPTAB_D (reduc_smax_scal_optab, "reduc_smax_scal_$a")
>  OPTAB_D (reduc_smin_scal_optab, "reduc_smin_scal_$a")
>  OPTAB_D (reduc_plus_scal_optab, "reduc_plus_scal_$a")
> +OPTAB_D (reduc_splus_widen_scal_optab, "reduc_splus_widen_scal_$a")
> +OPTAB_D (reduc_uplus_widen_scal_optab, "reduc_uplus_widen_scal_$a")
>  OPTAB_D (reduc_umax_scal_optab, "reduc_umax_scal_$a")
>  OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a")
>  OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")
> 
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-01 14:25   ` Richard Sandiford
@ 2022-11-11 14:33     ` Tamar Christina
  2022-11-15  8:35       ` Hongtao Liu
  0 siblings, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-11-11 14:33 UTC (permalink / raw)
  To: Richard Sandiford, Tamar Christina via Gcc-patches; +Cc: nd, rguenther

[-- Attachment #1: Type: text/plain, Size: 4249 bytes --]

Hi,

> 
> ...can we use expand_vec_perm_const here?  It will try the constant
> expansion first, which is the preferred order.  It also has a few variations up
> its sleeve.
> 

We can, however it this function seems to be incorrectly assuming it can always
Convert the input mode to a QI vector mode.  When I started using it we got a number
of miscompilations in the AArch64 codegen.  This had the knock-on effect of uncovering
bugs in both the AArch64 backend and i386.  I'll send patched out for those separately.

For now here's the new patch using that hook and updating the permute expansion code:

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* expmed.cc (extract_bit_field_1): Add support for vector element
	extracts.
	* optabs.cc (expand_vec_perm_const): Add checks before converting
	permute to QImode fallback.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/ext_1.c: New.

--- inline copy of patch ---

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index bab020c07222afa38305ef8d7333f271b1965b78..7d38045ae525c8a4665a0c1384fc515e4de88c67 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -1718,6 +1718,21 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, poly_uint64 bitnum,
 	      return target;
 	    }
 	}
+      else if (!known_eq (bitnum, 0U)
+	       && multiple_p (GET_MODE_UNIT_BITSIZE (tmode), bitnum, &pos))
+	{
+	  /* The encoding has a single stepped pattern.  */
+	  poly_uint64 nunits = GET_MODE_NUNITS (new_mode);
+	  vec_perm_builder sel (nunits, 1, 3);
+	  sel.quick_push (pos);
+	  sel.quick_push (pos + 1);
+	  sel.quick_push (pos + 2);
+
+	  rtx res
+	    = expand_vec_perm_const (new_mode, op0, op0, sel, new_mode, NULL);
+	  if (res)
+	    return simplify_gen_subreg (tmode, res, new_mode, 0);
+	}
     }
 
   /* See if we can get a better vector mode before extracting.  */
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b6896160090a453cc6a28d9 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
       v0_qi = gen_lowpart (qimode, v0);
       v1_qi = gen_lowpart (qimode, v1);
       if (targetm.vectorize.vec_perm_const != NULL
+	  && targetm.can_change_mode_class (mode, qimode, ALL_REGS)
 	  && targetm.vectorize.vec_perm_const (qimode, qimode, target_qi, v0_qi,
 					       v1_qi, qimode_indices))
 	return gen_lowpart (mode, target_qi);
@@ -6311,7 +6312,8 @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
     }
 
   if (qimode != VOIDmode
-      && selector_fits_mode_p (qimode, qimode_indices))
+      && selector_fits_mode_p (qimode, qimode_indices)
+      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
     {
       icode = direct_optab_handler (vec_perm_optab, qimode);
       if (icode != CODE_FOR_nothing)
diff --git a/gcc/testsuite/gcc.target/aarch64/ext_1.c b/gcc/testsuite/gcc.target/aarch64/ext_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e571b3bc2ddf887a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ext_1.c
@@ -0,0 +1,54 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include <string.h>
+
+typedef unsigned int v4si __attribute__((vector_size (16)));
+typedef unsigned int v2si __attribute__((vector_size (8)));
+
+/*
+** extract: { xfail *-*-* }
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract (v4si x)
+{
+    v2si res = {x[1], x[2]};
+    return res;
+}
+
+/*
+** extract1: { xfail *-*-* }
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract1 (v4si x)
+{
+    v2si res;
+    memcpy (&res, ((int*)&x)+1, sizeof(res));
+    return res;
+}
+
+typedef struct cast {
+  int a;
+  v2si b __attribute__((packed));
+} cast_t;
+
+typedef union Data {
+   v4si x;
+   cast_t y;
+} data;  
+
+/*
+** extract2:
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract2 (v4si x)
+{
+    data d;
+    d.x = x;
+    return d.y.b;
+}
+

[-- Attachment #2: rb16242.patch --]
[-- Type: application/octet-stream, Size: 3098 bytes --]

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index bab020c07222afa38305ef8d7333f271b1965b78..7d38045ae525c8a4665a0c1384fc515e4de88c67 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -1718,6 +1718,21 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, poly_uint64 bitnum,
 	      return target;
 	    }
 	}
+      else if (!known_eq (bitnum, 0U)
+	       && multiple_p (GET_MODE_UNIT_BITSIZE (tmode), bitnum, &pos))
+	{
+	  /* The encoding has a single stepped pattern.  */
+	  poly_uint64 nunits = GET_MODE_NUNITS (new_mode);
+	  vec_perm_builder sel (nunits, 1, 3);
+	  sel.quick_push (pos);
+	  sel.quick_push (pos + 1);
+	  sel.quick_push (pos + 2);
+
+	  rtx res
+	    = expand_vec_perm_const (new_mode, op0, op0, sel, new_mode, NULL);
+	  if (res)
+	    return simplify_gen_subreg (tmode, res, new_mode, 0);
+	}
     }
 
   /* See if we can get a better vector mode before extracting.  */
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b6896160090a453cc6a28d9 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
       v0_qi = gen_lowpart (qimode, v0);
       v1_qi = gen_lowpart (qimode, v1);
       if (targetm.vectorize.vec_perm_const != NULL
+	  && targetm.can_change_mode_class (mode, qimode, ALL_REGS)
 	  && targetm.vectorize.vec_perm_const (qimode, qimode, target_qi, v0_qi,
 					       v1_qi, qimode_indices))
 	return gen_lowpart (mode, target_qi);
@@ -6311,7 +6312,8 @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
     }
 
   if (qimode != VOIDmode
-      && selector_fits_mode_p (qimode, qimode_indices))
+      && selector_fits_mode_p (qimode, qimode_indices)
+      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
     {
       icode = direct_optab_handler (vec_perm_optab, qimode);
       if (icode != CODE_FOR_nothing)
diff --git a/gcc/testsuite/gcc.target/aarch64/ext_1.c b/gcc/testsuite/gcc.target/aarch64/ext_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e571b3bc2ddf887a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ext_1.c
@@ -0,0 +1,54 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include <string.h>
+
+typedef unsigned int v4si __attribute__((vector_size (16)));
+typedef unsigned int v2si __attribute__((vector_size (8)));
+
+/*
+** extract: { xfail *-*-* }
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract (v4si x)
+{
+    v2si res = {x[1], x[2]};
+    return res;
+}
+
+/*
+** extract1: { xfail *-*-* }
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract1 (v4si x)
+{
+    v2si res;
+    memcpy (&res, ((int*)&x)+1, sizeof(res));
+    return res;
+}
+
+typedef struct cast {
+  int a;
+  v2si b __attribute__((packed));
+} cast_t;
+
+typedef union Data {
+   v4si x;
+   cast_t y;
+} data;  
+
+/*
+** extract2:
+**	ext	v0.16b, v0.16b, v0.16b, #4
+**	ret
+*/
+v2si extract2 (v4si x)
+{
+    data d;
+    d.x = x;
+    return d.y.b;
+}
+

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
  2022-11-01 14:58   ` Richard Sandiford
  2022-11-01 15:11     ` Tamar Christina
@ 2022-11-11 14:39     ` Tamar Christina
  2022-11-22 16:01       ` Tamar Christina
  2022-12-06 10:28       ` Richard Sandiford
  1 sibling, 2 replies; 50+ messages in thread
From: Tamar Christina @ 2022-11-11 14:39 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

[-- Attachment #1: Type: text/plain, Size: 18112 bytes --]

Hi,


> This name might cause confusion with the SVE iterators, where FULL means
> "every bit of the register is used".  How about something like VMOVE
> instead?
> 
> With this change, I guess VALL_F16 represents "The set of all modes for
> which the vld1 intrinsics are provided" and VMOVE or whatever is "All
> Advanced SIMD modes suitable for moving, loading, and storing".
> That is, VMOVE extends VALL_F16 with modes that are not manifested via
> intrinsics.
> 

Done.

> Where is the 2h used, and is it valid syntax in that context?
> 
> Same for later instances of 2h.

They are, but they weren't meant to be in this patch.  They belong in a separate FP16 series that
I won't get to finish for GCC 13 due not being able to finish writing all the tests.  I have moved them
to that patch series though.

While the addp patch series has been killed, this patch is still good standalone and improves codegen
as shown in the updated testcase.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_simd_movv2hf): New.
	(mov<mode>, movmisalign<mode>, aarch64_dup_lane<mode>,
	aarch64_store_lane0<mode>, aarch64_simd_vec_set<mode>,
	@aarch64_simd_vec_copy_lane<mode>, vec_set<mode>,
	reduc_<optab>_scal_<mode>, reduc_<fmaxmin>_scal_<mode>,
	aarch64_reduc_<optab>_internal<mode>, aarch64_get_lane<mode>,
	vec_init<mode><Vel>, vec_extract<mode><Vel>): Support V2HF.
	(aarch64_simd_dupv2hf): New.
	* config/aarch64/aarch64.cc (aarch64_classify_vector_mode):
	Add E_V2HFmode.
	* config/aarch64/iterators.md (VHSDF_P): New.
	(V2F, VMOVE, nunits, Vtype, Vmtype, Vetype, stype, VEL,
	Vel, q, vp): Add V2HF.
	* config/arm/types.md (neon_fp_reduc_add_h): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/slp_1.c: Update testcase.

--- inline copy of patch ---

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index f4152160084d6b6f34bd69f0ba6386c1ab50f77e..487a31010245accec28e779661e6c2d578fca4b7 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -19,10 +19,10 @@
 ;; <http://www.gnu.org/licenses/>.
 
 (define_expand "mov<mode>"
-  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
-	(match_operand:VALL_F16 1 "general_operand"))]
+  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
+	(match_operand:VMOVE 1 "general_operand"))]
   "TARGET_SIMD"
-  "
+{
   /* Force the operand into a register if it is not an
      immediate whose use can be replaced with xzr.
      If the mode is 16 bytes wide, then we will be doing
@@ -46,12 +46,11 @@ (define_expand "mov<mode>"
       aarch64_expand_vector_init (operands[0], operands[1]);
       DONE;
     }
-  "
-)
+})
 
 (define_expand "movmisalign<mode>"
-  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
-        (match_operand:VALL_F16 1 "general_operand"))]
+  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
+        (match_operand:VMOVE 1 "general_operand"))]
   "TARGET_SIMD && !STRICT_ALIGNMENT"
 {
   /* This pattern is not permitted to fail during expansion: if both arguments
@@ -73,6 +72,16 @@ (define_insn "aarch64_simd_dup<mode>"
   [(set_attr "type" "neon_dup<q>, neon_from_gp<q>")]
 )
 
+(define_insn "aarch64_simd_dupv2hf"
+  [(set (match_operand:V2HF 0 "register_operand" "=w")
+	(vec_duplicate:V2HF
+	  (match_operand:HF 1 "register_operand" "0")))]
+  "TARGET_SIMD"
+  "@
+   sli\\t%d0, %d1, 16"
+  [(set_attr "type" "neon_shift_imm")]
+)
+
 (define_insn "aarch64_simd_dup<mode>"
   [(set (match_operand:VDQF_F16 0 "register_operand" "=w,w")
 	(vec_duplicate:VDQF_F16
@@ -85,10 +94,10 @@ (define_insn "aarch64_simd_dup<mode>"
 )
 
 (define_insn "aarch64_dup_lane<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
-	(vec_duplicate:VALL_F16
+  [(set (match_operand:VMOVE 0 "register_operand" "=w")
+	(vec_duplicate:VMOVE
 	  (vec_select:<VEL>
-	    (match_operand:VALL_F16 1 "register_operand" "w")
+	    (match_operand:VMOVE 1 "register_operand" "w")
 	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
           )))]
   "TARGET_SIMD"
@@ -142,6 +151,29 @@ (define_insn "*aarch64_simd_mov<VDMOV:mode>"
 		     mov_reg, neon_move<q>")]
 )
 
+(define_insn "*aarch64_simd_movv2hf"
+  [(set (match_operand:V2HF 0 "nonimmediate_operand"
+		"=w, m,  m,  w, ?r, ?w, ?r, w, w")
+	(match_operand:V2HF 1 "general_operand"
+		"m,  Dz, w,  w,  w,  r,  r, Dz, Dn"))]
+  "TARGET_SIMD_F16INST
+   && (register_operand (operands[0], V2HFmode)
+       || aarch64_simd_reg_or_zero (operands[1], V2HFmode))"
+   "@
+    ldr\\t%s0, %1
+    str\\twzr, %0
+    str\\t%s1, %0
+    mov\\t%0.2s[0], %1.2s[0]
+    umov\\t%w0, %1.s[0]
+    fmov\\t%s0, %1
+    mov\\t%0, %1
+    movi\\t%d0, 0
+    * return aarch64_output_simd_mov_immediate (operands[1], 32);"
+  [(set_attr "type" "neon_load1_1reg, store_8, neon_store1_1reg,\
+		     neon_logic, neon_to_gp, f_mcr,\
+		     mov_reg, neon_move, neon_move")]
+)
+
 (define_insn "*aarch64_simd_mov<VQMOV:mode>"
   [(set (match_operand:VQMOV 0 "nonimmediate_operand"
 		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
@@ -182,7 +214,7 @@ (define_insn "*aarch64_simd_mov<VQMOV:mode>"
 
 (define_insn "aarch64_store_lane0<mode>"
   [(set (match_operand:<VEL> 0 "memory_operand" "=m")
-	(vec_select:<VEL> (match_operand:VALL_F16 1 "register_operand" "w")
+	(vec_select:<VEL> (match_operand:VMOVE 1 "register_operand" "w")
 			(parallel [(match_operand 2 "const_int_operand" "n")])))]
   "TARGET_SIMD
    && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 0"
@@ -1035,11 +1067,11 @@ (define_insn "one_cmpl<mode>2"
 )
 
 (define_insn "aarch64_simd_vec_set<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
-	(vec_merge:VALL_F16
-	    (vec_duplicate:VALL_F16
+  [(set (match_operand:VMOVE 0 "register_operand" "=w,w,w")
+	(vec_merge:VMOVE
+	    (vec_duplicate:VMOVE
 		(match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand" "w,?r,Utv"))
-	    (match_operand:VALL_F16 3 "register_operand" "0,0,0")
+	    (match_operand:VMOVE 3 "register_operand" "0,0,0")
 	    (match_operand:SI 2 "immediate_operand" "i,i,i")))]
   "TARGET_SIMD"
   {
@@ -1061,14 +1093,14 @@ (define_insn "aarch64_simd_vec_set<mode>"
 )
 
 (define_insn "@aarch64_simd_vec_copy_lane<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
-	(vec_merge:VALL_F16
-	    (vec_duplicate:VALL_F16
+  [(set (match_operand:VMOVE 0 "register_operand" "=w")
+	(vec_merge:VMOVE
+	    (vec_duplicate:VMOVE
 	      (vec_select:<VEL>
-		(match_operand:VALL_F16 3 "register_operand" "w")
+		(match_operand:VMOVE 3 "register_operand" "w")
 		(parallel
 		  [(match_operand:SI 4 "immediate_operand" "i")])))
-	    (match_operand:VALL_F16 1 "register_operand" "0")
+	    (match_operand:VMOVE 1 "register_operand" "0")
 	    (match_operand:SI 2 "immediate_operand" "i")))]
   "TARGET_SIMD"
   {
@@ -1376,7 +1408,7 @@ (define_insn "vec_shr_<mode>"
 )
 
 (define_expand "vec_set<mode>"
-  [(match_operand:VALL_F16 0 "register_operand")
+  [(match_operand:VMOVE 0 "register_operand")
    (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
    (match_operand:SI 2 "immediate_operand")]
   "TARGET_SIMD"
@@ -3495,7 +3527,7 @@ (define_insn "popcount<mode>2"
 ;; gimple_fold'd to the IFN_REDUC_(MAX|MIN) function.  (This is FP smax/smin).
 (define_expand "reduc_<optab>_scal_<mode>"
   [(match_operand:<VEL> 0 "register_operand")
-   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
+   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
 		 FMAXMINV)]
   "TARGET_SIMD"
   {
@@ -3510,7 +3542,7 @@ (define_expand "reduc_<optab>_scal_<mode>"
 
 (define_expand "reduc_<fmaxmin>_scal_<mode>"
   [(match_operand:<VEL> 0 "register_operand")
-   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
+   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
 		 FMAXMINNMV)]
   "TARGET_SIMD"
   {
@@ -3554,8 +3586,8 @@ (define_insn "aarch64_reduc_<optab>_internalv2si"
 )
 
 (define_insn "aarch64_reduc_<optab>_internal<mode>"
- [(set (match_operand:VHSDF 0 "register_operand" "=w")
-       (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")]
+ [(set (match_operand:VHSDF_P 0 "register_operand" "=w")
+       (unspec:VHSDF_P [(match_operand:VHSDF_P 1 "register_operand" "w")]
 		      FMAXMINV))]
  "TARGET_SIMD"
  "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
@@ -4200,7 +4232,7 @@ (define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
 (define_insn_and_split "aarch64_get_lane<mode>"
   [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand" "=?r, w, Utv")
 	(vec_select:<VEL>
-	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
+	  (match_operand:VMOVE 1 "register_operand" "w, w, w")
 	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
   "TARGET_SIMD"
   {
@@ -7981,7 +8013,7 @@ (define_expand "aarch64_st1<VALL_F16:mode>"
 ;; Standard pattern name vec_init<mode><Vel>.
 
 (define_expand "vec_init<mode><Vel>"
-  [(match_operand:VALL_F16 0 "register_operand")
+  [(match_operand:VMOVE 0 "register_operand")
    (match_operand 1 "" "")]
   "TARGET_SIMD"
 {
@@ -8060,7 +8092,7 @@ (define_insn "aarch64_urecpe<mode>"
 
 (define_expand "vec_extract<mode><Vel>"
   [(match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand")
-   (match_operand:VALL_F16 1 "register_operand")
+   (match_operand:VMOVE 1 "register_operand")
    (match_operand:SI 2 "immediate_operand")]
   "TARGET_SIMD"
 {
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 84dbe2f4ea7d03b424602ed98a34e7824217dc91..35671cb86e374f9ded21d0e4944c63bc2cbc0901 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -3566,6 +3566,7 @@ aarch64_classify_vector_mode (machine_mode mode)
     case E_V8BFmode:
     case E_V4SFmode:
     case E_V2DFmode:
+    case E_V2HFmode:
       return TARGET_SIMD ? VEC_ADVSIMD : 0;
 
     default:
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 37d8161a33b1c399d80be82afa67613a087389d4..dfcf86a440e316c2abdbcc646363d39e458d1a91 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -160,6 +160,10 @@ (define_mode_iterator VDQF [V2SF V4SF V2DF])
 (define_mode_iterator VHSDF [(V4HF "TARGET_SIMD_F16INST")
 			     (V8HF "TARGET_SIMD_F16INST")
 			     V2SF V4SF V2DF])
+;; Advanced SIMD Float modes suitable for pairwise operations.
+(define_mode_iterator VHSDF_P [(V4HF "TARGET_SIMD_F16INST")
+			       (V8HF "TARGET_SIMD_F16INST")
+			       V2SF V4SF V2DF (V2HF "TARGET_SIMD_F16INST")])
 
 ;; Advanced SIMD Float modes, and DF.
 (define_mode_iterator VDQF_DF [V2SF V4SF V2DF DF])
@@ -188,15 +192,23 @@ (define_mode_iterator VDQF_COND [V2SF V2SI V4SF V4SI V2DF V2DI])
 (define_mode_iterator VALLF [V2SF V4SF V2DF SF DF])
 
 ;; Advanced SIMD Float modes with 2 elements.
-(define_mode_iterator V2F [V2SF V2DF])
+(define_mode_iterator V2F [V2SF V2DF V2HF])
 
 ;; All Advanced SIMD modes on which we support any arithmetic operations.
 (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF V2DF])
 
-;; All Advanced SIMD modes suitable for moving, loading, and storing.
+;; All Advanced SIMD modes suitable for moving, loading, and storing
+;; except V2HF.
 (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
 				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
 
+;; All Advanced SIMD modes suitable for moving, loading, and storing
+;; including V2HF
+(define_mode_iterator VMOVE [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
+			     V4HF V8HF V4BF V8BF V2SF V4SF V2DF
+			     (V2HF "TARGET_SIMD_F16INST")])
+
+
 ;; The VALL_F16 modes except the 128-bit 2-element ones.
 (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI V4SI
 				V4HF V8HF V2SF V4SF])
@@ -1076,7 +1088,7 @@ (define_mode_attr nunits [(V8QI "8") (V16QI "16")
 			  (V2SF "2") (V4SF "4")
 			  (V1DF "1") (V2DF "2")
 			  (DI "1") (DF "1")
-			  (V8DI "8")])
+			  (V8DI "8") (V2HF "2")])
 
 ;; Map a mode to the number of bits in it, if the size of the mode
 ;; is constant.
@@ -1090,6 +1102,7 @@ (define_mode_attr s [(HF "h") (SF "s") (DF "d") (SI "s") (DI "d")])
 
 ;; Give the length suffix letter for a sign- or zero-extension.
 (define_mode_attr size [(QI "b") (HI "h") (SI "w")])
+(define_mode_attr sizel [(QI "b") (HI "h") (SI "")])
 
 ;; Give the number of bits in the mode
 (define_mode_attr sizen [(QI "8") (HI "16") (SI "32") (DI "64")])
@@ -1193,7 +1206,7 @@ (define_mode_attr Vmntype [(V8HI ".8b") (V4SI ".4h")
 (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
 			  (V4HI "h") (V8HI  "h")
 			  (V2SI "s") (V4SI  "s")
-			  (V2DI "d")
+			  (V2DI "d") (V2HF  "h")
 			  (V4HF "h") (V8HF  "h")
 			  (V2SF "s") (V4SF  "s")
 			  (V2DF "d")
@@ -1285,7 +1298,7 @@ (define_mode_attr Vcwtype [(VNx16QI "b") (VNx8QI "h") (VNx4QI "w") (VNx2QI "d")
 ;; more accurately.
 (define_mode_attr stype [(V8QI "b") (V16QI "b") (V4HI "s") (V8HI "s")
 			 (V2SI "s") (V4SI "s") (V2DI "d") (V4HF "s")
-			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d")
+			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d") (V2HF "s")
 			 (HF "s") (SF "s") (DF "d") (QI "b") (HI "s")
 			 (SI "s") (DI "d")])
 
@@ -1360,8 +1373,8 @@ (define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
 		       (V4HF "HF") (V8HF  "HF")
 		       (V2SF "SF") (V4SF  "SF")
 		       (DF   "DF") (V2DF  "DF")
-		       (SI   "SI") (HI    "HI")
-		       (QI   "QI")
+		       (SI   "SI") (V2HF  "HF")
+		       (QI   "QI") (HI    "HI")
 		       (V4BF "BF") (V8BF "BF")
 		       (VNx16QI "QI") (VNx8QI "QI") (VNx4QI "QI") (VNx2QI "QI")
 		       (VNx8HI "HI") (VNx4HI "HI") (VNx2HI "HI")
@@ -1381,7 +1394,7 @@ (define_mode_attr Vel [(V8QI "qi") (V16QI "qi")
 		       (V2SF "sf") (V4SF "sf")
 		       (V2DF "df") (DF   "df")
 		       (SI   "si") (HI   "hi")
-		       (QI   "qi")
+		       (QI   "qi") (V2HF "hf")
 		       (V4BF "bf") (V8BF "bf")
 		       (VNx16QI "qi") (VNx8QI "qi") (VNx4QI "qi") (VNx2QI "qi")
 		       (VNx8HI "hi") (VNx4HI "hi") (VNx2HI "hi")
@@ -1866,7 +1879,7 @@ (define_mode_attr q [(V8QI "") (V16QI "_q")
 		     (V4HF "") (V8HF "_q")
 		     (V4BF "") (V8BF "_q")
 		     (V2SF "") (V4SF  "_q")
-			       (V2DF  "_q")
+		     (V2HF "") (V2DF  "_q")
 		     (QI "") (HI "") (SI "") (DI "") (HF "") (SF "") (DF "")
 		     (V2x8QI "") (V2x16QI "_q")
 		     (V2x4HI "") (V2x8HI "_q")
@@ -1905,6 +1918,7 @@ (define_mode_attr vp [(V8QI "v") (V16QI "v")
 		      (V2SI "p") (V4SI  "v")
 		      (V2DI "p") (V2DF  "p")
 		      (V2SF "p") (V4SF  "v")
+		      (V2HF "p")
 		      (V4HF "v") (V8HF  "v")])
 
 (define_mode_attr vsi2qi [(V2SI "v8qi") (V4SI "v16qi")
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
index 7d0504bdd944e9c0d1b545b0b66a9a1adc808714..3cfbc7a93cca1bea4925853e51d0a147c5722247 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -483,6 +483,7 @@ (define_attr "autodetect_type"
 ; neon_fp_minmax_s_q
 ; neon_fp_minmax_d
 ; neon_fp_minmax_d_q
+; neon_fp_reduc_add_h
 ; neon_fp_reduc_add_s
 ; neon_fp_reduc_add_s_q
 ; neon_fp_reduc_add_d
@@ -1033,6 +1034,7 @@ (define_attr "type"
   neon_fp_minmax_d,\
   neon_fp_minmax_d_q,\
 \
+  neon_fp_reduc_add_h,\
   neon_fp_reduc_add_s,\
   neon_fp_reduc_add_s_q,\
   neon_fp_reduc_add_d,\
@@ -1257,8 +1259,8 @@ (define_attr "is_neon_type" "yes,no"
           neon_fp_compare_d, neon_fp_compare_d_q, neon_fp_minmax_s,\
           neon_fp_minmax_s_q, neon_fp_minmax_d, neon_fp_minmax_d_q,\
           neon_fp_neg_s, neon_fp_neg_s_q, neon_fp_neg_d, neon_fp_neg_d_q,\
-          neon_fp_reduc_add_s, neon_fp_reduc_add_s_q, neon_fp_reduc_add_d,\
-          neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,
+          neon_fp_reduc_add_h, neon_fp_reduc_add_s, neon_fp_reduc_add_s_q,\
+          neon_fp_reduc_add_d, neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,\
           neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d,\
           neon_fp_reduc_minmax_d_q,\
           neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
index 07d71a63414b1066ea431e287286ad048515711a..e6021c5a42748701e5326a5c387a39a0bbadc9e5 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
@@ -30,11 +30,9 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int n)	\
 TEST_ALL (VEC_PERM)
 
 /* We should use one DUP for each of the 8-, 16- and 32-bit types,
-   although we currently use LD1RW for _Float16.  We should use two
-   DUPs for each of the three 64-bit types.  */
+   We should use two DUPs for each of the three 64-bit types.  */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 2 } } */
-/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 1 } } */
+/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
 /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 /* { dg-final { scan-assembler-not {\tzip2\t} } } */
@@ -53,7 +51,7 @@ TEST_ALL (VEC_PERM)
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
 /* { dg-final { scan-assembler-not {\tldr} } } */
-/* { dg-final { scan-assembler-times {\tstr} 2 } } */
-/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
+/* { dg-final { scan-assembler-not {\tstr} } } */
+/* { dg-final { scan-assembler-not {\tstr\th[0-9]+} } } */
 
 /* { dg-final { scan-assembler-not {\tuqdec} } } */

[-- Attachment #2: rb16244.patch --]
[-- Type: application/octet-stream, Size: 15802 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index f4152160084d6b6f34bd69f0ba6386c1ab50f77e..487a31010245accec28e779661e6c2d578fca4b7 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -19,10 +19,10 @@
 ;; <http://www.gnu.org/licenses/>.
 
 (define_expand "mov<mode>"
-  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
-	(match_operand:VALL_F16 1 "general_operand"))]
+  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
+	(match_operand:VMOVE 1 "general_operand"))]
   "TARGET_SIMD"
-  "
+{
   /* Force the operand into a register if it is not an
      immediate whose use can be replaced with xzr.
      If the mode is 16 bytes wide, then we will be doing
@@ -46,12 +46,11 @@ (define_expand "mov<mode>"
       aarch64_expand_vector_init (operands[0], operands[1]);
       DONE;
     }
-  "
-)
+})
 
 (define_expand "movmisalign<mode>"
-  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
-        (match_operand:VALL_F16 1 "general_operand"))]
+  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
+        (match_operand:VMOVE 1 "general_operand"))]
   "TARGET_SIMD && !STRICT_ALIGNMENT"
 {
   /* This pattern is not permitted to fail during expansion: if both arguments
@@ -73,6 +72,16 @@ (define_insn "aarch64_simd_dup<mode>"
   [(set_attr "type" "neon_dup<q>, neon_from_gp<q>")]
 )
 
+(define_insn "aarch64_simd_dupv2hf"
+  [(set (match_operand:V2HF 0 "register_operand" "=w")
+	(vec_duplicate:V2HF
+	  (match_operand:HF 1 "register_operand" "0")))]
+  "TARGET_SIMD"
+  "@
+   sli\\t%d0, %d1, 16"
+  [(set_attr "type" "neon_shift_imm")]
+)
+
 (define_insn "aarch64_simd_dup<mode>"
   [(set (match_operand:VDQF_F16 0 "register_operand" "=w,w")
 	(vec_duplicate:VDQF_F16
@@ -85,10 +94,10 @@ (define_insn "aarch64_simd_dup<mode>"
 )
 
 (define_insn "aarch64_dup_lane<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
-	(vec_duplicate:VALL_F16
+  [(set (match_operand:VMOVE 0 "register_operand" "=w")
+	(vec_duplicate:VMOVE
 	  (vec_select:<VEL>
-	    (match_operand:VALL_F16 1 "register_operand" "w")
+	    (match_operand:VMOVE 1 "register_operand" "w")
 	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
           )))]
   "TARGET_SIMD"
@@ -142,6 +151,29 @@ (define_insn "*aarch64_simd_mov<VDMOV:mode>"
 		     mov_reg, neon_move<q>")]
 )
 
+(define_insn "*aarch64_simd_movv2hf"
+  [(set (match_operand:V2HF 0 "nonimmediate_operand"
+		"=w, m,  m,  w, ?r, ?w, ?r, w, w")
+	(match_operand:V2HF 1 "general_operand"
+		"m,  Dz, w,  w,  w,  r,  r, Dz, Dn"))]
+  "TARGET_SIMD_F16INST
+   && (register_operand (operands[0], V2HFmode)
+       || aarch64_simd_reg_or_zero (operands[1], V2HFmode))"
+   "@
+    ldr\\t%s0, %1
+    str\\twzr, %0
+    str\\t%s1, %0
+    mov\\t%0.2s[0], %1.2s[0]
+    umov\\t%w0, %1.s[0]
+    fmov\\t%s0, %1
+    mov\\t%0, %1
+    movi\\t%d0, 0
+    * return aarch64_output_simd_mov_immediate (operands[1], 32);"
+  [(set_attr "type" "neon_load1_1reg, store_8, neon_store1_1reg,\
+		     neon_logic, neon_to_gp, f_mcr,\
+		     mov_reg, neon_move, neon_move")]
+)
+
 (define_insn "*aarch64_simd_mov<VQMOV:mode>"
   [(set (match_operand:VQMOV 0 "nonimmediate_operand"
 		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
@@ -182,7 +214,7 @@ (define_insn "*aarch64_simd_mov<VQMOV:mode>"
 
 (define_insn "aarch64_store_lane0<mode>"
   [(set (match_operand:<VEL> 0 "memory_operand" "=m")
-	(vec_select:<VEL> (match_operand:VALL_F16 1 "register_operand" "w")
+	(vec_select:<VEL> (match_operand:VMOVE 1 "register_operand" "w")
 			(parallel [(match_operand 2 "const_int_operand" "n")])))]
   "TARGET_SIMD
    && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 0"
@@ -1035,11 +1067,11 @@ (define_insn "one_cmpl<mode>2"
 )
 
 (define_insn "aarch64_simd_vec_set<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
-	(vec_merge:VALL_F16
-	    (vec_duplicate:VALL_F16
+  [(set (match_operand:VMOVE 0 "register_operand" "=w,w,w")
+	(vec_merge:VMOVE
+	    (vec_duplicate:VMOVE
 		(match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand" "w,?r,Utv"))
-	    (match_operand:VALL_F16 3 "register_operand" "0,0,0")
+	    (match_operand:VMOVE 3 "register_operand" "0,0,0")
 	    (match_operand:SI 2 "immediate_operand" "i,i,i")))]
   "TARGET_SIMD"
   {
@@ -1061,14 +1093,14 @@ (define_insn "aarch64_simd_vec_set<mode>"
 )
 
 (define_insn "@aarch64_simd_vec_copy_lane<mode>"
-  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
-	(vec_merge:VALL_F16
-	    (vec_duplicate:VALL_F16
+  [(set (match_operand:VMOVE 0 "register_operand" "=w")
+	(vec_merge:VMOVE
+	    (vec_duplicate:VMOVE
 	      (vec_select:<VEL>
-		(match_operand:VALL_F16 3 "register_operand" "w")
+		(match_operand:VMOVE 3 "register_operand" "w")
 		(parallel
 		  [(match_operand:SI 4 "immediate_operand" "i")])))
-	    (match_operand:VALL_F16 1 "register_operand" "0")
+	    (match_operand:VMOVE 1 "register_operand" "0")
 	    (match_operand:SI 2 "immediate_operand" "i")))]
   "TARGET_SIMD"
   {
@@ -1376,7 +1408,7 @@ (define_insn "vec_shr_<mode>"
 )
 
 (define_expand "vec_set<mode>"
-  [(match_operand:VALL_F16 0 "register_operand")
+  [(match_operand:VMOVE 0 "register_operand")
    (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
    (match_operand:SI 2 "immediate_operand")]
   "TARGET_SIMD"
@@ -3495,7 +3527,7 @@ (define_insn "popcount<mode>2"
 ;; gimple_fold'd to the IFN_REDUC_(MAX|MIN) function.  (This is FP smax/smin).
 (define_expand "reduc_<optab>_scal_<mode>"
   [(match_operand:<VEL> 0 "register_operand")
-   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
+   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
 		 FMAXMINV)]
   "TARGET_SIMD"
   {
@@ -3510,7 +3542,7 @@ (define_expand "reduc_<optab>_scal_<mode>"
 
 (define_expand "reduc_<fmaxmin>_scal_<mode>"
   [(match_operand:<VEL> 0 "register_operand")
-   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
+   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
 		 FMAXMINNMV)]
   "TARGET_SIMD"
   {
@@ -3554,8 +3586,8 @@ (define_insn "aarch64_reduc_<optab>_internalv2si"
 )
 
 (define_insn "aarch64_reduc_<optab>_internal<mode>"
- [(set (match_operand:VHSDF 0 "register_operand" "=w")
-       (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")]
+ [(set (match_operand:VHSDF_P 0 "register_operand" "=w")
+       (unspec:VHSDF_P [(match_operand:VHSDF_P 1 "register_operand" "w")]
 		      FMAXMINV))]
  "TARGET_SIMD"
  "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
@@ -4200,7 +4232,7 @@ (define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
 (define_insn_and_split "aarch64_get_lane<mode>"
   [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand" "=?r, w, Utv")
 	(vec_select:<VEL>
-	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
+	  (match_operand:VMOVE 1 "register_operand" "w, w, w")
 	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
   "TARGET_SIMD"
   {
@@ -7981,7 +8013,7 @@ (define_expand "aarch64_st1<VALL_F16:mode>"
 ;; Standard pattern name vec_init<mode><Vel>.
 
 (define_expand "vec_init<mode><Vel>"
-  [(match_operand:VALL_F16 0 "register_operand")
+  [(match_operand:VMOVE 0 "register_operand")
    (match_operand 1 "" "")]
   "TARGET_SIMD"
 {
@@ -8060,7 +8092,7 @@ (define_insn "aarch64_urecpe<mode>"
 
 (define_expand "vec_extract<mode><Vel>"
   [(match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand")
-   (match_operand:VALL_F16 1 "register_operand")
+   (match_operand:VMOVE 1 "register_operand")
    (match_operand:SI 2 "immediate_operand")]
   "TARGET_SIMD"
 {
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 84dbe2f4ea7d03b424602ed98a34e7824217dc91..35671cb86e374f9ded21d0e4944c63bc2cbc0901 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -3566,6 +3566,7 @@ aarch64_classify_vector_mode (machine_mode mode)
     case E_V8BFmode:
     case E_V4SFmode:
     case E_V2DFmode:
+    case E_V2HFmode:
       return TARGET_SIMD ? VEC_ADVSIMD : 0;
 
     default:
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 37d8161a33b1c399d80be82afa67613a087389d4..dfcf86a440e316c2abdbcc646363d39e458d1a91 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -160,6 +160,10 @@ (define_mode_iterator VDQF [V2SF V4SF V2DF])
 (define_mode_iterator VHSDF [(V4HF "TARGET_SIMD_F16INST")
 			     (V8HF "TARGET_SIMD_F16INST")
 			     V2SF V4SF V2DF])
+;; Advanced SIMD Float modes suitable for pairwise operations.
+(define_mode_iterator VHSDF_P [(V4HF "TARGET_SIMD_F16INST")
+			       (V8HF "TARGET_SIMD_F16INST")
+			       V2SF V4SF V2DF (V2HF "TARGET_SIMD_F16INST")])
 
 ;; Advanced SIMD Float modes, and DF.
 (define_mode_iterator VDQF_DF [V2SF V4SF V2DF DF])
@@ -188,15 +192,23 @@ (define_mode_iterator VDQF_COND [V2SF V2SI V4SF V4SI V2DF V2DI])
 (define_mode_iterator VALLF [V2SF V4SF V2DF SF DF])
 
 ;; Advanced SIMD Float modes with 2 elements.
-(define_mode_iterator V2F [V2SF V2DF])
+(define_mode_iterator V2F [V2SF V2DF V2HF])
 
 ;; All Advanced SIMD modes on which we support any arithmetic operations.
 (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF V2DF])
 
-;; All Advanced SIMD modes suitable for moving, loading, and storing.
+;; All Advanced SIMD modes suitable for moving, loading, and storing
+;; except V2HF.
 (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
 				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
 
+;; All Advanced SIMD modes suitable for moving, loading, and storing
+;; including V2HF
+(define_mode_iterator VMOVE [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
+			     V4HF V8HF V4BF V8BF V2SF V4SF V2DF
+			     (V2HF "TARGET_SIMD_F16INST")])
+
+
 ;; The VALL_F16 modes except the 128-bit 2-element ones.
 (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI V4SI
 				V4HF V8HF V2SF V4SF])
@@ -1076,7 +1088,7 @@ (define_mode_attr nunits [(V8QI "8") (V16QI "16")
 			  (V2SF "2") (V4SF "4")
 			  (V1DF "1") (V2DF "2")
 			  (DI "1") (DF "1")
-			  (V8DI "8")])
+			  (V8DI "8") (V2HF "2")])
 
 ;; Map a mode to the number of bits in it, if the size of the mode
 ;; is constant.
@@ -1090,6 +1102,7 @@ (define_mode_attr s [(HF "h") (SF "s") (DF "d") (SI "s") (DI "d")])
 
 ;; Give the length suffix letter for a sign- or zero-extension.
 (define_mode_attr size [(QI "b") (HI "h") (SI "w")])
+(define_mode_attr sizel [(QI "b") (HI "h") (SI "")])
 
 ;; Give the number of bits in the mode
 (define_mode_attr sizen [(QI "8") (HI "16") (SI "32") (DI "64")])
@@ -1193,7 +1206,7 @@ (define_mode_attr Vmntype [(V8HI ".8b") (V4SI ".4h")
 (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
 			  (V4HI "h") (V8HI  "h")
 			  (V2SI "s") (V4SI  "s")
-			  (V2DI "d")
+			  (V2DI "d") (V2HF  "h")
 			  (V4HF "h") (V8HF  "h")
 			  (V2SF "s") (V4SF  "s")
 			  (V2DF "d")
@@ -1285,7 +1298,7 @@ (define_mode_attr Vcwtype [(VNx16QI "b") (VNx8QI "h") (VNx4QI "w") (VNx2QI "d")
 ;; more accurately.
 (define_mode_attr stype [(V8QI "b") (V16QI "b") (V4HI "s") (V8HI "s")
 			 (V2SI "s") (V4SI "s") (V2DI "d") (V4HF "s")
-			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d")
+			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d") (V2HF "s")
 			 (HF "s") (SF "s") (DF "d") (QI "b") (HI "s")
 			 (SI "s") (DI "d")])
 
@@ -1360,8 +1373,8 @@ (define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
 		       (V4HF "HF") (V8HF  "HF")
 		       (V2SF "SF") (V4SF  "SF")
 		       (DF   "DF") (V2DF  "DF")
-		       (SI   "SI") (HI    "HI")
-		       (QI   "QI")
+		       (SI   "SI") (V2HF  "HF")
+		       (QI   "QI") (HI    "HI")
 		       (V4BF "BF") (V8BF "BF")
 		       (VNx16QI "QI") (VNx8QI "QI") (VNx4QI "QI") (VNx2QI "QI")
 		       (VNx8HI "HI") (VNx4HI "HI") (VNx2HI "HI")
@@ -1381,7 +1394,7 @@ (define_mode_attr Vel [(V8QI "qi") (V16QI "qi")
 		       (V2SF "sf") (V4SF "sf")
 		       (V2DF "df") (DF   "df")
 		       (SI   "si") (HI   "hi")
-		       (QI   "qi")
+		       (QI   "qi") (V2HF "hf")
 		       (V4BF "bf") (V8BF "bf")
 		       (VNx16QI "qi") (VNx8QI "qi") (VNx4QI "qi") (VNx2QI "qi")
 		       (VNx8HI "hi") (VNx4HI "hi") (VNx2HI "hi")
@@ -1866,7 +1879,7 @@ (define_mode_attr q [(V8QI "") (V16QI "_q")
 		     (V4HF "") (V8HF "_q")
 		     (V4BF "") (V8BF "_q")
 		     (V2SF "") (V4SF  "_q")
-			       (V2DF  "_q")
+		     (V2HF "") (V2DF  "_q")
 		     (QI "") (HI "") (SI "") (DI "") (HF "") (SF "") (DF "")
 		     (V2x8QI "") (V2x16QI "_q")
 		     (V2x4HI "") (V2x8HI "_q")
@@ -1905,6 +1918,7 @@ (define_mode_attr vp [(V8QI "v") (V16QI "v")
 		      (V2SI "p") (V4SI  "v")
 		      (V2DI "p") (V2DF  "p")
 		      (V2SF "p") (V4SF  "v")
+		      (V2HF "p")
 		      (V4HF "v") (V8HF  "v")])
 
 (define_mode_attr vsi2qi [(V2SI "v8qi") (V4SI "v16qi")
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
index 7d0504bdd944e9c0d1b545b0b66a9a1adc808714..3cfbc7a93cca1bea4925853e51d0a147c5722247 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -483,6 +483,7 @@ (define_attr "autodetect_type"
 ; neon_fp_minmax_s_q
 ; neon_fp_minmax_d
 ; neon_fp_minmax_d_q
+; neon_fp_reduc_add_h
 ; neon_fp_reduc_add_s
 ; neon_fp_reduc_add_s_q
 ; neon_fp_reduc_add_d
@@ -1033,6 +1034,7 @@ (define_attr "type"
   neon_fp_minmax_d,\
   neon_fp_minmax_d_q,\
 \
+  neon_fp_reduc_add_h,\
   neon_fp_reduc_add_s,\
   neon_fp_reduc_add_s_q,\
   neon_fp_reduc_add_d,\
@@ -1257,8 +1259,8 @@ (define_attr "is_neon_type" "yes,no"
           neon_fp_compare_d, neon_fp_compare_d_q, neon_fp_minmax_s,\
           neon_fp_minmax_s_q, neon_fp_minmax_d, neon_fp_minmax_d_q,\
           neon_fp_neg_s, neon_fp_neg_s_q, neon_fp_neg_d, neon_fp_neg_d_q,\
-          neon_fp_reduc_add_s, neon_fp_reduc_add_s_q, neon_fp_reduc_add_d,\
-          neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,
+          neon_fp_reduc_add_h, neon_fp_reduc_add_s, neon_fp_reduc_add_s_q,\
+          neon_fp_reduc_add_d, neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,\
           neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d,\
           neon_fp_reduc_minmax_d_q,\
           neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
index 07d71a63414b1066ea431e287286ad048515711a..e6021c5a42748701e5326a5c387a39a0bbadc9e5 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
@@ -30,11 +30,9 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int n)	\
 TEST_ALL (VEC_PERM)
 
 /* We should use one DUP for each of the 8-, 16- and 32-bit types,
-   although we currently use LD1RW for _Float16.  We should use two
-   DUPs for each of the three 64-bit types.  */
+   We should use two DUPs for each of the three 64-bit types.  */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 2 } } */
-/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 1 } } */
+/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } } */
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
 /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 /* { dg-final { scan-assembler-not {\tzip2\t} } } */
@@ -53,7 +51,7 @@ TEST_ALL (VEC_PERM)
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
 /* { dg-final { scan-assembler-not {\tldr} } } */
-/* { dg-final { scan-assembler-times {\tstr} 2 } } */
-/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
+/* { dg-final { scan-assembler-not {\tstr} } } */
+/* { dg-final { scan-assembler-not {\tstr\th[0-9]+} } } */
 
 /* { dg-final { scan-assembler-not {\tuqdec} } } */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-11 14:33     ` Tamar Christina
@ 2022-11-15  8:35       ` Hongtao Liu
  2022-11-15  8:51         ` Tamar Christina
  0 siblings, 1 reply; 50+ messages in thread
From: Hongtao Liu @ 2022-11-15  8:35 UTC (permalink / raw)
  To: Tamar Christina
  Cc: Richard Sandiford, Tamar Christina via Gcc-patches, nd, rguenther

Hi:
  I'm from https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606040.html.
>      }
>
>    /* See if we can get a better vector mode before extracting.  */
> diff --git a/gcc/optabs.cc b/gcc/optabs.cc
> index cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b6896160090a453cc6a28d9 100644
> --- a/gcc/optabs.cc
> +++ b/gcc/optabs.cc
> @@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
>        v0_qi = gen_lowpart (qimode, v0);
>        v1_qi = gen_lowpart (qimode, v1);
>        if (targetm.vectorize.vec_perm_const != NULL
> +         && targetm.can_change_mode_class (mode, qimode, ALL_REGS)
It looks like you want to guard gen_lowpart, shouldn't it be better to
use validate_subreg  or (tmp = gen_lowpart_if_possible (mode,
target_qi)).
IMHO, targetm.can_change_mode_class is mostly used for RA, but not to
guard gen_lowpart.
I did similar things in
https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579296.html
(and ALL_REGS doesn't cover all cases for registers which are both
available for qimode and mode, ALL_REGS fail doesn't mean it can't be
subreg, it just means parts of ALL_REGS can't be subreg. but with a
subset of ALL_REGS, there could be a reg class which return true for
targetm.can_change_mode_class)
>           && targetm.vectorize.vec_perm_const (qimode, qimode, target_qi, v0_qi,
>                                                v1_qi, qimode_indices))
>         return gen_lowpart (mode, target_qi);
> @@ -6311,7 +6312,8 @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
>      }
>
>    if (qimode != VOIDmode
> -      && selector_fits_mode_p (qimode, qimode_indices))
> +      && selector_fits_mode_p (qimode, qimode_indices)
> +      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
>      {
>        icode = direct_optab_handler (vec_perm_optab, qimode);
>        if (icode != CODE_FOR_nothing)
> diff --git a/gcc/testsuite/gcc.target/aarch64/ext_1.c b/gcc/testsuite/gcc.target/aarch64/ext_1.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e571b3bc2ddf887a




-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-15  8:35       ` Hongtao Liu
@ 2022-11-15  8:51         ` Tamar Christina
  2022-11-15  9:37           ` Hongtao Liu
  0 siblings, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-11-15  8:51 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Richard Sandiford, Tamar Christina via Gcc-patches, nd, rguenther

> -----Original Message-----
> From: Hongtao Liu <crazylht@gmail.com>
> Sent: Tuesday, November 15, 2022 8:36 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina via
> Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> rguenther@suse.de
> Subject: Re: [PATCH 3/8]middle-end: Support extractions of subvectors from
> arbitrary element position inside a vector
> 
> Hi:
>   I'm from https://gcc.gnu.org/pipermail/gcc-patches/2022-
> November/606040.html.
> >      }
> >
> >    /* See if we can get a better vector mode before extracting.  */
> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >
> cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b689616009
> 0
> > a453cc6a28d9 100644
> > --- a/gcc/optabs.cc
> > +++ b/gcc/optabs.cc
> > @@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode mode,
> rtx v0, rtx v1,
> >        v0_qi = gen_lowpart (qimode, v0);
> >        v1_qi = gen_lowpart (qimode, v1);
> >        if (targetm.vectorize.vec_perm_const != NULL
> > +         && targetm.can_change_mode_class (mode, qimode, ALL_REGS)
> It looks like you want to guard gen_lowpart, shouldn't it be better to use
> validate_subreg  or (tmp = gen_lowpart_if_possible (mode, target_qi)).
> IMHO, targetm.can_change_mode_class is mostly used for RA, but not to
> guard gen_lowpart.

Hmm I don't think this is quite true, there are existing usages in expr.cc and rtanal.cc
That do this and aren't part of RA.  As I mentioned before for instance the
canoncalization of vec_select to subreg in rtlanal for instances uses this.

So there are already existing precedence for this.  And the documentation for
the hook says:

"This hook returns true if it is possible to bitcast values held in registers of class rclass from mode from to mode to and if doing so preserves the low-order bits that are common to both modes. The result is only meaningful if rclass has registers that can hold both from and to. The default implementation returns true"

So it looks like it's use outside of RA is perfectly valid.. and the documentation also mentions
in the example the use from the mid-end as an example.

But if the mid-end maintainers are happy I'll use something else.

Tamar

> I did similar things in
> https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579296.html
> (and ALL_REGS doesn't cover all cases for registers which are both available
> for qimode and mode, ALL_REGS fail doesn't mean it can't be subreg, it just
> means parts of ALL_REGS can't be subreg. but with a subset of ALL_REGS,
> there could be a reg class which return true for
> targetm.can_change_mode_class)
> >           && targetm.vectorize.vec_perm_const (qimode, qimode, target_qi,
> v0_qi,
> >                                                v1_qi, qimode_indices))
> >         return gen_lowpart (mode, target_qi); @@ -6311,7 +6312,8 @@
> > expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
> >      }
> >
> >    if (qimode != VOIDmode
> > -      && selector_fits_mode_p (qimode, qimode_indices))
> > +      && selector_fits_mode_p (qimode, qimode_indices)
> > +      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
> >      {
> >        icode = direct_optab_handler (vec_perm_optab, qimode);
> >        if (icode != CODE_FOR_nothing)
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ext_1.c
> > b/gcc/testsuite/gcc.target/aarch64/ext_1.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e5
> 71
> > b3bc2ddf887a
> 
> 
> 
> 
> --
> BR,
> Hongtao

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-15  8:51         ` Tamar Christina
@ 2022-11-15  9:37           ` Hongtao Liu
  2022-11-15 10:00             ` Tamar Christina
  0 siblings, 1 reply; 50+ messages in thread
From: Hongtao Liu @ 2022-11-15  9:37 UTC (permalink / raw)
  To: Tamar Christina
  Cc: Richard Sandiford, Tamar Christina via Gcc-patches, nd, rguenther

On Tue, Nov 15, 2022 at 4:51 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
>
> > -----Original Message-----
> > From: Hongtao Liu <crazylht@gmail.com>
> > Sent: Tuesday, November 15, 2022 8:36 AM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina via
> > Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> > rguenther@suse.de
> > Subject: Re: [PATCH 3/8]middle-end: Support extractions of subvectors from
> > arbitrary element position inside a vector
> >
> > Hi:
> >   I'm from https://gcc.gnu.org/pipermail/gcc-patches/2022-
> > November/606040.html.
> > >      }
> > >
> > >    /* See if we can get a better vector mode before extracting.  */
> > > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> > >
> > cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b689616009
> > 0
> > > a453cc6a28d9 100644
> > > --- a/gcc/optabs.cc
> > > +++ b/gcc/optabs.cc
> > > @@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode mode,
> > rtx v0, rtx v1,
> > >        v0_qi = gen_lowpart (qimode, v0);
> > >        v1_qi = gen_lowpart (qimode, v1);
> > >        if (targetm.vectorize.vec_perm_const != NULL
> > > +         && targetm.can_change_mode_class (mode, qimode, ALL_REGS)
> > It looks like you want to guard gen_lowpart, shouldn't it be better to use
> > validate_subreg  or (tmp = gen_lowpart_if_possible (mode, target_qi)).
> > IMHO, targetm.can_change_mode_class is mostly used for RA, but not to
> > guard gen_lowpart.
>
> Hmm I don't think this is quite true, there are existing usages in expr.cc and rtanal.cc
> That do this and aren't part of RA.  As I mentioned before for instance the
> canoncalization of vec_select to subreg in rtlanal for instances uses this.
In theory, we need to iterate through all reg classes that can be
assigned for both qimode and mode, if any regclass returns true for
targetm.can_change_mode_class, the bitcast(validate_subreg) should be
ok.
Here we just passed ALL_REGS.
>
> So there are already existing precedence for this.  And the documentation for
> the hook says:
>
> "This hook returns true if it is possible to bitcast values held in registers of class rclass from mode from to mode to and if doing so preserves the low-order bits that are common to both modes. The result is only meaningful if rclass has registers that can hold both from and to. The default implementation returns true"
>
> So it looks like it's use outside of RA is perfectly valid.. and the documentation also mentions
> in the example the use from the mid-end as an example.
>
> But if the mid-end maintainers are happy I'll use something else.
>
> Tamar
>
> > I did similar things in
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579296.html
> > (and ALL_REGS doesn't cover all cases for registers which are both available
> > for qimode and mode, ALL_REGS fail doesn't mean it can't be subreg, it just
> > means parts of ALL_REGS can't be subreg. but with a subset of ALL_REGS,
> > there could be a reg class which return true for
> > targetm.can_change_mode_class)
> > >           && targetm.vectorize.vec_perm_const (qimode, qimode, target_qi,
> > v0_qi,
> > >                                                v1_qi, qimode_indices))
> > >         return gen_lowpart (mode, target_qi); @@ -6311,7 +6312,8 @@
> > > expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
> > >      }
> > >
> > >    if (qimode != VOIDmode
> > > -      && selector_fits_mode_p (qimode, qimode_indices))
> > > +      && selector_fits_mode_p (qimode, qimode_indices)
> > > +      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
> > >      {
> > >        icode = direct_optab_handler (vec_perm_optab, qimode);
> > >        if (icode != CODE_FOR_nothing)
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/ext_1.c
> > > b/gcc/testsuite/gcc.target/aarch64/ext_1.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e5
> > 71
> > > b3bc2ddf887a
> >
> >
> >
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-15  9:37           ` Hongtao Liu
@ 2022-11-15 10:00             ` Tamar Christina
  2022-11-15 17:39               ` Richard Sandiford
  0 siblings, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-11-15 10:00 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Richard Sandiford, Tamar Christina via Gcc-patches, nd, rguenther

> -----Original Message-----
> From: Hongtao Liu <crazylht@gmail.com>
> Sent: Tuesday, November 15, 2022 9:37 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina via
> Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> rguenther@suse.de
> Subject: Re: [PATCH 3/8]middle-end: Support extractions of subvectors from
> arbitrary element position inside a vector
> 
> On Tue, Nov 15, 2022 at 4:51 PM Tamar Christina
> <Tamar.Christina@arm.com> wrote:
> >
> > > -----Original Message-----
> > > From: Hongtao Liu <crazylht@gmail.com>
> > > Sent: Tuesday, November 15, 2022 8:36 AM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina
> > > via Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> > > rguenther@suse.de
> > > Subject: Re: [PATCH 3/8]middle-end: Support extractions of
> > > subvectors from arbitrary element position inside a vector
> > >
> > > Hi:
> > >   I'm from https://gcc.gnu.org/pipermail/gcc-patches/2022-
> > > November/606040.html.
> > > >      }
> > > >
> > > >    /* See if we can get a better vector mode before extracting.
> > > > */ diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> > > >
> > >
> cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b68961600
> > > 9
> > > 0
> > > > a453cc6a28d9 100644
> > > > --- a/gcc/optabs.cc
> > > > +++ b/gcc/optabs.cc
> > > > @@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode
> mode,
> > > rtx v0, rtx v1,
> > > >        v0_qi = gen_lowpart (qimode, v0);
> > > >        v1_qi = gen_lowpart (qimode, v1);
> > > >        if (targetm.vectorize.vec_perm_const != NULL
> > > > +         && targetm.can_change_mode_class (mode, qimode,
> > > > + ALL_REGS)
> > > It looks like you want to guard gen_lowpart, shouldn't it be better
> > > to use validate_subreg  or (tmp = gen_lowpart_if_possible (mode,
> target_qi)).
> > > IMHO, targetm.can_change_mode_class is mostly used for RA, but not
> > > to guard gen_lowpart.
> >
> > Hmm I don't think this is quite true, there are existing usages in
> > expr.cc and rtanal.cc That do this and aren't part of RA.  As I
> > mentioned before for instance the canoncalization of vec_select to subreg
> in rtlanal for instances uses this.
> In theory, we need to iterate through all reg classes that can be assigned for
> both qimode and mode, if any regclass returns true for
> targetm.can_change_mode_class, the bitcast(validate_subreg) should be ok.
> Here we just passed ALL_REGS.

Yes, and most targets where this transformation is valid return true here.

I've checked:
 * alpha
 * arm
 * aarch64
 * rs6000
 * s390
 * sparc
 * pa
 * mips

And even the default example that other targets use from the documentation
would return true as the size of the modes are the same.

X86 and RISCV are the only two targets that I found (but didn't check all) that
blankly return a result based on just the register classes.

That is to say, there are more targets that adhere to the interpretation that
rclass here means "should be possible in some class in rclass" rather than
"should be possible in ALL classes of rclass".

> >
> > So there are already existing precedence for this.  And the
> > documentation for the hook says:
> >
> > "This hook returns true if it is possible to bitcast values held in registers of
> class rclass from mode from to mode to and if doing so preserves the low-
> order bits that are common to both modes. The result is only meaningful if
> rclass has registers that can hold both from and to. The default
> implementation returns true"
> >
> > So it looks like it's use outside of RA is perfectly valid.. and the
> > documentation also mentions in the example the use from the mid-end as
> an example.
> >
> > But if the mid-end maintainers are happy I'll use something else.
> >
> > Tamar
> >
> > > I did similar things in
> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579296.html
> > > (and ALL_REGS doesn't cover all cases for registers which are both
> > > available for qimode and mode, ALL_REGS fail doesn't mean it can't
> > > be subreg, it just means parts of ALL_REGS can't be subreg. but with
> > > a subset of ALL_REGS, there could be a reg class which return true
> > > for
> > > targetm.can_change_mode_class)
> > > >           && targetm.vectorize.vec_perm_const (qimode, qimode,
> > > > target_qi,
> > > v0_qi,
> > > >                                                v1_qi, qimode_indices))
> > > >         return gen_lowpart (mode, target_qi); @@ -6311,7 +6312,8
> > > > @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
> > > >      }
> > > >
> > > >    if (qimode != VOIDmode
> > > > -      && selector_fits_mode_p (qimode, qimode_indices))
> > > > +      && selector_fits_mode_p (qimode, qimode_indices)
> > > > +      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
> > > >      {
> > > >        icode = direct_optab_handler (vec_perm_optab, qimode);
> > > >        if (icode != CODE_FOR_nothing) diff --git
> > > > a/gcc/testsuite/gcc.target/aarch64/ext_1.c
> > > > b/gcc/testsuite/gcc.target/aarch64/ext_1.c
> > > > new file mode 100644
> > > > index
> > > >
> > >
> 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e5
> > > 71
> > > > b3bc2ddf887a
> > >
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
> 
> 
> 
> --
> BR,
> Hongtao

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-15 10:00             ` Tamar Christina
@ 2022-11-15 17:39               ` Richard Sandiford
  2022-11-17  8:04                 ` Hongtao Liu
  0 siblings, 1 reply; 50+ messages in thread
From: Richard Sandiford @ 2022-11-15 17:39 UTC (permalink / raw)
  To: Tamar Christina
  Cc: Hongtao Liu, Tamar Christina via Gcc-patches, nd, rguenther

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Hongtao Liu <crazylht@gmail.com>
>> Sent: Tuesday, November 15, 2022 9:37 AM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina via
>> Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
>> rguenther@suse.de
>> Subject: Re: [PATCH 3/8]middle-end: Support extractions of subvectors from
>> arbitrary element position inside a vector
>>
>> On Tue, Nov 15, 2022 at 4:51 PM Tamar Christina
>> <Tamar.Christina@arm.com> wrote:
>> >
>> > > -----Original Message-----
>> > > From: Hongtao Liu <crazylht@gmail.com>
>> > > Sent: Tuesday, November 15, 2022 8:36 AM
>> > > To: Tamar Christina <Tamar.Christina@arm.com>
>> > > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina
>> > > via Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
>> > > rguenther@suse.de
>> > > Subject: Re: [PATCH 3/8]middle-end: Support extractions of
>> > > subvectors from arbitrary element position inside a vector
>> > >
>> > > Hi:
>> > >   I'm from https://gcc.gnu.org/pipermail/gcc-patches/2022-
>> > > November/606040.html.
>> > > >      }
>> > > >
>> > > >    /* See if we can get a better vector mode before extracting.
>> > > > */ diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
>> > > >
>> > >
>> cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b68961600
>> > > 9
>> > > 0
>> > > > a453cc6a28d9 100644
>> > > > --- a/gcc/optabs.cc
>> > > > +++ b/gcc/optabs.cc
>> > > > @@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode
>> mode,
>> > > rtx v0, rtx v1,
>> > > >        v0_qi = gen_lowpart (qimode, v0);
>> > > >        v1_qi = gen_lowpart (qimode, v1);
>> > > >        if (targetm.vectorize.vec_perm_const != NULL
>> > > > +         && targetm.can_change_mode_class (mode, qimode,
>> > > > + ALL_REGS)
>> > > It looks like you want to guard gen_lowpart, shouldn't it be better
>> > > to use validate_subreg  or (tmp = gen_lowpart_if_possible (mode,
>> target_qi)).
>> > > IMHO, targetm.can_change_mode_class is mostly used for RA, but not
>> > > to guard gen_lowpart.
>> >
>> > Hmm I don't think this is quite true, there are existing usages in
>> > expr.cc and rtanal.cc That do this and aren't part of RA.  As I
>> > mentioned before for instance the canoncalization of vec_select to subreg
>> in rtlanal for instances uses this.
>> In theory, we need to iterate through all reg classes that can be assigned for
>> both qimode and mode, if any regclass returns true for
>> targetm.can_change_mode_class, the bitcast(validate_subreg) should be ok.
>> Here we just passed ALL_REGS.
>
> Yes, and most targets where this transformation is valid return true here.
>
> I've checked:
>  * alpha
>  * arm
>  * aarch64
>  * rs6000
>  * s390
>  * sparc
>  * pa
>  * mips
>
> And even the default example that other targets use from the documentation
> would return true as the size of the modes are the same.
>
> X86 and RISCV are the only two targets that I found (but didn't check all) that
> blankly return a result based on just the register classes.
>
> That is to say, there are more targets that adhere to the interpretation that
> rclass here means "should be possible in some class in rclass" rather than
> "should be possible in ALL classes of rclass".

Yeah, I agree.  A query "can something stored in ALL_REGS change from
mode M1 to mode M2?" is meaningful if at least one register R in ALL_REGS
can hold both M1 and M2.  It's then the target's job to answer
conservatively so that the result covers all such R.

In principle it's OK for a target to err on the side of caution and forbid
things that are actually OK.  But that's going to risk losing performance
in some cases, and sometimes that loss of performance will be unacceptable.
IMO that's what's happening here.  The target is applying x87 rules to
things that (AIUI) are never stored in x87 registers, and so losing
performance as a result.

Note that the RA also uses ALL_REGS for some things, so this usage
isn't specific to non-RA code.

IMO it's not the job of target-independent code to iterate through
individual classes and aggregate the result.  One of the reasons for
having union classes is to avoid the need to do that.  And ALL_REGS
is the ultimate union class. :-)

The patch looks correct to me.

Thanks,
Richard

>> >
>> > So there are already existing precedence for this.  And the
>> > documentation for the hook says:
>> >
>> > "This hook returns true if it is possible to bitcast values held in registers of
>> class rclass from mode from to mode to and if doing so preserves the low-
>> order bits that are common to both modes. The result is only meaningful if
>> rclass has registers that can hold both from and to. The default
>> implementation returns true"
>> >
>> > So it looks like it's use outside of RA is perfectly valid.. and the
>> > documentation also mentions in the example the use from the mid-end as
>> an example.
>> >
>> > But if the mid-end maintainers are happy I'll use something else.
>> >
>> > Tamar
>> >
>> > > I did similar things in
>> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579296.html
>> > > (and ALL_REGS doesn't cover all cases for registers which are both
>> > > available for qimode and mode, ALL_REGS fail doesn't mean it can't
>> > > be subreg, it just means parts of ALL_REGS can't be subreg. but with
>> > > a subset of ALL_REGS, there could be a reg class which return true
>> > > for
>> > > targetm.can_change_mode_class)
>> > > >           && targetm.vectorize.vec_perm_const (qimode, qimode,
>> > > > target_qi,
>> > > v0_qi,
>> > > >                                                v1_qi, qimode_indices))
>> > > >         return gen_lowpart (mode, target_qi); @@ -6311,7 +6312,8
>> > > > @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
>> > > >      }
>> > > >
>> > > >    if (qimode != VOIDmode
>> > > > -      && selector_fits_mode_p (qimode, qimode_indices))
>> > > > +      && selector_fits_mode_p (qimode, qimode_indices)
>> > > > +      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
>> > > >      {
>> > > >        icode = direct_optab_handler (vec_perm_optab, qimode);
>> > > >        if (icode != CODE_FOR_nothing) diff --git
>> > > > a/gcc/testsuite/gcc.target/aarch64/ext_1.c
>> > > > b/gcc/testsuite/gcc.target/aarch64/ext_1.c
>> > > > new file mode 100644
>> > > > index
>> > > >
>> > >
>> 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e5
>> > > 71
>> > > > b3bc2ddf887a
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > BR,
>> > > Hongtao
>>
>>
>>
>> --
>> BR,
>> Hongtao

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-15 17:39               ` Richard Sandiford
@ 2022-11-17  8:04                 ` Hongtao Liu
  2022-11-17  9:39                   ` Richard Sandiford
  0 siblings, 1 reply; 50+ messages in thread
From: Hongtao Liu @ 2022-11-17  8:04 UTC (permalink / raw)
  To: Tamar Christina, Hongtao Liu, Tamar Christina via Gcc-patches,
	nd, rguenther, richard.sandiford

On Wed, Nov 16, 2022 at 1:39 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Hongtao Liu <crazylht@gmail.com>
> >> Sent: Tuesday, November 15, 2022 9:37 AM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina via
> >> Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> >> rguenther@suse.de
> >> Subject: Re: [PATCH 3/8]middle-end: Support extractions of subvectors from
> >> arbitrary element position inside a vector
> >>
> >> On Tue, Nov 15, 2022 at 4:51 PM Tamar Christina
> >> <Tamar.Christina@arm.com> wrote:
> >> >
> >> > > -----Original Message-----
> >> > > From: Hongtao Liu <crazylht@gmail.com>
> >> > > Sent: Tuesday, November 15, 2022 8:36 AM
> >> > > To: Tamar Christina <Tamar.Christina@arm.com>
> >> > > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina
> >> > > via Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> >> > > rguenther@suse.de
> >> > > Subject: Re: [PATCH 3/8]middle-end: Support extractions of
> >> > > subvectors from arbitrary element position inside a vector
> >> > >
> >> > > Hi:
> >> > >   I'm from https://gcc.gnu.org/pipermail/gcc-patches/2022-
> >> > > November/606040.html.
> >> > > >      }
> >> > > >
> >> > > >    /* See if we can get a better vector mode before extracting.
> >> > > > */ diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >> > > >
> >> > >
> >> cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b68961600
> >> > > 9
> >> > > 0
> >> > > > a453cc6a28d9 100644
> >> > > > --- a/gcc/optabs.cc
> >> > > > +++ b/gcc/optabs.cc
> >> > > > @@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode
> >> mode,
> >> > > rtx v0, rtx v1,
> >> > > >        v0_qi = gen_lowpart (qimode, v0);
> >> > > >        v1_qi = gen_lowpart (qimode, v1);
> >> > > >        if (targetm.vectorize.vec_perm_const != NULL
> >> > > > +         && targetm.can_change_mode_class (mode, qimode,
> >> > > > + ALL_REGS)
> >> > > It looks like you want to guard gen_lowpart, shouldn't it be better
> >> > > to use validate_subreg  or (tmp = gen_lowpart_if_possible (mode,
> >> target_qi)).
> >> > > IMHO, targetm.can_change_mode_class is mostly used for RA, but not
> >> > > to guard gen_lowpart.
> >> >
> >> > Hmm I don't think this is quite true, there are existing usages in
> >> > expr.cc and rtanal.cc That do this and aren't part of RA.  As I
> >> > mentioned before for instance the canoncalization of vec_select to subreg
> >> in rtlanal for instances uses this.
> >> In theory, we need to iterate through all reg classes that can be assigned for
> >> both qimode and mode, if any regclass returns true for
> >> targetm.can_change_mode_class, the bitcast(validate_subreg) should be ok.
> >> Here we just passed ALL_REGS.
> >
> > Yes, and most targets where this transformation is valid return true here.
> >
> > I've checked:
> >  * alpha
> >  * arm
> >  * aarch64
> >  * rs6000
> >  * s390
> >  * sparc
> >  * pa
> >  * mips
> >
> > And even the default example that other targets use from the documentation
> > would return true as the size of the modes are the same.
> >
> > X86 and RISCV are the only two targets that I found (but didn't check all) that
> > blankly return a result based on just the register classes.
> >
> > That is to say, there are more targets that adhere to the interpretation that
> > rclass here means "should be possible in some class in rclass" rather than
> > "should be possible in ALL classes of rclass".
>
> Yeah, I agree.  A query "can something stored in ALL_REGS change from
> mode M1 to mode M2?" is meaningful if at least one register R in ALL_REGS
> can hold both M1 and M2.  It's then the target's job to answer
> conservatively so that the result covers all such R.
>
> In principle it's OK for a target to err on the side of caution and forbid
> things that are actually OK.  But that's going to risk losing performance
> in some cases, and sometimes that loss of performance will be unacceptable.
> IMO that's what's happening here.  The target is applying x87 rules to
> things that (AIUI) are never stored in x87 registers, and so losing
Yes, it can be optimized since some mode will never assigned to x87 registers.
> performance as a result.
>
> Note that the RA also uses ALL_REGS for some things, so this usage
> isn't specific to non-RA code.
RA passes the minimal reg class(REGNO_REG_CLASS) which contains REGN
to decide if can_change_mode_class, not ALL_REGS.
511/* Given a hard REGN a FROM mode and a TO mode, return true if
512   REGN can change from mode FROM to mode TO.  */
513#define REG_CAN_CHANGE_MODE_P(REGN, FROM, TO)                          \
514  (targetm.can_change_mode_class (FROM, TO, REGNO_REG_CLASS (REGN)))
515

So I still think using can_change_mode_class outside of RA with
ALL_REGS passed to decide whether it's ok to generate subreg is not a
good idea.
>
> IMO it's not the job of target-independent code to iterate through
> individual classes and aggregate the result.  One of the reasons for
> having union classes is to avoid the need to do that.  And ALL_REGS
> is the ultimate union class. :-)
>
> The patch looks correct to me.
>
> Thanks,
> Richard
>
> >> >
> >> > So there are already existing precedence for this.  And the
> >> > documentation for the hook says:
> >> >
> >> > "This hook returns true if it is possible to bitcast values held in registers of
> >> class rclass from mode from to mode to and if doing so preserves the low-
> >> order bits that are common to both modes. The result is only meaningful if
> >> rclass has registers that can hold both from and to. The default
> >> implementation returns true"
> >> >
> >> > So it looks like it's use outside of RA is perfectly valid.. and the
> >> > documentation also mentions in the example the use from the mid-end as
> >> an example.
> >> >
> >> > But if the mid-end maintainers are happy I'll use something else.
> >> >
> >> > Tamar
> >> >
> >> > > I did similar things in
> >> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579296.html
> >> > > (and ALL_REGS doesn't cover all cases for registers which are both
> >> > > available for qimode and mode, ALL_REGS fail doesn't mean it can't
> >> > > be subreg, it just means parts of ALL_REGS can't be subreg. but with
> >> > > a subset of ALL_REGS, there could be a reg class which return true
> >> > > for
> >> > > targetm.can_change_mode_class)
> >> > > >           && targetm.vectorize.vec_perm_const (qimode, qimode,
> >> > > > target_qi,
> >> > > v0_qi,
> >> > > >                                                v1_qi, qimode_indices))
> >> > > >         return gen_lowpart (mode, target_qi); @@ -6311,7 +6312,8
> >> > > > @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
> >> > > >      }
> >> > > >
> >> > > >    if (qimode != VOIDmode
> >> > > > -      && selector_fits_mode_p (qimode, qimode_indices))
> >> > > > +      && selector_fits_mode_p (qimode, qimode_indices)
> >> > > > +      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
> >> > > >      {
> >> > > >        icode = direct_optab_handler (vec_perm_optab, qimode);
> >> > > >        if (icode != CODE_FOR_nothing) diff --git
> >> > > > a/gcc/testsuite/gcc.target/aarch64/ext_1.c
> >> > > > b/gcc/testsuite/gcc.target/aarch64/ext_1.c
> >> > > > new file mode 100644
> >> > > > index
> >> > > >
> >> > >
> >> 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e5
> >> > > 71
> >> > > > b3bc2ddf887a
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > BR,
> >> > > Hongtao
> >>
> >>
> >>
> >> --
> >> BR,
> >> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-17  8:04                 ` Hongtao Liu
@ 2022-11-17  9:39                   ` Richard Sandiford
  2022-11-17 10:20                     ` Hongtao Liu
  0 siblings, 1 reply; 50+ messages in thread
From: Richard Sandiford @ 2022-11-17  9:39 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Tamar Christina, Tamar Christina via Gcc-patches, nd, rguenther

Hongtao Liu <crazylht@gmail.com> writes:
> On Wed, Nov 16, 2022 at 1:39 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Hongtao Liu <crazylht@gmail.com>
>> >> Sent: Tuesday, November 15, 2022 9:37 AM
>> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina via
>> >> Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
>> >> rguenther@suse.de
>> >> Subject: Re: [PATCH 3/8]middle-end: Support extractions of subvectors from
>> >> arbitrary element position inside a vector
>> >>
>> >> On Tue, Nov 15, 2022 at 4:51 PM Tamar Christina
>> >> <Tamar.Christina@arm.com> wrote:
>> >> >
>> >> > > -----Original Message-----
>> >> > > From: Hongtao Liu <crazylht@gmail.com>
>> >> > > Sent: Tuesday, November 15, 2022 8:36 AM
>> >> > > To: Tamar Christina <Tamar.Christina@arm.com>
>> >> > > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina
>> >> > > via Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
>> >> > > rguenther@suse.de
>> >> > > Subject: Re: [PATCH 3/8]middle-end: Support extractions of
>> >> > > subvectors from arbitrary element position inside a vector
>> >> > >
>> >> > > Hi:
>> >> > >   I'm from https://gcc.gnu.org/pipermail/gcc-patches/2022-
>> >> > > November/606040.html.
>> >> > > >      }
>> >> > > >
>> >> > > >    /* See if we can get a better vector mode before extracting.
>> >> > > > */ diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
>> >> > > >
>> >> > >
>> >> cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b68961600
>> >> > > 9
>> >> > > 0
>> >> > > > a453cc6a28d9 100644
>> >> > > > --- a/gcc/optabs.cc
>> >> > > > +++ b/gcc/optabs.cc
>> >> > > > @@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode
>> >> mode,
>> >> > > rtx v0, rtx v1,
>> >> > > >        v0_qi = gen_lowpart (qimode, v0);
>> >> > > >        v1_qi = gen_lowpart (qimode, v1);
>> >> > > >        if (targetm.vectorize.vec_perm_const != NULL
>> >> > > > +         && targetm.can_change_mode_class (mode, qimode,
>> >> > > > + ALL_REGS)
>> >> > > It looks like you want to guard gen_lowpart, shouldn't it be better
>> >> > > to use validate_subreg  or (tmp = gen_lowpart_if_possible (mode,
>> >> target_qi)).
>> >> > > IMHO, targetm.can_change_mode_class is mostly used for RA, but not
>> >> > > to guard gen_lowpart.
>> >> >
>> >> > Hmm I don't think this is quite true, there are existing usages in
>> >> > expr.cc and rtanal.cc That do this and aren't part of RA.  As I
>> >> > mentioned before for instance the canoncalization of vec_select to subreg
>> >> in rtlanal for instances uses this.
>> >> In theory, we need to iterate through all reg classes that can be assigned for
>> >> both qimode and mode, if any regclass returns true for
>> >> targetm.can_change_mode_class, the bitcast(validate_subreg) should be ok.
>> >> Here we just passed ALL_REGS.
>> >
>> > Yes, and most targets where this transformation is valid return true here.
>> >
>> > I've checked:
>> >  * alpha
>> >  * arm
>> >  * aarch64
>> >  * rs6000
>> >  * s390
>> >  * sparc
>> >  * pa
>> >  * mips
>> >
>> > And even the default example that other targets use from the documentation
>> > would return true as the size of the modes are the same.
>> >
>> > X86 and RISCV are the only two targets that I found (but didn't check all) that
>> > blankly return a result based on just the register classes.
>> >
>> > That is to say, there are more targets that adhere to the interpretation that
>> > rclass here means "should be possible in some class in rclass" rather than
>> > "should be possible in ALL classes of rclass".
>>
>> Yeah, I agree.  A query "can something stored in ALL_REGS change from
>> mode M1 to mode M2?" is meaningful if at least one register R in ALL_REGS
>> can hold both M1 and M2.  It's then the target's job to answer
>> conservatively so that the result covers all such R.
>>
>> In principle it's OK for a target to err on the side of caution and forbid
>> things that are actually OK.  But that's going to risk losing performance
>> in some cases, and sometimes that loss of performance will be unacceptable.
>> IMO that's what's happening here.  The target is applying x87 rules to
>> things that (AIUI) are never stored in x87 registers, and so losing
> Yes, it can be optimized since some mode will never assigned to x87 registers.
>> performance as a result.
>>
>> Note that the RA also uses ALL_REGS for some things, so this usage
>> isn't specific to non-RA code.
> RA passes the minimal reg class(REGNO_REG_CLASS) which contains REGN
> to decide if can_change_mode_class, not ALL_REGS.
> 511/* Given a hard REGN a FROM mode and a TO mode, return true if
> 512   REGN can change from mode FROM to mode TO.  */
> 513#define REG_CAN_CHANGE_MODE_P(REGN, FROM, TO)                          \
> 514  (targetm.can_change_mode_class (FROM, TO, REGNO_REG_CLASS (REGN)))
> 515
>
> So I still think using can_change_mode_class outside of RA with
> ALL_REGS passed to decide whether it's ok to generate subreg is not a
> good idea.

But if the argument is that the only valid uses of can_change_mode_class
are through this macro, the hook isn't describing a class property,
it's describing the property of individual registers.  If we say that
querying individual registers is the only valid thing to do them
we should change the hook to take a register number rather than
a class enum.

The reason we have a class-based rather than register-based interface
is because it is useful to query classes before you've picked a
specific register.

Thanks,
Richard

>> IMO it's not the job of target-independent code to iterate through
>> individual classes and aggregate the result.  One of the reasons for
>> having union classes is to avoid the need to do that.  And ALL_REGS
>> is the ultimate union class. :-)
>>
>> The patch looks correct to me.
>>
>> Thanks,
>> Richard
>>
>> >> >
>> >> > So there are already existing precedence for this.  And the
>> >> > documentation for the hook says:
>> >> >
>> >> > "This hook returns true if it is possible to bitcast values held in registers of
>> >> class rclass from mode from to mode to and if doing so preserves the low-
>> >> order bits that are common to both modes. The result is only meaningful if
>> >> rclass has registers that can hold both from and to. The default
>> >> implementation returns true"
>> >> >
>> >> > So it looks like it's use outside of RA is perfectly valid.. and the
>> >> > documentation also mentions in the example the use from the mid-end as
>> >> an example.
>> >> >
>> >> > But if the mid-end maintainers are happy I'll use something else.
>> >> >
>> >> > Tamar
>> >> >
>> >> > > I did similar things in
>> >> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579296.html
>> >> > > (and ALL_REGS doesn't cover all cases for registers which are both
>> >> > > available for qimode and mode, ALL_REGS fail doesn't mean it can't
>> >> > > be subreg, it just means parts of ALL_REGS can't be subreg. but with
>> >> > > a subset of ALL_REGS, there could be a reg class which return true
>> >> > > for
>> >> > > targetm.can_change_mode_class)
>> >> > > >           && targetm.vectorize.vec_perm_const (qimode, qimode,
>> >> > > > target_qi,
>> >> > > v0_qi,
>> >> > > >                                                v1_qi, qimode_indices))
>> >> > > >         return gen_lowpart (mode, target_qi); @@ -6311,7 +6312,8
>> >> > > > @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
>> >> > > >      }
>> >> > > >
>> >> > > >    if (qimode != VOIDmode
>> >> > > > -      && selector_fits_mode_p (qimode, qimode_indices))
>> >> > > > +      && selector_fits_mode_p (qimode, qimode_indices)
>> >> > > > +      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
>> >> > > >      {
>> >> > > >        icode = direct_optab_handler (vec_perm_optab, qimode);
>> >> > > >        if (icode != CODE_FOR_nothing) diff --git
>> >> > > > a/gcc/testsuite/gcc.target/aarch64/ext_1.c
>> >> > > > b/gcc/testsuite/gcc.target/aarch64/ext_1.c
>> >> > > > new file mode 100644
>> >> > > > index
>> >> > > >
>> >> > >
>> >> 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e5
>> >> > > 71
>> >> > > > b3bc2ddf887a
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > BR,
>> >> > > Hongtao
>> >>
>> >>
>> >>
>> >> --
>> >> BR,
>> >> Hongtao

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-17  9:39                   ` Richard Sandiford
@ 2022-11-17 10:20                     ` Hongtao Liu
  2022-11-17 13:59                       ` Richard Sandiford
  0 siblings, 1 reply; 50+ messages in thread
From: Hongtao Liu @ 2022-11-17 10:20 UTC (permalink / raw)
  To: Hongtao Liu, Tamar Christina, Tamar Christina via Gcc-patches,
	nd, rguenther, richard.sandiford

On Thu, Nov 17, 2022 at 5:39 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Hongtao Liu <crazylht@gmail.com> writes:
> > On Wed, Nov 16, 2022 at 1:39 AM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> -----Original Message-----
> >> >> From: Hongtao Liu <crazylht@gmail.com>
> >> >> Sent: Tuesday, November 15, 2022 9:37 AM
> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina via
> >> >> Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> >> >> rguenther@suse.de
> >> >> Subject: Re: [PATCH 3/8]middle-end: Support extractions of subvectors from
> >> >> arbitrary element position inside a vector
> >> >>
> >> >> On Tue, Nov 15, 2022 at 4:51 PM Tamar Christina
> >> >> <Tamar.Christina@arm.com> wrote:
> >> >> >
> >> >> > > -----Original Message-----
> >> >> > > From: Hongtao Liu <crazylht@gmail.com>
> >> >> > > Sent: Tuesday, November 15, 2022 8:36 AM
> >> >> > > To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> > > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina
> >> >> > > via Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> >> >> > > rguenther@suse.de
> >> >> > > Subject: Re: [PATCH 3/8]middle-end: Support extractions of
> >> >> > > subvectors from arbitrary element position inside a vector
> >> >> > >
> >> >> > > Hi:
> >> >> > >   I'm from https://gcc.gnu.org/pipermail/gcc-patches/2022-
> >> >> > > November/606040.html.
> >> >> > > >      }
> >> >> > > >
> >> >> > > >    /* See if we can get a better vector mode before extracting.
> >> >> > > > */ diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >> >> > > >
> >> >> > >
> >> >> cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b68961600
> >> >> > > 9
> >> >> > > 0
> >> >> > > > a453cc6a28d9 100644
> >> >> > > > --- a/gcc/optabs.cc
> >> >> > > > +++ b/gcc/optabs.cc
> >> >> > > > @@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode
> >> >> mode,
> >> >> > > rtx v0, rtx v1,
> >> >> > > >        v0_qi = gen_lowpart (qimode, v0);
> >> >> > > >        v1_qi = gen_lowpart (qimode, v1);
> >> >> > > >        if (targetm.vectorize.vec_perm_const != NULL
> >> >> > > > +         && targetm.can_change_mode_class (mode, qimode,
> >> >> > > > + ALL_REGS)
> >> >> > > It looks like you want to guard gen_lowpart, shouldn't it be better
> >> >> > > to use validate_subreg  or (tmp = gen_lowpart_if_possible (mode,
> >> >> target_qi)).
> >> >> > > IMHO, targetm.can_change_mode_class is mostly used for RA, but not
> >> >> > > to guard gen_lowpart.
> >> >> >
> >> >> > Hmm I don't think this is quite true, there are existing usages in
> >> >> > expr.cc and rtanal.cc That do this and aren't part of RA.  As I
> >> >> > mentioned before for instance the canoncalization of vec_select to subreg
> >> >> in rtlanal for instances uses this.
> >> >> In theory, we need to iterate through all reg classes that can be assigned for
> >> >> both qimode and mode, if any regclass returns true for
> >> >> targetm.can_change_mode_class, the bitcast(validate_subreg) should be ok.
> >> >> Here we just passed ALL_REGS.
> >> >
> >> > Yes, and most targets where this transformation is valid return true here.
> >> >
> >> > I've checked:
> >> >  * alpha
> >> >  * arm
> >> >  * aarch64
> >> >  * rs6000
> >> >  * s390
> >> >  * sparc
> >> >  * pa
> >> >  * mips
> >> >
> >> > And even the default example that other targets use from the documentation
> >> > would return true as the size of the modes are the same.
> >> >
> >> > X86 and RISCV are the only two targets that I found (but didn't check all) that
> >> > blankly return a result based on just the register classes.
> >> >
> >> > That is to say, there are more targets that adhere to the interpretation that
> >> > rclass here means "should be possible in some class in rclass" rather than
> >> > "should be possible in ALL classes of rclass".
> >>
> >> Yeah, I agree.  A query "can something stored in ALL_REGS change from
> >> mode M1 to mode M2?" is meaningful if at least one register R in ALL_REGS
> >> can hold both M1 and M2.  It's then the target's job to answer
> >> conservatively so that the result covers all such R.
> >>
> >> In principle it's OK for a target to err on the side of caution and forbid
> >> things that are actually OK.  But that's going to risk losing performance
> >> in some cases, and sometimes that loss of performance will be unacceptable.
> >> IMO that's what's happening here.  The target is applying x87 rules to
> >> things that (AIUI) are never stored in x87 registers, and so losing
> > Yes, it can be optimized since some mode will never assigned to x87 registers.
> >> performance as a result.
> >>
> >> Note that the RA also uses ALL_REGS for some things, so this usage
> >> isn't specific to non-RA code.
> > RA passes the minimal reg class(REGNO_REG_CLASS) which contains REGN
> > to decide if can_change_mode_class, not ALL_REGS.
> > 511/* Given a hard REGN a FROM mode and a TO mode, return true if
> > 512   REGN can change from mode FROM to mode TO.  */
> > 513#define REG_CAN_CHANGE_MODE_P(REGN, FROM, TO)                          \
> > 514  (targetm.can_change_mode_class (FROM, TO, REGNO_REG_CLASS (REGN)))
> > 515
> >
> > So I still think using can_change_mode_class outside of RA with
> > ALL_REGS passed to decide whether it's ok to generate subreg is not a
> > good idea.
>
> But if the argument is that the only valid uses of can_change_mode_class
> are through this macro, the hook isn't describing a class property,
> it's describing the property of individual registers.  If we say that
> querying individual registers is the only valid thing to do them
> we should change the hook to take a register number rather than
> a class enum.
>
> The reason we have a class-based rather than register-based interface
> is because it is useful to query classes before you've picked a
> specific register.
For individual registers in the minimal reg class, we assume they are
not different from each other, I guess that's why we have
REGNO_REG_CLASS and class-based interfaces other than register-based
interfaces.
But for ALL_REGS, it's not the minimal reg class, it's the largest.
Using it It's not that suitable.
If the argument is if some r in rclass is ok for mode change, the hook
would return true, then why would RA use REGNO_REG_CLASS other than
ALL_REGS.
Another spot is in validate_subreg, we're using the minimal reg class
instead of ALL_REGS.
 973  /* This is a normal subreg.  Verify that the offset is representable.  */
 974
 975  /* For hard registers, we already have most of these rules collected in
 976     subreg_offset_representable_p.  */
 977  if (reg && REG_P (reg) && HARD_REGISTER_P (reg))
 978    {
 979      unsigned int regno = REGNO (reg);
 980
 981      if ((COMPLEX_MODE_P (imode) || VECTOR_MODE_P (imode))
 982          && GET_MODE_INNER (imode) == omode)
 983        ;
 984      else if (!REG_CAN_CHANGE_MODE_P (regno, imode, omode))
 985        return false;

I think we do need some hook in the middle end to query things like if
some r in rclass is ok for mode change?  but not reusing
can_change_mode_class.
>
> Thanks,
> Richard
>
> >> IMO it's not the job of target-independent code to iterate through
> >> individual classes and aggregate the result.  One of the reasons for
> >> having union classes is to avoid the need to do that.  And ALL_REGS
> >> is the ultimate union class. :-)
> >>
> >> The patch looks correct to me.
> >>
> >> Thanks,
> >> Richard
> >>
> >> >> >
> >> >> > So there are already existing precedence for this.  And the
> >> >> > documentation for the hook says:
> >> >> >
> >> >> > "This hook returns true if it is possible to bitcast values held in registers of
> >> >> class rclass from mode from to mode to and if doing so preserves the low-
> >> >> order bits that are common to both modes. The result is only meaningful if
> >> >> rclass has registers that can hold both from and to. The default
> >> >> implementation returns true"
> >> >> >
> >> >> > So it looks like it's use outside of RA is perfectly valid.. and the
> >> >> > documentation also mentions in the example the use from the mid-end as
> >> >> an example.
> >> >> >
> >> >> > But if the mid-end maintainers are happy I'll use something else.
> >> >> >
> >> >> > Tamar
> >> >> >
> >> >> > > I did similar things in
> >> >> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579296.html
> >> >> > > (and ALL_REGS doesn't cover all cases for registers which are both
> >> >> > > available for qimode and mode, ALL_REGS fail doesn't mean it can't
> >> >> > > be subreg, it just means parts of ALL_REGS can't be subreg. but with
> >> >> > > a subset of ALL_REGS, there could be a reg class which return true
> >> >> > > for
> >> >> > > targetm.can_change_mode_class)
> >> >> > > >           && targetm.vectorize.vec_perm_const (qimode, qimode,
> >> >> > > > target_qi,
> >> >> > > v0_qi,
> >> >> > > >                                                v1_qi, qimode_indices))
> >> >> > > >         return gen_lowpart (mode, target_qi); @@ -6311,7 +6312,8
> >> >> > > > @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
> >> >> > > >      }
> >> >> > > >
> >> >> > > >    if (qimode != VOIDmode
> >> >> > > > -      && selector_fits_mode_p (qimode, qimode_indices))
> >> >> > > > +      && selector_fits_mode_p (qimode, qimode_indices)
> >> >> > > > +      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
> >> >> > > >      {
> >> >> > > >        icode = direct_optab_handler (vec_perm_optab, qimode);
> >> >> > > >        if (icode != CODE_FOR_nothing) diff --git
> >> >> > > > a/gcc/testsuite/gcc.target/aarch64/ext_1.c
> >> >> > > > b/gcc/testsuite/gcc.target/aarch64/ext_1.c
> >> >> > > > new file mode 100644
> >> >> > > > index
> >> >> > > >
> >> >> > >
> >> >> 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e5
> >> >> > > 71
> >> >> > > > b3bc2ddf887a
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > BR,
> >> >> > > Hongtao
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> BR,
> >> >> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-17 10:20                     ` Hongtao Liu
@ 2022-11-17 13:59                       ` Richard Sandiford
  2022-11-18  2:31                         ` Hongtao Liu
  0 siblings, 1 reply; 50+ messages in thread
From: Richard Sandiford @ 2022-11-17 13:59 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Tamar Christina, Tamar Christina via Gcc-patches, nd, rguenther

Hongtao Liu <crazylht@gmail.com> writes:
> On Thu, Nov 17, 2022 at 5:39 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Hongtao Liu <crazylht@gmail.com> writes:
>> > On Wed, Nov 16, 2022 at 1:39 AM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> >> -----Original Message-----
>> >> >> From: Hongtao Liu <crazylht@gmail.com>
>> >> >> Sent: Tuesday, November 15, 2022 9:37 AM
>> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> >> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina via
>> >> >> Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
>> >> >> rguenther@suse.de
>> >> >> Subject: Re: [PATCH 3/8]middle-end: Support extractions of subvectors from
>> >> >> arbitrary element position inside a vector
>> >> >>
>> >> >> On Tue, Nov 15, 2022 at 4:51 PM Tamar Christina
>> >> >> <Tamar.Christina@arm.com> wrote:
>> >> >> >
>> >> >> > > -----Original Message-----
>> >> >> > > From: Hongtao Liu <crazylht@gmail.com>
>> >> >> > > Sent: Tuesday, November 15, 2022 8:36 AM
>> >> >> > > To: Tamar Christina <Tamar.Christina@arm.com>
>> >> >> > > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina
>> >> >> > > via Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
>> >> >> > > rguenther@suse.de
>> >> >> > > Subject: Re: [PATCH 3/8]middle-end: Support extractions of
>> >> >> > > subvectors from arbitrary element position inside a vector
>> >> >> > >
>> >> >> > > Hi:
>> >> >> > >   I'm from https://gcc.gnu.org/pipermail/gcc-patches/2022-
>> >> >> > > November/606040.html.
>> >> >> > > >      }
>> >> >> > > >
>> >> >> > > >    /* See if we can get a better vector mode before extracting.
>> >> >> > > > */ diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
>> >> >> > > >
>> >> >> > >
>> >> >> cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b68961600
>> >> >> > > 9
>> >> >> > > 0
>> >> >> > > > a453cc6a28d9 100644
>> >> >> > > > --- a/gcc/optabs.cc
>> >> >> > > > +++ b/gcc/optabs.cc
>> >> >> > > > @@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode
>> >> >> mode,
>> >> >> > > rtx v0, rtx v1,
>> >> >> > > >        v0_qi = gen_lowpart (qimode, v0);
>> >> >> > > >        v1_qi = gen_lowpart (qimode, v1);
>> >> >> > > >        if (targetm.vectorize.vec_perm_const != NULL
>> >> >> > > > +         && targetm.can_change_mode_class (mode, qimode,
>> >> >> > > > + ALL_REGS)
>> >> >> > > It looks like you want to guard gen_lowpart, shouldn't it be better
>> >> >> > > to use validate_subreg  or (tmp = gen_lowpart_if_possible (mode,
>> >> >> target_qi)).
>> >> >> > > IMHO, targetm.can_change_mode_class is mostly used for RA, but not
>> >> >> > > to guard gen_lowpart.
>> >> >> >
>> >> >> > Hmm I don't think this is quite true, there are existing usages in
>> >> >> > expr.cc and rtanal.cc That do this and aren't part of RA.  As I
>> >> >> > mentioned before for instance the canoncalization of vec_select to subreg
>> >> >> in rtlanal for instances uses this.
>> >> >> In theory, we need to iterate through all reg classes that can be assigned for
>> >> >> both qimode and mode, if any regclass returns true for
>> >> >> targetm.can_change_mode_class, the bitcast(validate_subreg) should be ok.
>> >> >> Here we just passed ALL_REGS.
>> >> >
>> >> > Yes, and most targets where this transformation is valid return true here.
>> >> >
>> >> > I've checked:
>> >> >  * alpha
>> >> >  * arm
>> >> >  * aarch64
>> >> >  * rs6000
>> >> >  * s390
>> >> >  * sparc
>> >> >  * pa
>> >> >  * mips
>> >> >
>> >> > And even the default example that other targets use from the documentation
>> >> > would return true as the size of the modes are the same.
>> >> >
>> >> > X86 and RISCV are the only two targets that I found (but didn't check all) that
>> >> > blankly return a result based on just the register classes.
>> >> >
>> >> > That is to say, there are more targets that adhere to the interpretation that
>> >> > rclass here means "should be possible in some class in rclass" rather than
>> >> > "should be possible in ALL classes of rclass".
>> >>
>> >> Yeah, I agree.  A query "can something stored in ALL_REGS change from
>> >> mode M1 to mode M2?" is meaningful if at least one register R in ALL_REGS
>> >> can hold both M1 and M2.  It's then the target's job to answer
>> >> conservatively so that the result covers all such R.
>> >>
>> >> In principle it's OK for a target to err on the side of caution and forbid
>> >> things that are actually OK.  But that's going to risk losing performance
>> >> in some cases, and sometimes that loss of performance will be unacceptable.
>> >> IMO that's what's happening here.  The target is applying x87 rules to
>> >> things that (AIUI) are never stored in x87 registers, and so losing
>> > Yes, it can be optimized since some mode will never assigned to x87 registers.
>> >> performance as a result.
>> >>
>> >> Note that the RA also uses ALL_REGS for some things, so this usage
>> >> isn't specific to non-RA code.
>> > RA passes the minimal reg class(REGNO_REG_CLASS) which contains REGN
>> > to decide if can_change_mode_class, not ALL_REGS.
>> > 511/* Given a hard REGN a FROM mode and a TO mode, return true if
>> > 512   REGN can change from mode FROM to mode TO.  */
>> > 513#define REG_CAN_CHANGE_MODE_P(REGN, FROM, TO)                          \
>> > 514  (targetm.can_change_mode_class (FROM, TO, REGNO_REG_CLASS (REGN)))
>> > 515
>> >
>> > So I still think using can_change_mode_class outside of RA with
>> > ALL_REGS passed to decide whether it's ok to generate subreg is not a
>> > good idea.
>>
>> But if the argument is that the only valid uses of can_change_mode_class
>> are through this macro, the hook isn't describing a class property,
>> it's describing the property of individual registers.  If we say that
>> querying individual registers is the only valid thing to do them
>> we should change the hook to take a register number rather than
>> a class enum.
>>
>> The reason we have a class-based rather than register-based interface
>> is because it is useful to query classes before you've picked a
>> specific register.
> For individual registers in the minimal reg class, we assume they are
> not different from each other, I guess that's why we have
> REGNO_REG_CLASS and class-based interfaces other than register-based
> interfaces.

I don't think that's necessarily true.  We have things like
hard_regno_nregs that operate on individual registers.  And even
the x86 implementation of the hook uses subset operations rather
than comparing the class for equality, which suggests some attempt
to handle classes other than those returned by REGNO_REG_CLASS.

> But for ALL_REGS, it's not the minimal reg class, it's the largest.
> Using it It's not that suitable.

But I think it is suitable for the gimple level, where we have no
information that would narrow the choice to a specific register class.

> If the argument is if some r in rclass is ok for mode change, the hook
> would return true, then why would RA use REGNO_REG_CLASS other than
> ALL_REGS.

If you know which hard register something is stored in, it makes
sense to ask about the closest enclosing class rather than something
more general.  If a particular mode can be stored in both general
registers and floating-point registers, and if floating-point registers
have restrictions that general registers don't, ALL_REGS should honour
the floating-point constraints.  But it wouldn't make sense to use the
floating-point constraints for something that is known to be in a
general register.

> Another spot is in validate_subreg, we're using the minimal reg class
> instead of ALL_REGS.
>  973  /* This is a normal subreg.  Verify that the offset is representable.  */
>  974
>  975  /* For hard registers, we already have most of these rules collected in
>  976     subreg_offset_representable_p.  */
>  977  if (reg && REG_P (reg) && HARD_REGISTER_P (reg))
>  978    {
>  979      unsigned int regno = REGNO (reg);
>  980
>  981      if ((COMPLEX_MODE_P (imode) || VECTOR_MODE_P (imode))
>  982          && GET_MODE_INNER (imode) == omode)
>  983        ;
>  984      else if (!REG_CAN_CHANGE_MODE_P (regno, imode, omode))
>  985        return false;

But here too, we're using REG_CAN_CHANGE_MODE_P because we know
the register number.  It wouldn't make sense to ask about a more
general class than necessary.

REG_CANNOT_CHANGE_MODE_P (as it was then) was added in 2002 as a
convenient interface to CANNOT_CHANGE_MODE_CLASS.  CANNOT_CHANGE_MODE_CLASS
in turn replaced CLASS_CANNOT_CHANGE_MODE_P, which only took two modes,
and wasn't given any class information.  So this interface was
originally a query about modes, not a query about classes.  The class
information was added later since (understandably) modes weren't always
enough on their own.  But I think it is still fundamentally a query
about modes, with the class provided for context, rather than a query
about classes, with modes provided by context.

> I think we do need some hook in the middle end to query things like if
> some r in rclass is ok for mode change?  but not reusing
> can_change_mode_class.

But if we add a hook to say "are mode changes from mode M1 to mode M2 OK?",
which is what Tamar's patch and some other existing code wants to know,
I fear we'll just reintroduce the old CLASS_CANNOT_CHANGE_MODE_P (but
hopefully without embedding the negative sense).  I don't think it makes
sense to have that hook alongside the existing one.  It would require
targets to duplicate information and would run the risk on conflicting
information for corner cases.  IMO it would repeat the mistake of having
both hard_regno_nregs and class_max_nregs; really, the latter ought
to be calculated from the former.

Thanks,
Richard

>> Thanks,
>> Richard
>>
>> >> IMO it's not the job of target-independent code to iterate through
>> >> individual classes and aggregate the result.  One of the reasons for
>> >> having union classes is to avoid the need to do that.  And ALL_REGS
>> >> is the ultimate union class. :-)
>> >>
>> >> The patch looks correct to me.
>> >>
>> >> Thanks,
>> >> Richard
>> >>
>> >> >> >
>> >> >> > So there are already existing precedence for this.  And the
>> >> >> > documentation for the hook says:
>> >> >> >
>> >> >> > "This hook returns true if it is possible to bitcast values held in registers of
>> >> >> class rclass from mode from to mode to and if doing so preserves the low-
>> >> >> order bits that are common to both modes. The result is only meaningful if
>> >> >> rclass has registers that can hold both from and to. The default
>> >> >> implementation returns true"
>> >> >> >
>> >> >> > So it looks like it's use outside of RA is perfectly valid.. and the
>> >> >> > documentation also mentions in the example the use from the mid-end as
>> >> >> an example.
>> >> >> >
>> >> >> > But if the mid-end maintainers are happy I'll use something else.
>> >> >> >
>> >> >> > Tamar
>> >> >> >
>> >> >> > > I did similar things in
>> >> >> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579296.html
>> >> >> > > (and ALL_REGS doesn't cover all cases for registers which are both
>> >> >> > > available for qimode and mode, ALL_REGS fail doesn't mean it can't
>> >> >> > > be subreg, it just means parts of ALL_REGS can't be subreg. but with
>> >> >> > > a subset of ALL_REGS, there could be a reg class which return true
>> >> >> > > for
>> >> >> > > targetm.can_change_mode_class)
>> >> >> > > >           && targetm.vectorize.vec_perm_const (qimode, qimode,
>> >> >> > > > target_qi,
>> >> >> > > v0_qi,
>> >> >> > > >                                                v1_qi, qimode_indices))
>> >> >> > > >         return gen_lowpart (mode, target_qi); @@ -6311,7 +6312,8
>> >> >> > > > @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
>> >> >> > > >      }
>> >> >> > > >
>> >> >> > > >    if (qimode != VOIDmode
>> >> >> > > > -      && selector_fits_mode_p (qimode, qimode_indices))
>> >> >> > > > +      && selector_fits_mode_p (qimode, qimode_indices)
>> >> >> > > > +      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
>> >> >> > > >      {
>> >> >> > > >        icode = direct_optab_handler (vec_perm_optab, qimode);
>> >> >> > > >        if (icode != CODE_FOR_nothing) diff --git
>> >> >> > > > a/gcc/testsuite/gcc.target/aarch64/ext_1.c
>> >> >> > > > b/gcc/testsuite/gcc.target/aarch64/ext_1.c
>> >> >> > > > new file mode 100644
>> >> >> > > > index
>> >> >> > > >
>> >> >> > >
>> >> >> 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e5
>> >> >> > > 71
>> >> >> > > > b3bc2ddf887a
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > --
>> >> >> > > BR,
>> >> >> > > Hongtao
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> BR,
>> >> >> Hongtao

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-17 13:59                       ` Richard Sandiford
@ 2022-11-18  2:31                         ` Hongtao Liu
  2022-11-18  9:16                           ` Richard Sandiford
  0 siblings, 1 reply; 50+ messages in thread
From: Hongtao Liu @ 2022-11-18  2:31 UTC (permalink / raw)
  To: Hongtao Liu, Tamar Christina, Tamar Christina via Gcc-patches,
	nd, rguenther, richard.sandiford

On Thu, Nov 17, 2022 at 9:59 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Hongtao Liu <crazylht@gmail.com> writes:
> > On Thu, Nov 17, 2022 at 5:39 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Hongtao Liu <crazylht@gmail.com> writes:
> >> > On Wed, Nov 16, 2022 at 1:39 AM Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> >> -----Original Message-----
> >> >> >> From: Hongtao Liu <crazylht@gmail.com>
> >> >> >> Sent: Tuesday, November 15, 2022 9:37 AM
> >> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> >> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina via
> >> >> >> Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> >> >> >> rguenther@suse.de
> >> >> >> Subject: Re: [PATCH 3/8]middle-end: Support extractions of subvectors from
> >> >> >> arbitrary element position inside a vector
> >> >> >>
> >> >> >> On Tue, Nov 15, 2022 at 4:51 PM Tamar Christina
> >> >> >> <Tamar.Christina@arm.com> wrote:
> >> >> >> >
> >> >> >> > > -----Original Message-----
> >> >> >> > > From: Hongtao Liu <crazylht@gmail.com>
> >> >> >> > > Sent: Tuesday, November 15, 2022 8:36 AM
> >> >> >> > > To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> >> > > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina
> >> >> >> > > via Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> >> >> >> > > rguenther@suse.de
> >> >> >> > > Subject: Re: [PATCH 3/8]middle-end: Support extractions of
> >> >> >> > > subvectors from arbitrary element position inside a vector
> >> >> >> > >
> >> >> >> > > Hi:
> >> >> >> > >   I'm from https://gcc.gnu.org/pipermail/gcc-patches/2022-
> >> >> >> > > November/606040.html.
> >> >> >> > > >      }
> >> >> >> > > >
> >> >> >> > > >    /* See if we can get a better vector mode before extracting.
> >> >> >> > > > */ diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >> >> >> > > >
> >> >> >> > >
> >> >> >> cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b68961600
> >> >> >> > > 9
> >> >> >> > > 0
> >> >> >> > > > a453cc6a28d9 100644
> >> >> >> > > > --- a/gcc/optabs.cc
> >> >> >> > > > +++ b/gcc/optabs.cc
> >> >> >> > > > @@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode
> >> >> >> mode,
> >> >> >> > > rtx v0, rtx v1,
> >> >> >> > > >        v0_qi = gen_lowpart (qimode, v0);
> >> >> >> > > >        v1_qi = gen_lowpart (qimode, v1);
> >> >> >> > > >        if (targetm.vectorize.vec_perm_const != NULL
> >> >> >> > > > +         && targetm.can_change_mode_class (mode, qimode,
> >> >> >> > > > + ALL_REGS)
> >> >> >> > > It looks like you want to guard gen_lowpart, shouldn't it be better
> >> >> >> > > to use validate_subreg  or (tmp = gen_lowpart_if_possible (mode,
> >> >> >> target_qi)).
> >> >> >> > > IMHO, targetm.can_change_mode_class is mostly used for RA, but not
> >> >> >> > > to guard gen_lowpart.
> >> >> >> >
> >> >> >> > Hmm I don't think this is quite true, there are existing usages in
> >> >> >> > expr.cc and rtanal.cc That do this and aren't part of RA.  As I
> >> >> >> > mentioned before for instance the canoncalization of vec_select to subreg
> >> >> >> in rtlanal for instances uses this.
> >> >> >> In theory, we need to iterate through all reg classes that can be assigned for
> >> >> >> both qimode and mode, if any regclass returns true for
> >> >> >> targetm.can_change_mode_class, the bitcast(validate_subreg) should be ok.
> >> >> >> Here we just passed ALL_REGS.
> >> >> >
> >> >> > Yes, and most targets where this transformation is valid return true here.
> >> >> >
> >> >> > I've checked:
> >> >> >  * alpha
> >> >> >  * arm
> >> >> >  * aarch64
> >> >> >  * rs6000
> >> >> >  * s390
> >> >> >  * sparc
> >> >> >  * pa
> >> >> >  * mips
> >> >> >
> >> >> > And even the default example that other targets use from the documentation
> >> >> > would return true as the size of the modes are the same.
> >> >> >
> >> >> > X86 and RISCV are the only two targets that I found (but didn't check all) that
> >> >> > blankly return a result based on just the register classes.
> >> >> >
> >> >> > That is to say, there are more targets that adhere to the interpretation that
> >> >> > rclass here means "should be possible in some class in rclass" rather than
> >> >> > "should be possible in ALL classes of rclass".
> >> >>
> >> >> Yeah, I agree.  A query "can something stored in ALL_REGS change from
> >> >> mode M1 to mode M2?" is meaningful if at least one register R in ALL_REGS
> >> >> can hold both M1 and M2.  It's then the target's job to answer
> >> >> conservatively so that the result covers all such R.
> >> >>
> >> >> In principle it's OK for a target to err on the side of caution and forbid
> >> >> things that are actually OK.  But that's going to risk losing performance
> >> >> in some cases, and sometimes that loss of performance will be unacceptable.
> >> >> IMO that's what's happening here.  The target is applying x87 rules to
> >> >> things that (AIUI) are never stored in x87 registers, and so losing
> >> > Yes, it can be optimized since some mode will never assigned to x87 registers.
> >> >> performance as a result.
> >> >>
> >> >> Note that the RA also uses ALL_REGS for some things, so this usage
> >> >> isn't specific to non-RA code.
> >> > RA passes the minimal reg class(REGNO_REG_CLASS) which contains REGN
> >> > to decide if can_change_mode_class, not ALL_REGS.
> >> > 511/* Given a hard REGN a FROM mode and a TO mode, return true if
> >> > 512   REGN can change from mode FROM to mode TO.  */
> >> > 513#define REG_CAN_CHANGE_MODE_P(REGN, FROM, TO)                          \
> >> > 514  (targetm.can_change_mode_class (FROM, TO, REGNO_REG_CLASS (REGN)))
> >> > 515
> >> >
> >> > So I still think using can_change_mode_class outside of RA with
> >> > ALL_REGS passed to decide whether it's ok to generate subreg is not a
> >> > good idea.
> >>
> >> But if the argument is that the only valid uses of can_change_mode_class
> >> are through this macro, the hook isn't describing a class property,
> >> it's describing the property of individual registers.  If we say that
> >> querying individual registers is the only valid thing to do them
> >> we should change the hook to take a register number rather than
> >> a class enum.
> >>
> >> The reason we have a class-based rather than register-based interface
> >> is because it is useful to query classes before you've picked a
> >> specific register.
> > For individual registers in the minimal reg class, we assume they are
> > not different from each other, I guess that's why we have
> > REGNO_REG_CLASS and class-based interfaces other than register-based
> > interfaces.
>
> I don't think that's necessarily true.  We have things like
> hard_regno_nregs that operate on individual registers.  And even
> the x86 implementation of the hook uses subset operations rather
> than comparing the class for equality, which suggests some attempt
> to handle classes other than those returned by REGNO_REG_CLASS.
From the gcc doc for subreg, it says if any reg is not ok for subreg,
can_change_mode_class should be false for all classes that include
reg.
---------------------------------------------------------------------
The rules above apply to both pseudo regs and hard regs. If the semantics
are not correct for particular combinations of m1, m2 and hard reg,
the target specific code must ensure that those combinations are never
used. For example:
TARGET_CAN_CHANGE_MODE_CLASS (m2, m1, class)
must be false for ****every class class***** that includes reg.
--------------------------------------------------------------------
>
> > But for ALL_REGS, it's not the minimal reg class, it's the largest.
> > Using it It's not that suitable.
>
> But I think it is suitable for the gimple level, where we have no
> information that would narrow the choice to a specific register class.
>
Maybe we should have different semantics for ALL_REGS(and outside of
RA?) since it's probably a combination of many subclasses, and for RA,
it always query the smallest REG class, and not using ALL_REG(unless
it is the minimal reg class containing regno).
> > If the argument is if some r in rclass is ok for mode change, the hook
> > would return true, then why would RA use REGNO_REG_CLASS other than
> > ALL_REGS.
>
> If you know which hard register something is stored in, it makes
> sense to ask about the closest enclosing class rather than something
> more general.  If a particular mode can be stored in both general
> registers and floating-point registers, and if floating-point registers
> have restrictions that general registers don't, ALL_REGS should honour
> the floating-point constraints.  But it wouldn't make sense to use the
> floating-point constraints for something that is known to be in a
> general register.
Yes, the backend should optimize the hook according to hard_regno_mode_ok.
But it's ok to be conservative, it should be a performance issue, not
correctness issue, no?
>
> > Another spot is in validate_subreg, we're using the minimal reg class
> > instead of ALL_REGS.
> >  973  /* This is a normal subreg.  Verify that the offset is representable.  */
> >  974
> >  975  /* For hard registers, we already have most of these rules collected in
> >  976     subreg_offset_representable_p.  */
> >  977  if (reg && REG_P (reg) && HARD_REGISTER_P (reg))
> >  978    {
> >  979      unsigned int regno = REGNO (reg);
> >  980
> >  981      if ((COMPLEX_MODE_P (imode) || VECTOR_MODE_P (imode))
> >  982          && GET_MODE_INNER (imode) == omode)
> >  983        ;
> >  984      else if (!REG_CAN_CHANGE_MODE_P (regno, imode, omode))
> >  985        return false;
>
> But here too, we're using REG_CAN_CHANGE_MODE_P because we know
> the register number.  It wouldn't make sense to ask about a more
> general class than necessary.
>
> REG_CANNOT_CHANGE_MODE_P (as it was then) was added in 2002 as a
> convenient interface to CANNOT_CHANGE_MODE_CLASS.  CANNOT_CHANGE_MODE_CLASS
> in turn replaced CLASS_CANNOT_CHANGE_MODE_P, which only took two modes,
> and wasn't given any class information.  So this interface was
> originally a query about modes, not a query about classes.  The class
> information was added later since (understandably) modes weren't always
> enough on their own.  But I think it is still fundamentally a query
> about modes, with the class provided for context, rather than a query
> about classes, with modes provided by context.
>
> > I think we do need some hook in the middle end to query things like if
> > some r in rclass is ok for mode change?  but not reusing
> > can_change_mode_class.
>
> But if we add a hook to say "are mode changes from mode M1 to mode M2 OK?",
> which is what Tamar's patch and some other existing code wants to know,
> I fear we'll just reintroduce the old CLASS_CANNOT_CHANGE_MODE_P (but
> hopefully without embedding the negative sense).  I don't think it makes
> sense to have that hook alongside the existing one.  It would require
> targets to duplicate information and would run the risk on conflicting
> information for corner cases.  IMO it would repeat the mistake of having
> both hard_regno_nregs and class_max_nregs; really, the latter ought
> to be calculated from the former.
We already have conflict info between validate_subreg and
REG_CAN_CHANGE_MODE_P, and it caused the ICE which Tamar wants to fix
in another x86 patch.
I agree we shouldn't introduce more mess before we clean existed up.
>
> Thanks,
> Richard
>
> >> Thanks,
> >> Richard
> >>
> >> >> IMO it's not the job of target-independent code to iterate through
> >> >> individual classes and aggregate the result.  One of the reasons for
> >> >> having union classes is to avoid the need to do that.  And ALL_REGS
> >> >> is the ultimate union class. :-)
> >> >>
> >> >> The patch looks correct to me.
> >> >>
> >> >> Thanks,
> >> >> Richard
> >> >>
> >> >> >> >
> >> >> >> > So there are already existing precedence for this.  And the
> >> >> >> > documentation for the hook says:
> >> >> >> >
> >> >> >> > "This hook returns true if it is possible to bitcast values held in registers of
> >> >> >> class rclass from mode from to mode to and if doing so preserves the low-
> >> >> >> order bits that are common to both modes. The result is only meaningful if
> >> >> >> rclass has registers that can hold both from and to. The default
> >> >> >> implementation returns true"
> >> >> >> >
> >> >> >> > So it looks like it's use outside of RA is perfectly valid.. and the
> >> >> >> > documentation also mentions in the example the use from the mid-end as
> >> >> >> an example.
> >> >> >> >
> >> >> >> > But if the mid-end maintainers are happy I'll use something else.
> >> >> >> >
> >> >> >> > Tamar
> >> >> >> >
> >> >> >> > > I did similar things in
> >> >> >> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579296.html
> >> >> >> > > (and ALL_REGS doesn't cover all cases for registers which are both
> >> >> >> > > available for qimode and mode, ALL_REGS fail doesn't mean it can't
> >> >> >> > > be subreg, it just means parts of ALL_REGS can't be subreg. but with
> >> >> >> > > a subset of ALL_REGS, there could be a reg class which return true
> >> >> >> > > for
> >> >> >> > > targetm.can_change_mode_class)
> >> >> >> > > >           && targetm.vectorize.vec_perm_const (qimode, qimode,
> >> >> >> > > > target_qi,
> >> >> >> > > v0_qi,
> >> >> >> > > >                                                v1_qi, qimode_indices))
> >> >> >> > > >         return gen_lowpart (mode, target_qi); @@ -6311,7 +6312,8
> >> >> >> > > > @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
> >> >> >> > > >      }
> >> >> >> > > >
> >> >> >> > > >    if (qimode != VOIDmode
> >> >> >> > > > -      && selector_fits_mode_p (qimode, qimode_indices))
> >> >> >> > > > +      && selector_fits_mode_p (qimode, qimode_indices)
> >> >> >> > > > +      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
> >> >> >> > > >      {
> >> >> >> > > >        icode = direct_optab_handler (vec_perm_optab, qimode);
> >> >> >> > > >        if (icode != CODE_FOR_nothing) diff --git
> >> >> >> > > > a/gcc/testsuite/gcc.target/aarch64/ext_1.c
> >> >> >> > > > b/gcc/testsuite/gcc.target/aarch64/ext_1.c
> >> >> >> > > > new file mode 100644
> >> >> >> > > > index
> >> >> >> > > >
> >> >> >> > >
> >> >> >> 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e5
> >> >> >> > > 71
> >> >> >> > > > b3bc2ddf887a
> >> >> >> > >
> >> >> >> > >
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > --
> >> >> >> > > BR,
> >> >> >> > > Hongtao
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> BR,
> >> >> >> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
  2022-11-18  2:31                         ` Hongtao Liu
@ 2022-11-18  9:16                           ` Richard Sandiford
  0 siblings, 0 replies; 50+ messages in thread
From: Richard Sandiford @ 2022-11-18  9:16 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Tamar Christina, Tamar Christina via Gcc-patches, nd, rguenther

Hongtao Liu <crazylht@gmail.com> writes:
> On Thu, Nov 17, 2022 at 9:59 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Hongtao Liu <crazylht@gmail.com> writes:
>> > On Thu, Nov 17, 2022 at 5:39 PM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Hongtao Liu <crazylht@gmail.com> writes:
>> >> > On Wed, Nov 16, 2022 at 1:39 AM Richard Sandiford
>> >> > <richard.sandiford@arm.com> wrote:
>> >> >>
>> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> >> >> -----Original Message-----
>> >> >> >> From: Hongtao Liu <crazylht@gmail.com>
>> >> >> >> Sent: Tuesday, November 15, 2022 9:37 AM
>> >> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> >> >> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina via
>> >> >> >> Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
>> >> >> >> rguenther@suse.de
>> >> >> >> Subject: Re: [PATCH 3/8]middle-end: Support extractions of subvectors from
>> >> >> >> arbitrary element position inside a vector
>> >> >> >>
>> >> >> >> On Tue, Nov 15, 2022 at 4:51 PM Tamar Christina
>> >> >> >> <Tamar.Christina@arm.com> wrote:
>> >> >> >> >
>> >> >> >> > > -----Original Message-----
>> >> >> >> > > From: Hongtao Liu <crazylht@gmail.com>
>> >> >> >> > > Sent: Tuesday, November 15, 2022 8:36 AM
>> >> >> >> > > To: Tamar Christina <Tamar.Christina@arm.com>
>> >> >> >> > > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina
>> >> >> >> > > via Gcc-patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
>> >> >> >> > > rguenther@suse.de
>> >> >> >> > > Subject: Re: [PATCH 3/8]middle-end: Support extractions of
>> >> >> >> > > subvectors from arbitrary element position inside a vector
>> >> >> >> > >
>> >> >> >> > > Hi:
>> >> >> >> > >   I'm from https://gcc.gnu.org/pipermail/gcc-patches/2022-
>> >> >> >> > > November/606040.html.
>> >> >> >> > > >      }
>> >> >> >> > > >
>> >> >> >> > > >    /* See if we can get a better vector mode before extracting.
>> >> >> >> > > > */ diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> cff37ccb0dfc3dd79b97d0abfd872f340855dc96..f338df410265dfe55b68961600
>> >> >> >> > > 9
>> >> >> >> > > 0
>> >> >> >> > > > a453cc6a28d9 100644
>> >> >> >> > > > --- a/gcc/optabs.cc
>> >> >> >> > > > +++ b/gcc/optabs.cc
>> >> >> >> > > > @@ -6267,6 +6267,7 @@ expand_vec_perm_const (machine_mode
>> >> >> >> mode,
>> >> >> >> > > rtx v0, rtx v1,
>> >> >> >> > > >        v0_qi = gen_lowpart (qimode, v0);
>> >> >> >> > > >        v1_qi = gen_lowpart (qimode, v1);
>> >> >> >> > > >        if (targetm.vectorize.vec_perm_const != NULL
>> >> >> >> > > > +         && targetm.can_change_mode_class (mode, qimode,
>> >> >> >> > > > + ALL_REGS)
>> >> >> >> > > It looks like you want to guard gen_lowpart, shouldn't it be better
>> >> >> >> > > to use validate_subreg  or (tmp = gen_lowpart_if_possible (mode,
>> >> >> >> target_qi)).
>> >> >> >> > > IMHO, targetm.can_change_mode_class is mostly used for RA, but not
>> >> >> >> > > to guard gen_lowpart.
>> >> >> >> >
>> >> >> >> > Hmm I don't think this is quite true, there are existing usages in
>> >> >> >> > expr.cc and rtanal.cc That do this and aren't part of RA.  As I
>> >> >> >> > mentioned before for instance the canoncalization of vec_select to subreg
>> >> >> >> in rtlanal for instances uses this.
>> >> >> >> In theory, we need to iterate through all reg classes that can be assigned for
>> >> >> >> both qimode and mode, if any regclass returns true for
>> >> >> >> targetm.can_change_mode_class, the bitcast(validate_subreg) should be ok.
>> >> >> >> Here we just passed ALL_REGS.
>> >> >> >
>> >> >> > Yes, and most targets where this transformation is valid return true here.
>> >> >> >
>> >> >> > I've checked:
>> >> >> >  * alpha
>> >> >> >  * arm
>> >> >> >  * aarch64
>> >> >> >  * rs6000
>> >> >> >  * s390
>> >> >> >  * sparc
>> >> >> >  * pa
>> >> >> >  * mips
>> >> >> >
>> >> >> > And even the default example that other targets use from the documentation
>> >> >> > would return true as the size of the modes are the same.
>> >> >> >
>> >> >> > X86 and RISCV are the only two targets that I found (but didn't check all) that
>> >> >> > blankly return a result based on just the register classes.
>> >> >> >
>> >> >> > That is to say, there are more targets that adhere to the interpretation that
>> >> >> > rclass here means "should be possible in some class in rclass" rather than
>> >> >> > "should be possible in ALL classes of rclass".
>> >> >>
>> >> >> Yeah, I agree.  A query "can something stored in ALL_REGS change from
>> >> >> mode M1 to mode M2?" is meaningful if at least one register R in ALL_REGS
>> >> >> can hold both M1 and M2.  It's then the target's job to answer
>> >> >> conservatively so that the result covers all such R.
>> >> >>
>> >> >> In principle it's OK for a target to err on the side of caution and forbid
>> >> >> things that are actually OK.  But that's going to risk losing performance
>> >> >> in some cases, and sometimes that loss of performance will be unacceptable.
>> >> >> IMO that's what's happening here.  The target is applying x87 rules to
>> >> >> things that (AIUI) are never stored in x87 registers, and so losing
>> >> > Yes, it can be optimized since some mode will never assigned to x87 registers.
>> >> >> performance as a result.
>> >> >>
>> >> >> Note that the RA also uses ALL_REGS for some things, so this usage
>> >> >> isn't specific to non-RA code.
>> >> > RA passes the minimal reg class(REGNO_REG_CLASS) which contains REGN
>> >> > to decide if can_change_mode_class, not ALL_REGS.
>> >> > 511/* Given a hard REGN a FROM mode and a TO mode, return true if
>> >> > 512   REGN can change from mode FROM to mode TO.  */
>> >> > 513#define REG_CAN_CHANGE_MODE_P(REGN, FROM, TO)                          \
>> >> > 514  (targetm.can_change_mode_class (FROM, TO, REGNO_REG_CLASS (REGN)))
>> >> > 515
>> >> >
>> >> > So I still think using can_change_mode_class outside of RA with
>> >> > ALL_REGS passed to decide whether it's ok to generate subreg is not a
>> >> > good idea.
>> >>
>> >> But if the argument is that the only valid uses of can_change_mode_class
>> >> are through this macro, the hook isn't describing a class property,
>> >> it's describing the property of individual registers.  If we say that
>> >> querying individual registers is the only valid thing to do them
>> >> we should change the hook to take a register number rather than
>> >> a class enum.
>> >>
>> >> The reason we have a class-based rather than register-based interface
>> >> is because it is useful to query classes before you've picked a
>> >> specific register.
>> > For individual registers in the minimal reg class, we assume they are
>> > not different from each other, I guess that's why we have
>> > REGNO_REG_CLASS and class-based interfaces other than register-based
>> > interfaces.
>>
>> I don't think that's necessarily true.  We have things like
>> hard_regno_nregs that operate on individual registers.  And even
>> the x86 implementation of the hook uses subset operations rather
>> than comparing the class for equality, which suggests some attempt
>> to handle classes other than those returned by REGNO_REG_CLASS.
> From the gcc doc for subreg, it says if any reg is not ok for subreg,
> can_change_mode_class should be false for all classes that include
> reg.
> ---------------------------------------------------------------------
> The rules above apply to both pseudo regs and hard regs. If the semantics
> are not correct for particular combinations of m1, m2 and hard reg,
> the target specific code must ensure that those combinations are never
> used. For example:
> TARGET_CAN_CHANGE_MODE_CLASS (m2, m1, class)
> must be false for ****every class class***** that includes reg.
> --------------------------------------------------------------------

Yeah.  My point was that the x86 hook is right to look at other classes,
and is right to use subset tests.  It corresponds to my later general
registers vs. floating-point registers example: ALL_REGS contains
floating-point registers, so ALL_REGS is bound by the floating-point
register constraints.

My point was: you seemed to be saying that the only valid use of the
hook was through REG_CAN_CHANGE_MODE_P, or for the minimal register
classes.  If that was true, the hook wouldn't need to handle superclasses.
But even the x86 hook (rightly) handles those too, even though they
would never be returned by REGNO_REG_CLASS.

The reason the current x86 hook doesn't really follow the documented
behaviour is that it sometimes (but not always) ignores m1 and m2.

>> > But for ALL_REGS, it's not the minimal reg class, it's the largest.
>> > Using it It's not that suitable.
>>
>> But I think it is suitable for the gimple level, where we have no
>> information that would narrow the choice to a specific register class.
>>
> Maybe we should have different semantics for ALL_REGS(and outside of
> RA?) since it's probably a combination of many subclasses, and for RA,
> it always query the smallest REG class, and not using ALL_REG(unless
> it is the minimal reg class containing regno).
>> > If the argument is if some r in rclass is ok for mode change, the hook
>> > would return true, then why would RA use REGNO_REG_CLASS other than
>> > ALL_REGS.
>>
>> If you know which hard register something is stored in, it makes
>> sense to ask about the closest enclosing class rather than something
>> more general.  If a particular mode can be stored in both general
>> registers and floating-point registers, and if floating-point registers
>> have restrictions that general registers don't, ALL_REGS should honour
>> the floating-point constraints.  But it wouldn't make sense to use the
>> floating-point constraints for something that is known to be in a
>> general register.
> Yes, the backend should optimize the hook according to hard_regno_mode_ok.
> But it's ok to be conservative, it should be a performance issue, not
> correctness issue, no?

Yeah, it's OK to be conservative.  As I understood it, the issue
in Tamar's case was that ix86_vectorize_vec_perm_const assumed that
target-independent code would handle some cases by converting the
permutation to operate on a vector of QIs, whereas x86_can_change_mode_class
said that the associated mode change wasn't valid.  Neither of the hooks
are wrong individually, it's the combination that causes problems.

>> > Another spot is in validate_subreg, we're using the minimal reg class
>> > instead of ALL_REGS.
>> >  973  /* This is a normal subreg.  Verify that the offset is representable.  */
>> >  974
>> >  975  /* For hard registers, we already have most of these rules collected in
>> >  976     subreg_offset_representable_p.  */
>> >  977  if (reg && REG_P (reg) && HARD_REGISTER_P (reg))
>> >  978    {
>> >  979      unsigned int regno = REGNO (reg);
>> >  980
>> >  981      if ((COMPLEX_MODE_P (imode) || VECTOR_MODE_P (imode))
>> >  982          && GET_MODE_INNER (imode) == omode)
>> >  983        ;
>> >  984      else if (!REG_CAN_CHANGE_MODE_P (regno, imode, omode))
>> >  985        return false;
>>
>> But here too, we're using REG_CAN_CHANGE_MODE_P because we know
>> the register number.  It wouldn't make sense to ask about a more
>> general class than necessary.
>>
>> REG_CANNOT_CHANGE_MODE_P (as it was then) was added in 2002 as a
>> convenient interface to CANNOT_CHANGE_MODE_CLASS.  CANNOT_CHANGE_MODE_CLASS
>> in turn replaced CLASS_CANNOT_CHANGE_MODE_P, which only took two modes,
>> and wasn't given any class information.  So this interface was
>> originally a query about modes, not a query about classes.  The class
>> information was added later since (understandably) modes weren't always
>> enough on their own.  But I think it is still fundamentally a query
>> about modes, with the class provided for context, rather than a query
>> about classes, with modes provided by context.
>>
>> > I think we do need some hook in the middle end to query things like if
>> > some r in rclass is ok for mode change?  but not reusing
>> > can_change_mode_class.
>>
>> But if we add a hook to say "are mode changes from mode M1 to mode M2 OK?",
>> which is what Tamar's patch and some other existing code wants to know,
>> I fear we'll just reintroduce the old CLASS_CANNOT_CHANGE_MODE_P (but
>> hopefully without embedding the negative sense).  I don't think it makes
>> sense to have that hook alongside the existing one.  It would require
>> targets to duplicate information and would run the risk on conflicting
>> information for corner cases.  IMO it would repeat the mistake of having
>> both hard_regno_nregs and class_max_nregs; really, the latter ought
>> to be calculated from the former.
> We already have conflict info between validate_subreg and
> REG_CAN_CHANGE_MODE_P, and it caused the ICE which Tamar wants to fix
> in another x86 patch.

Ah, haven't got to that thread yet.

Thanks,
Richard

> I agree we shouldn't introduce more mess before we clean existed up.
>>
>> Thanks,
>> Richard
>>
>> >> Thanks,
>> >> Richard
>> >>
>> >> >> IMO it's not the job of target-independent code to iterate through
>> >> >> individual classes and aggregate the result.  One of the reasons for
>> >> >> having union classes is to avoid the need to do that.  And ALL_REGS
>> >> >> is the ultimate union class. :-)
>> >> >>
>> >> >> The patch looks correct to me.
>> >> >>
>> >> >> Thanks,
>> >> >> Richard
>> >> >>
>> >> >> >> >
>> >> >> >> > So there are already existing precedence for this.  And the
>> >> >> >> > documentation for the hook says:
>> >> >> >> >
>> >> >> >> > "This hook returns true if it is possible to bitcast values held in registers of
>> >> >> >> class rclass from mode from to mode to and if doing so preserves the low-
>> >> >> >> order bits that are common to both modes. The result is only meaningful if
>> >> >> >> rclass has registers that can hold both from and to. The default
>> >> >> >> implementation returns true"
>> >> >> >> >
>> >> >> >> > So it looks like it's use outside of RA is perfectly valid.. and the
>> >> >> >> > documentation also mentions in the example the use from the mid-end as
>> >> >> >> an example.
>> >> >> >> >
>> >> >> >> > But if the mid-end maintainers are happy I'll use something else.
>> >> >> >> >
>> >> >> >> > Tamar
>> >> >> >> >
>> >> >> >> > > I did similar things in
>> >> >> >> > > https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579296.html
>> >> >> >> > > (and ALL_REGS doesn't cover all cases for registers which are both
>> >> >> >> > > available for qimode and mode, ALL_REGS fail doesn't mean it can't
>> >> >> >> > > be subreg, it just means parts of ALL_REGS can't be subreg. but with
>> >> >> >> > > a subset of ALL_REGS, there could be a reg class which return true
>> >> >> >> > > for
>> >> >> >> > > targetm.can_change_mode_class)
>> >> >> >> > > >           && targetm.vectorize.vec_perm_const (qimode, qimode,
>> >> >> >> > > > target_qi,
>> >> >> >> > > v0_qi,
>> >> >> >> > > >                                                v1_qi, qimode_indices))
>> >> >> >> > > >         return gen_lowpart (mode, target_qi); @@ -6311,7 +6312,8
>> >> >> >> > > > @@ expand_vec_perm_const (machine_mode mode, rtx v0, rtx v1,
>> >> >> >> > > >      }
>> >> >> >> > > >
>> >> >> >> > > >    if (qimode != VOIDmode
>> >> >> >> > > > -      && selector_fits_mode_p (qimode, qimode_indices))
>> >> >> >> > > > +      && selector_fits_mode_p (qimode, qimode_indices)
>> >> >> >> > > > +      && targetm.can_change_mode_class (mode, qimode, ALL_REGS))
>> >> >> >> > > >      {
>> >> >> >> > > >        icode = direct_optab_handler (vec_perm_optab, qimode);
>> >> >> >> > > >        if (icode != CODE_FOR_nothing) diff --git
>> >> >> >> > > > a/gcc/testsuite/gcc.target/aarch64/ext_1.c
>> >> >> >> > > > b/gcc/testsuite/gcc.target/aarch64/ext_1.c
>> >> >> >> > > > new file mode 100644
>> >> >> >> > > > index
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e5
>> >> >> >> > > 71
>> >> >> >> > > > b3bc2ddf887a
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > > --
>> >> >> >> > > BR,
>> >> >> >> > > Hongtao
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> BR,
>> >> >> >> Hongtao

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-11-07 11:56           ` Tamar Christina
@ 2022-11-22 10:36             ` Richard Sandiford
  2022-11-22 10:58               ` Richard Biener
  0 siblings, 1 reply; 50+ messages in thread
From: Richard Sandiford @ 2022-11-22 10:36 UTC (permalink / raw)
  To: Tamar Christina via Gcc-patches
  Cc: Richard Biener, Tamar Christina, Richard Biener, nd

Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> So it's not easily possible the within current infrastructure.  But it does look
>> like ARM might eventually benefit from something like STV on x86?
>> 
>
> I'm not sure.  The problem with trying to do this in RTL is that you'd have to be
> able to decide from two psuedos whether they come from extracts that are
> sequential. When coming in from a hard register that's easy yes.  When coming in
> from a load, or any other operation that produces psuedos that becomes harder.

Yeah.

Just in case anyone reading the above is tempted to implement STV for
AArch64: I think it would set a bad precedent if we had a paste-&-adjust
version of the x86 pass.  AFAIK, the target capabilities and constraints
are mostly modelled correctly using existing mechanisms, so I don't
think there's anything particularly target-specific about the process
of forcing things to be on the general or SIMD/FP side.

So if we did have an STV-ish thing for AArch64, I think it should be
a target-independent pass that uses hooks and recog, even if the pass
is initially enabled for AArch64 only.

(FWIW, on the patch itself, I tend to agree that this is really an
SLP optimisation.  If the vectoriser fails to see the benefit, or if
it fails to handle more complex cases, then it would be good to try
to fix that.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-11-22 10:36             ` Richard Sandiford
@ 2022-11-22 10:58               ` Richard Biener
  2022-11-22 11:02                 ` Tamar Christina
  0 siblings, 1 reply; 50+ messages in thread
From: Richard Biener @ 2022-11-22 10:58 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, Tamar Christina, Richard Biener, nd

On Tue, 22 Nov 2022, Richard Sandiford wrote:

> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> So it's not easily possible the within current infrastructure.  But it does look
> >> like ARM might eventually benefit from something like STV on x86?
> >> 
> >
> > I'm not sure.  The problem with trying to do this in RTL is that you'd have to be
> > able to decide from two psuedos whether they come from extracts that are
> > sequential. When coming in from a hard register that's easy yes.  When coming in
> > from a load, or any other operation that produces psuedos that becomes harder.
> 
> Yeah.
> 
> Just in case anyone reading the above is tempted to implement STV for
> AArch64: I think it would set a bad precedent if we had a paste-&-adjust
> version of the x86 pass.  AFAIK, the target capabilities and constraints
> are mostly modelled correctly using existing mechanisms, so I don't
> think there's anything particularly target-specific about the process
> of forcing things to be on the general or SIMD/FP side.
> 
> So if we did have an STV-ish thing for AArch64, I think it should be
> a target-independent pass that uses hooks and recog, even if the pass
> is initially enabled for AArch64 only.

Agreed - maybe some of the x86 code can be leveraged, but of course
the cost modeling is the most difficult to get right - IIRC the x86
backend resorts to backend specific tuning flags rather than trying
to get rtx_cost or insn_cost "correct" here.

> (FWIW, on the patch itself, I tend to agree that this is really an
> SLP optimisation.  If the vectoriser fails to see the benefit, or if
> it fails to handle more complex cases, then it would be good to try
> to fix that.)

Also agreed - but costing is hard ;)

Richard.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-11-22 10:58               ` Richard Biener
@ 2022-11-22 11:02                 ` Tamar Christina
  2022-11-22 11:06                   ` Richard Sandiford
  0 siblings, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-11-22 11:02 UTC (permalink / raw)
  To: Richard Biener, Richard Sandiford
  Cc: Tamar Christina via Gcc-patches, Richard Biener, nd

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Tuesday, November 22, 2022 10:59 AM
> To: Richard Sandiford <Richard.Sandiford@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar
> Christina <Tamar.Christina@arm.com>; Richard Biener
> <richard.guenther@gmail.com>; nd <nd@arm.com>
> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> bitfields and array_refs
> 
> On Tue, 22 Nov 2022, Richard Sandiford wrote:
> 
> > Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > >> So it's not easily possible the within current infrastructure.  But
> > >> it does look like ARM might eventually benefit from something like STV
> on x86?
> > >>
> > >
> > > I'm not sure.  The problem with trying to do this in RTL is that
> > > you'd have to be able to decide from two psuedos whether they come
> > > from extracts that are sequential. When coming in from a hard
> > > register that's easy yes.  When coming in from a load, or any other
> operation that produces psuedos that becomes harder.
> >
> > Yeah.
> >
> > Just in case anyone reading the above is tempted to implement STV for
> > AArch64: I think it would set a bad precedent if we had a
> > paste-&-adjust version of the x86 pass.  AFAIK, the target
> > capabilities and constraints are mostly modelled correctly using
> > existing mechanisms, so I don't think there's anything particularly
> > target-specific about the process of forcing things to be on the general or
> SIMD/FP side.
> >
> > So if we did have an STV-ish thing for AArch64, I think it should be a
> > target-independent pass that uses hooks and recog, even if the pass is
> > initially enabled for AArch64 only.
> 
> Agreed - maybe some of the x86 code can be leveraged, but of course the
> cost modeling is the most difficult to get right - IIRC the x86 backend resorts
> to backend specific tuning flags rather than trying to get rtx_cost or insn_cost
> "correct" here.
> 
> > (FWIW, on the patch itself, I tend to agree that this is really an SLP
> > optimisation.  If the vectoriser fails to see the benefit, or if it
> > fails to handle more complex cases, then it would be good to try to
> > fix that.)
> 
> Also agreed - but costing is hard ;)

I guess, I still disagree here but I've clearly been out-Richard.  The problem is still
that this is just basic codegen.  I still don't think it requires -O2 to be usable.

So I guess the only correct implementation is to use an STV-like patch.  But given
that this is already the second attempt, first RTL one was rejected by Richard,
second GIMPLE one was rejected by Richi I'd like to get an agreement on this STV
thing before I waste months more..

Thanks,
Tamar

> 
> Richard.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-11-22 11:02                 ` Tamar Christina
@ 2022-11-22 11:06                   ` Richard Sandiford
  2022-11-22 11:08                     ` Richard Biener
  0 siblings, 1 reply; 50+ messages in thread
From: Richard Sandiford @ 2022-11-22 11:06 UTC (permalink / raw)
  To: Tamar Christina
  Cc: Richard Biener, Tamar Christina via Gcc-patches, Richard Biener, nd

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Biener <rguenther@suse.de>
>> Sent: Tuesday, November 22, 2022 10:59 AM
>> To: Richard Sandiford <Richard.Sandiford@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar
>> Christina <Tamar.Christina@arm.com>; Richard Biener
>> <richard.guenther@gmail.com>; nd <nd@arm.com>
>> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
>> bitfields and array_refs
>>
>> On Tue, 22 Nov 2022, Richard Sandiford wrote:
>>
>> > Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > >> So it's not easily possible the within current infrastructure.  But
>> > >> it does look like ARM might eventually benefit from something like STV
>> on x86?
>> > >>
>> > >
>> > > I'm not sure.  The problem with trying to do this in RTL is that
>> > > you'd have to be able to decide from two psuedos whether they come
>> > > from extracts that are sequential. When coming in from a hard
>> > > register that's easy yes.  When coming in from a load, or any other
>> operation that produces psuedos that becomes harder.
>> >
>> > Yeah.
>> >
>> > Just in case anyone reading the above is tempted to implement STV for
>> > AArch64: I think it would set a bad precedent if we had a
>> > paste-&-adjust version of the x86 pass.  AFAIK, the target
>> > capabilities and constraints are mostly modelled correctly using
>> > existing mechanisms, so I don't think there's anything particularly
>> > target-specific about the process of forcing things to be on the general or
>> SIMD/FP side.
>> >
>> > So if we did have an STV-ish thing for AArch64, I think it should be a
>> > target-independent pass that uses hooks and recog, even if the pass is
>> > initially enabled for AArch64 only.
>>
>> Agreed - maybe some of the x86 code can be leveraged, but of course the
>> cost modeling is the most difficult to get right - IIRC the x86 backend resorts
>> to backend specific tuning flags rather than trying to get rtx_cost or insn_cost
>> "correct" here.
>>
>> > (FWIW, on the patch itself, I tend to agree that this is really an SLP
>> > optimisation.  If the vectoriser fails to see the benefit, or if it
>> > fails to handle more complex cases, then it would be good to try to
>> > fix that.)
>>
>> Also agreed - but costing is hard ;)
>
> I guess, I still disagree here but I've clearly been out-Richard.  The problem is still
> that this is just basic codegen.  I still don't think it requires -O2 to be usable.
>
> So I guess the only correct implementation is to use an STV-like patch.  But given
> that this is already the second attempt, first RTL one was rejected by Richard,
> second GIMPLE one was rejected by Richi I'd like to get an agreement on this STV
> thing before I waste months more..

I don't think this in itself is a good motivation for STV.  My comment
above was more about the idea of STV for AArch64 in general (since it
had been raised).

Personally I still think the reduction should be generated in gimple.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-11-22 11:06                   ` Richard Sandiford
@ 2022-11-22 11:08                     ` Richard Biener
  2022-11-22 14:33                       ` Jeff Law
  0 siblings, 1 reply; 50+ messages in thread
From: Richard Biener @ 2022-11-22 11:08 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: Tamar Christina, Tamar Christina via Gcc-patches, Richard Biener, nd

On Tue, 22 Nov 2022, Richard Sandiford wrote:

> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Biener <rguenther@suse.de>
> >> Sent: Tuesday, November 22, 2022 10:59 AM
> >> To: Richard Sandiford <Richard.Sandiford@arm.com>
> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar
> >> Christina <Tamar.Christina@arm.com>; Richard Biener
> >> <richard.guenther@gmail.com>; nd <nd@arm.com>
> >> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
> >> bitfields and array_refs
> >>
> >> On Tue, 22 Nov 2022, Richard Sandiford wrote:
> >>
> >> > Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > >> So it's not easily possible the within current infrastructure.  But
> >> > >> it does look like ARM might eventually benefit from something like STV
> >> on x86?
> >> > >>
> >> > >
> >> > > I'm not sure.  The problem with trying to do this in RTL is that
> >> > > you'd have to be able to decide from two psuedos whether they come
> >> > > from extracts that are sequential. When coming in from a hard
> >> > > register that's easy yes.  When coming in from a load, or any other
> >> operation that produces psuedos that becomes harder.
> >> >
> >> > Yeah.
> >> >
> >> > Just in case anyone reading the above is tempted to implement STV for
> >> > AArch64: I think it would set a bad precedent if we had a
> >> > paste-&-adjust version of the x86 pass.  AFAIK, the target
> >> > capabilities and constraints are mostly modelled correctly using
> >> > existing mechanisms, so I don't think there's anything particularly
> >> > target-specific about the process of forcing things to be on the general or
> >> SIMD/FP side.
> >> >
> >> > So if we did have an STV-ish thing for AArch64, I think it should be a
> >> > target-independent pass that uses hooks and recog, even if the pass is
> >> > initially enabled for AArch64 only.
> >>
> >> Agreed - maybe some of the x86 code can be leveraged, but of course the
> >> cost modeling is the most difficult to get right - IIRC the x86 backend resorts
> >> to backend specific tuning flags rather than trying to get rtx_cost or insn_cost
> >> "correct" here.
> >>
> >> > (FWIW, on the patch itself, I tend to agree that this is really an SLP
> >> > optimisation.  If the vectoriser fails to see the benefit, or if it
> >> > fails to handle more complex cases, then it would be good to try to
> >> > fix that.)
> >>
> >> Also agreed - but costing is hard ;)
> >
> > I guess, I still disagree here but I've clearly been out-Richard.  The problem is still
> > that this is just basic codegen.  I still don't think it requires -O2 to be usable.
> >
> > So I guess the only correct implementation is to use an STV-like patch.  But given
> > that this is already the second attempt, first RTL one was rejected by Richard,
> > second GIMPLE one was rejected by Richi I'd like to get an agreement on this STV
> > thing before I waste months more..
> 
> I don't think this in itself is a good motivation for STV.  My comment
> above was more about the idea of STV for AArch64 in general (since it
> had been raised).
> 
> Personally I still think the reduction should be generated in gimple.

I agree, and the proper place to generate the reduction is in SLP.

Richard.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs
  2022-11-22 11:08                     ` Richard Biener
@ 2022-11-22 14:33                       ` Jeff Law
  0 siblings, 0 replies; 50+ messages in thread
From: Jeff Law @ 2022-11-22 14:33 UTC (permalink / raw)
  To: Richard Biener, Richard Sandiford
  Cc: Tamar Christina, Tamar Christina via Gcc-patches, Richard Biener, nd


On 11/22/22 04:08, Richard Biener via Gcc-patches wrote:
> On Tue, 22 Nov 2022, Richard Sandiford wrote:
>
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>>>> -----Original Message-----
>>>> From: Richard Biener <rguenther@suse.de>
>>>> Sent: Tuesday, November 22, 2022 10:59 AM
>>>> To: Richard Sandiford <Richard.Sandiford@arm.com>
>>>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar
>>>> Christina <Tamar.Christina@arm.com>; Richard Biener
>>>> <richard.guenther@gmail.com>; nd <nd@arm.com>
>>>> Subject: Re: [PATCH 1/8]middle-end: Recognize scalar reductions from
>>>> bitfields and array_refs
>>>>
>>>> On Tue, 22 Nov 2022, Richard Sandiford wrote:
>>>>
>>>>> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>>>>>>> So it's not easily possible the within current infrastructure.  But
>>>>>>> it does look like ARM might eventually benefit from something like STV
>>>> on x86?
>>>>>> I'm not sure.  The problem with trying to do this in RTL is that
>>>>>> you'd have to be able to decide from two psuedos whether they come
>>>>>> from extracts that are sequential. When coming in from a hard
>>>>>> register that's easy yes.  When coming in from a load, or any other
>>>> operation that produces psuedos that becomes harder.
>>>>> Yeah.
>>>>>
>>>>> Just in case anyone reading the above is tempted to implement STV for
>>>>> AArch64: I think it would set a bad precedent if we had a
>>>>> paste-&-adjust version of the x86 pass.  AFAIK, the target
>>>>> capabilities and constraints are mostly modelled correctly using
>>>>> existing mechanisms, so I don't think there's anything particularly
>>>>> target-specific about the process of forcing things to be on the general or
>>>> SIMD/FP side.
>>>>> So if we did have an STV-ish thing for AArch64, I think it should be a
>>>>> target-independent pass that uses hooks and recog, even if the pass is
>>>>> initially enabled for AArch64 only.
>>>> Agreed - maybe some of the x86 code can be leveraged, but of course the
>>>> cost modeling is the most difficult to get right - IIRC the x86 backend resorts
>>>> to backend specific tuning flags rather than trying to get rtx_cost or insn_cost
>>>> "correct" here.
>>>>
>>>>> (FWIW, on the patch itself, I tend to agree that this is really an SLP
>>>>> optimisation.  If the vectoriser fails to see the benefit, or if it
>>>>> fails to handle more complex cases, then it would be good to try to
>>>>> fix that.)
>>>> Also agreed - but costing is hard ;)
>>> I guess, I still disagree here but I've clearly been out-Richard.  The problem is still
>>> that this is just basic codegen.  I still don't think it requires -O2 to be usable.
>>>
>>> So I guess the only correct implementation is to use an STV-like patch.  But given
>>> that this is already the second attempt, first RTL one was rejected by Richard,
>>> second GIMPLE one was rejected by Richi I'd like to get an agreement on this STV
>>> thing before I waste months more..
>> I don't think this in itself is a good motivation for STV.  My comment
>> above was more about the idea of STV for AArch64 in general (since it
>> had been raised).
>>
>> Personally I still think the reduction should be generated in gimple.
> I agree, and the proper place to generate the reduction is in SLP.

Sorry to have sent things astray with my earlier ACK.  It looked 
reasonable to me.

jeff


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
  2022-11-11 14:39     ` Tamar Christina
@ 2022-11-22 16:01       ` Tamar Christina
  2022-11-30  4:26         ` Tamar Christina
  2022-12-06 10:28       ` Richard Sandiford
  1 sibling, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-11-22 16:01 UTC (permalink / raw)
  To: Tamar Christina, Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Ping

> -----Original Message-----
> From: Gcc-patches <gcc-patches-
> bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Tamar
> Christina via Gcc-patches
> Sent: Friday, November 11, 2022 2:40 PM
> To: Richard Sandiford <Richard.Sandiford@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: RE: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
> 
> Hi,
> 
> 
> > This name might cause confusion with the SVE iterators, where FULL
> > means "every bit of the register is used".  How about something like
> > VMOVE instead?
> >
> > With this change, I guess VALL_F16 represents "The set of all modes
> > for which the vld1 intrinsics are provided" and VMOVE or whatever is
> > "All Advanced SIMD modes suitable for moving, loading, and storing".
> > That is, VMOVE extends VALL_F16 with modes that are not manifested via
> > intrinsics.
> >
> 
> Done.
> 
> > Where is the 2h used, and is it valid syntax in that context?
> >
> > Same for later instances of 2h.
> 
> They are, but they weren't meant to be in this patch.  They belong in a
> separate FP16 series that I won't get to finish for GCC 13 due not being able
> to finish writing all the tests.  I have moved them to that patch series though.
> 
> While the addp patch series has been killed, this patch is still good standalone
> and improves codegen as shown in the updated testcase.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md (*aarch64_simd_movv2hf): New.
> 	(mov<mode>, movmisalign<mode>, aarch64_dup_lane<mode>,
> 	aarch64_store_lane0<mode>, aarch64_simd_vec_set<mode>,
> 	@aarch64_simd_vec_copy_lane<mode>, vec_set<mode>,
> 	reduc_<optab>_scal_<mode>, reduc_<fmaxmin>_scal_<mode>,
> 	aarch64_reduc_<optab>_internal<mode>,
> aarch64_get_lane<mode>,
> 	vec_init<mode><Vel>, vec_extract<mode><Vel>): Support V2HF.
> 	(aarch64_simd_dupv2hf): New.
> 	* config/aarch64/aarch64.cc (aarch64_classify_vector_mode):
> 	Add E_V2HFmode.
> 	* config/aarch64/iterators.md (VHSDF_P): New.
> 	(V2F, VMOVE, nunits, Vtype, Vmtype, Vetype, stype, VEL,
> 	Vel, q, vp): Add V2HF.
> 	* config/arm/types.md (neon_fp_reduc_add_h): New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/sve/slp_1.c: Update testcase.
> 
> --- inline copy of patch ---
> 
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> f4152160084d6b6f34bd69f0ba6386c1ab50f77e..487a31010245accec28e779661
> e6c2d578fca4b7 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -19,10 +19,10 @@
>  ;; <http://www.gnu.org/licenses/>.
> 
>  (define_expand "mov<mode>"
> -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> -	(match_operand:VALL_F16 1 "general_operand"))]
> +  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
> +	(match_operand:VMOVE 1 "general_operand"))]
>    "TARGET_SIMD"
> -  "
> +{
>    /* Force the operand into a register if it is not an
>       immediate whose use can be replaced with xzr.
>       If the mode is 16 bytes wide, then we will be doing @@ -46,12 +46,11 @@
> (define_expand "mov<mode>"
>        aarch64_expand_vector_init (operands[0], operands[1]);
>        DONE;
>      }
> -  "
> -)
> +})
> 
>  (define_expand "movmisalign<mode>"
> -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> -        (match_operand:VALL_F16 1 "general_operand"))]
> +  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
> +        (match_operand:VMOVE 1 "general_operand"))]
>    "TARGET_SIMD && !STRICT_ALIGNMENT"
>  {
>    /* This pattern is not permitted to fail during expansion: if both arguments
> @@ -73,6 +72,16 @@ (define_insn "aarch64_simd_dup<mode>"
>    [(set_attr "type" "neon_dup<q>, neon_from_gp<q>")]
>  )
> 
> +(define_insn "aarch64_simd_dupv2hf"
> +  [(set (match_operand:V2HF 0 "register_operand" "=w")
> +	(vec_duplicate:V2HF
> +	  (match_operand:HF 1 "register_operand" "0")))]
> +  "TARGET_SIMD"
> +  "@
> +   sli\\t%d0, %d1, 16"
> +  [(set_attr "type" "neon_shift_imm")]
> +)
> +
>  (define_insn "aarch64_simd_dup<mode>"
>    [(set (match_operand:VDQF_F16 0 "register_operand" "=w,w")
>  	(vec_duplicate:VDQF_F16
> @@ -85,10 +94,10 @@ (define_insn "aarch64_simd_dup<mode>"
>  )
> 
>  (define_insn "aarch64_dup_lane<mode>"
> -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> -	(vec_duplicate:VALL_F16
> +  [(set (match_operand:VMOVE 0 "register_operand" "=w")
> +	(vec_duplicate:VMOVE
>  	  (vec_select:<VEL>
> -	    (match_operand:VALL_F16 1 "register_operand" "w")
> +	    (match_operand:VMOVE 1 "register_operand" "w")
>  	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
>            )))]
>    "TARGET_SIMD"
> @@ -142,6 +151,29 @@ (define_insn
> "*aarch64_simd_mov<VDMOV:mode>"
>  		     mov_reg, neon_move<q>")]
>  )
> 
> +(define_insn "*aarch64_simd_movv2hf"
> +  [(set (match_operand:V2HF 0 "nonimmediate_operand"
> +		"=w, m,  m,  w, ?r, ?w, ?r, w, w")
> +	(match_operand:V2HF 1 "general_operand"
> +		"m,  Dz, w,  w,  w,  r,  r, Dz, Dn"))]
> +  "TARGET_SIMD_F16INST
> +   && (register_operand (operands[0], V2HFmode)
> +       || aarch64_simd_reg_or_zero (operands[1], V2HFmode))"
> +   "@
> +    ldr\\t%s0, %1
> +    str\\twzr, %0
> +    str\\t%s1, %0
> +    mov\\t%0.2s[0], %1.2s[0]
> +    umov\\t%w0, %1.s[0]
> +    fmov\\t%s0, %1
> +    mov\\t%0, %1
> +    movi\\t%d0, 0
> +    * return aarch64_output_simd_mov_immediate (operands[1], 32);"
> +  [(set_attr "type" "neon_load1_1reg, store_8, neon_store1_1reg,\
> +		     neon_logic, neon_to_gp, f_mcr,\
> +		     mov_reg, neon_move, neon_move")]
> +)
> +
>  (define_insn "*aarch64_simd_mov<VQMOV:mode>"
>    [(set (match_operand:VQMOV 0 "nonimmediate_operand"
>  		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
> @@ -182,7 +214,7 @@ (define_insn "*aarch64_simd_mov<VQMOV:mode>"
> 
>  (define_insn "aarch64_store_lane0<mode>"
>    [(set (match_operand:<VEL> 0 "memory_operand" "=m")
> -	(vec_select:<VEL> (match_operand:VALL_F16 1 "register_operand"
> "w")
> +	(vec_select:<VEL> (match_operand:VMOVE 1 "register_operand"
> "w")
>  			(parallel [(match_operand 2 "const_int_operand"
> "n")])))]
>    "TARGET_SIMD
>     && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 0"
> @@ -1035,11 +1067,11 @@ (define_insn "one_cmpl<mode>2"
>  )
> 
>  (define_insn "aarch64_simd_vec_set<mode>"
> -  [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
> -	(vec_merge:VALL_F16
> -	    (vec_duplicate:VALL_F16
> +  [(set (match_operand:VMOVE 0 "register_operand" "=w,w,w")
> +	(vec_merge:VMOVE
> +	    (vec_duplicate:VMOVE
>  		(match_operand:<VEL> 1
> "aarch64_simd_nonimmediate_operand" "w,?r,Utv"))
> -	    (match_operand:VALL_F16 3 "register_operand" "0,0,0")
> +	    (match_operand:VMOVE 3 "register_operand" "0,0,0")
>  	    (match_operand:SI 2 "immediate_operand" "i,i,i")))]
>    "TARGET_SIMD"
>    {
> @@ -1061,14 +1093,14 @@ (define_insn "aarch64_simd_vec_set<mode>"
>  )
> 
>  (define_insn "@aarch64_simd_vec_copy_lane<mode>"
> -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> -	(vec_merge:VALL_F16
> -	    (vec_duplicate:VALL_F16
> +  [(set (match_operand:VMOVE 0 "register_operand" "=w")
> +	(vec_merge:VMOVE
> +	    (vec_duplicate:VMOVE
>  	      (vec_select:<VEL>
> -		(match_operand:VALL_F16 3 "register_operand" "w")
> +		(match_operand:VMOVE 3 "register_operand" "w")
>  		(parallel
>  		  [(match_operand:SI 4 "immediate_operand" "i")])))
> -	    (match_operand:VALL_F16 1 "register_operand" "0")
> +	    (match_operand:VMOVE 1 "register_operand" "0")
>  	    (match_operand:SI 2 "immediate_operand" "i")))]
>    "TARGET_SIMD"
>    {
> @@ -1376,7 +1408,7 @@ (define_insn "vec_shr_<mode>"
>  )
> 
>  (define_expand "vec_set<mode>"
> -  [(match_operand:VALL_F16 0 "register_operand")
> +  [(match_operand:VMOVE 0 "register_operand")
>     (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
>     (match_operand:SI 2 "immediate_operand")]
>    "TARGET_SIMD"
> @@ -3495,7 +3527,7 @@ (define_insn "popcount<mode>2"
>  ;; gimple_fold'd to the IFN_REDUC_(MAX|MIN) function.  (This is FP
> smax/smin).
>  (define_expand "reduc_<optab>_scal_<mode>"
>    [(match_operand:<VEL> 0 "register_operand")
> -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
>  		 FMAXMINV)]
>    "TARGET_SIMD"
>    {
> @@ -3510,7 +3542,7 @@ (define_expand "reduc_<optab>_scal_<mode>"
> 
>  (define_expand "reduc_<fmaxmin>_scal_<mode>"
>    [(match_operand:<VEL> 0 "register_operand")
> -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
>  		 FMAXMINNMV)]
>    "TARGET_SIMD"
>    {
> @@ -3554,8 +3586,8 @@ (define_insn
> "aarch64_reduc_<optab>_internalv2si"
>  )
> 
>  (define_insn "aarch64_reduc_<optab>_internal<mode>"
> - [(set (match_operand:VHSDF 0 "register_operand" "=w")
> -       (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")]
> + [(set (match_operand:VHSDF_P 0 "register_operand" "=w")
> +       (unspec:VHSDF_P [(match_operand:VHSDF_P 1 "register_operand"
> + "w")]
>  		      FMAXMINV))]
>   "TARGET_SIMD"
>   "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
> @@ -4200,7 +4232,7 @@ (define_insn
> "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
>  (define_insn_and_split "aarch64_get_lane<mode>"
>    [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand"
> "=?r, w, Utv")
>  	(vec_select:<VEL>
> -	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
> +	  (match_operand:VMOVE 1 "register_operand" "w, w, w")
>  	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
>    "TARGET_SIMD"
>    {
> @@ -7981,7 +8013,7 @@ (define_expand "aarch64_st1<VALL_F16:mode>"
>  ;; Standard pattern name vec_init<mode><Vel>.
> 
>  (define_expand "vec_init<mode><Vel>"
> -  [(match_operand:VALL_F16 0 "register_operand")
> +  [(match_operand:VMOVE 0 "register_operand")
>     (match_operand 1 "" "")]
>    "TARGET_SIMD"
>  {
> @@ -8060,7 +8092,7 @@ (define_insn "aarch64_urecpe<mode>"
> 
>  (define_expand "vec_extract<mode><Vel>"
>    [(match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand")
> -   (match_operand:VALL_F16 1 "register_operand")
> +   (match_operand:VMOVE 1 "register_operand")
>     (match_operand:SI 2 "immediate_operand")]
>    "TARGET_SIMD"
>  {
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index
> 84dbe2f4ea7d03b424602ed98a34e7824217dc91..35671cb86e374f9ded21d0e4
> 944c63bc2cbc0901 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -3566,6 +3566,7 @@ aarch64_classify_vector_mode (machine_mode
> mode)
>      case E_V8BFmode:
>      case E_V4SFmode:
>      case E_V2DFmode:
> +    case E_V2HFmode:
>        return TARGET_SIMD ? VEC_ADVSIMD : 0;
> 
>      default:
> diff --git a/gcc/config/aarch64/iterators.md
> b/gcc/config/aarch64/iterators.md index
> 37d8161a33b1c399d80be82afa67613a087389d4..dfcf86a440e316c2abdbcc6463
> 63d39e458d1a91 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -160,6 +160,10 @@ (define_mode_iterator VDQF [V2SF V4SF V2DF])
> (define_mode_iterator VHSDF [(V4HF "TARGET_SIMD_F16INST")
>  			     (V8HF "TARGET_SIMD_F16INST")
>  			     V2SF V4SF V2DF])
> +;; Advanced SIMD Float modes suitable for pairwise operations.
> +(define_mode_iterator VHSDF_P [(V4HF "TARGET_SIMD_F16INST")
> +			       (V8HF "TARGET_SIMD_F16INST")
> +			       V2SF V4SF V2DF (V2HF
> "TARGET_SIMD_F16INST")])
> 
>  ;; Advanced SIMD Float modes, and DF.
>  (define_mode_iterator VDQF_DF [V2SF V4SF V2DF DF]) @@ -188,15 +192,23
> @@ (define_mode_iterator VDQF_COND [V2SF V2SI V4SF V4SI V2DF V2DI])
> (define_mode_iterator VALLF [V2SF V4SF V2DF SF DF])
> 
>  ;; Advanced SIMD Float modes with 2 elements.
> -(define_mode_iterator V2F [V2SF V2DF])
> +(define_mode_iterator V2F [V2SF V2DF V2HF])
> 
>  ;; All Advanced SIMD modes on which we support any arithmetic operations.
>  (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF
> V4SF V2DF])
> 
> -;; All Advanced SIMD modes suitable for moving, loading, and storing.
> +;; All Advanced SIMD modes suitable for moving, loading, and storing ;;
> +except V2HF.
>  (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
>  				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
> 
> +;; All Advanced SIMD modes suitable for moving, loading, and storing ;;
> +including V2HF (define_mode_iterator VMOVE [V8QI V16QI V4HI V8HI V2SI
> +V4SI V2DI
> +			     V4HF V8HF V4BF V8BF V2SF V4SF V2DF
> +			     (V2HF "TARGET_SIMD_F16INST")])
> +
> +
>  ;; The VALL_F16 modes except the 128-bit 2-element ones.
>  (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI
> V4SI
>  				V4HF V8HF V2SF V4SF])
> @@ -1076,7 +1088,7 @@ (define_mode_attr nunits [(V8QI "8") (V16QI "16")
>  			  (V2SF "2") (V4SF "4")
>  			  (V1DF "1") (V2DF "2")
>  			  (DI "1") (DF "1")
> -			  (V8DI "8")])
> +			  (V8DI "8") (V2HF "2")])
> 
>  ;; Map a mode to the number of bits in it, if the size of the mode  ;; is
> constant.
> @@ -1090,6 +1102,7 @@ (define_mode_attr s [(HF "h") (SF "s") (DF "d") (SI
> "s") (DI "d")])
> 
>  ;; Give the length suffix letter for a sign- or zero-extension.
>  (define_mode_attr size [(QI "b") (HI "h") (SI "w")])
> +(define_mode_attr sizel [(QI "b") (HI "h") (SI "")])
> 
>  ;; Give the number of bits in the mode
>  (define_mode_attr sizen [(QI "8") (HI "16") (SI "32") (DI "64")]) @@ -1193,7
> +1206,7 @@ (define_mode_attr Vmntype [(V8HI ".8b") (V4SI ".4h")
> (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
>  			  (V4HI "h") (V8HI  "h")
>  			  (V2SI "s") (V4SI  "s")
> -			  (V2DI "d")
> +			  (V2DI "d") (V2HF  "h")
>  			  (V4HF "h") (V8HF  "h")
>  			  (V2SF "s") (V4SF  "s")
>  			  (V2DF "d")
> @@ -1285,7 +1298,7 @@ (define_mode_attr Vcwtype [(VNx16QI "b")
> (VNx8QI "h") (VNx4QI "w") (VNx2QI "d")  ;; more accurately.
>  (define_mode_attr stype [(V8QI "b") (V16QI "b") (V4HI "s") (V8HI "s")
>  			 (V2SI "s") (V4SI "s") (V2DI "d") (V4HF "s")
> -			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d")
> +			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d") (V2HF
> "s")
>  			 (HF "s") (SF "s") (DF "d") (QI "b") (HI "s")
>  			 (SI "s") (DI "d")])
> 
> @@ -1360,8 +1373,8 @@ (define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
>  		       (V4HF "HF") (V8HF  "HF")
>  		       (V2SF "SF") (V4SF  "SF")
>  		       (DF   "DF") (V2DF  "DF")
> -		       (SI   "SI") (HI    "HI")
> -		       (QI   "QI")
> +		       (SI   "SI") (V2HF  "HF")
> +		       (QI   "QI") (HI    "HI")
>  		       (V4BF "BF") (V8BF "BF")
>  		       (VNx16QI "QI") (VNx8QI "QI") (VNx4QI "QI") (VNx2QI
> "QI")
>  		       (VNx8HI "HI") (VNx4HI "HI") (VNx2HI "HI") @@ -1381,7
> +1394,7 @@ (define_mode_attr Vel [(V8QI "qi") (V16QI "qi")
>  		       (V2SF "sf") (V4SF "sf")
>  		       (V2DF "df") (DF   "df")
>  		       (SI   "si") (HI   "hi")
> -		       (QI   "qi")
> +		       (QI   "qi") (V2HF "hf")
>  		       (V4BF "bf") (V8BF "bf")
>  		       (VNx16QI "qi") (VNx8QI "qi") (VNx4QI "qi") (VNx2QI "qi")
>  		       (VNx8HI "hi") (VNx4HI "hi") (VNx2HI "hi") @@ -1866,7
> +1879,7 @@ (define_mode_attr q [(V8QI "") (V16QI "_q")
>  		     (V4HF "") (V8HF "_q")
>  		     (V4BF "") (V8BF "_q")
>  		     (V2SF "") (V4SF  "_q")
> -			       (V2DF  "_q")
> +		     (V2HF "") (V2DF  "_q")
>  		     (QI "") (HI "") (SI "") (DI "") (HF "") (SF "") (DF "")
>  		     (V2x8QI "") (V2x16QI "_q")
>  		     (V2x4HI "") (V2x8HI "_q")
> @@ -1905,6 +1918,7 @@ (define_mode_attr vp [(V8QI "v") (V16QI "v")
>  		      (V2SI "p") (V4SI  "v")
>  		      (V2DI "p") (V2DF  "p")
>  		      (V2SF "p") (V4SF  "v")
> +		      (V2HF "p")
>  		      (V4HF "v") (V8HF  "v")])
> 
>  (define_mode_attr vsi2qi [(V2SI "v8qi") (V4SI "v16qi") diff --git
> a/gcc/config/arm/types.md b/gcc/config/arm/types.md index
> 7d0504bdd944e9c0d1b545b0b66a9a1adc808714..3cfbc7a93cca1bea4925853e5
> 1d0a147c5722247 100644
> --- a/gcc/config/arm/types.md
> +++ b/gcc/config/arm/types.md
> @@ -483,6 +483,7 @@ (define_attr "autodetect_type"
>  ; neon_fp_minmax_s_q
>  ; neon_fp_minmax_d
>  ; neon_fp_minmax_d_q
> +; neon_fp_reduc_add_h
>  ; neon_fp_reduc_add_s
>  ; neon_fp_reduc_add_s_q
>  ; neon_fp_reduc_add_d
> @@ -1033,6 +1034,7 @@ (define_attr "type"
>    neon_fp_minmax_d,\
>    neon_fp_minmax_d_q,\
>  \
> +  neon_fp_reduc_add_h,\
>    neon_fp_reduc_add_s,\
>    neon_fp_reduc_add_s_q,\
>    neon_fp_reduc_add_d,\
> @@ -1257,8 +1259,8 @@ (define_attr "is_neon_type" "yes,no"
>            neon_fp_compare_d, neon_fp_compare_d_q, neon_fp_minmax_s,\
>            neon_fp_minmax_s_q, neon_fp_minmax_d, neon_fp_minmax_d_q,\
>            neon_fp_neg_s, neon_fp_neg_s_q, neon_fp_neg_d,
> neon_fp_neg_d_q,\
> -          neon_fp_reduc_add_s, neon_fp_reduc_add_s_q,
> neon_fp_reduc_add_d,\
> -          neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,
> +          neon_fp_reduc_add_h, neon_fp_reduc_add_s,
> neon_fp_reduc_add_s_q,\
> +          neon_fp_reduc_add_d, neon_fp_reduc_add_d_q,
> + neon_fp_reduc_minmax_s,\
>            neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d,\
>            neon_fp_reduc_minmax_d_q,\
>            neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\ diff --git
> a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> index
> 07d71a63414b1066ea431e287286ad048515711a..e6021c5a42748701e5326a5c3
> 87a39a0bbadc9e5 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> @@ -30,11 +30,9 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int n)
> 	\
>  TEST_ALL (VEC_PERM)
> 
>  /* We should use one DUP for each of the 8-, 16- and 32-bit types,
> -   although we currently use LD1RW for _Float16.  We should use two
> -   DUPs for each of the three 64-bit types.  */
> +   We should use two DUPs for each of the three 64-bit types.  */
>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 2 } } */
> -/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 1 } } */
> +/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } } */
>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
>  /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-
> 9]+\.d\n} 3 } } */
>  /* { dg-final { scan-assembler-not {\tzip2\t} } } */ @@ -53,7 +51,7 @@
> TEST_ALL (VEC_PERM)
>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
>  /* { dg-final { scan-assembler-not {\tldr} } } */
> -/* { dg-final { scan-assembler-times {\tstr} 2 } } */
> -/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
> +/* { dg-final { scan-assembler-not {\tstr} } } */
> +/* { dg-final { scan-assembler-not {\tstr\th[0-9]+} } } */
> 
>  /* { dg-final { scan-assembler-not {\tuqdec} } } */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
  2022-11-22 16:01       ` Tamar Christina
@ 2022-11-30  4:26         ` Tamar Christina
  0 siblings, 0 replies; 50+ messages in thread
From: Tamar Christina @ 2022-11-30  4:26 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Ping x3

> -----Original Message-----
> From: Tamar Christina
> Sent: Tuesday, November 22, 2022 4:01 PM
> To: Tamar Christina <Tamar.Christina@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: RE: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
> 
> Ping
> 
> > -----Original Message-----
> > From: Gcc-patches <gcc-patches-
> > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Tamar
> > Christina via Gcc-patches
> > Sent: Friday, November 11, 2022 2:40 PM
> > To: Richard Sandiford <Richard.Sandiford@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> > <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> > <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>
> > Subject: RE: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
> >
> > Hi,
> >
> >
> > > This name might cause confusion with the SVE iterators, where FULL
> > > means "every bit of the register is used".  How about something like
> > > VMOVE instead?
> > >
> > > With this change, I guess VALL_F16 represents "The set of all modes
> > > for which the vld1 intrinsics are provided" and VMOVE or whatever is
> > > "All Advanced SIMD modes suitable for moving, loading, and storing".
> > > That is, VMOVE extends VALL_F16 with modes that are not manifested
> > > via intrinsics.
> > >
> >
> > Done.
> >
> > > Where is the 2h used, and is it valid syntax in that context?
> > >
> > > Same for later instances of 2h.
> >
> > They are, but they weren't meant to be in this patch.  They belong in
> > a separate FP16 series that I won't get to finish for GCC 13 due not
> > being able to finish writing all the tests.  I have moved them to that patch
> series though.
> >
> > While the addp patch series has been killed, this patch is still good
> > standalone and improves codegen as shown in the updated testcase.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* config/aarch64/aarch64-simd.md (*aarch64_simd_movv2hf): New.
> > 	(mov<mode>, movmisalign<mode>, aarch64_dup_lane<mode>,
> > 	aarch64_store_lane0<mode>, aarch64_simd_vec_set<mode>,
> > 	@aarch64_simd_vec_copy_lane<mode>, vec_set<mode>,
> > 	reduc_<optab>_scal_<mode>, reduc_<fmaxmin>_scal_<mode>,
> > 	aarch64_reduc_<optab>_internal<mode>,
> > aarch64_get_lane<mode>,
> > 	vec_init<mode><Vel>, vec_extract<mode><Vel>): Support V2HF.
> > 	(aarch64_simd_dupv2hf): New.
> > 	* config/aarch64/aarch64.cc (aarch64_classify_vector_mode):
> > 	Add E_V2HFmode.
> > 	* config/aarch64/iterators.md (VHSDF_P): New.
> > 	(V2F, VMOVE, nunits, Vtype, Vmtype, Vetype, stype, VEL,
> > 	Vel, q, vp): Add V2HF.
> > 	* config/arm/types.md (neon_fp_reduc_add_h): New.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	* gcc.target/aarch64/sve/slp_1.c: Update testcase.
> >
> > --- inline copy of patch ---
> >
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > b/gcc/config/aarch64/aarch64-simd.md
> > index
> >
> f4152160084d6b6f34bd69f0ba6386c1ab50f77e..487a31010245accec28e779661
> > e6c2d578fca4b7 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -19,10 +19,10 @@
> >  ;; <http://www.gnu.org/licenses/>.
> >
> >  (define_expand "mov<mode>"
> > -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> > -	(match_operand:VALL_F16 1 "general_operand"))]
> > +  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
> > +	(match_operand:VMOVE 1 "general_operand"))]
> >    "TARGET_SIMD"
> > -  "
> > +{
> >    /* Force the operand into a register if it is not an
> >       immediate whose use can be replaced with xzr.
> >       If the mode is 16 bytes wide, then we will be doing @@ -46,12
> > +46,11 @@ (define_expand "mov<mode>"
> >        aarch64_expand_vector_init (operands[0], operands[1]);
> >        DONE;
> >      }
> > -  "
> > -)
> > +})
> >
> >  (define_expand "movmisalign<mode>"
> > -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> > -        (match_operand:VALL_F16 1 "general_operand"))]
> > +  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
> > +        (match_operand:VMOVE 1 "general_operand"))]
> >    "TARGET_SIMD && !STRICT_ALIGNMENT"
> >  {
> >    /* This pattern is not permitted to fail during expansion: if both
> > arguments @@ -73,6 +72,16 @@ (define_insn
> "aarch64_simd_dup<mode>"
> >    [(set_attr "type" "neon_dup<q>, neon_from_gp<q>")]
> >  )
> >
> > +(define_insn "aarch64_simd_dupv2hf"
> > +  [(set (match_operand:V2HF 0 "register_operand" "=w")
> > +	(vec_duplicate:V2HF
> > +	  (match_operand:HF 1 "register_operand" "0")))]
> > +  "TARGET_SIMD"
> > +  "@
> > +   sli\\t%d0, %d1, 16"
> > +  [(set_attr "type" "neon_shift_imm")]
> > +)
> > +
> >  (define_insn "aarch64_simd_dup<mode>"
> >    [(set (match_operand:VDQF_F16 0 "register_operand" "=w,w")
> >  	(vec_duplicate:VDQF_F16
> > @@ -85,10 +94,10 @@ (define_insn "aarch64_simd_dup<mode>"
> >  )
> >
> >  (define_insn "aarch64_dup_lane<mode>"
> > -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> > -	(vec_duplicate:VALL_F16
> > +  [(set (match_operand:VMOVE 0 "register_operand" "=w")
> > +	(vec_duplicate:VMOVE
> >  	  (vec_select:<VEL>
> > -	    (match_operand:VALL_F16 1 "register_operand" "w")
> > +	    (match_operand:VMOVE 1 "register_operand" "w")
> >  	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
> >            )))]
> >    "TARGET_SIMD"
> > @@ -142,6 +151,29 @@ (define_insn
> > "*aarch64_simd_mov<VDMOV:mode>"
> >  		     mov_reg, neon_move<q>")]
> >  )
> >
> > +(define_insn "*aarch64_simd_movv2hf"
> > +  [(set (match_operand:V2HF 0 "nonimmediate_operand"
> > +		"=w, m,  m,  w, ?r, ?w, ?r, w, w")
> > +	(match_operand:V2HF 1 "general_operand"
> > +		"m,  Dz, w,  w,  w,  r,  r, Dz, Dn"))]
> > +  "TARGET_SIMD_F16INST
> > +   && (register_operand (operands[0], V2HFmode)
> > +       || aarch64_simd_reg_or_zero (operands[1], V2HFmode))"
> > +   "@
> > +    ldr\\t%s0, %1
> > +    str\\twzr, %0
> > +    str\\t%s1, %0
> > +    mov\\t%0.2s[0], %1.2s[0]
> > +    umov\\t%w0, %1.s[0]
> > +    fmov\\t%s0, %1
> > +    mov\\t%0, %1
> > +    movi\\t%d0, 0
> > +    * return aarch64_output_simd_mov_immediate (operands[1], 32);"
> > +  [(set_attr "type" "neon_load1_1reg, store_8, neon_store1_1reg,\
> > +		     neon_logic, neon_to_gp, f_mcr,\
> > +		     mov_reg, neon_move, neon_move")]
> > +)
> > +
> >  (define_insn "*aarch64_simd_mov<VQMOV:mode>"
> >    [(set (match_operand:VQMOV 0 "nonimmediate_operand"
> >  		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
> > @@ -182,7 +214,7 @@ (define_insn
> "*aarch64_simd_mov<VQMOV:mode>"
> >
> >  (define_insn "aarch64_store_lane0<mode>"
> >    [(set (match_operand:<VEL> 0 "memory_operand" "=m")
> > -	(vec_select:<VEL> (match_operand:VALL_F16 1 "register_operand"
> > "w")
> > +	(vec_select:<VEL> (match_operand:VMOVE 1 "register_operand"
> > "w")
> >  			(parallel [(match_operand 2 "const_int_operand"
> > "n")])))]
> >    "TARGET_SIMD
> >     && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 0"
> > @@ -1035,11 +1067,11 @@ (define_insn "one_cmpl<mode>2"
> >  )
> >
> >  (define_insn "aarch64_simd_vec_set<mode>"
> > -  [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
> > -	(vec_merge:VALL_F16
> > -	    (vec_duplicate:VALL_F16
> > +  [(set (match_operand:VMOVE 0 "register_operand" "=w,w,w")
> > +	(vec_merge:VMOVE
> > +	    (vec_duplicate:VMOVE
> >  		(match_operand:<VEL> 1
> > "aarch64_simd_nonimmediate_operand" "w,?r,Utv"))
> > -	    (match_operand:VALL_F16 3 "register_operand" "0,0,0")
> > +	    (match_operand:VMOVE 3 "register_operand" "0,0,0")
> >  	    (match_operand:SI 2 "immediate_operand" "i,i,i")))]
> >    "TARGET_SIMD"
> >    {
> > @@ -1061,14 +1093,14 @@ (define_insn "aarch64_simd_vec_set<mode>"
> >  )
> >
> >  (define_insn "@aarch64_simd_vec_copy_lane<mode>"
> > -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> > -	(vec_merge:VALL_F16
> > -	    (vec_duplicate:VALL_F16
> > +  [(set (match_operand:VMOVE 0 "register_operand" "=w")
> > +	(vec_merge:VMOVE
> > +	    (vec_duplicate:VMOVE
> >  	      (vec_select:<VEL>
> > -		(match_operand:VALL_F16 3 "register_operand" "w")
> > +		(match_operand:VMOVE 3 "register_operand" "w")
> >  		(parallel
> >  		  [(match_operand:SI 4 "immediate_operand" "i")])))
> > -	    (match_operand:VALL_F16 1 "register_operand" "0")
> > +	    (match_operand:VMOVE 1 "register_operand" "0")
> >  	    (match_operand:SI 2 "immediate_operand" "i")))]
> >    "TARGET_SIMD"
> >    {
> > @@ -1376,7 +1408,7 @@ (define_insn "vec_shr_<mode>"
> >  )
> >
> >  (define_expand "vec_set<mode>"
> > -  [(match_operand:VALL_F16 0 "register_operand")
> > +  [(match_operand:VMOVE 0 "register_operand")
> >     (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
> >     (match_operand:SI 2 "immediate_operand")]
> >    "TARGET_SIMD"
> > @@ -3495,7 +3527,7 @@ (define_insn "popcount<mode>2"
> >  ;; gimple_fold'd to the IFN_REDUC_(MAX|MIN) function.  (This is FP
> > smax/smin).
> >  (define_expand "reduc_<optab>_scal_<mode>"
> >    [(match_operand:<VEL> 0 "register_operand")
> > -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> > +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
> >  		 FMAXMINV)]
> >    "TARGET_SIMD"
> >    {
> > @@ -3510,7 +3542,7 @@ (define_expand "reduc_<optab>_scal_<mode>"
> >
> >  (define_expand "reduc_<fmaxmin>_scal_<mode>"
> >    [(match_operand:<VEL> 0 "register_operand")
> > -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> > +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
> >  		 FMAXMINNMV)]
> >    "TARGET_SIMD"
> >    {
> > @@ -3554,8 +3586,8 @@ (define_insn
> > "aarch64_reduc_<optab>_internalv2si"
> >  )
> >
> >  (define_insn "aarch64_reduc_<optab>_internal<mode>"
> > - [(set (match_operand:VHSDF 0 "register_operand" "=w")
> > -       (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")]
> > + [(set (match_operand:VHSDF_P 0 "register_operand" "=w")
> > +       (unspec:VHSDF_P [(match_operand:VHSDF_P 1 "register_operand"
> > + "w")]
> >  		      FMAXMINV))]
> >   "TARGET_SIMD"
> >   "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
> > @@ -4200,7 +4232,7 @@ (define_insn
> > "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
> >  (define_insn_and_split "aarch64_get_lane<mode>"
> >    [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand"
> > "=?r, w, Utv")
> >  	(vec_select:<VEL>
> > -	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
> > +	  (match_operand:VMOVE 1 "register_operand" "w, w, w")
> >  	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
> >    "TARGET_SIMD"
> >    {
> > @@ -7981,7 +8013,7 @@ (define_expand "aarch64_st1<VALL_F16:mode>"
> >  ;; Standard pattern name vec_init<mode><Vel>.
> >
> >  (define_expand "vec_init<mode><Vel>"
> > -  [(match_operand:VALL_F16 0 "register_operand")
> > +  [(match_operand:VMOVE 0 "register_operand")
> >     (match_operand 1 "" "")]
> >    "TARGET_SIMD"
> >  {
> > @@ -8060,7 +8092,7 @@ (define_insn "aarch64_urecpe<mode>"
> >
> >  (define_expand "vec_extract<mode><Vel>"
> >    [(match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand")
> > -   (match_operand:VALL_F16 1 "register_operand")
> > +   (match_operand:VMOVE 1 "register_operand")
> >     (match_operand:SI 2 "immediate_operand")]
> >    "TARGET_SIMD"
> >  {
> > diff --git a/gcc/config/aarch64/aarch64.cc
> > b/gcc/config/aarch64/aarch64.cc index
> >
> 84dbe2f4ea7d03b424602ed98a34e7824217dc91..35671cb86e374f9ded21d0e4
> > 944c63bc2cbc0901 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -3566,6 +3566,7 @@ aarch64_classify_vector_mode (machine_mode
> > mode)
> >      case E_V8BFmode:
> >      case E_V4SFmode:
> >      case E_V2DFmode:
> > +    case E_V2HFmode:
> >        return TARGET_SIMD ? VEC_ADVSIMD : 0;
> >
> >      default:
> > diff --git a/gcc/config/aarch64/iterators.md
> > b/gcc/config/aarch64/iterators.md index
> >
> 37d8161a33b1c399d80be82afa67613a087389d4..dfcf86a440e316c2abdbcc6463
> > 63d39e458d1a91 100644
> > --- a/gcc/config/aarch64/iterators.md
> > +++ b/gcc/config/aarch64/iterators.md
> > @@ -160,6 +160,10 @@ (define_mode_iterator VDQF [V2SF V4SF V2DF])
> > (define_mode_iterator VHSDF [(V4HF "TARGET_SIMD_F16INST")
> >  			     (V8HF "TARGET_SIMD_F16INST")
> >  			     V2SF V4SF V2DF])
> > +;; Advanced SIMD Float modes suitable for pairwise operations.
> > +(define_mode_iterator VHSDF_P [(V4HF "TARGET_SIMD_F16INST")
> > +			       (V8HF "TARGET_SIMD_F16INST")
> > +			       V2SF V4SF V2DF (V2HF
> > "TARGET_SIMD_F16INST")])
> >
> >  ;; Advanced SIMD Float modes, and DF.
> >  (define_mode_iterator VDQF_DF [V2SF V4SF V2DF DF]) @@ -188,15
> +192,23
> > @@ (define_mode_iterator VDQF_COND [V2SF V2SI V4SF V4SI V2DF
> V2DI])
> > (define_mode_iterator VALLF [V2SF V4SF V2DF SF DF])
> >
> >  ;; Advanced SIMD Float modes with 2 elements.
> > -(define_mode_iterator V2F [V2SF V2DF])
> > +(define_mode_iterator V2F [V2SF V2DF V2HF])
> >
> >  ;; All Advanced SIMD modes on which we support any arithmetic
> operations.
> >  (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF
> > V4SF V2DF])
> >
> > -;; All Advanced SIMD modes suitable for moving, loading, and storing.
> > +;; All Advanced SIMD modes suitable for moving, loading, and storing
> > +;; except V2HF.
> >  (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
> >  				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
> >
> > +;; All Advanced SIMD modes suitable for moving, loading, and storing
> > +;; including V2HF (define_mode_iterator VMOVE [V8QI V16QI V4HI V8HI
> > +V2SI V4SI V2DI
> > +			     V4HF V8HF V4BF V8BF V2SF V4SF V2DF
> > +			     (V2HF "TARGET_SIMD_F16INST")])
> > +
> > +
> >  ;; The VALL_F16 modes except the 128-bit 2-element ones.
> >  (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI
> V4SI
> >  				V4HF V8HF V2SF V4SF])
> > @@ -1076,7 +1088,7 @@ (define_mode_attr nunits [(V8QI "8") (V16QI
> "16")
> >  			  (V2SF "2") (V4SF "4")
> >  			  (V1DF "1") (V2DF "2")
> >  			  (DI "1") (DF "1")
> > -			  (V8DI "8")])
> > +			  (V8DI "8") (V2HF "2")])
> >
> >  ;; Map a mode to the number of bits in it, if the size of the mode
> > ;; is constant.
> > @@ -1090,6 +1102,7 @@ (define_mode_attr s [(HF "h") (SF "s") (DF "d")
> > (SI
> > "s") (DI "d")])
> >
> >  ;; Give the length suffix letter for a sign- or zero-extension.
> >  (define_mode_attr size [(QI "b") (HI "h") (SI "w")])
> > +(define_mode_attr sizel [(QI "b") (HI "h") (SI "")])
> >
> >  ;; Give the number of bits in the mode  (define_mode_attr sizen [(QI
> > "8") (HI "16") (SI "32") (DI "64")]) @@ -1193,7
> > +1206,7 @@ (define_mode_attr Vmntype [(V8HI ".8b") (V4SI ".4h")
> > (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
> >  			  (V4HI "h") (V8HI  "h")
> >  			  (V2SI "s") (V4SI  "s")
> > -			  (V2DI "d")
> > +			  (V2DI "d") (V2HF  "h")
> >  			  (V4HF "h") (V8HF  "h")
> >  			  (V2SF "s") (V4SF  "s")
> >  			  (V2DF "d")
> > @@ -1285,7 +1298,7 @@ (define_mode_attr Vcwtype [(VNx16QI "b")
> (VNx8QI
> > "h") (VNx4QI "w") (VNx2QI "d")  ;; more accurately.
> >  (define_mode_attr stype [(V8QI "b") (V16QI "b") (V4HI "s") (V8HI "s")
> >  			 (V2SI "s") (V4SI "s") (V2DI "d") (V4HF "s")
> > -			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d")
> > +			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d") (V2HF
> > "s")
> >  			 (HF "s") (SF "s") (DF "d") (QI "b") (HI "s")
> >  			 (SI "s") (DI "d")])
> >
> > @@ -1360,8 +1373,8 @@ (define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
> >  		       (V4HF "HF") (V8HF  "HF")
> >  		       (V2SF "SF") (V4SF  "SF")
> >  		       (DF   "DF") (V2DF  "DF")
> > -		       (SI   "SI") (HI    "HI")
> > -		       (QI   "QI")
> > +		       (SI   "SI") (V2HF  "HF")
> > +		       (QI   "QI") (HI    "HI")
> >  		       (V4BF "BF") (V8BF "BF")
> >  		       (VNx16QI "QI") (VNx8QI "QI") (VNx4QI "QI") (VNx2QI
> > "QI")
> >  		       (VNx8HI "HI") (VNx4HI "HI") (VNx2HI "HI") @@ -1381,7
> > +1394,7 @@ (define_mode_attr Vel [(V8QI "qi") (V16QI "qi")
> >  		       (V2SF "sf") (V4SF "sf")
> >  		       (V2DF "df") (DF   "df")
> >  		       (SI   "si") (HI   "hi")
> > -		       (QI   "qi")
> > +		       (QI   "qi") (V2HF "hf")
> >  		       (V4BF "bf") (V8BF "bf")
> >  		       (VNx16QI "qi") (VNx8QI "qi") (VNx4QI "qi") (VNx2QI "qi")
> >  		       (VNx8HI "hi") (VNx4HI "hi") (VNx2HI "hi") @@ -1866,7
> > +1879,7 @@ (define_mode_attr q [(V8QI "") (V16QI "_q")
> >  		     (V4HF "") (V8HF "_q")
> >  		     (V4BF "") (V8BF "_q")
> >  		     (V2SF "") (V4SF  "_q")
> > -			       (V2DF  "_q")
> > +		     (V2HF "") (V2DF  "_q")
> >  		     (QI "") (HI "") (SI "") (DI "") (HF "") (SF "") (DF "")
> >  		     (V2x8QI "") (V2x16QI "_q")
> >  		     (V2x4HI "") (V2x8HI "_q")
> > @@ -1905,6 +1918,7 @@ (define_mode_attr vp [(V8QI "v") (V16QI "v")
> >  		      (V2SI "p") (V4SI  "v")
> >  		      (V2DI "p") (V2DF  "p")
> >  		      (V2SF "p") (V4SF  "v")
> > +		      (V2HF "p")
> >  		      (V4HF "v") (V8HF  "v")])
> >
> >  (define_mode_attr vsi2qi [(V2SI "v8qi") (V4SI "v16qi") diff --git
> > a/gcc/config/arm/types.md b/gcc/config/arm/types.md index
> >
> 7d0504bdd944e9c0d1b545b0b66a9a1adc808714..3cfbc7a93cca1bea4925853e5
> > 1d0a147c5722247 100644
> > --- a/gcc/config/arm/types.md
> > +++ b/gcc/config/arm/types.md
> > @@ -483,6 +483,7 @@ (define_attr "autodetect_type"
> >  ; neon_fp_minmax_s_q
> >  ; neon_fp_minmax_d
> >  ; neon_fp_minmax_d_q
> > +; neon_fp_reduc_add_h
> >  ; neon_fp_reduc_add_s
> >  ; neon_fp_reduc_add_s_q
> >  ; neon_fp_reduc_add_d
> > @@ -1033,6 +1034,7 @@ (define_attr "type"
> >    neon_fp_minmax_d,\
> >    neon_fp_minmax_d_q,\
> >  \
> > +  neon_fp_reduc_add_h,\
> >    neon_fp_reduc_add_s,\
> >    neon_fp_reduc_add_s_q,\
> >    neon_fp_reduc_add_d,\
> > @@ -1257,8 +1259,8 @@ (define_attr "is_neon_type" "yes,no"
> >            neon_fp_compare_d, neon_fp_compare_d_q, neon_fp_minmax_s,\
> >            neon_fp_minmax_s_q, neon_fp_minmax_d,
> neon_fp_minmax_d_q,\
> >            neon_fp_neg_s, neon_fp_neg_s_q, neon_fp_neg_d,
> > neon_fp_neg_d_q,\
> > -          neon_fp_reduc_add_s, neon_fp_reduc_add_s_q,
> > neon_fp_reduc_add_d,\
> > -          neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,
> > +          neon_fp_reduc_add_h, neon_fp_reduc_add_s,
> > neon_fp_reduc_add_s_q,\
> > +          neon_fp_reduc_add_d, neon_fp_reduc_add_d_q,
> > + neon_fp_reduc_minmax_s,\
> >            neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d,\
> >            neon_fp_reduc_minmax_d_q,\
> >            neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\ diff --git
> > a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > index
> >
> 07d71a63414b1066ea431e287286ad048515711a..e6021c5a42748701e5326a5c3
> > 87a39a0bbadc9e5 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > @@ -30,11 +30,9 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int
> n)
> > 	\
> >  TEST_ALL (VEC_PERM)
> >
> >  /* We should use one DUP for each of the 8-, 16- and 32-bit types,
> > -   although we currently use LD1RW for _Float16.  We should use two
> > -   DUPs for each of the three 64-bit types.  */
> > +   We should use two DUPs for each of the three 64-bit types.  */
> >  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } }
> > */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 2 } }
> > */
> > -/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 1 } } */
> > +/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } }
> > +*/
> >  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } }
> > */
> >  /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d,
> > z[0- 9]+\.d\n} 3 } } */
> >  /* { dg-final { scan-assembler-not {\tzip2\t} } } */ @@ -53,7 +51,7
> > @@ TEST_ALL (VEC_PERM)
> >  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
> >  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
> >  /* { dg-final { scan-assembler-not {\tldr} } } */
> > -/* { dg-final { scan-assembler-times {\tstr} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
> > +/* { dg-final { scan-assembler-not {\tstr} } } */
> > +/* { dg-final { scan-assembler-not {\tstr\th[0-9]+} } } */
> >
> >  /* { dg-final { scan-assembler-not {\tuqdec} } } */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 7/8]AArch64: Consolidate zero and sign extension patterns and add missing ones.
  2022-10-31 11:59 ` [PATCH 7/8]AArch64: Consolidate zero and sign extension patterns and add missing ones Tamar Christina
@ 2022-11-30  4:28   ` Tamar Christina
  2022-12-06 15:59   ` Richard Sandiford
  1 sibling, 0 replies; 50+ messages in thread
From: Tamar Christina @ 2022-11-30  4:28 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov,
	Richard Sandiford

Ping.

> -----Original Message-----
> From: Tamar Christina <tamar.christina@arm.com>
> Sent: Monday, October 31, 2022 12:00 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: [PATCH 7/8]AArch64: Consolidate zero and sign extension patterns
> and add missing ones.
> 
> Hi All,
> 
> The target has various zero and sign extension patterns.  These however live
> in various locations around the MD file and almost all of them are split
> differently.  Due to the various patterns we also ended up missing valid
> extensions.  For instance smov is almost never generated.
> 
> This change tries to make this more manageable by consolidating the
> patterns as much as possible and in doing so fix the missing alternatives.
> 
> There were also some duplicate patterns.  Note that the
> zero_extend<*_ONLY:mode><SD_HSDI:mode>2  patterns are nearly
> identical however QImode lacks an alternative that the others don't have, so
> I have left them as
> 3 different patterns next to each other.
> 
> In a lot of cases the wrong iterator was used leaving out cases that should
> exist.
> 
> I've also changed the masks used for zero extensions to hex instead of
> decimal as it's more clear what they do that way, and aligns better with
> output of other compilers.
> 
> This leave the bulk of the extensions in just 3 patterns.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md
> 	(*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>):
> Changed to ...
> 	(*aarch64_get_lane_zero_extend<GPI:mode><VDQV_L:mode>): ...
> This.
> 	(*aarch64_get_lane_extenddi<VS:mode>): New.
> 	* config/aarch64/aarch64.md (<optab>sidi2, *extendsidi2_aarch64,
> 	<optab>qihi2, *extendqihi2_aarch64, *zero_extendsidi2_aarch64):
> Remove
> 	duplicate patterns.
> 	(<ANY_EXTEND:optab><SHORT:mode><GPI:mode>2,
> 	*extend<SHORT:mode><GPI:mode>2_aarch64): Remove,
> consolidate
> 	into ...
> 	(extend<ALLX:mode><SD_HSDI:mode>2): ... This.
> 	(*zero_extendqihi2_aarch64,
> 	*zero_extend<SHORT:mode><GPI:mode>2_aarch64): Remove,
> consolidate into
> 	...
> 	(zero_extend<SI_ONLY:mode><SD_HSDI:mode>2,
> 	zero_extend<HI_ONLY:mode><SD_HSDI:mode>2,
> 	zero_extend<QI_ONLY:mode><SD_HSDI:mode>2):
> 	(*ands<GPI:mode>_compare0): Renamed to ...
> 	(*ands<SD_HSDI:mode>_compare0): ... This.
> 	* config/aarch64/iterators.md (HI_ONLY, QI_ONLY): New.
> 	(short_mask): Use hex rather than dec and add SI.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/ands_3.c: Update codegen.
> 	* gcc.target/aarch64/sve/slp_1.c: Likewise.
> 	* gcc.target/aarch64/tst_5.c: Likewise.
> 	* gcc.target/aarch64/tst_6.c: Likewise.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 8a84a8560e982b8155b18541f5504801b3330124..d0b37c4dd48aeafd3d87c90dc
> 3270e71af5a72b9 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4237,19 +4237,34 @@ (define_insn
> "*aarch64_get_lane_extend<GPI:mode><VDQQH:mode>"
>    [(set_attr "type" "neon_to_gp<VDQQH:q>")]
>  )
> 
> -(define_insn
> "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
> +(define_insn "*aarch64_get_lane_extenddi<VS:mode>"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> +	(sign_extend:DI
> +	  (vec_select:<VS:VEL>
> +	    (match_operand:VS 1 "register_operand" "w")
> +	    (parallel [(match_operand:SI 2 "immediate_operand" "i")]))))]
> +  "TARGET_SIMD"
> +  {
> +    operands[2] = aarch64_endian_lane_rtx (<VS:MODE>mode,
> +					   INTVAL (operands[2]));
> +    return "smov\\t%x0, %1.<VS:Vetype>[%2]";
> +  }
> +  [(set_attr "type" "neon_to_gp<VS:q>")]
> +)
> +
> +(define_insn
> "*aarch64_get_lane_zero_extend<GPI:mode><VDQV_L:mode>"
>    [(set (match_operand:GPI 0 "register_operand" "=r")
>  	(zero_extend:GPI
> -	  (vec_select:<VDQQH:VEL>
> -	    (match_operand:VDQQH 1 "register_operand" "w")
> +	  (vec_select:<VDQV_L:VEL>
> +	    (match_operand:VDQV_L 1 "register_operand" "w")
>  	    (parallel [(match_operand:SI 2 "immediate_operand" "i")]))))]
>    "TARGET_SIMD"
>    {
> -    operands[2] = aarch64_endian_lane_rtx (<VDQQH:MODE>mode,
> +    operands[2] = aarch64_endian_lane_rtx (<VDQV_L:MODE>mode,
>  					   INTVAL (operands[2]));
> -    return "umov\\t%w0, %1.<VDQQH:Vetype>[%2]";
> +    return "umov\\t%w0, %1.<VDQV_L:Vetype>[%2]";
>    }
> -  [(set_attr "type" "neon_to_gp<VDQQH:q>")]
> +  [(set_attr "type" "neon_to_gp<VDQV_L:q>")]
>  )
> 
>  ;; Lane extraction of a value, neither sign nor zero extension diff --git
> a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index
> 3ea16dbc2557c6a4f37104d44a49f77f768eb53d..09ae1118371f82ca63146fceb9
> 53eb9e820d05a4 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -1911,22 +1911,6 @@ (define_insn
> "storewb_pair<TX:mode>_<P:mode>"
>  ;; Sign/Zero extension
>  ;; -------------------------------------------------------------------
> 
> -(define_expand "<optab>sidi2"
> -  [(set (match_operand:DI 0 "register_operand")
> -	(ANY_EXTEND:DI (match_operand:SI 1 "nonimmediate_operand")))]
> -  ""
> -)
> -
> -(define_insn "*extendsidi2_aarch64"
> -  [(set (match_operand:DI 0 "register_operand" "=r,r")
> -        (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand"
> "r,m")))]
> -  ""
> -  "@
> -   sxtw\t%0, %w1
> -   ldrsw\t%0, %1"
> -  [(set_attr "type" "extend,load_4")]
> -)
> -
>  (define_insn "*load_pair_extendsidi2_aarch64"
>    [(set (match_operand:DI 0 "register_operand" "=r")
>  	(sign_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand"
> "Ump"))) @@ -1940,21 +1924,6 @@ (define_insn
> "*load_pair_extendsidi2_aarch64"
>    [(set_attr "type" "load_8")]
>  )
> 
> -(define_insn "*zero_extendsidi2_aarch64"
> -  [(set (match_operand:DI 0 "register_operand" "=r,r,w,w,r,w")
> -        (zero_extend:DI (match_operand:SI 1 "nonimmediate_operand"
> "r,m,r,m,w,w")))]
> -  ""
> -  "@
> -   uxtw\t%0, %w1
> -   ldr\t%w0, %1
> -   fmov\t%s0, %w1
> -   ldr\t%s0, %1
> -   fmov\t%w0, %s1
> -   fmov\t%s0, %s1"
> -  [(set_attr "type" "mov_reg,load_4,f_mcr,f_loads,f_mrc,fmov")
> -   (set_attr "arch" "*,*,fp,fp,fp,fp")]
> -)
> -
>  (define_insn "*load_pair_zero_extendsidi2_aarch64"
>    [(set (match_operand:DI 0 "register_operand" "=r,w")
>  	(zero_extend:DI (match_operand:SI 1
> "aarch64_mem_pair_operand" "Ump,Ump"))) @@ -1971,61 +1940,64 @@
> (define_insn "*load_pair_zero_extendsidi2_aarch64"
>     (set_attr "arch" "*,fp")]
>  )
> 
> -(define_expand "<ANY_EXTEND:optab><SHORT:mode><GPI:mode>2"
> -  [(set (match_operand:GPI 0 "register_operand")
> -        (ANY_EXTEND:GPI (match_operand:SHORT 1
> "nonimmediate_operand")))]
> -  ""
> -)
> -
> -(define_insn "*extend<SHORT:mode><GPI:mode>2_aarch64"
> -  [(set (match_operand:GPI 0 "register_operand" "=r,r,r")
> -        (sign_extend:GPI (match_operand:SHORT 1 "nonimmediate_operand"
> "r,m,w")))]
> +(define_insn "extend<ALLX:mode><SD_HSDI:mode>2"
> +  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,r")
> +        (sign_extend:SD_HSDI
> +	  (match_operand:ALLX 1 "nonimmediate_operand" "r,m,w")))]
>    ""
>    "@
> -   sxt<SHORT:size>\t%<GPI:w>0, %w1
> -   ldrs<SHORT:size>\t%<GPI:w>0, %1
> -   smov\t%<GPI:w>0, %1.<SHORT:size>[0]"
> +   sxt<ALLX:size>\t%<SD_HSDI:w>0, %w1
> +   ldrs<ALLX:size>\t%<SD_HSDI:w>0, %1
> +   smov\t%<SD_HSDI:w>0, %1.<ALLX:Vetype>[0]"
>    [(set_attr "type" "extend,load_4,neon_to_gp")
>     (set_attr "arch" "*,*,fp")]
>  )
> 
> -(define_insn "*zero_extend<SHORT:mode><GPI:mode>2_aarch64"
> -  [(set (match_operand:GPI 0 "register_operand" "=r,r,w,r")
> -        (zero_extend:GPI (match_operand:SHORT 1 "nonimmediate_operand"
> "r,m,m,w")))]
> +(define_insn "zero_extend<SI_ONLY:mode><SD_HSDI:mode>2"
> +  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,w,r,w")
> +        (zero_extend:SD_HSDI
> +	  (match_operand:SI_ONLY 1 "nonimmediate_operand"
> "r,m,r,m,w,w")))]
>    ""
>    "@
> -   and\t%<GPI:w>0, %<GPI:w>1, <SHORT:short_mask>
> -   ldr<SHORT:size>\t%w0, %1
> -   ldr\t%<SHORT:size>0, %1
> -   umov\t%w0, %1.<SHORT:size>[0]"
> -  [(set_attr "type" "logic_imm,load_4,f_loads,neon_to_gp")
> -   (set_attr "arch" "*,*,fp,fp")]
> -)
> -
> -(define_expand "<optab>qihi2"
> -  [(set (match_operand:HI 0 "register_operand")
> -        (ANY_EXTEND:HI (match_operand:QI 1 "nonimmediate_operand")))]
> -  ""
> +   uxt<SI_ONLY:size>\t%<SD_HSDI:w>0, %w1
> +   ldr<SI_ONLY:sizel>\t%w0, %1
> +   fmov\t%<SI_ONLY:Vetype>0, %w1
> +   ldr\t%<SI_ONLY:Vetype>0, %1
> +   fmov\t%w0, %<SI_ONLY:Vetype>1
> +   fmov\t%<SI_ONLY:Vetype>0, %<SI_ONLY:Vetype>1"
> +  [(set_attr "type" "mov_reg,load_4,f_mcr,f_loads,f_mrc,fmov")
> +   (set_attr "arch" "*,*,fp,fp,fp,fp")]
>  )
> 
> -(define_insn "*extendqihi2_aarch64"
> -  [(set (match_operand:HI 0 "register_operand" "=r,r")
> -	(sign_extend:HI (match_operand:QI 1 "nonimmediate_operand"
> "r,m")))]
> +(define_insn "zero_extend<HI_ONLY:mode><SD_HSDI:mode>2"
> +  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,w,r,w")
> +        (zero_extend:SD_HSDI
> +	  (match_operand:HI_ONLY 1 "nonimmediate_operand"
> "r,m,r,m,w,w")))]
>    ""
>    "@
> -   sxtb\t%w0, %w1
> -   ldrsb\t%w0, %1"
> -  [(set_attr "type" "extend,load_4")]
> +   uxt<HI_ONLY:size>\t%<SD_HSDI:w>0, %w1
> +   ldr<HI_ONLY:sizel>\t%w0, %1
> +   fmov\t%<HI_ONLY:Vetype>0, %w1
> +   ldr\t%<HI_ONLY:Vetype>0, %1
> +   umov\t%w0, %1.<HI_ONLY:Vetype>[0]
> +   fmov\t%<HI_ONLY:Vetype>0, %<HI_ONLY:Vetype>1"
> +  [(set_attr "type" "mov_reg,load_4,f_mcr,f_loads,f_mrc,fmov")
> +   (set_attr "arch" "*,*,fp16,fp,fp,fp16")]
>  )
> 
> -(define_insn "*zero_extendqihi2_aarch64"
> -  [(set (match_operand:HI 0 "register_operand" "=r,r")
> -	(zero_extend:HI (match_operand:QI 1 "nonimmediate_operand"
> "r,m")))]
> +(define_insn "zero_extend<QI_ONLY:mode><SD_HSDI:mode>2"
> +  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,r,w")
> +        (zero_extend:SD_HSDI
> +	  (match_operand:QI_ONLY 1 "nonimmediate_operand"
> "r,m,m,w,w")))]
>    ""
>    "@
> -   and\t%w0, %w1, 255
> -   ldrb\t%w0, %1"
> -  [(set_attr "type" "logic_imm,load_4")]
> +   uxt<QI_ONLY:size>\t%<SD_HSDI:w>0, %w1
> +   ldr<QI_ONLY:sizel>\t%w0, %1
> +   ldr\t%<QI_ONLY:Vetype>0, %1
> +   umov\t%w0, %1.<QI_ONLY:Vetype>[0]
> +   dup\t%<QI_ONLY:Vetype>0, %1.<QI_ONLY:Vetype>[0]"
> +  [(set_attr "type" "mov_reg,load_4,f_loads,f_mrc,fmov")
> +   (set_attr "arch" "*,*,fp,fp,fp")]
>  )
> 
>  ;; -------------------------------------------------------------------
> @@ -5029,15 +5001,15 @@ (define_insn "*and<mode>_compare0"
>    [(set_attr "type" "alus_imm")]
>  )
> 
> -(define_insn "*ands<GPI:mode>_compare0"
> +(define_insn "*ands<SD_HSDI:mode>_compare0"
>    [(set (reg:CC_NZ CC_REGNUM)
>  	(compare:CC_NZ
> -	 (zero_extend:GPI (match_operand:SHORT 1 "register_operand"
> "r"))
> +	 (zero_extend:SD_HSDI (match_operand:ALLX 1 "register_operand"
> "r"))
>  	 (const_int 0)))
> -   (set (match_operand:GPI 0 "register_operand" "=r")
> -	(zero_extend:GPI (match_dup 1)))]
> +   (set (match_operand:SD_HSDI 0 "register_operand" "=r")
> +	(zero_extend:SD_HSDI (match_dup 1)))]
>    ""
> -  "ands\\t%<GPI:w>0, %<GPI:w>1, <short_mask>"
> +  "ands\\t%<SD_HSDI:w>0, %<SD_HSDI:w>1, <ALLX:short_mask>"
>    [(set_attr "type" "alus_imm")]
>  )
> 
> diff --git a/gcc/config/aarch64/iterators.md
> b/gcc/config/aarch64/iterators.md index
> 1df09f7fe2eb35aed96113476541e0faa5393551..e904407b2169e589b7007ff966
> b2d9347a6d0fd2 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -41,6 +41,8 @@ (define_mode_iterator SHORT [QI HI])  ;; Iterators for
> single modes, for "@" patterns.
>  (define_mode_iterator SI_ONLY [SI])
>  (define_mode_iterator DI_ONLY [DI])
> +(define_mode_iterator HI_ONLY [HI])
> +(define_mode_iterator QI_ONLY [QI])
> 
>  ;; Iterator for all integer modes (up to 64-bit)  (define_mode_iterator ALLI
> [QI HI SI DI]) @@ -1033,7 +1035,7 @@ (define_mode_attr w2 [(HF "x") (SF
> "x") (DF "w")])  ;; For width of fp registers in fcvt instruction
> (define_mode_attr fpw [(DI "s") (SI "d")])
> 
> -(define_mode_attr short_mask [(HI "65535") (QI "255")])
> +(define_mode_attr short_mask [(SI "0xffffffff") (HI "0xffff") (QI
> +"0xff")])
> 
>  ;; For constraints used in scalar immediate vector moves  (define_mode_attr
> hq [(HI "h") (QI "q")]) diff --git a/gcc/testsuite/gcc.target/aarch64/ands_3.c
> b/gcc/testsuite/gcc.target/aarch64/ands_3.c
> index
> 42cb7f0f0bc86a4aceb09851c31eb2e888d93403..421aa5cea7a51ad810cc9c5653
> a149cb21bb871c 100644
> --- a/gcc/testsuite/gcc.target/aarch64/ands_3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/ands_3.c
> @@ -9,4 +9,4 @@ f9 (unsigned char x, int y)
>    return x;
>  }
> 
> -/* { dg-final { scan-assembler "ands\t(x|w)\[0-9\]+,\[ \t\]*(x|w)\[0-9\]+,\[
> \t\]*255" } } */
> +/* { dg-final { scan-assembler "ands\t(x|w)\[0-9\]+,\[
> +\t\]*(x|w)\[0-9\]+,\[ \t\]*0xff" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> index
> 8e35e0b574d49913b43c7d8d4f4ba75f127f42e9..03288976b3397cdbe0e822f94
> f2a6448d9fa9a52 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> @@ -51,7 +51,6 @@ TEST_ALL (VEC_PERM)
>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
>  /* { dg-final { scan-assembler-not {\tldr} } } */
> -/* { dg-final { scan-assembler-times {\tstr} 2 } } */
> -/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
> +/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[1\],
> +v[0-9]+\.h\[0\]} 1 } } */
> 
>  /* { dg-final { scan-assembler-not {\tuqdec} } } */ diff --git
> a/gcc/testsuite/gcc.target/aarch64/tst_5.c
> b/gcc/testsuite/gcc.target/aarch64/tst_5.c
> index
> 0de40a6c47a7d63c1b7a81aeba438a096c0041b8..19034cd74ed07ea4d670c25d
> 9ab3d1cff805a483 100644
> --- a/gcc/testsuite/gcc.target/aarch64/tst_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/tst_5.c
> @@ -4,7 +4,7 @@
>  int
>  f255 (int x)
>  {
> -  if (x & 255)
> +  if (x & 0xff)
>      return 1;
>    return x;
>  }
> @@ -12,10 +12,10 @@ f255 (int x)
>  int
>  f65535 (int x)
>  {
> -  if (x & 65535)
> +  if (x & 0xffff)
>      return 1;
>    return x;
>  }
> 
> -/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*255" } } */
> -/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*65535" } } */
> +/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xff" } } */
> +/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xffff" } }
> +*/
> diff --git a/gcc/testsuite/gcc.target/aarch64/tst_6.c
> b/gcc/testsuite/gcc.target/aarch64/tst_6.c
> index
> f15ec114c391fed79cc43b7740fde83fb3d4ea53..1c047cfae214b60e5bf003e678
> 1a277202fcc588 100644
> --- a/gcc/testsuite/gcc.target/aarch64/tst_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/tst_6.c
> @@ -7,4 +7,4 @@ foo (long x)
>     return ((short) x != 0) ? x : 1;
>  }
> 
> -/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*65535" } } */
> +/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xffff" } }
> +*/
> 
> 
> 
> 
> --

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
  2022-11-11 14:39     ` Tamar Christina
  2022-11-22 16:01       ` Tamar Christina
@ 2022-12-06 10:28       ` Richard Sandiford
  2022-12-06 10:58         ` Tamar Christina
  1 sibling, 1 reply; 50+ messages in thread
From: Richard Sandiford @ 2022-12-06 10:28 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Tamar Christina <Tamar.Christina@arm.com> writes:
> Hi,
>
>
>> This name might cause confusion with the SVE iterators, where FULL means
>> "every bit of the register is used".  How about something like VMOVE
>> instead?
>> 
>> With this change, I guess VALL_F16 represents "The set of all modes for
>> which the vld1 intrinsics are provided" and VMOVE or whatever is "All
>> Advanced SIMD modes suitable for moving, loading, and storing".
>> That is, VMOVE extends VALL_F16 with modes that are not manifested via
>> intrinsics.
>> 
>
> Done.
>
>> Where is the 2h used, and is it valid syntax in that context?
>> 
>> Same for later instances of 2h.
>
> They are, but they weren't meant to be in this patch.  They belong in a separate FP16 series that
> I won't get to finish for GCC 13 due not being able to finish writing all the tests.  I have moved them
> to that patch series though.
>
> While the addp patch series has been killed, this patch is still good standalone and improves codegen
> as shown in the updated testcase.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64-simd.md (*aarch64_simd_movv2hf): New.
> 	(mov<mode>, movmisalign<mode>, aarch64_dup_lane<mode>,
> 	aarch64_store_lane0<mode>, aarch64_simd_vec_set<mode>,
> 	@aarch64_simd_vec_copy_lane<mode>, vec_set<mode>,
> 	reduc_<optab>_scal_<mode>, reduc_<fmaxmin>_scal_<mode>,
> 	aarch64_reduc_<optab>_internal<mode>, aarch64_get_lane<mode>,
> 	vec_init<mode><Vel>, vec_extract<mode><Vel>): Support V2HF.
> 	(aarch64_simd_dupv2hf): New.
> 	* config/aarch64/aarch64.cc (aarch64_classify_vector_mode):
> 	Add E_V2HFmode.
> 	* config/aarch64/iterators.md (VHSDF_P): New.
> 	(V2F, VMOVE, nunits, Vtype, Vmtype, Vetype, stype, VEL,
> 	Vel, q, vp): Add V2HF.
> 	* config/arm/types.md (neon_fp_reduc_add_h): New.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/sve/slp_1.c: Update testcase.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index f4152160084d6b6f34bd69f0ba6386c1ab50f77e..487a31010245accec28e779661e6c2d578fca4b7 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -19,10 +19,10 @@
>  ;; <http://www.gnu.org/licenses/>.
>  
>  (define_expand "mov<mode>"
> -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> -	(match_operand:VALL_F16 1 "general_operand"))]
> +  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
> +	(match_operand:VMOVE 1 "general_operand"))]
>    "TARGET_SIMD"
> -  "
> +{
>    /* Force the operand into a register if it is not an
>       immediate whose use can be replaced with xzr.
>       If the mode is 16 bytes wide, then we will be doing
> @@ -46,12 +46,11 @@ (define_expand "mov<mode>"
>        aarch64_expand_vector_init (operands[0], operands[1]);
>        DONE;
>      }
> -  "
> -)
> +})
>  
>  (define_expand "movmisalign<mode>"
> -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> -        (match_operand:VALL_F16 1 "general_operand"))]
> +  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
> +        (match_operand:VMOVE 1 "general_operand"))]
>    "TARGET_SIMD && !STRICT_ALIGNMENT"
>  {
>    /* This pattern is not permitted to fail during expansion: if both arguments
> @@ -73,6 +72,16 @@ (define_insn "aarch64_simd_dup<mode>"
>    [(set_attr "type" "neon_dup<q>, neon_from_gp<q>")]
>  )
>  
> +(define_insn "aarch64_simd_dupv2hf"
> +  [(set (match_operand:V2HF 0 "register_operand" "=w")
> +	(vec_duplicate:V2HF
> +	  (match_operand:HF 1 "register_operand" "0")))]

Seems like this should be "w" rather than "0", since SLI is a
two-register instruction.

> +  "TARGET_SIMD"
> +  "@
> +   sli\\t%d0, %d1, 16"
> +  [(set_attr "type" "neon_shift_imm")]
> +)
> +
>  (define_insn "aarch64_simd_dup<mode>"
>    [(set (match_operand:VDQF_F16 0 "register_operand" "=w,w")
>  	(vec_duplicate:VDQF_F16
> @@ -85,10 +94,10 @@ (define_insn "aarch64_simd_dup<mode>"
>  )
>  
>  (define_insn "aarch64_dup_lane<mode>"
> -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> -	(vec_duplicate:VALL_F16
> +  [(set (match_operand:VMOVE 0 "register_operand" "=w")
> +	(vec_duplicate:VMOVE
>  	  (vec_select:<VEL>
> -	    (match_operand:VALL_F16 1 "register_operand" "w")
> +	    (match_operand:VMOVE 1 "register_operand" "w")
>  	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
>            )))]
>    "TARGET_SIMD"
> @@ -142,6 +151,29 @@ (define_insn "*aarch64_simd_mov<VDMOV:mode>"
>  		     mov_reg, neon_move<q>")]
>  )
>  
> +(define_insn "*aarch64_simd_movv2hf"
> +  [(set (match_operand:V2HF 0 "nonimmediate_operand"
> +		"=w, m,  m,  w, ?r, ?w, ?r, w, w")
> +	(match_operand:V2HF 1 "general_operand"
> +		"m,  Dz, w,  w,  w,  r,  r, Dz, Dn"))]
> +  "TARGET_SIMD_F16INST
> +   && (register_operand (operands[0], V2HFmode)
> +       || aarch64_simd_reg_or_zero (operands[1], V2HFmode))"
> +   "@
> +    ldr\\t%s0, %1
> +    str\\twzr, %0
> +    str\\t%s1, %0
> +    mov\\t%0.2s[0], %1.2s[0]
> +    umov\\t%w0, %1.s[0]
> +    fmov\\t%s0, %1

Should be %w1 instead.

> +    mov\\t%0, %1

I guess this one works with either % (X registers) or %w.  Might still
be better to use %w anyway, so that it looks less like an oversight.

> +    movi\\t%d0, 0
> +    * return aarch64_output_simd_mov_immediate (operands[1], 32);"
> +  [(set_attr "type" "neon_load1_1reg, store_8, neon_store1_1reg,\
> +		     neon_logic, neon_to_gp, f_mcr,\
> +		     mov_reg, neon_move, neon_move")]
> +)
> +
>  (define_insn "*aarch64_simd_mov<VQMOV:mode>"
>    [(set (match_operand:VQMOV 0 "nonimmediate_operand"
>  		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
> @@ -182,7 +214,7 @@ (define_insn "*aarch64_simd_mov<VQMOV:mode>"
>  
>  (define_insn "aarch64_store_lane0<mode>"
>    [(set (match_operand:<VEL> 0 "memory_operand" "=m")
> -	(vec_select:<VEL> (match_operand:VALL_F16 1 "register_operand" "w")
> +	(vec_select:<VEL> (match_operand:VMOVE 1 "register_operand" "w")
>  			(parallel [(match_operand 2 "const_int_operand" "n")])))]
>    "TARGET_SIMD
>     && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 0"
> @@ -1035,11 +1067,11 @@ (define_insn "one_cmpl<mode>2"
>  )
>  
>  (define_insn "aarch64_simd_vec_set<mode>"
> -  [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
> -	(vec_merge:VALL_F16
> -	    (vec_duplicate:VALL_F16
> +  [(set (match_operand:VMOVE 0 "register_operand" "=w,w,w")
> +	(vec_merge:VMOVE
> +	    (vec_duplicate:VMOVE
>  		(match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand" "w,?r,Utv"))
> -	    (match_operand:VALL_F16 3 "register_operand" "0,0,0")
> +	    (match_operand:VMOVE 3 "register_operand" "0,0,0")
>  	    (match_operand:SI 2 "immediate_operand" "i,i,i")))]
>    "TARGET_SIMD"
>    {
> @@ -1061,14 +1093,14 @@ (define_insn "aarch64_simd_vec_set<mode>"
>  )
>  
>  (define_insn "@aarch64_simd_vec_copy_lane<mode>"
> -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> -	(vec_merge:VALL_F16
> -	    (vec_duplicate:VALL_F16
> +  [(set (match_operand:VMOVE 0 "register_operand" "=w")
> +	(vec_merge:VMOVE
> +	    (vec_duplicate:VMOVE
>  	      (vec_select:<VEL>
> -		(match_operand:VALL_F16 3 "register_operand" "w")
> +		(match_operand:VMOVE 3 "register_operand" "w")
>  		(parallel
>  		  [(match_operand:SI 4 "immediate_operand" "i")])))
> -	    (match_operand:VALL_F16 1 "register_operand" "0")
> +	    (match_operand:VMOVE 1 "register_operand" "0")
>  	    (match_operand:SI 2 "immediate_operand" "i")))]
>    "TARGET_SIMD"
>    {
> @@ -1376,7 +1408,7 @@ (define_insn "vec_shr_<mode>"
>  )
>  
>  (define_expand "vec_set<mode>"
> -  [(match_operand:VALL_F16 0 "register_operand")
> +  [(match_operand:VMOVE 0 "register_operand")
>     (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
>     (match_operand:SI 2 "immediate_operand")]
>    "TARGET_SIMD"
> @@ -3495,7 +3527,7 @@ (define_insn "popcount<mode>2"
>  ;; gimple_fold'd to the IFN_REDUC_(MAX|MIN) function.  (This is FP smax/smin).
>  (define_expand "reduc_<optab>_scal_<mode>"
>    [(match_operand:<VEL> 0 "register_operand")
> -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
>  		 FMAXMINV)]
>    "TARGET_SIMD"
>    {
> @@ -3510,7 +3542,7 @@ (define_expand "reduc_<optab>_scal_<mode>"
>  
>  (define_expand "reduc_<fmaxmin>_scal_<mode>"
>    [(match_operand:<VEL> 0 "register_operand")
> -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
>  		 FMAXMINNMV)]
>    "TARGET_SIMD"
>    {
> @@ -3554,8 +3586,8 @@ (define_insn "aarch64_reduc_<optab>_internalv2si"
>  )
>  
>  (define_insn "aarch64_reduc_<optab>_internal<mode>"
> - [(set (match_operand:VHSDF 0 "register_operand" "=w")
> -       (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")]
> + [(set (match_operand:VHSDF_P 0 "register_operand" "=w")
> +       (unspec:VHSDF_P [(match_operand:VHSDF_P 1 "register_operand" "w")]
>  		      FMAXMINV))]
>   "TARGET_SIMD"
>   "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
> @@ -4200,7 +4232,7 @@ (define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
>  (define_insn_and_split "aarch64_get_lane<mode>"
>    [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand" "=?r, w, Utv")
>  	(vec_select:<VEL>
> -	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
> +	  (match_operand:VMOVE 1 "register_operand" "w, w, w")
>  	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
>    "TARGET_SIMD"
>    {
> @@ -7981,7 +8013,7 @@ (define_expand "aarch64_st1<VALL_F16:mode>"
>  ;; Standard pattern name vec_init<mode><Vel>.
>  
>  (define_expand "vec_init<mode><Vel>"
> -  [(match_operand:VALL_F16 0 "register_operand")
> +  [(match_operand:VMOVE 0 "register_operand")
>     (match_operand 1 "" "")]
>    "TARGET_SIMD"
>  {
> @@ -8060,7 +8092,7 @@ (define_insn "aarch64_urecpe<mode>"
>  
>  (define_expand "vec_extract<mode><Vel>"
>    [(match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand")
> -   (match_operand:VALL_F16 1 "register_operand")
> +   (match_operand:VMOVE 1 "register_operand")
>     (match_operand:SI 2 "immediate_operand")]
>    "TARGET_SIMD"
>  {
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 84dbe2f4ea7d03b424602ed98a34e7824217dc91..35671cb86e374f9ded21d0e4944c63bc2cbc0901 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -3566,6 +3566,7 @@ aarch64_classify_vector_mode (machine_mode mode)
>      case E_V8BFmode:
>      case E_V4SFmode:
>      case E_V2DFmode:
> +    case E_V2HFmode:
>        return TARGET_SIMD ? VEC_ADVSIMD : 0;
>  
>      default:
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 37d8161a33b1c399d80be82afa67613a087389d4..dfcf86a440e316c2abdbcc646363d39e458d1a91 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -160,6 +160,10 @@ (define_mode_iterator VDQF [V2SF V4SF V2DF])
>  (define_mode_iterator VHSDF [(V4HF "TARGET_SIMD_F16INST")
>  			     (V8HF "TARGET_SIMD_F16INST")
>  			     V2SF V4SF V2DF])
> +;; Advanced SIMD Float modes suitable for pairwise operations.
> +(define_mode_iterator VHSDF_P [(V4HF "TARGET_SIMD_F16INST")
> +			       (V8HF "TARGET_SIMD_F16INST")
> +			       V2SF V4SF V2DF (V2HF "TARGET_SIMD_F16INST")])

Maybe "reduction or pairwise operations"?  Otherwise it isn't obvious
why V4HF, V8HF and V4SF are included.

>  
>  ;; Advanced SIMD Float modes, and DF.
>  (define_mode_iterator VDQF_DF [V2SF V4SF V2DF DF])
> @@ -188,15 +192,23 @@ (define_mode_iterator VDQF_COND [V2SF V2SI V4SF V4SI V2DF V2DI])
>  (define_mode_iterator VALLF [V2SF V4SF V2DF SF DF])
>  
>  ;; Advanced SIMD Float modes with 2 elements.
> -(define_mode_iterator V2F [V2SF V2DF])
> +(define_mode_iterator V2F [V2SF V2DF V2HF])
>  
>  ;; All Advanced SIMD modes on which we support any arithmetic operations.
>  (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF V2DF])
>  
> -;; All Advanced SIMD modes suitable for moving, loading, and storing.
> +;; All Advanced SIMD modes suitable for moving, loading, and storing
> +;; except V2HF.

I'd prefer:

;; The set of all modes for which vld1 intrinsics are provided.

otherwise it isn't clear why V2HF is a special case.

>  (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
>  				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
>  
> +;; All Advanced SIMD modes suitable for moving, loading, and storing
> +;; including V2HF
> +(define_mode_iterator VMOVE [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
> +			     V4HF V8HF V4BF V8BF V2SF V4SF V2DF
> +			     (V2HF "TARGET_SIMD_F16INST")])
> +
> +
>  ;; The VALL_F16 modes except the 128-bit 2-element ones.
>  (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI V4SI
>  				V4HF V8HF V2SF V4SF])
> @@ -1076,7 +1088,7 @@ (define_mode_attr nunits [(V8QI "8") (V16QI "16")
>  			  (V2SF "2") (V4SF "4")
>  			  (V1DF "1") (V2DF "2")
>  			  (DI "1") (DF "1")
> -			  (V8DI "8")])
> +			  (V8DI "8") (V2HF "2")])
>  
>  ;; Map a mode to the number of bits in it, if the size of the mode
>  ;; is constant.
> @@ -1090,6 +1102,7 @@ (define_mode_attr s [(HF "h") (SF "s") (DF "d") (SI "s") (DI "d")])
>  
>  ;; Give the length suffix letter for a sign- or zero-extension.
>  (define_mode_attr size [(QI "b") (HI "h") (SI "w")])
> +(define_mode_attr sizel [(QI "b") (HI "h") (SI "")])
>  
>  ;; Give the number of bits in the mode
>  (define_mode_attr sizen [(QI "8") (HI "16") (SI "32") (DI "64")])

Looks like this isn't used in the patch, so could be dropped.

OK with those changes, thanks.

Richard

> @@ -1193,7 +1206,7 @@ (define_mode_attr Vmntype [(V8HI ".8b") (V4SI ".4h")
>  (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
>  			  (V4HI "h") (V8HI  "h")
>  			  (V2SI "s") (V4SI  "s")
> -			  (V2DI "d")
> +			  (V2DI "d") (V2HF  "h")
>  			  (V4HF "h") (V8HF  "h")
>  			  (V2SF "s") (V4SF  "s")
>  			  (V2DF "d")
> @@ -1285,7 +1298,7 @@ (define_mode_attr Vcwtype [(VNx16QI "b") (VNx8QI "h") (VNx4QI "w") (VNx2QI "d")
>  ;; more accurately.
>  (define_mode_attr stype [(V8QI "b") (V16QI "b") (V4HI "s") (V8HI "s")
>  			 (V2SI "s") (V4SI "s") (V2DI "d") (V4HF "s")
> -			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d")
> +			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d") (V2HF "s")
>  			 (HF "s") (SF "s") (DF "d") (QI "b") (HI "s")
>  			 (SI "s") (DI "d")])
>  
> @@ -1360,8 +1373,8 @@ (define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
>  		       (V4HF "HF") (V8HF  "HF")
>  		       (V2SF "SF") (V4SF  "SF")
>  		       (DF   "DF") (V2DF  "DF")
> -		       (SI   "SI") (HI    "HI")
> -		       (QI   "QI")
> +		       (SI   "SI") (V2HF  "HF")
> +		       (QI   "QI") (HI    "HI")
>  		       (V4BF "BF") (V8BF "BF")
>  		       (VNx16QI "QI") (VNx8QI "QI") (VNx4QI "QI") (VNx2QI "QI")
>  		       (VNx8HI "HI") (VNx4HI "HI") (VNx2HI "HI")
> @@ -1381,7 +1394,7 @@ (define_mode_attr Vel [(V8QI "qi") (V16QI "qi")
>  		       (V2SF "sf") (V4SF "sf")
>  		       (V2DF "df") (DF   "df")
>  		       (SI   "si") (HI   "hi")
> -		       (QI   "qi")
> +		       (QI   "qi") (V2HF "hf")
>  		       (V4BF "bf") (V8BF "bf")
>  		       (VNx16QI "qi") (VNx8QI "qi") (VNx4QI "qi") (VNx2QI "qi")
>  		       (VNx8HI "hi") (VNx4HI "hi") (VNx2HI "hi")
> @@ -1866,7 +1879,7 @@ (define_mode_attr q [(V8QI "") (V16QI "_q")
>  		     (V4HF "") (V8HF "_q")
>  		     (V4BF "") (V8BF "_q")
>  		     (V2SF "") (V4SF  "_q")
> -			       (V2DF  "_q")
> +		     (V2HF "") (V2DF  "_q")
>  		     (QI "") (HI "") (SI "") (DI "") (HF "") (SF "") (DF "")
>  		     (V2x8QI "") (V2x16QI "_q")
>  		     (V2x4HI "") (V2x8HI "_q")
> @@ -1905,6 +1918,7 @@ (define_mode_attr vp [(V8QI "v") (V16QI "v")
>  		      (V2SI "p") (V4SI  "v")
>  		      (V2DI "p") (V2DF  "p")
>  		      (V2SF "p") (V4SF  "v")
> +		      (V2HF "p")
>  		      (V4HF "v") (V8HF  "v")])
>  
>  (define_mode_attr vsi2qi [(V2SI "v8qi") (V4SI "v16qi")
> diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
> index 7d0504bdd944e9c0d1b545b0b66a9a1adc808714..3cfbc7a93cca1bea4925853e51d0a147c5722247 100644
> --- a/gcc/config/arm/types.md
> +++ b/gcc/config/arm/types.md
> @@ -483,6 +483,7 @@ (define_attr "autodetect_type"
>  ; neon_fp_minmax_s_q
>  ; neon_fp_minmax_d
>  ; neon_fp_minmax_d_q
> +; neon_fp_reduc_add_h
>  ; neon_fp_reduc_add_s
>  ; neon_fp_reduc_add_s_q
>  ; neon_fp_reduc_add_d
> @@ -1033,6 +1034,7 @@ (define_attr "type"
>    neon_fp_minmax_d,\
>    neon_fp_minmax_d_q,\
>  \
> +  neon_fp_reduc_add_h,\
>    neon_fp_reduc_add_s,\
>    neon_fp_reduc_add_s_q,\
>    neon_fp_reduc_add_d,\
> @@ -1257,8 +1259,8 @@ (define_attr "is_neon_type" "yes,no"
>            neon_fp_compare_d, neon_fp_compare_d_q, neon_fp_minmax_s,\
>            neon_fp_minmax_s_q, neon_fp_minmax_d, neon_fp_minmax_d_q,\
>            neon_fp_neg_s, neon_fp_neg_s_q, neon_fp_neg_d, neon_fp_neg_d_q,\
> -          neon_fp_reduc_add_s, neon_fp_reduc_add_s_q, neon_fp_reduc_add_d,\
> -          neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,
> +          neon_fp_reduc_add_h, neon_fp_reduc_add_s, neon_fp_reduc_add_s_q,\
> +          neon_fp_reduc_add_d, neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,\
>            neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d,\
>            neon_fp_reduc_minmax_d_q,\
>            neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> index 07d71a63414b1066ea431e287286ad048515711a..e6021c5a42748701e5326a5c387a39a0bbadc9e5 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> @@ -30,11 +30,9 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int n)	\
>  TEST_ALL (VEC_PERM)
>  
>  /* We should use one DUP for each of the 8-, 16- and 32-bit types,
> -   although we currently use LD1RW for _Float16.  We should use two
> -   DUPs for each of the three 64-bit types.  */
> +   We should use two DUPs for each of the three 64-bit types.  */
>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 2 } } */
> -/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 1 } } */
> +/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } } */
>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
>  /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
>  /* { dg-final { scan-assembler-not {\tzip2\t} } } */
> @@ -53,7 +51,7 @@ TEST_ALL (VEC_PERM)
>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
>  /* { dg-final { scan-assembler-not {\tldr} } } */
> -/* { dg-final { scan-assembler-times {\tstr} 2 } } */
> -/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
> +/* { dg-final { scan-assembler-not {\tstr} } } */
> +/* { dg-final { scan-assembler-not {\tstr\th[0-9]+} } } */
>  
>  /* { dg-final { scan-assembler-not {\tuqdec} } } */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
  2022-12-06 10:28       ` Richard Sandiford
@ 2022-12-06 10:58         ` Tamar Christina
  2022-12-06 11:05           ` Richard Sandiford
  0 siblings, 1 reply; 50+ messages in thread
From: Tamar Christina @ 2022-12-06 10:58 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Tuesday, December 6, 2022 10:28 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: Re: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> > Hi,
> >
> >
> >> This name might cause confusion with the SVE iterators, where FULL
> >> means "every bit of the register is used".  How about something like
> >> VMOVE instead?
> >>
> >> With this change, I guess VALL_F16 represents "The set of all modes
> >> for which the vld1 intrinsics are provided" and VMOVE or whatever is
> >> "All Advanced SIMD modes suitable for moving, loading, and storing".
> >> That is, VMOVE extends VALL_F16 with modes that are not manifested
> >> via intrinsics.
> >>
> >
> > Done.
> >
> >> Where is the 2h used, and is it valid syntax in that context?
> >>
> >> Same for later instances of 2h.
> >
> > They are, but they weren't meant to be in this patch.  They belong in
> > a separate FP16 series that I won't get to finish for GCC 13 due not
> > being able to finish writing all the tests.  I have moved them to that patch
> series though.
> >
> > While the addp patch series has been killed, this patch is still good
> > standalone and improves codegen as shown in the updated testcase.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* config/aarch64/aarch64-simd.md (*aarch64_simd_movv2hf): New.
> > 	(mov<mode>, movmisalign<mode>, aarch64_dup_lane<mode>,
> > 	aarch64_store_lane0<mode>, aarch64_simd_vec_set<mode>,
> > 	@aarch64_simd_vec_copy_lane<mode>, vec_set<mode>,
> > 	reduc_<optab>_scal_<mode>, reduc_<fmaxmin>_scal_<mode>,
> > 	aarch64_reduc_<optab>_internal<mode>,
> aarch64_get_lane<mode>,
> > 	vec_init<mode><Vel>, vec_extract<mode><Vel>): Support V2HF.
> > 	(aarch64_simd_dupv2hf): New.
> > 	* config/aarch64/aarch64.cc (aarch64_classify_vector_mode):
> > 	Add E_V2HFmode.
> > 	* config/aarch64/iterators.md (VHSDF_P): New.
> > 	(V2F, VMOVE, nunits, Vtype, Vmtype, Vetype, stype, VEL,
> > 	Vel, q, vp): Add V2HF.
> > 	* config/arm/types.md (neon_fp_reduc_add_h): New.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	* gcc.target/aarch64/sve/slp_1.c: Update testcase.
> >
> > --- inline copy of patch ---
> >
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > b/gcc/config/aarch64/aarch64-simd.md
> > index
> >
> f4152160084d6b6f34bd69f0ba6386c1ab50f77e..487a31010245accec28e779661
> e6
> > c2d578fca4b7 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -19,10 +19,10 @@
> >  ;; <http://www.gnu.org/licenses/>.
> >
> >  (define_expand "mov<mode>"
> > -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> > -	(match_operand:VALL_F16 1 "general_operand"))]
> > +  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
> > +	(match_operand:VMOVE 1 "general_operand"))]
> >    "TARGET_SIMD"
> > -  "
> > +{
> >    /* Force the operand into a register if it is not an
> >       immediate whose use can be replaced with xzr.
> >       If the mode is 16 bytes wide, then we will be doing @@ -46,12
> > +46,11 @@ (define_expand "mov<mode>"
> >        aarch64_expand_vector_init (operands[0], operands[1]);
> >        DONE;
> >      }
> > -  "
> > -)
> > +})
> >
> >  (define_expand "movmisalign<mode>"
> > -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> > -        (match_operand:VALL_F16 1 "general_operand"))]
> > +  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
> > +        (match_operand:VMOVE 1 "general_operand"))]
> >    "TARGET_SIMD && !STRICT_ALIGNMENT"
> >  {
> >    /* This pattern is not permitted to fail during expansion: if both
> > arguments @@ -73,6 +72,16 @@ (define_insn
> "aarch64_simd_dup<mode>"
> >    [(set_attr "type" "neon_dup<q>, neon_from_gp<q>")]
> >  )
> >
> > +(define_insn "aarch64_simd_dupv2hf"
> > +  [(set (match_operand:V2HF 0 "register_operand" "=w")
> > +	(vec_duplicate:V2HF
> > +	  (match_operand:HF 1 "register_operand" "0")))]
> 
> Seems like this should be "w" rather than "0", since SLI is a two-register
> instruction.

Yes, but for a dup it's only valid when the same register is used. i.e. it has to
write into the original src register.

Thanks,
Tamar

> 
> > +  "TARGET_SIMD"
> > +  "@
> > +   sli\\t%d0, %d1, 16"
> > +  [(set_attr "type" "neon_shift_imm")]
> > +)
> > +
> >  (define_insn "aarch64_simd_dup<mode>"
> >    [(set (match_operand:VDQF_F16 0 "register_operand" "=w,w")
> >  	(vec_duplicate:VDQF_F16
> > @@ -85,10 +94,10 @@ (define_insn "aarch64_simd_dup<mode>"
> >  )
> >
> >  (define_insn "aarch64_dup_lane<mode>"
> > -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> > -	(vec_duplicate:VALL_F16
> > +  [(set (match_operand:VMOVE 0 "register_operand" "=w")
> > +	(vec_duplicate:VMOVE
> >  	  (vec_select:<VEL>
> > -	    (match_operand:VALL_F16 1 "register_operand" "w")
> > +	    (match_operand:VMOVE 1 "register_operand" "w")
> >  	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
> >            )))]
> >    "TARGET_SIMD"
> > @@ -142,6 +151,29 @@ (define_insn
> "*aarch64_simd_mov<VDMOV:mode>"
> >  		     mov_reg, neon_move<q>")]
> >  )
> >
> > +(define_insn "*aarch64_simd_movv2hf"
> > +  [(set (match_operand:V2HF 0 "nonimmediate_operand"
> > +		"=w, m,  m,  w, ?r, ?w, ?r, w, w")
> > +	(match_operand:V2HF 1 "general_operand"
> > +		"m,  Dz, w,  w,  w,  r,  r, Dz, Dn"))]
> > +  "TARGET_SIMD_F16INST
> > +   && (register_operand (operands[0], V2HFmode)
> > +       || aarch64_simd_reg_or_zero (operands[1], V2HFmode))"
> > +   "@
> > +    ldr\\t%s0, %1
> > +    str\\twzr, %0
> > +    str\\t%s1, %0
> > +    mov\\t%0.2s[0], %1.2s[0]
> > +    umov\\t%w0, %1.s[0]
> > +    fmov\\t%s0, %1
> 
> Should be %w1 instead.
> 
> > +    mov\\t%0, %1
> 
> I guess this one works with either % (X registers) or %w.  Might still be better
> to use %w anyway, so that it looks less like an oversight.
> 
> > +    movi\\t%d0, 0
> > +    * return aarch64_output_simd_mov_immediate (operands[1], 32);"
> > +  [(set_attr "type" "neon_load1_1reg, store_8, neon_store1_1reg,\
> > +		     neon_logic, neon_to_gp, f_mcr,\
> > +		     mov_reg, neon_move, neon_move")]
> > +)
> > +
> >  (define_insn "*aarch64_simd_mov<VQMOV:mode>"
> >    [(set (match_operand:VQMOV 0 "nonimmediate_operand"
> >  		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
> > @@ -182,7 +214,7 @@ (define_insn
> "*aarch64_simd_mov<VQMOV:mode>"
> >
> >  (define_insn "aarch64_store_lane0<mode>"
> >    [(set (match_operand:<VEL> 0 "memory_operand" "=m")
> > -	(vec_select:<VEL> (match_operand:VALL_F16 1 "register_operand"
> "w")
> > +	(vec_select:<VEL> (match_operand:VMOVE 1 "register_operand"
> "w")
> >  			(parallel [(match_operand 2 "const_int_operand"
> "n")])))]
> >    "TARGET_SIMD
> >     && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 0"
> > @@ -1035,11 +1067,11 @@ (define_insn "one_cmpl<mode>2"
> >  )
> >
> >  (define_insn "aarch64_simd_vec_set<mode>"
> > -  [(set (match_operand:VALL_F16 0 "register_operand" "=w,w,w")
> > -	(vec_merge:VALL_F16
> > -	    (vec_duplicate:VALL_F16
> > +  [(set (match_operand:VMOVE 0 "register_operand" "=w,w,w")
> > +	(vec_merge:VMOVE
> > +	    (vec_duplicate:VMOVE
> >  		(match_operand:<VEL> 1
> "aarch64_simd_nonimmediate_operand" "w,?r,Utv"))
> > -	    (match_operand:VALL_F16 3 "register_operand" "0,0,0")
> > +	    (match_operand:VMOVE 3 "register_operand" "0,0,0")
> >  	    (match_operand:SI 2 "immediate_operand" "i,i,i")))]
> >    "TARGET_SIMD"
> >    {
> > @@ -1061,14 +1093,14 @@ (define_insn "aarch64_simd_vec_set<mode>"
> >  )
> >
> >  (define_insn "@aarch64_simd_vec_copy_lane<mode>"
> > -  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> > -	(vec_merge:VALL_F16
> > -	    (vec_duplicate:VALL_F16
> > +  [(set (match_operand:VMOVE 0 "register_operand" "=w")
> > +	(vec_merge:VMOVE
> > +	    (vec_duplicate:VMOVE
> >  	      (vec_select:<VEL>
> > -		(match_operand:VALL_F16 3 "register_operand" "w")
> > +		(match_operand:VMOVE 3 "register_operand" "w")
> >  		(parallel
> >  		  [(match_operand:SI 4 "immediate_operand" "i")])))
> > -	    (match_operand:VALL_F16 1 "register_operand" "0")
> > +	    (match_operand:VMOVE 1 "register_operand" "0")
> >  	    (match_operand:SI 2 "immediate_operand" "i")))]
> >    "TARGET_SIMD"
> >    {
> > @@ -1376,7 +1408,7 @@ (define_insn "vec_shr_<mode>"
> >  )
> >
> >  (define_expand "vec_set<mode>"
> > -  [(match_operand:VALL_F16 0 "register_operand")
> > +  [(match_operand:VMOVE 0 "register_operand")
> >     (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
> >     (match_operand:SI 2 "immediate_operand")]
> >    "TARGET_SIMD"
> > @@ -3495,7 +3527,7 @@ (define_insn "popcount<mode>2"
> >  ;; gimple_fold'd to the IFN_REDUC_(MAX|MIN) function.  (This is FP
> smax/smin).
> >  (define_expand "reduc_<optab>_scal_<mode>"
> >    [(match_operand:<VEL> 0 "register_operand")
> > -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> > +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
> >  		 FMAXMINV)]
> >    "TARGET_SIMD"
> >    {
> > @@ -3510,7 +3542,7 @@ (define_expand "reduc_<optab>_scal_<mode>"
> >
> >  (define_expand "reduc_<fmaxmin>_scal_<mode>"
> >    [(match_operand:<VEL> 0 "register_operand")
> > -   (unspec:<VEL> [(match_operand:VHSDF 1 "register_operand")]
> > +   (unspec:<VEL> [(match_operand:VHSDF_P 1 "register_operand")]
> >  		 FMAXMINNMV)]
> >    "TARGET_SIMD"
> >    {
> > @@ -3554,8 +3586,8 @@ (define_insn
> "aarch64_reduc_<optab>_internalv2si"
> >  )
> >
> >  (define_insn "aarch64_reduc_<optab>_internal<mode>"
> > - [(set (match_operand:VHSDF 0 "register_operand" "=w")
> > -       (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")]
> > + [(set (match_operand:VHSDF_P 0 "register_operand" "=w")
> > +       (unspec:VHSDF_P [(match_operand:VHSDF_P 1 "register_operand"
> > + "w")]
> >  		      FMAXMINV))]
> >   "TARGET_SIMD"
> >   "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
> > @@ -4200,7 +4232,7 @@ (define_insn
> "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
> >  (define_insn_and_split "aarch64_get_lane<mode>"
> >    [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand"
> "=?r, w, Utv")
> >  	(vec_select:<VEL>
> > -	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
> > +	  (match_operand:VMOVE 1 "register_operand" "w, w, w")
> >  	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
> >    "TARGET_SIMD"
> >    {
> > @@ -7981,7 +8013,7 @@ (define_expand "aarch64_st1<VALL_F16:mode>"
> >  ;; Standard pattern name vec_init<mode><Vel>.
> >
> >  (define_expand "vec_init<mode><Vel>"
> > -  [(match_operand:VALL_F16 0 "register_operand")
> > +  [(match_operand:VMOVE 0 "register_operand")
> >     (match_operand 1 "" "")]
> >    "TARGET_SIMD"
> >  {
> > @@ -8060,7 +8092,7 @@ (define_insn "aarch64_urecpe<mode>"
> >
> >  (define_expand "vec_extract<mode><Vel>"
> >    [(match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand")
> > -   (match_operand:VALL_F16 1 "register_operand")
> > +   (match_operand:VMOVE 1 "register_operand")
> >     (match_operand:SI 2 "immediate_operand")]
> >    "TARGET_SIMD"
> >  {
> > diff --git a/gcc/config/aarch64/aarch64.cc
> > b/gcc/config/aarch64/aarch64.cc index
> >
> 84dbe2f4ea7d03b424602ed98a34e7824217dc91..35671cb86e374f9ded21d0e4
> 944c
> > 63bc2cbc0901 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -3566,6 +3566,7 @@ aarch64_classify_vector_mode (machine_mode
> mode)
> >      case E_V8BFmode:
> >      case E_V4SFmode:
> >      case E_V2DFmode:
> > +    case E_V2HFmode:
> >        return TARGET_SIMD ? VEC_ADVSIMD : 0;
> >
> >      default:
> > diff --git a/gcc/config/aarch64/iterators.md
> > b/gcc/config/aarch64/iterators.md index
> >
> 37d8161a33b1c399d80be82afa67613a087389d4..dfcf86a440e316c2abdbcc6463
> 63
> > d39e458d1a91 100644
> > --- a/gcc/config/aarch64/iterators.md
> > +++ b/gcc/config/aarch64/iterators.md
> > @@ -160,6 +160,10 @@ (define_mode_iterator VDQF [V2SF V4SF V2DF])
> > (define_mode_iterator VHSDF [(V4HF "TARGET_SIMD_F16INST")
> >  			     (V8HF "TARGET_SIMD_F16INST")
> >  			     V2SF V4SF V2DF])
> > +;; Advanced SIMD Float modes suitable for pairwise operations.
> > +(define_mode_iterator VHSDF_P [(V4HF "TARGET_SIMD_F16INST")
> > +			       (V8HF "TARGET_SIMD_F16INST")
> > +			       V2SF V4SF V2DF (V2HF
> "TARGET_SIMD_F16INST")])
> 
> Maybe "reduction or pairwise operations"?  Otherwise it isn't obvious why
> V4HF, V8HF and V4SF are included.
> 
> >
> >  ;; Advanced SIMD Float modes, and DF.
> >  (define_mode_iterator VDQF_DF [V2SF V4SF V2DF DF]) @@ -188,15
> +192,23
> > @@ (define_mode_iterator VDQF_COND [V2SF V2SI V4SF V4SI V2DF
> V2DI])
> > (define_mode_iterator VALLF [V2SF V4SF V2DF SF DF])
> >
> >  ;; Advanced SIMD Float modes with 2 elements.
> > -(define_mode_iterator V2F [V2SF V2DF])
> > +(define_mode_iterator V2F [V2SF V2DF V2HF])
> >
> >  ;; All Advanced SIMD modes on which we support any arithmetic
> operations.
> >  (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF
> > V4SF V2DF])
> >
> > -;; All Advanced SIMD modes suitable for moving, loading, and storing.
> > +;; All Advanced SIMD modes suitable for moving, loading, and storing
> > +;; except V2HF.
> 
> I'd prefer:
> 
> ;; The set of all modes for which vld1 intrinsics are provided.
> 
> otherwise it isn't clear why V2HF is a special case.
> 
> >  (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
> >  				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
> >
> > +;; All Advanced SIMD modes suitable for moving, loading, and storing
> > +;; including V2HF (define_mode_iterator VMOVE [V8QI V16QI V4HI V8HI
> > +V2SI V4SI V2DI
> > +			     V4HF V8HF V4BF V8BF V2SF V4SF V2DF
> > +			     (V2HF "TARGET_SIMD_F16INST")])
> > +
> > +
> >  ;; The VALL_F16 modes except the 128-bit 2-element ones.
> >  (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI
> V4SI
> >  				V4HF V8HF V2SF V4SF])
> > @@ -1076,7 +1088,7 @@ (define_mode_attr nunits [(V8QI "8") (V16QI
> "16")
> >  			  (V2SF "2") (V4SF "4")
> >  			  (V1DF "1") (V2DF "2")
> >  			  (DI "1") (DF "1")
> > -			  (V8DI "8")])
> > +			  (V8DI "8") (V2HF "2")])
> >
> >  ;; Map a mode to the number of bits in it, if the size of the mode
> > ;; is constant.
> > @@ -1090,6 +1102,7 @@ (define_mode_attr s [(HF "h") (SF "s") (DF "d")
> > (SI "s") (DI "d")])
> >
> >  ;; Give the length suffix letter for a sign- or zero-extension.
> >  (define_mode_attr size [(QI "b") (HI "h") (SI "w")])
> > +(define_mode_attr sizel [(QI "b") (HI "h") (SI "")])
> >
> >  ;; Give the number of bits in the mode  (define_mode_attr sizen [(QI
> > "8") (HI "16") (SI "32") (DI "64")])
> 
> Looks like this isn't used in the patch, so could be dropped.
> 
> OK with those changes, thanks.
> 
> Richard
> 
> > @@ -1193,7 +1206,7 @@ (define_mode_attr Vmntype [(V8HI ".8b") (V4SI
> > ".4h")  (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
> >  			  (V4HI "h") (V8HI  "h")
> >  			  (V2SI "s") (V4SI  "s")
> > -			  (V2DI "d")
> > +			  (V2DI "d") (V2HF  "h")
> >  			  (V4HF "h") (V8HF  "h")
> >  			  (V2SF "s") (V4SF  "s")
> >  			  (V2DF "d")
> > @@ -1285,7 +1298,7 @@ (define_mode_attr Vcwtype [(VNx16QI "b")
> (VNx8QI
> > "h") (VNx4QI "w") (VNx2QI "d")  ;; more accurately.
> >  (define_mode_attr stype [(V8QI "b") (V16QI "b") (V4HI "s") (V8HI "s")
> >  			 (V2SI "s") (V4SI "s") (V2DI "d") (V4HF "s")
> > -			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d")
> > +			 (V8HF "s") (V2SF "s") (V4SF "s") (V2DF "d") (V2HF
> "s")
> >  			 (HF "s") (SF "s") (DF "d") (QI "b") (HI "s")
> >  			 (SI "s") (DI "d")])
> >
> > @@ -1360,8 +1373,8 @@ (define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
> >  		       (V4HF "HF") (V8HF  "HF")
> >  		       (V2SF "SF") (V4SF  "SF")
> >  		       (DF   "DF") (V2DF  "DF")
> > -		       (SI   "SI") (HI    "HI")
> > -		       (QI   "QI")
> > +		       (SI   "SI") (V2HF  "HF")
> > +		       (QI   "QI") (HI    "HI")
> >  		       (V4BF "BF") (V8BF "BF")
> >  		       (VNx16QI "QI") (VNx8QI "QI") (VNx4QI "QI") (VNx2QI
> "QI")
> >  		       (VNx8HI "HI") (VNx4HI "HI") (VNx2HI "HI") @@ -1381,7
> +1394,7
> > @@ (define_mode_attr Vel [(V8QI "qi") (V16QI "qi")
> >  		       (V2SF "sf") (V4SF "sf")
> >  		       (V2DF "df") (DF   "df")
> >  		       (SI   "si") (HI   "hi")
> > -		       (QI   "qi")
> > +		       (QI   "qi") (V2HF "hf")
> >  		       (V4BF "bf") (V8BF "bf")
> >  		       (VNx16QI "qi") (VNx8QI "qi") (VNx4QI "qi") (VNx2QI "qi")
> >  		       (VNx8HI "hi") (VNx4HI "hi") (VNx2HI "hi") @@ -1866,7
> +1879,7
> > @@ (define_mode_attr q [(V8QI "") (V16QI "_q")
> >  		     (V4HF "") (V8HF "_q")
> >  		     (V4BF "") (V8BF "_q")
> >  		     (V2SF "") (V4SF  "_q")
> > -			       (V2DF  "_q")
> > +		     (V2HF "") (V2DF  "_q")
> >  		     (QI "") (HI "") (SI "") (DI "") (HF "") (SF "") (DF "")
> >  		     (V2x8QI "") (V2x16QI "_q")
> >  		     (V2x4HI "") (V2x8HI "_q")
> > @@ -1905,6 +1918,7 @@ (define_mode_attr vp [(V8QI "v") (V16QI "v")
> >  		      (V2SI "p") (V4SI  "v")
> >  		      (V2DI "p") (V2DF  "p")
> >  		      (V2SF "p") (V4SF  "v")
> > +		      (V2HF "p")
> >  		      (V4HF "v") (V8HF  "v")])
> >
> >  (define_mode_attr vsi2qi [(V2SI "v8qi") (V4SI "v16qi") diff --git
> > a/gcc/config/arm/types.md b/gcc/config/arm/types.md index
> >
> 7d0504bdd944e9c0d1b545b0b66a9a1adc808714..3cfbc7a93cca1bea4925853e5
> 1d0
> > a147c5722247 100644
> > --- a/gcc/config/arm/types.md
> > +++ b/gcc/config/arm/types.md
> > @@ -483,6 +483,7 @@ (define_attr "autodetect_type"
> >  ; neon_fp_minmax_s_q
> >  ; neon_fp_minmax_d
> >  ; neon_fp_minmax_d_q
> > +; neon_fp_reduc_add_h
> >  ; neon_fp_reduc_add_s
> >  ; neon_fp_reduc_add_s_q
> >  ; neon_fp_reduc_add_d
> > @@ -1033,6 +1034,7 @@ (define_attr "type"
> >    neon_fp_minmax_d,\
> >    neon_fp_minmax_d_q,\
> >  \
> > +  neon_fp_reduc_add_h,\
> >    neon_fp_reduc_add_s,\
> >    neon_fp_reduc_add_s_q,\
> >    neon_fp_reduc_add_d,\
> > @@ -1257,8 +1259,8 @@ (define_attr "is_neon_type" "yes,no"
> >            neon_fp_compare_d, neon_fp_compare_d_q, neon_fp_minmax_s,\
> >            neon_fp_minmax_s_q, neon_fp_minmax_d,
> neon_fp_minmax_d_q,\
> >            neon_fp_neg_s, neon_fp_neg_s_q, neon_fp_neg_d,
> neon_fp_neg_d_q,\
> > -          neon_fp_reduc_add_s, neon_fp_reduc_add_s_q,
> neon_fp_reduc_add_d,\
> > -          neon_fp_reduc_add_d_q, neon_fp_reduc_minmax_s,
> > +          neon_fp_reduc_add_h, neon_fp_reduc_add_s,
> neon_fp_reduc_add_s_q,\
> > +          neon_fp_reduc_add_d, neon_fp_reduc_add_d_q,
> > + neon_fp_reduc_minmax_s,\
> >            neon_fp_reduc_minmax_s_q, neon_fp_reduc_minmax_d,\
> >            neon_fp_reduc_minmax_d_q,\
> >            neon_fp_cvt_narrow_s_q, neon_fp_cvt_narrow_d_q,\ diff --git
> > a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > index
> >
> 07d71a63414b1066ea431e287286ad048515711a..e6021c5a42748701e5326a5c3
> 87a
> > 39a0bbadc9e5 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > @@ -30,11 +30,9 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int
> n)	\
> >  TEST_ALL (VEC_PERM)
> >
> >  /* We should use one DUP for each of the 8-, 16- and 32-bit types,
> > -   although we currently use LD1RW for _Float16.  We should use two
> > -   DUPs for each of the three 64-bit types.  */
> > +   We should use two DUPs for each of the three 64-bit types.  */
> >  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } }
> > */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 2 } }
> > */
> > -/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 1 } } */
> > +/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } }
> > +*/
> >  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } }
> > */
> >  /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d,
> > z[0-9]+\.d\n} 3 } } */
> >  /* { dg-final { scan-assembler-not {\tzip2\t} } } */ @@ -53,7 +51,7
> > @@ TEST_ALL (VEC_PERM)
> >  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
> >  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
> >  /* { dg-final { scan-assembler-not {\tldr} } } */
> > -/* { dg-final { scan-assembler-times {\tstr} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
> > +/* { dg-final { scan-assembler-not {\tstr} } } */
> > +/* { dg-final { scan-assembler-not {\tstr\th[0-9]+} } } */
> >
> >  /* { dg-final { scan-assembler-not {\tuqdec} } } */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
  2022-12-06 10:58         ` Tamar Christina
@ 2022-12-06 11:05           ` Richard Sandiford
  0 siblings, 0 replies; 50+ messages in thread
From: Richard Sandiford @ 2022-12-06 11:05 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Tuesday, December 6, 2022 10:28 AM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
>> Subject: Re: [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable.
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> > Hi,
>> >
>> >
>> >> This name might cause confusion with the SVE iterators, where FULL
>> >> means "every bit of the register is used".  How about something like
>> >> VMOVE instead?
>> >>
>> >> With this change, I guess VALL_F16 represents "The set of all modes
>> >> for which the vld1 intrinsics are provided" and VMOVE or whatever is
>> >> "All Advanced SIMD modes suitable for moving, loading, and storing".
>> >> That is, VMOVE extends VALL_F16 with modes that are not manifested
>> >> via intrinsics.
>> >>
>> >
>> > Done.
>> >
>> >> Where is the 2h used, and is it valid syntax in that context?
>> >>
>> >> Same for later instances of 2h.
>> >
>> > They are, but they weren't meant to be in this patch.  They belong in
>> > a separate FP16 series that I won't get to finish for GCC 13 due not
>> > being able to finish writing all the tests.  I have moved them to that patch
>> series though.
>> >
>> > While the addp patch series has been killed, this patch is still good
>> > standalone and improves codegen as shown in the updated testcase.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> > 	* config/aarch64/aarch64-simd.md (*aarch64_simd_movv2hf): New.
>> > 	(mov<mode>, movmisalign<mode>, aarch64_dup_lane<mode>,
>> > 	aarch64_store_lane0<mode>, aarch64_simd_vec_set<mode>,
>> > 	@aarch64_simd_vec_copy_lane<mode>, vec_set<mode>,
>> > 	reduc_<optab>_scal_<mode>, reduc_<fmaxmin>_scal_<mode>,
>> > 	aarch64_reduc_<optab>_internal<mode>,
>> aarch64_get_lane<mode>,
>> > 	vec_init<mode><Vel>, vec_extract<mode><Vel>): Support V2HF.
>> > 	(aarch64_simd_dupv2hf): New.
>> > 	* config/aarch64/aarch64.cc (aarch64_classify_vector_mode):
>> > 	Add E_V2HFmode.
>> > 	* config/aarch64/iterators.md (VHSDF_P): New.
>> > 	(V2F, VMOVE, nunits, Vtype, Vmtype, Vetype, stype, VEL,
>> > 	Vel, q, vp): Add V2HF.
>> > 	* config/arm/types.md (neon_fp_reduc_add_h): New.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> > 	* gcc.target/aarch64/sve/slp_1.c: Update testcase.
>> >
>> > --- inline copy of patch ---
>> >
>> > diff --git a/gcc/config/aarch64/aarch64-simd.md
>> > b/gcc/config/aarch64/aarch64-simd.md
>> > index
>> >
>> f4152160084d6b6f34bd69f0ba6386c1ab50f77e..487a31010245accec28e779661
>> e6
>> > c2d578fca4b7 100644
>> > --- a/gcc/config/aarch64/aarch64-simd.md
>> > +++ b/gcc/config/aarch64/aarch64-simd.md
>> > @@ -19,10 +19,10 @@
>> >  ;; <http://www.gnu.org/licenses/>.
>> >
>> >  (define_expand "mov<mode>"
>> > -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
>> > -	(match_operand:VALL_F16 1 "general_operand"))]
>> > +  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
>> > +	(match_operand:VMOVE 1 "general_operand"))]
>> >    "TARGET_SIMD"
>> > -  "
>> > +{
>> >    /* Force the operand into a register if it is not an
>> >       immediate whose use can be replaced with xzr.
>> >       If the mode is 16 bytes wide, then we will be doing @@ -46,12
>> > +46,11 @@ (define_expand "mov<mode>"
>> >        aarch64_expand_vector_init (operands[0], operands[1]);
>> >        DONE;
>> >      }
>> > -  "
>> > -)
>> > +})
>> >
>> >  (define_expand "movmisalign<mode>"
>> > -  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
>> > -        (match_operand:VALL_F16 1 "general_operand"))]
>> > +  [(set (match_operand:VMOVE 0 "nonimmediate_operand")
>> > +        (match_operand:VMOVE 1 "general_operand"))]
>> >    "TARGET_SIMD && !STRICT_ALIGNMENT"
>> >  {
>> >    /* This pattern is not permitted to fail during expansion: if both
>> > arguments @@ -73,6 +72,16 @@ (define_insn
>> "aarch64_simd_dup<mode>"
>> >    [(set_attr "type" "neon_dup<q>, neon_from_gp<q>")]
>> >  )
>> >
>> > +(define_insn "aarch64_simd_dupv2hf"
>> > +  [(set (match_operand:V2HF 0 "register_operand" "=w")
>> > +	(vec_duplicate:V2HF
>> > +	  (match_operand:HF 1 "register_operand" "0")))]
>> 
>> Seems like this should be "w" rather than "0", since SLI is a two-register
>> instruction.
>
> Yes, but for a dup it's only valid when the same register is used. i.e. it has to
> write into the original src register.

Ah, right.  In that case it might be better to use %d0 for the source
operand:

  For operands to match in a particular case usually means that they
  are identical-looking RTL expressions.  But in a few special cases
  specific kinds of dissimilarity are allowed.  For example, @code{*x}
  as an input operand will match @code{*x++} as an output operand.
  For proper results in such cases, the output template should always
  use the output-operand's number when printing the operand.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/8]AArch64: Consolidate zero and sign extension patterns and add missing ones.
  2022-10-31 11:59 ` [PATCH 7/8]AArch64: Consolidate zero and sign extension patterns and add missing ones Tamar Christina
  2022-11-30  4:28   ` Tamar Christina
@ 2022-12-06 15:59   ` Richard Sandiford
  1 sibling, 0 replies; 50+ messages in thread
From: Richard Sandiford @ 2022-12-06 15:59 UTC (permalink / raw)
  To: Tamar Christina via Gcc-patches
  Cc: Tamar Christina, Richard.Earnshaw, nd, Marcus.Shawcroft

Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Hi All,
>
> The target has various zero and sign extension patterns.  These however live in
> various locations around the MD file and almost all of them are split
> differently.  Due to the various patterns we also ended up missing valid
> extensions.  For instance smov is almost never generated.
>
> This change tries to make this more manageable by consolidating the patterns as
> much as possible and in doing so fix the missing alternatives.
>
> There were also some duplicate patterns.  Note that the
> zero_extend<*_ONLY:mode><SD_HSDI:mode>2  patterns are nearly identical however
> QImode lacks an alternative that the others don't have, so I have left them as
> 3 different patterns next to each other.
>
> In a lot of cases the wrong iterator was used leaving out cases that should
> exist.
>
> I've also changed the masks used for zero extensions to hex instead of decimal
> as it's more clear what they do that way, and aligns better with output of
> other compilers.
>
> This leave the bulk of the extensions in just 3 patterns.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64-simd.md
> 	(*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>): Changed to ...
> 	(*aarch64_get_lane_zero_extend<GPI:mode><VDQV_L:mode>): ... This.
> 	(*aarch64_get_lane_extenddi<VS:mode>): New.
> 	* config/aarch64/aarch64.md (<optab>sidi2, *extendsidi2_aarch64,
> 	<optab>qihi2, *extendqihi2_aarch64, *zero_extendsidi2_aarch64): Remove
> 	duplicate patterns.
> 	(<ANY_EXTEND:optab><SHORT:mode><GPI:mode>2,
> 	*extend<SHORT:mode><GPI:mode>2_aarch64): Remove, consolidate
> 	into ...
> 	(extend<ALLX:mode><SD_HSDI:mode>2): ... This.
> 	(*zero_extendqihi2_aarch64,
> 	*zero_extend<SHORT:mode><GPI:mode>2_aarch64): Remove, consolidate into
> 	...
> 	(zero_extend<SI_ONLY:mode><SD_HSDI:mode>2,
> 	zero_extend<HI_ONLY:mode><SD_HSDI:mode>2,
> 	zero_extend<QI_ONLY:mode><SD_HSDI:mode>2):
> 	(*ands<GPI:mode>_compare0): Renamed to ...
> 	(*ands<SD_HSDI:mode>_compare0): ... This.
> 	* config/aarch64/iterators.md (HI_ONLY, QI_ONLY): New.
> 	(short_mask): Use hex rather than dec and add SI.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/ands_3.c: Update codegen.
> 	* gcc.target/aarch64/sve/slp_1.c: Likewise.
> 	* gcc.target/aarch64/tst_5.c: Likewise.
> 	* gcc.target/aarch64/tst_6.c: Likewise.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 8a84a8560e982b8155b18541f5504801b3330124..d0b37c4dd48aeafd3d87c90dc3270e71af5a72b9 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4237,19 +4237,34 @@ (define_insn "*aarch64_get_lane_extend<GPI:mode><VDQQH:mode>"
>    [(set_attr "type" "neon_to_gp<VDQQH:q>")]
>  )
>  
> -(define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQQH:mode>"
> +(define_insn "*aarch64_get_lane_extenddi<VS:mode>"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> +	(sign_extend:DI
> +	  (vec_select:<VS:VEL>
> +	    (match_operand:VS 1 "register_operand" "w")
> +	    (parallel [(match_operand:SI 2 "immediate_operand" "i")]))))]
> +  "TARGET_SIMD"
> +  {
> +    operands[2] = aarch64_endian_lane_rtx (<VS:MODE>mode,
> +					   INTVAL (operands[2]));
> +    return "smov\\t%x0, %1.<VS:Vetype>[%2]";
> +  }
> +  [(set_attr "type" "neon_to_gp<VS:q>")]
> +)
> +
> +(define_insn "*aarch64_get_lane_zero_extend<GPI:mode><VDQV_L:mode>"
>    [(set (match_operand:GPI 0 "register_operand" "=r")
>  	(zero_extend:GPI
> -	  (vec_select:<VDQQH:VEL>
> -	    (match_operand:VDQQH 1 "register_operand" "w")
> +	  (vec_select:<VDQV_L:VEL>
> +	    (match_operand:VDQV_L 1 "register_operand" "w")
>  	    (parallel [(match_operand:SI 2 "immediate_operand" "i")]))))]
>    "TARGET_SIMD"
>    {
> -    operands[2] = aarch64_endian_lane_rtx (<VDQQH:MODE>mode,
> +    operands[2] = aarch64_endian_lane_rtx (<VDQV_L:MODE>mode,
>  					   INTVAL (operands[2]));
> -    return "umov\\t%w0, %1.<VDQQH:Vetype>[%2]";
> +    return "umov\\t%w0, %1.<VDQV_L:Vetype>[%2]";
>    }
> -  [(set_attr "type" "neon_to_gp<VDQQH:q>")]
> +  [(set_attr "type" "neon_to_gp<VDQV_L:q>")]
>  )

Do you have any tests for the extra SI sign-extends?

I think it'd better to use a consistent style here: either have a single
pattern for all source modes (like you do with the zero_extends) or have
a separate extend-to-DI-only pattern for SI inputs (like you do with the
sign_extends).

If we go with the single-pattern approach, then as per the reviews
in other patches that came after this patch was posted, it'd be good
to compile out the invalid extend-SI-to-SI cases, e.g. using a condition
based on <elem_bits> or whatever (extended to Advanced SIMD modes).

Same comments for the other patterns: would be good to compile-out
invalid cases.  E.g. in particular:

> -(define_insn "*zero_extend<SHORT:mode><GPI:mode>2_aarch64"
> -  [(set (match_operand:GPI 0 "register_operand" "=r,r,w,r")
> -        (zero_extend:GPI (match_operand:SHORT 1 "nonimmediate_operand" "r,m,m,w")))]
> +(define_insn "zero_extend<SI_ONLY:mode><SD_HSDI:mode>2"
> +  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,w,r,w")
> +        (zero_extend:SD_HSDI
> +	  (match_operand:SI_ONLY 1 "nonimmediate_operand" "r,m,r,m,w,w")))]

It doesn't really make conceptual sense to define SI extensions to HI or SI.
This can just be a single pattern, with no iterators.  It might be easier
to write the HI and QI iterators in the same style.

I guess one reason you might have done this is because a later patch
added "@" to the names, but it looked like that use could use paradoxical
subregs instead.  Even if we do want to generate extensions directly in
future, it's probably better to use the optabs interface, since it
coerces the operands to the predicates.

Thanks,
Richard

>    ""
>    "@
> -   and\t%<GPI:w>0, %<GPI:w>1, <SHORT:short_mask>
> -   ldr<SHORT:size>\t%w0, %1
> -   ldr\t%<SHORT:size>0, %1
> -   umov\t%w0, %1.<SHORT:size>[0]"
> -  [(set_attr "type" "logic_imm,load_4,f_loads,neon_to_gp")
> -   (set_attr "arch" "*,*,fp,fp")]
> -)
> -
> -(define_expand "<optab>qihi2"
> -  [(set (match_operand:HI 0 "register_operand")
> -        (ANY_EXTEND:HI (match_operand:QI 1 "nonimmediate_operand")))]
> -  ""
> +   uxt<SI_ONLY:size>\t%<SD_HSDI:w>0, %w1
> +   ldr<SI_ONLY:sizel>\t%w0, %1
> +   fmov\t%<SI_ONLY:Vetype>0, %w1
> +   ldr\t%<SI_ONLY:Vetype>0, %1
> +   fmov\t%w0, %<SI_ONLY:Vetype>1
> +   fmov\t%<SI_ONLY:Vetype>0, %<SI_ONLY:Vetype>1"
> +  [(set_attr "type" "mov_reg,load_4,f_mcr,f_loads,f_mrc,fmov")
> +   (set_attr "arch" "*,*,fp,fp,fp,fp")]
>  )
>  
> -(define_insn "*extendqihi2_aarch64"
> -  [(set (match_operand:HI 0 "register_operand" "=r,r")
> -	(sign_extend:HI (match_operand:QI 1 "nonimmediate_operand" "r,m")))]
> +(define_insn "zero_extend<HI_ONLY:mode><SD_HSDI:mode>2"
> +  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,w,r,w")
> +        (zero_extend:SD_HSDI
> +	  (match_operand:HI_ONLY 1 "nonimmediate_operand" "r,m,r,m,w,w")))]
>    ""
>    "@
> -   sxtb\t%w0, %w1
> -   ldrsb\t%w0, %1"
> -  [(set_attr "type" "extend,load_4")]
> +   uxt<HI_ONLY:size>\t%<SD_HSDI:w>0, %w1
> +   ldr<HI_ONLY:sizel>\t%w0, %1
> +   fmov\t%<HI_ONLY:Vetype>0, %w1
> +   ldr\t%<HI_ONLY:Vetype>0, %1
> +   umov\t%w0, %1.<HI_ONLY:Vetype>[0]
> +   fmov\t%<HI_ONLY:Vetype>0, %<HI_ONLY:Vetype>1"
> +  [(set_attr "type" "mov_reg,load_4,f_mcr,f_loads,f_mrc,fmov")
> +   (set_attr "arch" "*,*,fp16,fp,fp,fp16")]
>  )
>  
> -(define_insn "*zero_extendqihi2_aarch64"
> -  [(set (match_operand:HI 0 "register_operand" "=r,r")
> -	(zero_extend:HI (match_operand:QI 1 "nonimmediate_operand" "r,m")))]
> +(define_insn "zero_extend<QI_ONLY:mode><SD_HSDI:mode>2"
> +  [(set (match_operand:SD_HSDI 0 "register_operand" "=r,r,w,r,w")
> +        (zero_extend:SD_HSDI
> +	  (match_operand:QI_ONLY 1 "nonimmediate_operand" "r,m,m,w,w")))]
>    ""
>    "@
> -   and\t%w0, %w1, 255
> -   ldrb\t%w0, %1"
> -  [(set_attr "type" "logic_imm,load_4")]
> +   uxt<QI_ONLY:size>\t%<SD_HSDI:w>0, %w1
> +   ldr<QI_ONLY:sizel>\t%w0, %1
> +   ldr\t%<QI_ONLY:Vetype>0, %1
> +   umov\t%w0, %1.<QI_ONLY:Vetype>[0]
> +   dup\t%<QI_ONLY:Vetype>0, %1.<QI_ONLY:Vetype>[0]"
> +  [(set_attr "type" "mov_reg,load_4,f_loads,f_mrc,fmov")
> +   (set_attr "arch" "*,*,fp,fp,fp")]
>  )
>  
>  ;; -------------------------------------------------------------------
> @@ -5029,15 +5001,15 @@ (define_insn "*and<mode>_compare0"
>    [(set_attr "type" "alus_imm")]
>  )
>  
> -(define_insn "*ands<GPI:mode>_compare0"
> +(define_insn "*ands<SD_HSDI:mode>_compare0"
>    [(set (reg:CC_NZ CC_REGNUM)
>  	(compare:CC_NZ
> -	 (zero_extend:GPI (match_operand:SHORT 1 "register_operand" "r"))
> +	 (zero_extend:SD_HSDI (match_operand:ALLX 1 "register_operand" "r"))
>  	 (const_int 0)))
> -   (set (match_operand:GPI 0 "register_operand" "=r")
> -	(zero_extend:GPI (match_dup 1)))]
> +   (set (match_operand:SD_HSDI 0 "register_operand" "=r")
> +	(zero_extend:SD_HSDI (match_dup 1)))]
>    ""
> -  "ands\\t%<GPI:w>0, %<GPI:w>1, <short_mask>"
> +  "ands\\t%<SD_HSDI:w>0, %<SD_HSDI:w>1, <ALLX:short_mask>"
>    [(set_attr "type" "alus_imm")]
>  )
>  
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 1df09f7fe2eb35aed96113476541e0faa5393551..e904407b2169e589b7007ff966b2d9347a6d0fd2 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -41,6 +41,8 @@ (define_mode_iterator SHORT [QI HI])
>  ;; Iterators for single modes, for "@" patterns.
>  (define_mode_iterator SI_ONLY [SI])
>  (define_mode_iterator DI_ONLY [DI])
> +(define_mode_iterator HI_ONLY [HI])
> +(define_mode_iterator QI_ONLY [QI])
>  
>  ;; Iterator for all integer modes (up to 64-bit)
>  (define_mode_iterator ALLI [QI HI SI DI])
> @@ -1033,7 +1035,7 @@ (define_mode_attr w2 [(HF "x") (SF "x") (DF "w")])
>  ;; For width of fp registers in fcvt instruction
>  (define_mode_attr fpw [(DI "s") (SI "d")])
>  
> -(define_mode_attr short_mask [(HI "65535") (QI "255")])
> +(define_mode_attr short_mask [(SI "0xffffffff") (HI "0xffff") (QI "0xff")])
>  
>  ;; For constraints used in scalar immediate vector moves
>  (define_mode_attr hq [(HI "h") (QI "q")])
> diff --git a/gcc/testsuite/gcc.target/aarch64/ands_3.c b/gcc/testsuite/gcc.target/aarch64/ands_3.c
> index 42cb7f0f0bc86a4aceb09851c31eb2e888d93403..421aa5cea7a51ad810cc9c5653a149cb21bb871c 100644
> --- a/gcc/testsuite/gcc.target/aarch64/ands_3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/ands_3.c
> @@ -9,4 +9,4 @@ f9 (unsigned char x, int y)
>    return x;
>  }
>  
> -/* { dg-final { scan-assembler "ands\t(x|w)\[0-9\]+,\[ \t\]*(x|w)\[0-9\]+,\[ \t\]*255" } } */
> +/* { dg-final { scan-assembler "ands\t(x|w)\[0-9\]+,\[ \t\]*(x|w)\[0-9\]+,\[ \t\]*0xff" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> index 8e35e0b574d49913b43c7d8d4f4ba75f127f42e9..03288976b3397cdbe0e822f94f2a6448d9fa9a52 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> @@ -51,7 +51,6 @@ TEST_ALL (VEC_PERM)
>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
>  /* { dg-final { scan-assembler-not {\tldr} } } */
> -/* { dg-final { scan-assembler-times {\tstr} 2 } } */
> -/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
> +/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[1\], v[0-9]+\.h\[0\]} 1 } } */
>  
>  /* { dg-final { scan-assembler-not {\tuqdec} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/tst_5.c b/gcc/testsuite/gcc.target/aarch64/tst_5.c
> index 0de40a6c47a7d63c1b7a81aeba438a096c0041b8..19034cd74ed07ea4d670c25d9ab3d1cff805a483 100644
> --- a/gcc/testsuite/gcc.target/aarch64/tst_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/tst_5.c
> @@ -4,7 +4,7 @@
>  int
>  f255 (int x)
>  {
> -  if (x & 255)
> +  if (x & 0xff)
>      return 1;
>    return x;
>  }
> @@ -12,10 +12,10 @@ f255 (int x)
>  int
>  f65535 (int x)
>  {
> -  if (x & 65535)
> +  if (x & 0xffff)
>      return 1;
>    return x;
>  }
>  
> -/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*255" } } */
> -/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*65535" } } */
> +/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xff" } } */
> +/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xffff" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/tst_6.c b/gcc/testsuite/gcc.target/aarch64/tst_6.c
> index f15ec114c391fed79cc43b7740fde83fb3d4ea53..1c047cfae214b60e5bf003e6781a277202fcc588 100644
> --- a/gcc/testsuite/gcc.target/aarch64/tst_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/tst_6.c
> @@ -7,4 +7,4 @@ foo (long x)
>     return ((short) x != 0) ? x : 1;
>  }
>  
> -/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*65535" } } */
> +/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]+,\[ \t\]*0xffff" } } */

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2022-12-06 15:59 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-31 11:56 [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Tamar Christina
2022-10-31 11:57 ` [PATCH 2/8]middle-end: Recognize scalar widening reductions Tamar Christina
2022-10-31 21:42   ` Jeff Law
2022-11-07 13:21   ` Richard Biener
2022-10-31 11:57 ` [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector Tamar Christina
2022-10-31 21:44   ` Jeff Law
2022-11-01 14:25   ` Richard Sandiford
2022-11-11 14:33     ` Tamar Christina
2022-11-15  8:35       ` Hongtao Liu
2022-11-15  8:51         ` Tamar Christina
2022-11-15  9:37           ` Hongtao Liu
2022-11-15 10:00             ` Tamar Christina
2022-11-15 17:39               ` Richard Sandiford
2022-11-17  8:04                 ` Hongtao Liu
2022-11-17  9:39                   ` Richard Sandiford
2022-11-17 10:20                     ` Hongtao Liu
2022-11-17 13:59                       ` Richard Sandiford
2022-11-18  2:31                         ` Hongtao Liu
2022-11-18  9:16                           ` Richard Sandiford
2022-10-31 11:58 ` [PATCH 4/8]AArch64 aarch64: Implement widening reduction patterns Tamar Christina
2022-11-01 14:41   ` Richard Sandiford
2022-10-31 11:58 ` [PATCH 5/8]AArch64 aarch64: Make existing V2HF be usable Tamar Christina
2022-11-01 14:58   ` Richard Sandiford
2022-11-01 15:11     ` Tamar Christina
2022-11-11 14:39     ` Tamar Christina
2022-11-22 16:01       ` Tamar Christina
2022-11-30  4:26         ` Tamar Christina
2022-12-06 10:28       ` Richard Sandiford
2022-12-06 10:58         ` Tamar Christina
2022-12-06 11:05           ` Richard Sandiford
2022-10-31 11:59 ` [PATCH 6/8]AArch64: Add peephole and scheduling logic for pairwise operations that appear late in RTL Tamar Christina
2022-10-31 11:59 ` [PATCH 7/8]AArch64: Consolidate zero and sign extension patterns and add missing ones Tamar Christina
2022-11-30  4:28   ` Tamar Christina
2022-12-06 15:59   ` Richard Sandiford
2022-10-31 12:00 ` [PATCH 8/8]AArch64: Have reload not choose to do add on the scalar side if both values exist on the SIMD side Tamar Christina
2022-11-01 15:04   ` Richard Sandiford
2022-11-01 15:20     ` Tamar Christina
2022-10-31 21:41 ` [PATCH 1/8]middle-end: Recognize scalar reductions from bitfields and array_refs Jeff Law
2022-11-05 11:32 ` Richard Biener
2022-11-07  7:16   ` Tamar Christina
2022-11-07 10:17     ` Richard Biener
2022-11-07 11:00       ` Tamar Christina
2022-11-07 11:22         ` Richard Biener
2022-11-07 11:56           ` Tamar Christina
2022-11-22 10:36             ` Richard Sandiford
2022-11-22 10:58               ` Richard Biener
2022-11-22 11:02                 ` Tamar Christina
2022-11-22 11:06                   ` Richard Sandiford
2022-11-22 11:08                     ` Richard Biener
2022-11-22 14:33                       ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).