public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114
@ 2014-09-18 11:41 Alan Lawrence
  2014-09-18 11:45 ` [PATCH 1/14][AArch64] Temporarily remove aarch64_gimple_fold_builtin code for reduction operations Alan Lawrence
                   ` (16 more replies)
  0 siblings, 17 replies; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 11:41 UTC (permalink / raw)
  To: gcc-patches

The end goal here is to remove this code from tree-vect-loop.c 
(vect_create_epilog_for_reduction):

       if (BYTES_BIG_ENDIAN)
         bitpos = size_binop (MULT_EXPR,
                              bitsize_int (TYPE_VECTOR_SUBPARTS (vectype) - 1),
                              TYPE_SIZE (scalar_type));
       else

as this is the root cause of PR/61114 (see testcase there, failing on all 
bigendian targets supporting reduc_[us]plus_optab). Quoting Richard Biener, "all 
code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is suspicious". The 
code snippet above is used on two paths:

(Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR = 
reduc_[us](plus|min|max)_optab.
The optab is documented as "the scalar result is stored in the least significant 
bits of operand 0", but the tree code as "the first element in the vector 
holding the result of the reduction of all elements of the operand". This 
mismatch means that when the tree code is folded, the code snippet above reads 
the result from the wrong end of the vector.

The strategy (as per https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) 
is to define new tree codes and optabs that produce scalar results directly; 
this seems better than tying (the element of the vector into which the result is 
placed) to (the endianness of the target), and avoids generating extra moves on 
current bigendian targets. However, the previous optabs are retained for now as 
a migration strategy so as not to break existing backends; moving individual 
platforms over will follow.

A complication here is on AArch64, where we directly generate REDUC_PLUS_EXPRs 
from intrinsics in gimple_fold_builtin; I temporarily remove this folding in 
order to decouple the midend and AArch64 backend.

(Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e. 
VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the optab is 
defined in an endianness-dependent way, leading to significant complication in 
fold-const.c. (Moreover, the "equivalent" vec_shl_optab is never used!). Few 
platforms appear to handle vec_shr_optab (and fewer bigendian - I see only 
PowerPC and MIPS), so it seems pertinent to change the existing optab to be 
endianness-neutral.

Patch 10 defines vec_shr for AArch64, for the old specification; patch 13 
updates that implementation to fit the new endianness-neutral specification, 
serving as a guide for other existing backends. Patches/RFCs 15 and 16 are 
equivalents for MIPS and PowerPC; I haven't tested these but hope they act as 
useful pointers for the port maintainers.

Finally patch 14 cleans up the affected part of tree-vect-loop.c 
(vect_create_epilog_for_reduction).

--Alan

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/14][AArch64] Temporarily remove aarch64_gimple_fold_builtin code for reduction operations
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
@ 2014-09-18 11:45 ` Alan Lawrence
  2014-09-24  9:41   ` Marcus Shawcroft
  2014-09-18 11:51 ` [PATCH 2/14][Vectorizer] Make REDUC_xxx_EXPR tree codes produce a scalar result Alan Lawrence
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 11:45 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 971 bytes --]

The gimple folding ties the AArch64 backend to the tree representation of the 
midend via the neon intrinsics. This code enables constant folding of Neon 
intrinsics reduction ops, so improves performance, but is not necessary for 
correctness. By temporarily removing it (here), we can then change the midend 
representation independently of the AArch64 backend + intrinsics.

However, I'm leaving the code in place, as a later patch will bring it all back 
in a very similar form (but enabled for bigendian).

Bootstrapped on aarch64-none-linux; tested aarch64.exp on aarch64-none-elf and 
aarch64_be-none-elf. (The removed code was already disabled for bigendian; and 
this is solely a __builtin-folding mechanism, i.e. used only for Neon/ACLE 
intrinsics.)

gcc/ChangeLog:
	* config/aarch64/aarch64.c (TARGET_GIMPLE_FOLD_BUILTIN): Comment out.
	* config/aarch64/aarch64-builtins.c (aarch64_gimple_fold_builtin):
	Remove using preprocessor directives.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 1_aarch64_disable_.gimple_fold.patch --]
[-- Type: text/x-patch; name=1_aarch64_disable_.gimple_fold.patch, Size: 1550 bytes --]

diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
index 5217f4a5f39224dbf8029542ad33790ef2c191be..15eb7c686d95b1d66cbd514500ec29ba074eaa3f 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -1333,6 +1333,9 @@ aarch64_fold_builtin (tree fndecl, int n_args ATTRIBUTE_UNUSED, tree *args,
   return NULL_TREE;
 }
 
+/* Handling of reduction operations temporarily removed so as to decouple
+   changes to tree codes from AArch64 NEON Intrinsics.  */
+#if 0
 bool
 aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 {
@@ -1404,6 +1407,7 @@ aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 
   return changed;
 }
+#endif
 
 void
 aarch64_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index e7946fc0b70ced70a4e98caa0a33121f29242aad..9197ec038b7d40a601c886b846113c50a29cf5e2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -9925,8 +9925,8 @@ aarch64_expand_movmem (rtx *operands)
 #undef TARGET_FRAME_POINTER_REQUIRED
 #define TARGET_FRAME_POINTER_REQUIRED aarch64_frame_pointer_required
 
-#undef TARGET_GIMPLE_FOLD_BUILTIN
-#define TARGET_GIMPLE_FOLD_BUILTIN aarch64_gimple_fold_builtin
+//#undef TARGET_GIMPLE_FOLD_BUILTIN
+//#define TARGET_GIMPLE_FOLD_BUILTIN aarch64_gimple_fold_builtin
 
 #undef TARGET_GIMPLIFY_VA_ARG_EXPR
 #define TARGET_GIMPLIFY_VA_ARG_EXPR aarch64_gimplify_va_arg_expr

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 2/14][Vectorizer] Make REDUC_xxx_EXPR tree codes produce a scalar result
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
  2014-09-18 11:45 ` [PATCH 1/14][AArch64] Temporarily remove aarch64_gimple_fold_builtin code for reduction operations Alan Lawrence
@ 2014-09-18 11:51 ` Alan Lawrence
  2014-09-22 10:34   ` Richard Biener
  2014-09-18 11:54 ` [PATCH 3/14] Add new optabs for reducing vectors to scalars Alan Lawrence
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 11:51 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2595 bytes --]

This fixes PR/61114 by redefining the REDUC_{MIN,MAX,PLUS}_EXPR tree codes.

These are presently documented as producing a vector with the result in element 
0, and this is inconsistent with their use in tree-vect-loop.c (which on 
bigendian targets pulls the bits out of the wrong end of the vector result). 
This leads to bugs on bigendian targets - see also 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114.

I discounted "fixing" the vectorizer (to read from element 0) and then making 
bigendian targets (whose architectural insn produces the result in lane N-1) 
permute the result vector, as optimization of vectors in RTL seems unlikely to 
remove such a permute and would lead to a performance regression.

Instead it seems more natural for the tree code to produce a scalar result 
(producing a vector with the result in lane 0 has already caused confusion, e.g. 
https://gcc.gnu.org/ml/gcc-patches/2012-10/msg01100.html).

However, this patch preserves the meaning of the optab (producing a result in 
lane 0 on little-endian architectures or N-1 on bigendian), thus generally 
avoiding the need to change backends. Thus, expr.c extracts an 
endianness-dependent element from the optab result to give the result expected 
for the tree code.

Previously posted as an RFC 
https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html , now with an extra 
VIEW_CONVERT_EXPR if the types of the reduction/result do not match.

Testing:
	x86_86-none-linux-gnu: bootstrap, check-gcc, check-g++
	aarch64-none-linux-gnu: bootstrap
	aarch64-none-elf:  check-gcc, check-g++
	arm-none-eabi: check-gcc

	aarch64_be-none-elf: check-gcc, showing
	FAIL->PASS: gcc.dg/vect/no-scevccp-outer-7.c execution test
	FAIL->PASS: gcc.dg/vect/no-scevccp-outer-13.c execution test
	Passes the (previously-failing) reduced testcase on
	 	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114

	Have also assembler/stage-1 tested that testcase on PowerPC, also fixed.

gcc/ChangeLog:

	* expr.c (expand_expr_real_2): For REDUC_{MIN,MAX,PLUS}_EXPR, add
	extract_bit_field around optab result.

	* fold-const.c (fold_unary_loc): For REDUC_{MIN,MAX,PLUS}_EXPR, produce
	scalar not vector.

	* tree-cfg.c (verify_gimple_assign_unary): Check result vs operand type
	for REDUC_{MIN,MAX,PLUS}_EXPR.

	* tree-vect-loop.c (vect_analyze_loop): Update comment.
	(vect_create_epilog_for_reduction): For direct vector reduction, use
	result of tree code directly without extract_bit_field.

	* tree.def (REDUC_MAX_EXPR, REDUC_MIN_EXPR, REDUC_PLUS_EXPR): Update
	comment.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 2_make_reduc_tree_codes_scalar.patch --]
[-- Type: text/x-patch; name=2_make_reduc_tree_codes_scalar.patch, Size: 6827 bytes --]

diff --git a/gcc/expr.c b/gcc/expr.c
index 58b87ba7ed7eee156b9730b61679af946694e8df..a293c06489f09586ed56dff1381467401687be45 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9020,7 +9020,17 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
       {
         op0 = expand_normal (treeop0);
         this_optab = optab_for_tree_code (code, type, optab_default);
-        temp = expand_unop (mode, this_optab, op0, target, unsignedp);
+        enum machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop0));
+        temp = expand_unop (vec_mode, this_optab, op0, NULL_RTX, unsignedp);
+        gcc_assert (temp);
+        /* The tree code produces a scalar result, but (somewhat by convention)
+           the optab produces a vector with the result in element 0 if
+           little-endian, or element N-1 if big-endian.  So pull the scalar
+           result out of that element.  */
+        int index = BYTES_BIG_ENDIAN ? GET_MODE_NUNITS (vec_mode) - 1 : 0;
+        int bitsize = GET_MODE_BITSIZE (GET_MODE_INNER (vec_mode));
+        temp = extract_bit_field (temp, bitsize, bitsize * index, unsignedp,
+				  target, mode, mode);
         gcc_assert (temp);
         return temp;
       }
diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index d44476972158b125aecd8c4a5c8d6176ad3b0e5c..b8baa94d37a74ebb824e2a4d03f2a10befcdf749 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -8475,12 +8475,13 @@ fold_unary_loc (location_t loc, enum tree_code code, tree type, tree op0)
     case REDUC_MAX_EXPR:
     case REDUC_PLUS_EXPR:
       {
-	unsigned int nelts = TYPE_VECTOR_SUBPARTS (type), i;
+	unsigned int nelts, i;
 	tree *elts;
 	enum tree_code subcode;
 
 	if (TREE_CODE (op0) != VECTOR_CST)
 	  return NULL_TREE;
+        nelts = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
 
 	elts = XALLOCAVEC (tree, nelts);
 	if (!vec_cst_ctor_to_array (op0, elts))
@@ -8499,10 +8500,9 @@ fold_unary_loc (location_t loc, enum tree_code code, tree type, tree op0)
 	    elts[0] = const_binop (subcode, elts[0], elts[i]);
 	    if (elts[0] == NULL_TREE || !CONSTANT_CLASS_P (elts[0]))
 	      return NULL_TREE;
-	    elts[i] = build_zero_cst (TREE_TYPE (type));
 	  }
 
-	return build_vector (type, elts);
+	return elts[0];
       }
 
     default:
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 9d1de01021cfda296c3fe53c9212c3aa6bd627c5..49986cc40758bb5998e395c727142e75f7d6e9f4 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3527,12 +3527,21 @@ verify_gimple_assign_unary (gimple stmt)
 
         return false;
       }
-
-    case VEC_UNPACK_HI_EXPR:
-    case VEC_UNPACK_LO_EXPR:
     case REDUC_MAX_EXPR:
     case REDUC_MIN_EXPR:
     case REDUC_PLUS_EXPR:
+      if (!VECTOR_TYPE_P (rhs1_type)
+	  || !useless_type_conversion_p (lhs_type, TREE_TYPE (rhs1_type)))
+        {
+	  error ("reduction should convert from vector to element type");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  return true;
+	}
+      return false;
+
+    case VEC_UNPACK_HI_EXPR:
+    case VEC_UNPACK_LO_EXPR:
     case VEC_UNPACK_FLOAT_HI_EXPR:
     case VEC_UNPACK_FLOAT_LO_EXPR:
       /* FIXME.  */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 7e013f3b549a07bd44789bd4d3e3701eec7c51dc..36f51977845bf5ce451564ccd1eb8ad5f39ac8de 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1892,9 +1892,9 @@ vect_analyze_loop (struct loop *loop)
 
    Output:
    REDUC_CODE - the corresponding tree-code to be used to reduce the
-      vector of partial results into a single scalar result (which
-      will also reside in a vector) or ERROR_MARK if the operation is
-      a supported reduction operation, but does not have such tree-code.
+      vector of partial results into a single scalar result, or ERROR_MARK
+      if the operation is a supported reduction operation, but does not have
+      such a tree-code.
 
    Return FALSE if CODE currently cannot be vectorized as reduction.  */
 
@@ -4167,6 +4167,7 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
   if (reduc_code != ERROR_MARK && !slp_reduc)
     {
       tree tmp;
+      tree vec_elem_type;
 
       /*** Case 1:  Create:
            v_out2 = reduc_expr <v_out1>  */
@@ -4175,14 +4176,26 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
         dump_printf_loc (MSG_NOTE, vect_location,
 			 "Reduce using direct vector reduction.\n");
 
-      vec_dest = vect_create_destination_var (scalar_dest, vectype);
-      tmp = build1 (reduc_code, vectype, new_phi_result);
-      epilog_stmt = gimple_build_assign (vec_dest, tmp);
-      new_temp = make_ssa_name (vec_dest, epilog_stmt);
+      vec_elem_type = TREE_TYPE (TREE_TYPE (new_phi_result));
+      if (!useless_type_conversion_p (scalar_type, vec_elem_type))
+	{
+          tree tmp_dest =
+	      vect_create_destination_var (scalar_dest, vec_elem_type);
+	  tmp = build1 (reduc_code, vec_elem_type, new_phi_result);
+	  epilog_stmt = gimple_build_assign (tmp_dest, tmp);
+	  new_temp = make_ssa_name (tmp_dest, epilog_stmt);
+	  gimple_assign_set_lhs (epilog_stmt, new_temp);
+	  gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
+
+	  tmp = build1 (VIEW_CONVERT_EXPR, scalar_type, new_temp);
+	}
+      else
+	tmp = build1 (reduc_code, scalar_type, new_phi_result);
+      epilog_stmt = gimple_build_assign (new_scalar_dest, tmp);
+      new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
       gimple_assign_set_lhs (epilog_stmt, new_temp);
       gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
-
-      extract_scalar_result = true;
+      scalar_results.safe_push (new_temp);
     }
   else
     {
diff --git a/gcc/tree.def b/gcc/tree.def
index 84ffe93aa6fdc827f18ca81225bca007d50b50f6..e9af52e554babb100d49ea14f47c805cd5024949 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1157,10 +1157,9 @@ DEFTREECODE (TRANSACTION_EXPR, "transaction_expr", tcc_expression, 1)
    result (e.g. summing the elements of the vector, finding the minimum over
    the vector elements, etc).
    Operand 0 is a vector.
-   The expression returns a vector of the same type, with the first
-   element in the vector holding the result of the reduction of all elements
-   of the operand.  The content of the other elements in the returned vector
-   is undefined.  */
+   The expression returns a scalar, with type the same as the elements of the
+   vector, holding the result of the reduction of all elements of the operand.
+   */
 DEFTREECODE (REDUC_MAX_EXPR, "reduc_max_expr", tcc_unary, 1)
 DEFTREECODE (REDUC_MIN_EXPR, "reduc_min_expr", tcc_unary, 1)
 DEFTREECODE (REDUC_PLUS_EXPR, "reduc_plus_expr", tcc_unary, 1)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 3/14] Add new optabs for reducing vectors to scalars
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
  2014-09-18 11:45 ` [PATCH 1/14][AArch64] Temporarily remove aarch64_gimple_fold_builtin code for reduction operations Alan Lawrence
  2014-09-18 11:51 ` [PATCH 2/14][Vectorizer] Make REDUC_xxx_EXPR tree codes produce a scalar result Alan Lawrence
@ 2014-09-18 11:54 ` Alan Lawrence
  2014-09-22 10:40   ` Richard Biener
  2014-09-18 11:59 ` [PATCH 4/14][AArch64] Use new reduc_plus_scal optabs, inc. for __builtins Alan Lawrence
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 11:54 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1396 bytes --]

These match their corresponding tree codes, by taking a vector and returning a 
scalar; this is more architecturally neutral than the (somewhat loosely defined) 
previous optab that took a vector and returned a vector with the result in the 
least significant bits (i.e. element 0 for little-endian or N-1 for bigendian). 
However, the old optabs are preserved so as not to break existing backends, so 
clients check for both old + new optabs.

Bootstrap, check-gcc and check-g++ on x86_64-none-linux-gnu.
aarch64.exp + vect.exp on aarch64{,_be}-none-elf.
(of course at this point in the series all these are using the old optab + 
migration path.)

gcc/ChangeLog:

	* doc/md.texi (Standard Names): Add reduc_(plus,[us](min|max))|scal
	optabs, and note in reduc_[us](plus|min|max) to prefer the former.

	* expr.c (expand_expr_real_2): Use reduc_..._scal if available, fall
	back to old reduc_... + BIT_FIELD_REF only if not.

	* optabs.c (optab_for_tree_code): for REDUC_(MAX,MIN,PLUS)_EXPR,
	return the reduce-to-scalar (reduc_..._scal) optab.
	(scalar_reduc_to_vector): New.

	* optabs.def (reduc_smax_scal_optab, reduc_smin_scal_optab,
	reduc_plus_scal_optab, reduc_umax_scal_optab, reduc_umin_scal_optab):
	New.

	* optabs.h (scalar_reduc_to_vector): Declare.

	* tree-vect-loop.c (vectorizable_reduction): Look for optabs reducing
	to either scalar or vector.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 3_add_new_optabs.patch --]
[-- Type: text/x-patch; name=3_add_new_optabs.patch, Size: 10874 bytes --]

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index dd7861188afb8afd01971f9f75f0e32da9f9f826..3f5fd6f0e3ac3fcc30f6c961e3e2709a35f4d413 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4811,29 +4811,48 @@ it is unspecified which of the two operands is returned as the result.
 @cindex @code{reduc_smax_@var{m}} instruction pattern
 @item @samp{reduc_smin_@var{m}}, @samp{reduc_smax_@var{m}}
 Find the signed minimum/maximum of the elements of a vector. The vector is
-operand 1, and the scalar result is stored in the least significant bits of
+operand 1, and the result is stored in the least significant bits of
 operand 0 (also a vector). The output and input vector should have the same
-modes.
+modes. These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_smin_scal_@var{m}} and @samp{reduc_smax_scal_@var{m}}.
 
 @cindex @code{reduc_umin_@var{m}} instruction pattern
 @cindex @code{reduc_umax_@var{m}} instruction pattern
 @item @samp{reduc_umin_@var{m}}, @samp{reduc_umax_@var{m}}
 Find the unsigned minimum/maximum of the elements of a vector. The vector is
-operand 1, and the scalar result is stored in the least significant bits of
+operand 1, and the result is stored in the least significant bits of
 operand 0 (also a vector). The output and input vector should have the same
-modes.
+modes. These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_umin_scal_@var{m}} and @samp{reduc_umax_scal_@var{m}}.
 
 @cindex @code{reduc_splus_@var{m}} instruction pattern
-@item @samp{reduc_splus_@var{m}}
-Compute the sum of the signed elements of a vector. The vector is operand 1,
-and the scalar result is stored in the least significant bits of operand 0
-(also a vector). The output and input vector should have the same modes.
-
 @cindex @code{reduc_uplus_@var{m}} instruction pattern
-@item @samp{reduc_uplus_@var{m}}
-Compute the sum of the unsigned elements of a vector. The vector is operand 1,
-and the scalar result is stored in the least significant bits of operand 0
+@item @samp{reduc_splus_@var{m}}, @samp{reduc_uplus_@var{m}}
+Compute the sum of the signed/unsigned elements of a vector. The vector is
+operand 1, and the result is stored in the least significant bits of operand 0
 (also a vector). The output and input vector should have the same modes.
+These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_plus_scal_@var{m}}.
+
+@cindex @code{reduc_smin_scal_@var{m}} instruction pattern
+@cindex @code{reduc_smax_scal_@var{m}} instruction pattern
+@item @samp{reduc_smin_scal_@var{m}}, @samp{reduc_smax_scal_@var{m}}
+Find the signed minimum/maximum of the elements of a vector. The vector is
+operand 1, and operand 0 is the scalar result, with mode equal to the mode of
+the elements of the input vector.
+
+@cindex @code{reduc_umin_scal_@var{m}} instruction pattern
+@cindex @code{reduc_umax_scal_@var{m}} instruction pattern
+@item @samp{reduc_umin_scal_@var{m}}, @samp{reduc_umax_scal_@var{m}}
+Find the unsigned minimum/maximum of the elements of a vector. The vector is
+operand 1, and operand 0 is the scalar result, with mode equal to the mode of
+the elements of the input vector.
+
+@cindex @code{reduc_plus_scal_@var{m}} instruction pattern
+@item @samp{reduc_plus_scal_@var{m}}
+Compute the sum of the elements of a vector. The vector is operand 1, and
+operand 0 is the scalar result, with mode equal to the mode of the elements of
+the input vector.
 
 @cindex @code{sdot_prod@var{m}} instruction pattern
 @item @samp{sdot_prod@var{m}}
diff --git a/gcc/expr.c b/gcc/expr.c
index a293c06489f09586ed56dff1381467401687be45..11930ca121e4e1f3807261a2e5b0ca4f6723176d 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9018,21 +9018,39 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
     case REDUC_MIN_EXPR:
     case REDUC_PLUS_EXPR:
       {
-        op0 = expand_normal (treeop0);
-        this_optab = optab_for_tree_code (code, type, optab_default);
-        enum machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop0));
-        temp = expand_unop (vec_mode, this_optab, op0, NULL_RTX, unsignedp);
-        gcc_assert (temp);
-        /* The tree code produces a scalar result, but (somewhat by convention)
-           the optab produces a vector with the result in element 0 if
-           little-endian, or element N-1 if big-endian.  So pull the scalar
-           result out of that element.  */
-        int index = BYTES_BIG_ENDIAN ? GET_MODE_NUNITS (vec_mode) - 1 : 0;
-        int bitsize = GET_MODE_BITSIZE (GET_MODE_INNER (vec_mode));
-        temp = extract_bit_field (temp, bitsize, bitsize * index, unsignedp,
-				  target, mode, mode);
-        gcc_assert (temp);
-        return temp;
+	op0 = expand_normal (treeop0);
+	enum machine_mode vec_mode = GET_MODE (op0);
+	this_optab = optab_for_tree_code (code, type, optab_default);
+
+	if (optab_handler (this_optab, vec_mode) != CODE_FOR_nothing)
+	  {
+	    struct expand_operand ops[2];
+	    enum insn_code icode = optab_handler (this_optab, vec_mode);
+
+	    create_output_operand (&ops[0], target, mode);
+	    create_input_operand (&ops[1], op0, vec_mode);
+	    if (maybe_expand_insn (icode, 2, ops))
+	      {
+		target = ops[0].value;
+		if (GET_MODE (target) != mode)
+		  return gen_lowpart (tmode, target);
+		return target;
+	      }
+	  }
+	/* Fall back to optab with vector result, and then extract scalar.  */
+	this_optab = scalar_reduc_to_vector (this_optab, type);
+	temp = expand_unop (vec_mode, this_optab, op0, NULL_RTX, unsignedp);
+	gcc_assert (temp);
+	/* The tree code produces a scalar result, but (somewhat by convention)
+	   the optab produces a vector with the result in element 0 if
+	   little-endian, or element N-1 if big-endian.  So pull the scalar
+	   result out of that element.  */
+	int index = BYTES_BIG_ENDIAN ? GET_MODE_NUNITS (vec_mode) - 1 : 0;
+	int bitsize = GET_MODE_BITSIZE (GET_MODE_INNER (vec_mode));
+	temp = extract_bit_field (temp, bitsize, bitsize * index, unsignedp,
+	                          target, mode, mode);
+	gcc_assert (temp);
+	return temp;
       }
 
     case VEC_LSHIFT_EXPR:
diff --git a/gcc/optabs.c b/gcc/optabs.c
index d6412ec42d7c4908a74d098e80d7038e068ca557..e422bcce18d06a39b26547b510c35858efc2303e 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -505,13 +505,15 @@ optab_for_tree_code (enum tree_code code, const_tree type,
       return fma_optab;
 
     case REDUC_MAX_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_umax_optab : reduc_smax_optab;
+      return TYPE_UNSIGNED (type)
+	     ? reduc_umax_scal_optab : reduc_smax_scal_optab;
 
     case REDUC_MIN_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_umin_optab : reduc_smin_optab;
+      return TYPE_UNSIGNED (type)
+	     ? reduc_umin_scal_optab : reduc_smin_scal_optab;
 
     case REDUC_PLUS_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;
+      return reduc_plus_scal_optab;
 
     case VEC_LSHIFT_EXPR:
       return vec_shl_optab;
@@ -607,7 +609,22 @@ optab_for_tree_code (enum tree_code code, const_tree type,
       return unknown_optab;
     }
 }
-\f
+
+optab
+scalar_reduc_to_vector (optab unoptab, const_tree type)
+{
+  switch (unoptab)
+    {
+    case reduc_plus_scal_optab:
+      return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;
+
+    case reduc_smin_scal_optab: return reduc_smin_optab;
+    case reduc_umin_scal_optab: return reduc_umin_optab;
+    case reduc_smax_scal_optab: return reduc_smax_optab;
+    case reduc_umax_scal_optab: return reduc_umax_optab;
+    default: return unknown_optab;
+    }
+}
 
 /* Expand vector widening operations.
 
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b75547006585267d9f5b4f17ba972ba388852cf5..131ea048b012b073345be3b426d4ac8f33061809 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -243,12 +243,20 @@ OPTAB_D (sin_optab, "sin$a2")
 OPTAB_D (sincos_optab, "sincos$a3")
 OPTAB_D (tan_optab, "tan$a2")
 
+/* Vector reduction to a scalar.  */
+OPTAB_D (reduc_smax_scal_optab, "reduc_smax_scal_$a")
+OPTAB_D (reduc_smin_scal_optab, "reduc_smin_scal_$a")
+OPTAB_D (reduc_plus_scal_optab, "reduc_plus_scal_$a")
+OPTAB_D (reduc_umax_scal_optab, "reduc_umax_scal_$a")
+OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a")
+/* (Old) Vector reduction, returning a vector with the result in one lane.  */
 OPTAB_D (reduc_smax_optab, "reduc_smax_$a")
 OPTAB_D (reduc_smin_optab, "reduc_smin_$a")
 OPTAB_D (reduc_splus_optab, "reduc_splus_$a")
 OPTAB_D (reduc_umax_optab, "reduc_umax_$a")
 OPTAB_D (reduc_umin_optab, "reduc_umin_$a")
 OPTAB_D (reduc_uplus_optab, "reduc_uplus_$a")
+
 OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
diff --git a/gcc/optabs.h b/gcc/optabs.h
index 089b15a6fcd261bb15c898f185a157f1257284ba..d9f4900620a13d74fc3dfb1bac9bcb34416012de 100644
--- a/gcc/optabs.h
+++ b/gcc/optabs.h
@@ -162,6 +162,11 @@ enum optab_subtype
    vector shifts and rotates */
 extern optab optab_for_tree_code (enum tree_code, const_tree, enum optab_subtype);
 
+/* Given optab UNOPTAB that reduces a vector to a scalar, find instead the old
+   optab that produces a vector with the reduction result in one element,
+   for a tree with type TYPE.  */
+extern optab scalar_reduc_to_vector (optab unoptab, const_tree type);
+
 /* The various uses that a comparison can have; used by can_compare_p:
    jumps, conditional moves, store flag operations.  */
 enum can_compare_purpose
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 36f51977845bf5ce451564ccd1eb8ad5f39ac8de..d0a29d312bfd9a7eb552d937e3c64cf9b30d558a 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -5101,15 +5101,17 @@ vectorizable_reduction (gimple stmt, gimple_stmt_iterator *gsi,
 
           epilog_reduc_code = ERROR_MARK;
         }
-
-      if (reduc_optab
-          && optab_handler (reduc_optab, vec_mode) == CODE_FOR_nothing)
+      else if (optab_handler (reduc_optab, vec_mode) == CODE_FOR_nothing)
         {
-          if (dump_enabled_p ())
-	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "reduc op not supported by target.\n");
+          optab = scalar_reduc_to_vector (reduc_optab, vectype_out);
+          if (optab_handler (optab, vec_mode) == CODE_FOR_nothing)
+            {
+              if (dump_enabled_p ())
+	        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "reduc op not supported by target.\n");
 
-          epilog_reduc_code = ERROR_MARK;
+	      epilog_reduc_code = ERROR_MARK;
+	    }
         }
     }
   else

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 4/14][AArch64] Use new reduc_plus_scal optabs, inc. for __builtins
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (2 preceding siblings ...)
  2014-09-18 11:54 ` [PATCH 3/14] Add new optabs for reducing vectors to scalars Alan Lawrence
@ 2014-09-18 11:59 ` Alan Lawrence
  2014-09-24  9:44   ` Marcus Shawcroft
  2014-09-18 12:02 ` [PATCH 5/14][AArch64] Use new reduc_[us](min|max)_scal optabs, inc. for builtins Alan Lawrence
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 11:59 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1561 bytes --]

This migrates AArch64 over to the new optab for 'plus' reductions, i.e. so the 
define_expands produce scalars by generating a MOV to a GPR. Effectively, this 
moves the vget_lane inside every arm_neon.h intrinsic, into the inside of the 
define_expand.

Tested: aarch64.exp vect.exp on aarch64-none-elf and aarch64_be-none-elf (full 
check-gcc on next patch for reduc_min/max)

gcc/ChangeLog:

	* config/aarch64/aarch64-simd-builtins.def
	(reduc_splus_<mode>/VDQF, reduc_uplus_<mode>/VDQF, reduc_splus_v4sf):
	Remove.
	(reduc_plus_scal_<mode>, reduc_plus_scal_v4sf): New.

	* config/aarch64/aarch64-simd.md (reduc_<sur>plus_mode): Remove.
	(reduc_splus_<mode>, reduc_uplus_<mode>, reduc_plus_scal_<mode>): New.

	(reduc_<sur>plus_mode): Change SUADDV -> UNSPEC_ADDV, rename to...
	(aarch64_reduc_plus_internal<mode>): ...this.

	(reduc_<sur>plus_v2si): Change SUADDV -> UNSPEC_ADDV, rename to...
	(aarch64_reduc_plus_internalv2si): ...this.

	(reduc_splus_<mode>/V2F): Rename to...
	(aarch64_reduc_plus_internal<mode>): ...this.

	* config/aarch64/iterators.md
	(UNSPEC_SADDV, UNSPEC_UADDV, SUADDV): Remove.
	(UNSPEC_ADDV): New.
	(sur): Remove elements for UNSPEC_SADDV and UNSPEC_UADDV.

	* config/aarch64/arm_neon.h (vaddv_s8, vaddv_s16, vaddv_s32, vaddv_u8,
	vaddv_u16, vaddv_u32, vaddvq_s8, vaddvq_s16, vaddvq_s32, vaddvq_s64,
	vaddvq_u8, vaddvq_u16, vaddvq_u32, vaddvq_u64, vaddv_f32, vaddvq_f32,
	vaddvq_f64): Change __builtin_aarch64_reduc_[us]plus_... to
	__builtin_aarch64_reduc_plus_scal, remove vget_lane wrapper.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 4_aarch64_reduc_plus.patch --]
[-- Type: text/x-patch; name=4_aarch64_reduc_plus.patch, Size: 11159 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index 4f3bd12c8447e7125dfeba3f06536cdf9acc2440..ae4ab42e3e3df7de4e4b2c5e46a1476a2ed64175 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -248,9 +248,8 @@
   BUILTIN_VSDQ_I_DI (BINOP, cmgtu, 0)
   BUILTIN_VSDQ_I_DI (BINOP, cmtst, 0)
 
-  /* Implemented by reduc_<sur>plus_<mode>.  */
-  BUILTIN_VALL (UNOP, reduc_splus_, 10)
-  BUILTIN_VDQ (UNOP, reduc_uplus_, 10)
+  /* Implemented by aarch64_reduc_plus_<mode>.  */
+  BUILTIN_VALL (UNOP, reduc_plus_scal_, 10)
 
   /* Implemented by reduc_<maxmin_uns>_<mode>.  */
   BUILTIN_VDQIF (UNOP, reduc_smax_, 10)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index f5fa4aebe4cafe1430b31ca3a89ec5f3698d23bd..23b89584d9ba1d88ff49bfa28d210b325e7dea7f 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1719,25 +1719,74 @@
 
 ;; 'across lanes' add.
 
-(define_insn "reduc_<sur>plus_<mode>"
+(define_expand "reduc_splus_<mode>"
+  [(match_operand:VALL 0 "register_operand" "=w")
+   (match_operand:VALL 1 "register_operand" "w")]
+  "TARGET_SIMD"
+  {
+    /* Old optab/standard name, should not be used since we are providing
+       newer reduc_plus_scal_<mode>.  */
+    gcc_unreachable ();
+  }
+)
+
+(define_expand "reduc_uplus_<mode>"
+  [(match_operand:VALL 0 "register_operand" "=w")
+   (match_operand:VALL 1 "register_operand" "w")]
+  "TARGET_SIMD"
+  {
+    /* Old optab/standard name, should not be used since we are providing
+       newer reduc_plus_scal_<mode>.  */
+    gcc_unreachable ();
+  }
+)
+
+(define_expand "reduc_plus_scal_<mode>"
+  [(match_operand:<VEL> 0 "register_operand" "=w")
+   (unspec:VDQ [(match_operand:VDQ 1 "register_operand" "w")]
+	       UNSPEC_ADDV)]
+  "TARGET_SIMD"
+  {
+    rtx elt = GEN_INT (ENDIAN_LANE_N (<MODE>mode, 0));
+    rtx scratch = gen_reg_rtx (<MODE>mode);
+    emit_insn (gen_aarch64_reduc_plus_internal<mode> (scratch, operands[1]));
+    emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
+    DONE;
+  }
+)
+
+(define_expand "reduc_plus_scal_<mode>"
+  [(match_operand:<VEL> 0 "register_operand" "=w")
+   (match_operand:V2F 1 "register_operand" "w")]
+  "TARGET_SIMD"
+  {
+    rtx elt = GEN_INT (ENDIAN_LANE_N (<MODE>mode, 0));
+    rtx scratch = gen_reg_rtx (<MODE>mode);
+    emit_insn (gen_aarch64_reduc_plus_internal<mode> (scratch, operands[1]));
+    emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
+    DONE;
+  }
+)
+
+(define_insn "aarch64_reduc_plus_internal<mode>"
  [(set (match_operand:VDQV 0 "register_operand" "=w")
        (unspec:VDQV [(match_operand:VDQV 1 "register_operand" "w")]
-		    SUADDV))]
+		    UNSPEC_ADDV))]
  "TARGET_SIMD"
  "add<VDQV:vp>\\t%<Vetype>0, %1.<Vtype>"
   [(set_attr "type" "neon_reduc_add<q>")]
 )
 
-(define_insn "reduc_<sur>plus_v2si"
+(define_insn "aarch64_reduc_plus_internalv2si"
  [(set (match_operand:V2SI 0 "register_operand" "=w")
        (unspec:V2SI [(match_operand:V2SI 1 "register_operand" "w")]
-		    SUADDV))]
+		    UNSPEC_ADDV))]
  "TARGET_SIMD"
  "addp\\t%0.2s, %1.2s, %1.2s"
   [(set_attr "type" "neon_reduc_add")]
 )
 
-(define_insn "reduc_splus_<mode>"
+(define_insn "aarch64_reduc_plus_internal<mode>"
  [(set (match_operand:V2F 0 "register_operand" "=w")
        (unspec:V2F [(match_operand:V2F 1 "register_operand" "w")]
 		   UNSPEC_FADDV))]
@@ -1755,14 +1804,17 @@
   [(set_attr "type" "neon_fp_reduc_add_s_q")]
 )
 
-(define_expand "reduc_splus_v4sf"
- [(set (match_operand:V4SF 0 "register_operand")
+(define_expand "reduc_plus_scal_v4sf"
+ [(set (match_operand:SF 0 "register_operand")
        (unspec:V4SF [(match_operand:V4SF 1 "register_operand")]
 		    UNSPEC_FADDV))]
  "TARGET_SIMD"
 {
-  emit_insn (gen_aarch64_addpv4sf (operands[0], operands[1]));
-  emit_insn (gen_aarch64_addpv4sf (operands[0], operands[0]));
+  rtx elt = GEN_INT (ENDIAN_LANE_N (V4SFmode, 0));
+  rtx scratch = gen_reg_rtx (V4SFmode);
+  emit_insn (gen_aarch64_addpv4sf (scratch, operands[1]));
+  emit_insn (gen_aarch64_addpv4sf (scratch, scratch));
+  emit_insn (gen_aarch64_get_lanev4sf (operands[0], scratch, elt));
   DONE;
 })
 
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 0a86172ccb9aa9ab026f4aa020fd4418098e0923..734788e1c0fc81f6bf7efc126b357a74c22692f5 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -13456,121 +13456,103 @@ vaddd_u64 (uint64_t __a, uint64_t __b)
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vaddv_s8 (int8x8_t __a)
 {
-  return vget_lane_s8 (__builtin_aarch64_reduc_splus_v8qi (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v8qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vaddv_s16 (int16x4_t __a)
 {
-  return vget_lane_s16 (__builtin_aarch64_reduc_splus_v4hi (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v4hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vaddv_s32 (int32x2_t __a)
 {
-  return vget_lane_s32 (__builtin_aarch64_reduc_splus_v2si (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v2si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vaddv_u8 (uint8x8_t __a)
 {
-  return vget_lane_u8 ((uint8x8_t)
-		__builtin_aarch64_reduc_uplus_v8qi ((int8x8_t) __a),
-		0);
+  return (uint8_t) __builtin_aarch64_reduc_plus_scal_v8qi ((int8x8_t) __a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vaddv_u16 (uint16x4_t __a)
 {
-  return vget_lane_u16 ((uint16x4_t)
-		__builtin_aarch64_reduc_uplus_v4hi ((int16x4_t) __a),
-		0);
+  return (uint16_t) __builtin_aarch64_reduc_plus_scal_v4hi ((int16x4_t) __a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vaddv_u32 (uint32x2_t __a)
 {
-  return vget_lane_u32 ((uint32x2_t)
-		__builtin_aarch64_reduc_uplus_v2si ((int32x2_t) __a),
-		0);
+  return (int32_t) __builtin_aarch64_reduc_plus_scal_v2si ((int32x2_t) __a);
 }
 
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vaddvq_s8 (int8x16_t __a)
 {
-  return vgetq_lane_s8 (__builtin_aarch64_reduc_splus_v16qi (__a),
-			0);
+  return __builtin_aarch64_reduc_plus_scal_v16qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vaddvq_s16 (int16x8_t __a)
 {
-  return vgetq_lane_s16 (__builtin_aarch64_reduc_splus_v8hi (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v8hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vaddvq_s32 (int32x4_t __a)
 {
-  return vgetq_lane_s32 (__builtin_aarch64_reduc_splus_v4si (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v4si (__a);
 }
 
 __extension__ static __inline int64_t __attribute__ ((__always_inline__))
 vaddvq_s64 (int64x2_t __a)
 {
-  return vgetq_lane_s64 (__builtin_aarch64_reduc_splus_v2di (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v2di (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vaddvq_u8 (uint8x16_t __a)
 {
-  return vgetq_lane_u8 ((uint8x16_t)
-		__builtin_aarch64_reduc_uplus_v16qi ((int8x16_t) __a),
-		0);
+  return (uint8_t) __builtin_aarch64_reduc_plus_scal_v16qi ((int8x16_t) __a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vaddvq_u16 (uint16x8_t __a)
 {
-  return vgetq_lane_u16 ((uint16x8_t)
-		__builtin_aarch64_reduc_uplus_v8hi ((int16x8_t) __a),
-		0);
+  return (uint16_t) __builtin_aarch64_reduc_plus_scal_v8hi ((int16x8_t) __a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vaddvq_u32 (uint32x4_t __a)
 {
-  return vgetq_lane_u32 ((uint32x4_t)
-		__builtin_aarch64_reduc_uplus_v4si ((int32x4_t) __a),
-		0);
+  return (uint32_t) __builtin_aarch64_reduc_plus_scal_v4si ((int32x4_t) __a);
 }
 
 __extension__ static __inline uint64_t __attribute__ ((__always_inline__))
 vaddvq_u64 (uint64x2_t __a)
 {
-  return vgetq_lane_u64 ((uint64x2_t)
-		__builtin_aarch64_reduc_uplus_v2di ((int64x2_t) __a),
-		0);
+  return (uint64_t) __builtin_aarch64_reduc_plus_scal_v2di ((int64x2_t) __a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vaddv_f32 (float32x2_t __a)
 {
-  float32x2_t __t = __builtin_aarch64_reduc_splus_v2sf (__a);
-  return vget_lane_f32 (__t, 0);
+  return __builtin_aarch64_reduc_plus_scal_v2sf (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vaddvq_f32 (float32x4_t __a)
 {
-  float32x4_t __t = __builtin_aarch64_reduc_splus_v4sf (__a);
-  return vgetq_lane_f32 (__t, 0);
+  return __builtin_aarch64_reduc_plus_scal_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vaddvq_f64 (float64x2_t __a)
 {
-  float64x2_t __t = __builtin_aarch64_reduc_splus_v2df (__a);
-  return vgetq_lane_f64 (__t, 0);
+  return __builtin_aarch64_reduc_plus_scal_v2df (__a);
 }
 
 /* vbsl  */
@@ -19234,7 +19216,7 @@ vpadd_u32 (uint32x2_t __a, uint32x2_t __b)
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vpaddd_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_splus_v2df (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v2df (__a);
 }
 
 __extension__ static __inline int64_t __attribute__ ((__always_inline__))
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 3203c3da7e293d566d1ea329856cbef8fb73a825..f738c298252736716077238d7c23478195481468 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -207,8 +207,7 @@
     UNSPEC_FMINNMV	; Used in aarch64-simd.md.
     UNSPEC_FMINV	; Used in aarch64-simd.md.
     UNSPEC_FADDV	; Used in aarch64-simd.md.
-    UNSPEC_SADDV	; Used in aarch64-simd.md.
-    UNSPEC_UADDV	; Used in aarch64-simd.md.
+    UNSPEC_ADDV		; Used in aarch64-simd.md.
     UNSPEC_SMAXV	; Used in aarch64-simd.md.
     UNSPEC_SMINV	; Used in aarch64-simd.md.
     UNSPEC_UMAXV	; Used in aarch64-simd.md.
@@ -845,8 +844,6 @@
 (define_int_iterator FMAXMINV [UNSPEC_FMAXV UNSPEC_FMINV
 			       UNSPEC_FMAXNMV UNSPEC_FMINNMV])
 
-(define_int_iterator SUADDV [UNSPEC_SADDV UNSPEC_UADDV])
-
 (define_int_iterator HADDSUB [UNSPEC_SHADD UNSPEC_UHADD
 			      UNSPEC_SRHADD UNSPEC_URHADD
 			      UNSPEC_SHSUB UNSPEC_UHSUB
@@ -951,7 +948,6 @@
 		      (UNSPEC_SUBHN2 "") (UNSPEC_RSUBHN2 "r")
 		      (UNSPEC_SQXTN "s") (UNSPEC_UQXTN "u")
 		      (UNSPEC_USQADD "us") (UNSPEC_SUQADD "su")
-		      (UNSPEC_SADDV "s") (UNSPEC_UADDV "u")
 		      (UNSPEC_SSLI  "s") (UNSPEC_USLI  "u")
 		      (UNSPEC_SSRI  "s") (UNSPEC_USRI  "u")
 		      (UNSPEC_USRA  "u") (UNSPEC_SSRA  "s")

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 5/14][AArch64] Use new reduc_[us](min|max)_scal optabs, inc. for builtins
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (3 preceding siblings ...)
  2014-09-18 11:59 ` [PATCH 4/14][AArch64] Use new reduc_plus_scal optabs, inc. for __builtins Alan Lawrence
@ 2014-09-18 12:02 ` Alan Lawrence
  2014-09-24  9:47   ` Marcus Shawcroft
  2014-09-18 12:05 ` [PATCH 6/14][AArch64] Restore gimple_folding of reduction intrinsics Alan Lawrence
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 12:02 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1613 bytes --]

Similarly to the previous patch (r/2205), this migrates AArch64 to the new 
reduce-to-scalar optabs for min and max. For consistency we apply the same 
treatment to the smax_nan and smin_nan patterns (used for __builtins), even 
though reduc_smin_nan_scal (etc.) is not a standard name.

Tested: check-gcc on aarch64-none-elf and aarch64_be-none-elf.

gcc/ChangeLog:

	* config/aarch64/aarch64-simd-builtins.def (reduc_smax_, reduc_smin_,
	reduc_umax_, reduc_umin_, reduc_smax_nan_, reduc_smin_nan_): Remove.
	(reduc_smax_scal_, reduc_smin_scal_, reduc_umax_scal_,
	reduc_umin_scal_, reduc_smax_nan_scal_, reduc_smin_nan_scal_): New.

	* config/aarch64/aarch64-simd.md
	(reduc_<maxmin_uns>_<mode>): Rename VDQV_S variant to...
	(reduc_<maxmin_uns>_internal<mode>): ...this.
	(reduc_<maxmin_uns>_<mode>): New (VDQ_BHSI).
	(reduc_<maxmin_uns>_scal_<mode>): New (*2).

	(reduc_<maxmin_uns>_v2si): Combine with below, renaming...
	(reduc_<maxmin_uns>_<mode>): Combine V2F with above, renaming...
	(reduc_<maxmin_uns>_internal_<mode>): ...to this (VDQF).

	* config/aarch64/arm_neon.h (vmaxv_f32, vmaxv_s8, vmaxv_s16,
	vmaxv_s32, vmaxv_u8, vmaxv_u16, vmaxv_u32, vmaxvq_f32, vmaxvq_f64,
	vmaxvq_s8, vmaxvq_s16, vmaxvq_s32, vmaxvq_u8, vmaxvq_u16, vmaxvq_u32,
	vmaxnmv_f32, vmaxnmvq_f32, vmaxnmvq_f64, vminv_f32, vminv_s8,
	vminv_s16, vminv_s32, vminv_u8, vminv_u16, vminv_u32, vminvq_f32,
	vminvq_f64, vminvq_s8, vminvq_s16, vminvq_s32, vminvq_u8, vminvq_u16,
	vminvq_u32, vminnmv_f32, vminnmvq_f32, vminnmvq_f64): Update to use
	__builtin_aarch64_reduc_..._scal; remove vget_lane wrapper.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 5_aarch64_reduc_minmax.patch --]
[-- Type: text/x-patch; name=5_aarch64_reduc_minmax.patch, Size: 15085 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index ae4ab42e3e3df7de4e4b2c5e46a1476a2ed64175..e213b9ce3adfc0c4c50b4dc34f4f1b995d5e8042 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -251,13 +251,13 @@
   /* Implemented by aarch64_reduc_plus_<mode>.  */
   BUILTIN_VALL (UNOP, reduc_plus_scal_, 10)
 
-  /* Implemented by reduc_<maxmin_uns>_<mode>.  */
-  BUILTIN_VDQIF (UNOP, reduc_smax_, 10)
-  BUILTIN_VDQIF (UNOP, reduc_smin_, 10)
-  BUILTIN_VDQ_BHSI (UNOP, reduc_umax_, 10)
-  BUILTIN_VDQ_BHSI (UNOP, reduc_umin_, 10)
-  BUILTIN_VDQF (UNOP, reduc_smax_nan_, 10)
-  BUILTIN_VDQF (UNOP, reduc_smin_nan_, 10)
+  /* Implemented by reduc_<maxmin_uns>_scal_<mode> (producing scalar).  */
+  BUILTIN_VDQIF (UNOP, reduc_smax_scal_, 10)
+  BUILTIN_VDQIF (UNOP, reduc_smin_scal_, 10)
+  BUILTIN_VDQ_BHSI (UNOPU, reduc_umax_scal_, 10)
+  BUILTIN_VDQ_BHSI (UNOPU, reduc_umin_scal_, 10)
+  BUILTIN_VDQF (UNOP, reduc_smax_nan_scal_, 10)
+  BUILTIN_VDQF (UNOP, reduc_smin_nan_scal_, 10)
 
   /* Implemented by <maxmin><mode>3.
      smax variants map to fmaxnm,
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 23b89584d9ba1d88ff49bfa28d210b325e7dea7f..d4a745be59897b4cb2a0de23adb56b5d79203592 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1828,7 +1828,64 @@
 
 ;; 'across lanes' max and min ops.
 
-(define_insn "reduc_<maxmin_uns>_<mode>"
+(define_expand "reduc_<maxmin_uns>_<mode>"
+  [(match_operand:VDQ_BHSI 0 "register_operand")
+   (unspec:VDQ_BHSI [(match_operand:VDQ_BHSI 1 "register_operand")]
+		MAXMINV)]
+  "TARGET_SIMD"
+  {
+    /* Old optab/standard name, should not be used since we are providing
+    newer reduc_..._scal_<mode>.  */
+    gcc_unreachable ();
+  }
+)
+
+(define_expand "reduc_<maxmin_uns>_<mode>"
+  [(match_operand:VDQF 0 "register_operand")
+   (unspec:VDQF [(match_operand:VDQF 1 "register_operand")]
+		FMAXMINV)]
+  "TARGET_SIMD"
+  {
+    /* Old optab/standard name, should not be used since we are providing
+    newer reduc_..._scal_<mode>.  */
+    gcc_unreachable ();
+  }
+)
+
+;; Template for outputting a scalar, so we can create __builtins which can be
+;; gimple_fold'd to the REDUC_(MAX|MIN)_EXPR tree code.  (This is FP smax/smin).
+(define_expand "reduc_<maxmin_uns>_scal_<mode>"
+  [(match_operand:<VEL> 0 "register_operand")
+   (unspec:VDQF [(match_operand:VDQF 1 "register_operand")]
+		FMAXMINV)]
+  "TARGET_SIMD"
+  {
+    rtx elt = GEN_INT (ENDIAN_LANE_N (<MODE>mode, 0));
+    rtx scratch = gen_reg_rtx (<MODE>mode);
+    emit_insn (gen_aarch64_reduc_<maxmin_uns>_internal<mode> (scratch,
+							      operands[1]));
+    emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
+    DONE;
+  }
+)
+
+;; Likewise for integer cases, signed and unsigned.
+(define_expand "reduc_<maxmin_uns>_scal_<mode>"
+  [(match_operand:<VEL> 0 "register_operand")
+   (unspec:VDQ_BHSI [(match_operand:VDQ_BHSI 1 "register_operand")]
+		    MAXMINV)]
+  "TARGET_SIMD"
+  {
+    rtx elt = GEN_INT (ENDIAN_LANE_N (<MODE>mode, 0));
+    rtx scratch = gen_reg_rtx (<MODE>mode);
+    emit_insn (gen_aarch64_reduc_<maxmin_uns>_internal<mode> (scratch,
+							      operands[1]));
+    emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
+    DONE;
+  }
+)
+
+(define_insn "aarch64_reduc_<maxmin_uns>_internal<mode>"
  [(set (match_operand:VDQV_S 0 "register_operand" "=w")
        (unspec:VDQV_S [(match_operand:VDQV_S 1 "register_operand" "w")]
 		    MAXMINV))]
@@ -1837,7 +1894,7 @@
   [(set_attr "type" "neon_reduc_minmax<q>")]
 )
 
-(define_insn "reduc_<maxmin_uns>_v2si"
+(define_insn "aarch64_reduc_<maxmin_uns>_internalv2si"
  [(set (match_operand:V2SI 0 "register_operand" "=w")
        (unspec:V2SI [(match_operand:V2SI 1 "register_operand" "w")]
 		    MAXMINV))]
@@ -1846,24 +1903,15 @@
   [(set_attr "type" "neon_reduc_minmax")]
 )
 
-(define_insn "reduc_<maxmin_uns>_<mode>"
- [(set (match_operand:V2F 0 "register_operand" "=w")
-       (unspec:V2F [(match_operand:V2F 1 "register_operand" "w")]
+(define_insn "aarch64_reduc_<maxmin_uns>_internal<mode>"
+ [(set (match_operand:VDQF 0 "register_operand" "=w")
+       (unspec:VDQF [(match_operand:VDQF 1 "register_operand" "w")]
 		    FMAXMINV))]
  "TARGET_SIMD"
- "<maxmin_uns_op>p\\t%<Vetype>0, %1.<Vtype>"
+ "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
   [(set_attr "type" "neon_fp_reduc_minmax_<Vetype><q>")]
 )
 
-(define_insn "reduc_<maxmin_uns>_v4sf"
- [(set (match_operand:V4SF 0 "register_operand" "=w")
-       (unspec:V4SF [(match_operand:V4SF 1 "register_operand" "w")]
-		    FMAXMINV))]
- "TARGET_SIMD"
- "<maxmin_uns_op>v\\t%s0, %1.4s"
-  [(set_attr "type" "neon_fp_reduc_minmax_s_q")]
-)
-
 ;; aarch64_simd_bsl may compile to any of bsl/bif/bit depending on register
 ;; allocation.
 ;; Operand 1 is the mask, operands 2 and 3 are the bitfields from which
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 734788e1c0fc81f6bf7efc126b357a74c22692f5..35be8a0ba913461552e9cc1e740dffb6f6c95bd4 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -18047,106 +18047,91 @@ vmaxnmq_f64 (float64x2_t __a, float64x2_t __b)
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vmaxv_f32 (float32x2_t __a)
 {
-  return vget_lane_f32 (__builtin_aarch64_reduc_smax_nan_v2sf (__a),
-			0);
+  return __builtin_aarch64_reduc_smax_nan_scal_v2sf (__a);
 }
 
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vmaxv_s8 (int8x8_t __a)
 {
-  return vget_lane_s8 (__builtin_aarch64_reduc_smax_v8qi (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v8qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vmaxv_s16 (int16x4_t __a)
 {
-  return vget_lane_s16 (__builtin_aarch64_reduc_smax_v4hi (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v4hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vmaxv_s32 (int32x2_t __a)
 {
-  return vget_lane_s32 (__builtin_aarch64_reduc_smax_v2si (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v2si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vmaxv_u8 (uint8x8_t __a)
 {
-  return vget_lane_u8 ((uint8x8_t)
-		__builtin_aarch64_reduc_umax_v8qi ((int8x8_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v8qi_uu (__a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vmaxv_u16 (uint16x4_t __a)
 {
-  return vget_lane_u16 ((uint16x4_t)
-		__builtin_aarch64_reduc_umax_v4hi ((int16x4_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v4hi_uu (__a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vmaxv_u32 (uint32x2_t __a)
 {
-  return vget_lane_u32 ((uint32x2_t)
-		__builtin_aarch64_reduc_umax_v2si ((int32x2_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v2si_uu (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vmaxvq_f32 (float32x4_t __a)
 {
-  return vgetq_lane_f32 (__builtin_aarch64_reduc_smax_nan_v4sf (__a),
-			 0);
+  return __builtin_aarch64_reduc_smax_nan_scal_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vmaxvq_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_smax_nan_v2df (__a),
-			 0);
+  return __builtin_aarch64_reduc_smax_nan_scal_v2df (__a);
 }
 
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vmaxvq_s8 (int8x16_t __a)
 {
-  return vgetq_lane_s8 (__builtin_aarch64_reduc_smax_v16qi (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v16qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vmaxvq_s16 (int16x8_t __a)
 {
-  return vgetq_lane_s16 (__builtin_aarch64_reduc_smax_v8hi (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v8hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vmaxvq_s32 (int32x4_t __a)
 {
-  return vgetq_lane_s32 (__builtin_aarch64_reduc_smax_v4si (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v4si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vmaxvq_u8 (uint8x16_t __a)
 {
-  return vgetq_lane_u8 ((uint8x16_t)
-		__builtin_aarch64_reduc_umax_v16qi ((int8x16_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v16qi_uu (__a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vmaxvq_u16 (uint16x8_t __a)
 {
-  return vgetq_lane_u16 ((uint16x8_t)
-		__builtin_aarch64_reduc_umax_v8hi ((int16x8_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v8hi_uu (__a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vmaxvq_u32 (uint32x4_t __a)
 {
-  return vgetq_lane_u32 ((uint32x4_t)
-		__builtin_aarch64_reduc_umax_v4si ((int32x4_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v4si_uu (__a);
 }
 
 /* vmaxnmv  */
@@ -18154,20 +18139,19 @@ vmaxvq_u32 (uint32x4_t __a)
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vmaxnmv_f32 (float32x2_t __a)
 {
-  return vget_lane_f32 (__builtin_aarch64_reduc_smax_v2sf (__a),
-			0);
+  return __builtin_aarch64_reduc_smax_scal_v2sf (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vmaxnmvq_f32 (float32x4_t __a)
 {
-  return vgetq_lane_f32 (__builtin_aarch64_reduc_smax_v4sf (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vmaxnmvq_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_smax_v2df (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v2df (__a);
 }
 
 /* vmin  */
@@ -18293,107 +18277,91 @@ vminnmq_f64 (float64x2_t __a, float64x2_t __b)
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vminv_f32 (float32x2_t __a)
 {
-  return vget_lane_f32 (__builtin_aarch64_reduc_smin_nan_v2sf (__a),
-			0);
+  return __builtin_aarch64_reduc_smin_nan_scal_v2sf (__a);
 }
 
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vminv_s8 (int8x8_t __a)
 {
-  return vget_lane_s8 (__builtin_aarch64_reduc_smin_v8qi (__a),
-		       0);
+  return __builtin_aarch64_reduc_smin_scal_v8qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vminv_s16 (int16x4_t __a)
 {
-  return vget_lane_s16 (__builtin_aarch64_reduc_smin_v4hi (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v4hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vminv_s32 (int32x2_t __a)
 {
-  return vget_lane_s32 (__builtin_aarch64_reduc_smin_v2si (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v2si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vminv_u8 (uint8x8_t __a)
 {
-  return vget_lane_u8 ((uint8x8_t)
-		__builtin_aarch64_reduc_umin_v8qi ((int8x8_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v8qi_uu (__a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vminv_u16 (uint16x4_t __a)
 {
-  return vget_lane_u16 ((uint16x4_t)
-		__builtin_aarch64_reduc_umin_v4hi ((int16x4_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v4hi_uu (__a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vminv_u32 (uint32x2_t __a)
 {
-  return vget_lane_u32 ((uint32x2_t)
-		__builtin_aarch64_reduc_umin_v2si ((int32x2_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v2si_uu (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vminvq_f32 (float32x4_t __a)
 {
-  return vgetq_lane_f32 (__builtin_aarch64_reduc_smin_nan_v4sf (__a),
-			 0);
+  return __builtin_aarch64_reduc_smin_nan_scal_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vminvq_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_smin_nan_v2df (__a),
-			 0);
+  return __builtin_aarch64_reduc_smin_nan_scal_v2df (__a);
 }
 
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vminvq_s8 (int8x16_t __a)
 {
-  return vgetq_lane_s8 (__builtin_aarch64_reduc_smin_v16qi (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v16qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vminvq_s16 (int16x8_t __a)
 {
-  return vgetq_lane_s16 (__builtin_aarch64_reduc_smin_v8hi (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v8hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vminvq_s32 (int32x4_t __a)
 {
-  return vgetq_lane_s32 (__builtin_aarch64_reduc_smin_v4si (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v4si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vminvq_u8 (uint8x16_t __a)
 {
-  return vgetq_lane_u8 ((uint8x16_t)
-		__builtin_aarch64_reduc_umin_v16qi ((int8x16_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v16qi_uu (__a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vminvq_u16 (uint16x8_t __a)
 {
-  return vgetq_lane_u16 ((uint16x8_t)
-		__builtin_aarch64_reduc_umin_v8hi ((int16x8_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v8hi_uu (__a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vminvq_u32 (uint32x4_t __a)
 {
-  return vgetq_lane_u32 ((uint32x4_t)
-		__builtin_aarch64_reduc_umin_v4si ((int32x4_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v4si_uu (__a);
 }
 
 /* vminnmv  */
@@ -18401,19 +18369,19 @@ vminvq_u32 (uint32x4_t __a)
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vminnmv_f32 (float32x2_t __a)
 {
-  return vget_lane_f32 (__builtin_aarch64_reduc_smin_v2sf (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v2sf (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vminnmvq_f32 (float32x4_t __a)
 {
-  return vgetq_lane_f32 (__builtin_aarch64_reduc_smin_v4sf (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vminnmvq_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_smin_v2df (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v2df (__a);
 }
 
 /* vmla */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 6/14][AArch64] Restore gimple_folding of reduction intrinsics
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (4 preceding siblings ...)
  2014-09-18 12:02 ` [PATCH 5/14][AArch64] Use new reduc_[us](min|max)_scal optabs, inc. for builtins Alan Lawrence
@ 2014-09-18 12:05 ` Alan Lawrence
  2014-09-24  9:48   ` Marcus Shawcroft
  2014-09-18 12:19 ` [PATCH 7/14][Testsuite] Add tests of reductions using whole-vector-shifts (multiplication) Alan Lawrence
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 12:05 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 463 bytes --]

This gives us back the constant-folding of the neon-intrinsics that was removed 
in the first patch, but is now OK for bigendian too.

bootstrapped on aarch64-none-linux-gnu.
check-gcc on aarch64-none-elf and aarch64_be-none-elf.

gcc/ChangeLog:

	* config/aarch64/aarch64.c (TARGET_GIMPLE_FOLD_BUILTIN): Define again.
	* config/aarch64/aarch64-builtins.c (aarch64_gimple_fold_builtin):
	Restore, enable for bigendian, update to use __builtin..._scal...

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 6_aarch64_restore_gimple_fold.patch --]
[-- Type: text/x-patch; name=6_aarch64_restore_gimple_fold.patch, Size: 3694 bytes --]

diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
index 15eb7c686d95b1d66cbd514500ec29ba074eaa3f..0432d3aa1a515a15b051ba89afec7c0306cb5803 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -1333,9 +1333,6 @@ aarch64_fold_builtin (tree fndecl, int n_args ATTRIBUTE_UNUSED, tree *args,
   return NULL_TREE;
 }
 
-/* Handling of reduction operations temporarily removed so as to decouple
-   changes to tree codes from AArch64 NEON Intrinsics.  */
-#if 0
 bool
 aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 {
@@ -1345,19 +1342,6 @@ aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
   tree fndecl;
   gimple new_stmt = NULL;
 
-  /* The operations folded below are reduction operations.  These are
-     defined to leave their result in the 0'th element (from the perspective
-     of GCC).  The architectural instruction we are folding will leave the
-     result in the 0'th element (from the perspective of the architecture).
-     For big-endian systems, these perspectives are not aligned.
-
-     It is therefore wrong to perform this fold on big-endian.  There
-     are some tricks we could play with shuffling, but the mid-end is
-     inconsistent in the way it treats reduction operations, so we will
-     end up in difficulty.  Until we fix the ambiguity - just bail out.  */
-  if (BYTES_BIG_ENDIAN)
-    return false;
-
   if (call)
     {
       fndecl = gimple_call_fndecl (stmt);
@@ -1369,23 +1353,28 @@ aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 			? gimple_call_arg_ptr (stmt, 0)
 			: &error_mark_node);
 
+	  /* We use gimple's REDUC_(PLUS|MIN|MAX)_EXPRs for float, signed int
+	     and unsigned int; it will distinguish according to the types of
+	     the arguments to the __builtin.  */
 	  switch (fcode)
 	    {
-	      BUILTIN_VALL (UNOP, reduc_splus_, 10)
-		new_stmt = gimple_build_assign_with_ops (
+	      BUILTIN_VALL (UNOP, reduc_plus_scal_, 10)
+	        new_stmt = gimple_build_assign_with_ops (
 						REDUC_PLUS_EXPR,
 						gimple_call_lhs (stmt),
 						args[0],
 						NULL_TREE);
 		break;
-	      BUILTIN_VDQIF (UNOP, reduc_smax_, 10)
+	      BUILTIN_VDQIF (UNOP, reduc_smax_scal_, 10)
+	      BUILTIN_VDQ_BHSI (UNOPU, reduc_umax_scal_, 10)
 		new_stmt = gimple_build_assign_with_ops (
 						REDUC_MAX_EXPR,
 						gimple_call_lhs (stmt),
 						args[0],
 						NULL_TREE);
 		break;
-	      BUILTIN_VDQIF (UNOP, reduc_smin_, 10)
+	      BUILTIN_VDQIF (UNOP, reduc_smin_scal_, 10)
+	      BUILTIN_VDQ_BHSI (UNOPU, reduc_umin_scal_, 10)
 		new_stmt = gimple_build_assign_with_ops (
 						REDUC_MIN_EXPR,
 						gimple_call_lhs (stmt),
@@ -1407,7 +1396,6 @@ aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 
   return changed;
 }
-#endif
 
 void
 aarch64_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 9197ec038b7d40a601c886b846113c50a29cf5e2..e7946fc0b70ced70a4e98caa0a33121f29242aad 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -9925,8 +9925,8 @@ aarch64_expand_movmem (rtx *operands)
 #undef TARGET_FRAME_POINTER_REQUIRED
 #define TARGET_FRAME_POINTER_REQUIRED aarch64_frame_pointer_required
 
-//#undef TARGET_GIMPLE_FOLD_BUILTIN
-//#define TARGET_GIMPLE_FOLD_BUILTIN aarch64_gimple_fold_builtin
+#undef TARGET_GIMPLE_FOLD_BUILTIN
+#define TARGET_GIMPLE_FOLD_BUILTIN aarch64_gimple_fold_builtin
 
 #undef TARGET_GIMPLIFY_VA_ARG_EXPR
 #define TARGET_GIMPLIFY_VA_ARG_EXPR aarch64_gimplify_va_arg_expr

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 7/14][Testsuite] Add tests of reductions using whole-vector-shifts (multiplication)
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (5 preceding siblings ...)
  2014-09-18 12:05 ` [PATCH 6/14][AArch64] Restore gimple_folding of reduction intrinsics Alan Lawrence
@ 2014-09-18 12:19 ` Alan Lawrence
  2014-09-22 10:41   ` Richard Biener
  2014-09-18 12:25 ` [PATCH 8/14][Testsuite] Add tests of reductions using whole-vector-shifts (ior) Alan Lawrence
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 12:19 UTC (permalink / raw)
  To: gcc-patches; +Cc: David Edelsohn, Aldy Hernandez

[-- Attachment #1: Type: text/plain, Size: 1725 bytes --]

For reduction operations (e.g. multiply) that don't have such a tree code ,or 
where the target platform doesn't define an optab handler for the tree code, we 
can perform the reduction using a series of log(N) shifts (where N = #elements 
in vector), using the VEC_RSHIFT_EXPR=whole-vector-shift tree code (if the 
platform handles the vec_shr_optab).

First stage is to add some tests of non-(min/max/plus) reductions; here, 
multiplies. The first is designed to be non-foldable, so we make sure the 
architectural instructions line up with what the tree codes specify. The second 
is designed to be easily constant-propagated, to test the (currently 
endianness-dependent) constant folding code.

In lib/target-supports.exp, I've defined a new 
check_effective_target_whole_vector_shift, which I intended to define to true 
for platforms with the vec_shr optab. However, I've not managed to make this 
test pass on PowerPC - even with -maltivec, -fdump-tree-vect-details gives me a 
message about the target not supporting vector multiplication - so I've omitted 
PowerPC from the whole_vector_shift. This doesn't feel right, suggestions 
welcomed from PowerPC maintainers?

Tests passing on arm-none-eabi and x86_64-none-linux-gnu;
also verified the scan-tree-dump part works on ia64-none-linux-gnu (by compiling 
to assembly only).
(Tests are not run on AArch64, because we have no vec_shr_optab at this point; 
PowerPC, as above; or MIPS, as check_effective_target_vect_int_mult yields 0.)

gcc/testsuite/ChangeLog:

	* lib/target-supports.exp (check_effective_target_whole_vector_shift):
	New.

	* gcc.dg/vect/vect-reduc-mul_1.c: New test.
	* gcc.dg/vect/vect-reduc-mul_2.c: New test.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 7_test_shift_reduc_mul.patch --]
[-- Type: text/x-patch; name=7_test_shift_reduc_mul.patch, Size: 3088 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-mul_1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-mul_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..44f026ff9b561bcf314224c44d51bdd19448851b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-mul_1.c
@@ -0,0 +1,36 @@
+/* { dg-require-effective-target vect_int_mult } */
+/* { dg-require-effective-target whole_vector_shift } */
+
+/* Write a reduction loop to be reduced using vector shifts.  */
+
+extern void abort(void);
+
+unsigned char in[16];
+
+int
+main (unsigned char argc, char **argv)
+{
+  unsigned char i = 0;
+  unsigned char sum = 1;
+
+  for (i = 0; i < 16; i++)
+    in[i] = i + i + 1;
+
+  /* Prevent constant propagation of the entire loop below.  */
+  asm volatile ("" : : : "memory");
+
+  for (i = 0; i < 16; i++)
+    sum *= in[i];
+
+  if (sum != 33)
+    {
+      __builtin_printf("Failed %d\n", sum);
+      abort();
+    }
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "Reduce using vector shifts" "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-mul_2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-mul_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..414fba7a5c96c4dd89030682492edb57ebba3b16
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-mul_2.c
@@ -0,0 +1,32 @@
+/* { dg-require-effective-target vect_int_mult } */
+/* { dg-require-effective-target whole_vector_shift } */
+
+/* Write a reduction loop to be reduced using vector shifts and folded.  */
+
+extern void abort(void);
+
+int
+main (unsigned char argc, char **argv)
+{
+  unsigned char in[16];
+  unsigned char i = 0;
+  unsigned char sum = 1;
+
+  for (i = 0; i < 16; i++)
+    in[i] = i + i + 1;
+
+  for (i = 0; i < 16; i++)
+    sum *= in[i];
+
+  if (sum != 33)
+    {
+      __builtin_printf("Failed %d\n", sum);
+      abort();
+    }
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "Reduce using vector shifts" "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index fa5137ea472e1773be60759caad32bbc7ab4c551..0f4bebd533c9268adfcd4ed250f06fca825c92b1 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3320,6 +3320,22 @@ proc check_effective_target_vect_shift { } {
     return $et_vect_shift_saved
 }
 
+proc check_effective_target_whole_vector_shift { } {
+    if { [istarget x86_64-*-*]
+	 || [istarget ia64-*-*]
+	 || ([check_effective_target_arm32]
+	     && [check_effective_target_arm_little_endian])
+	 || ([istarget mips*-*-*]
+	     && [check_effective_target_mips_loongson]) } {
+	set answer 1
+    } else {
+	set answer 0
+    }
+
+    verbose "check_effective_target_vect_long: returning $answer" 2
+    return $answer
+}
+
 # Return 1 if the target supports vector bswap operations.
 
 proc check_effective_target_vect_bswap { } {

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 8/14][Testsuite] Add tests of reductions using whole-vector-shifts (ior)
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (6 preceding siblings ...)
  2014-09-18 12:19 ` [PATCH 7/14][Testsuite] Add tests of reductions using whole-vector-shifts (multiplication) Alan Lawrence
@ 2014-09-18 12:25 ` Alan Lawrence
  2014-09-22 10:42   ` Richard Biener
  2014-09-18 12:27 ` [PATCH 9/14] Enforce whole-vector-shifts to always be by a whole number of elements Alan Lawrence
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 12:25 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 804 bytes --]

These are like the previous patch, but using | rather than * - I was unable to 
get the previous test to pass on PowerPC and MIPS.

I note there is no inherent vector operation here - a bitwise OR across a word, 
and a "reduction via shifts" using scalar (not vector) ops would be all that's 
necessary. However, GCC doesn't exploit this possibility at present, and I don't 
have any plans at present to add such myself.

Passing on x86_64-linux-gnu, aarch64-none-elf, aarch64_be-none-elf, arm-none-eabi.
The 'scan-tree-dump' part passes on mips64 and powerpc (although the latter is 
disabled as check_effective_target_whole_vector_shift gives 0, as per previous 
patch)

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/vect-reduc-or_1.c: New test.
	* gcc.dg/vect/vect-reduc-or_2.c: Likewise.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 8_test_shift_reduc_ior.patch --]
[-- Type: text/x-patch; name=8_test_shift_reduc_ior.patch, Size: 2105 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-or_1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-or_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..4e1a8577ce21aad539fca7cf07700b99575dfab0
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-or_1.c
@@ -0,0 +1,35 @@
+/* { dg-require-effective-target whole_vector_shift } */
+
+/* Write a reduction loop to be reduced using vector shifts.  */
+
+extern void abort(void);
+
+unsigned char in[16] __attribute__((__aligned__(16)));
+
+int
+main (unsigned char argc, char **argv)
+{
+  unsigned char i = 0;
+  unsigned char sum = 1;
+
+  for (i = 0; i < 16; i++)
+    in[i] = (i + i + 1) & 0xfd;
+
+  /* Prevent constant propagation of the entire loop below.  */
+  asm volatile ("" : : : "memory");
+
+  for (i = 0; i < 16; i++)
+    sum |= in[i];
+
+  if (sum != 29)
+    {
+      __builtin_printf("Failed %d\n", sum);
+      abort();
+    }
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "Reduce using vector shifts" "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-or_2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-or_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..e25467e59221adc09cbe0bb7548842902a4bf6da
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-or_2.c
@@ -0,0 +1,31 @@
+/* { dg-require-effective-target whole_vector_shift } */
+
+/* Write a reduction loop to be reduced using vector shifts and folded.  */
+
+extern void abort(void);
+
+int
+main (unsigned char argc, char **argv)
+{
+  unsigned char in[16] __attribute__((aligned(16)));
+  unsigned char i = 0;
+  unsigned char sum = 1;
+
+  for (i = 0; i < 16; i++)
+    in[i] = (i + i + 1) & 0xfd;
+
+  for (i = 0; i < 16; i++)
+    sum |= in[i];
+
+  if (sum != 29)
+    {
+      __builtin_printf("Failed %d\n", sum);
+      abort();
+    }
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "Reduce using vector shifts" "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 9/14] Enforce whole-vector-shifts to always be by a whole number of elements
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (7 preceding siblings ...)
  2014-09-18 12:25 ` [PATCH 8/14][Testsuite] Add tests of reductions using whole-vector-shifts (ior) Alan Lawrence
@ 2014-09-18 12:27 ` Alan Lawrence
  2014-09-22 10:50   ` Richard Biener
  2014-09-18 12:34 ` [PATCH 10/14][AArch64] Implement vec_shr optab Alan Lawrence
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 12:27 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 887 bytes --]

The VEC_RSHIFT_EXPR is only ever used by the vectorizer in tree-vect-loop.c 
(vect_create_epilog_for_reduction), to shift the vector by a whole number of 
elements. The tree code allows more general shifts but only for integral types. 
This only causes pain and difficulty for backends (particularly for backends 
with different endiannesses), and enforcing that restriction for integral types 
too does no harm.

bootstrapped on aarch64-none-linux-gnu and x86-64-none-linux-gnu
check-gcc on aarch64-none-elf and x86_64-none-linux-gnu

gcc/ChangeLog:

	* tree-cfg.c (verify_gimple_assign_binary): for VEC_RSHIFT_EXPR (and
	VEC_LSHIFT_EXPR), require shifts to be by a whole number of elements
	for all types, rather than only non-integral types.

	* tree.def (VEC_LSHIFT_EXPR, VEC_RSHIFT_EXPR): Update comment.

	* doc/md.texi (vec_shl_m, vec_shr_m): Update comment.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 9_vec_shift_by_whole_elem.patch --]
[-- Type: text/x-patch; name=9_vec_shift_by_whole_elem.patch, Size: 2249 bytes --]

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 3f5fd6f0e3ac3fcc30f6c961e3e2709a35f4d413..a78aea2f3f6e35b0d89719a42d734e62a2f5bd65 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4888,7 +4888,8 @@ of a wider mode.)
 @item @samp{vec_shl_@var{m}}, @samp{vec_shr_@var{m}}
 Whole vector left/right shift in bits.
 Operand 1 is a vector to be shifted.
-Operand 2 is an integer shift amount in bits.
+Operand 2 is an integer shift amount in bits, which must be a multiple of the
+element size.
 Operand 0 is where the resulting shifted vector is stored.
 The output and input vectors should have the same modes.
 
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 49986cc40758bb5998e395c727142e75f7d6e9f4..1ea2e256b09b25331810a57a9c35e5cc875d0404 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3667,14 +3667,11 @@ verify_gimple_assign_binary (gimple stmt)
 	    debug_generic_expr (rhs2_type);
 	    return true;
 	  }
-	/* For shifting a vector of non-integral components we
-	   only allow shifting by a constant multiple of the element size.  */
-	if (!INTEGRAL_TYPE_P (TREE_TYPE (rhs1_type))
-	    && (TREE_CODE (rhs2) != INTEGER_CST
-		|| !div_if_zero_remainder (rhs2,
-					   TYPE_SIZE (TREE_TYPE (rhs1_type)))))
+	/* All shifts must be by a constant multiple of the element size.  */
+	if (TREE_CODE (rhs2) != INTEGER_CST
+	    || !div_if_zero_remainder (rhs2, TYPE_SIZE (TREE_TYPE (rhs1_type))))
 	  {
-	    error ("non-element sized vector shift of floating point vector");
+	    error ("non-element sized vector shift");
 	    return true;
 	  }
 
diff --git a/gcc/tree.def b/gcc/tree.def
index e9af52e554babb100d49ea14f47c805cd5024949..5406ffe67c53ff3f12920ca8c965cf0740a079c2 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1240,7 +1240,8 @@ DEFTREECODE (FMA_EXPR, "fma_expr", tcc_expression, 3)
 
 /* Whole vector left/right shift in bits.
    Operand 0 is a vector to be shifted.
-   Operand 1 is an integer shift amount in bits.  */
+   Operand 1 is an integer shift amount in bits, which must be a multiple of the
+   element size.  */
 DEFTREECODE (VEC_LSHIFT_EXPR, "vec_lshift_expr", tcc_binary, 2)
 DEFTREECODE (VEC_RSHIFT_EXPR, "vec_rshift_expr", tcc_binary, 2)
 \f

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 10/14][AArch64] Implement vec_shr optab
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (8 preceding siblings ...)
  2014-09-18 12:27 ` [PATCH 9/14] Enforce whole-vector-shifts to always be by a whole number of elements Alan Lawrence
@ 2014-09-18 12:34 ` Alan Lawrence
  2014-09-18 12:35 ` [PATCH 11/14] Remove VEC_LSHIFT_EXPR and vec_shl_optab Alan Lawrence
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 12:34 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2250 bytes --]

This allows reduction of non-(plus|min|max) operations using log_2(N) shifts 
rather than N vec_extracts; e.g. for example code

int
main (unsigned char argc, char **argv)
{
   unsigned char in[16] = { 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31 };
   unsigned char i = 0;
   unsigned char sum = 1;

   /* Prevent constant propagation of the entire loop below.  */
   asm volatile ("" : : : "memory");

   for (i = 0; i < 16; i++)
     sum *= in[i];

   if (sum != 33)
       __builtin_printf("Failed %d\n", sum);
}

(a simplified, less-general version of vect-reduc-mul_1.c) this gives

main:
         ldr     q0, .LC0
         sub     sp, sp, #16
         str     q0, [sp]
         ldr     q1, [sp]
         movi    v0.4s, 0
         ext     v2.16b, v1.16b, v0.16b, #8
         mul     v1.16b, v1.16b, v2.16b
         ext     v2.16b, v1.16b, v0.16b, #4
         mul     v1.16b, v2.16b, v1.16b
         ext     v2.16b, v1.16b, v0.16b, #2
         mul     v1.16b, v2.16b, v1.16b
         ext     v0.16b, v1.16b, v0.16b, #1
         mul     v0.16b, v0.16b, v1.16b
         umov    w1, v0.b[0]
         cmp     w1, 33
         beq     .L2
         ...

rather than previously:

main:
         ldr     q0, .LC0
         sub     sp, sp, #16
         str     q0, [sp]
         ldr     d1, [sp]
         ldr     d0, [sp, 8]
         mul     v0.8b, v0.8b, v1.8b
         umov    w0, v0.b[1]
         umov    w3, v0.b[0]
         umov    w2, v0.b[2]
         umov    w7, v0.b[3]
         umov    w6, v0.b[4]
         mul     w3, w0, w3
         umov    w5, v0.b[5]
         umov    w4, v0.b[6]
         umov    w1, v0.b[7]
         mul     w3, w3, w2
         mul     w2, w3, w7
         mul     w2, w2, w6
         mul     w0, w2, w5
         mul     w0, w0, w4
         mul     w1, w0, w1
         uxtb    w1, w1
         cmp     w1, 33
         beq     .L2
         ...


Tested check-gcc on aarch64-none-elf and aarch64_be-none-elf. (Including new 
tests from previous patches.)

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (vec_shr<mode>): New (*2).

gcc/testsuite/ChangeLog:
	* lib/target_supports.exp (check_effective_target_whole_vector_shift):
	Add aarch64*-*-*.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 10_aarch64_vec_shr.patch --]
[-- Type: text/x-patch; name=10_aarch64_vec_shr.patch, Size: 2323 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index d4a745be59897b4cb2a0de23adb56b5d79203592..3fcf809113d73b37a95653b8c2be432478d2bc1e 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -770,6 +770,45 @@
   }
 )
 
+;; For 64-bit modes we use ushl/r, as this does not require a SIMD zero.
+(define_insn "vec_shr_<mode>"
+  [(set (match_operand:VD 0 "register_operand" "=w")
+        (lshiftrt:VD (match_operand:VD 1 "register_operand" "w")
+		     (match_operand:SI 2 "immediate_operand" "i")))]
+  "TARGET_SIMD"
+  "ushr %d0, %d1, %2"
+  [(set_attr "type" "neon_shift_imm")]
+)
+
+(define_expand "vec_shr_<mode>"
+  [(set (match_operand:VQ 0 "register_operand" "=w")
+        (lshiftrt:VQ (match_operand:VQ 1 "register_operand" "w")
+		      (match_operand:SI 2 "immediate_operand" "i")))]
+  "TARGET_SIMD"
+{
+  HOST_WIDE_INT num_bits = INTVAL (operands[2]);
+  HOST_WIDE_INT elem_bits = GET_MODE_BITSIZE (GET_MODE_INNER (<MODE>mode));
+  rtx zero_reg = force_reg (<MODE>mode, CONST0_RTX (<MODE>mode));
+
+  gcc_assert (GET_MODE_BITSIZE (<MODE>mode) == 128);
+  gcc_assert (num_bits % elem_bits == 0);
+
+  if (num_bits == 0)
+    {
+      emit_move_insn (operands[0], operands[1]);
+      DONE;
+    }
+  else if (num_bits == 128)
+    {
+      emit_move_insn (operands[0], CONST0_RTX (<MODE>mode));
+      DONE;
+    }
+
+  emit_insn (gen_aarch64_ext<mode> (operands[0], operands[1], zero_reg,
+		      GEN_INT (num_bits / elem_bits)));
+  DONE;
+})
+
 (define_insn "aarch64_simd_vec_setv2di"
   [(set (match_operand:V2DI 0 "register_operand" "=w,w")
         (vec_merge:V2DI
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 5e40f5fcdfc95e41e804075bb5daa7030eb9bc66..720cc345bf6a76470cc85116d7b3365be07caa97 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3323,6 +3323,7 @@ proc check_effective_target_vect_shift { } {
 proc check_effective_target_whole_vector_shift { } {
     if { [istarget x86_64-*-*]
 	 || [istarget ia64-*-*]
+	 || [istarget aarch64*-*-*]
 	 || ([check_effective_target_arm32]
 	     && [check_effective_target_arm_little_endian])
 	 || ([istarget mips*-*-*]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 11/14] Remove VEC_LSHIFT_EXPR and vec_shl_optab
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (9 preceding siblings ...)
  2014-09-18 12:34 ` [PATCH 10/14][AArch64] Implement vec_shr optab Alan Lawrence
@ 2014-09-18 12:35 ` Alan Lawrence
  2014-09-22 10:52   ` Richard Biener
  2014-09-18 12:43 ` [PATCH 12/14][Vectorizer] Redefine VEC_RSHIFT_EXPR and vec_shr_optab as endianness-neutral Alan Lawrence
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 12:35 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1129 bytes --]

The VEC_LSHIFT_EXPR tree code, and the corresponding vec_shl_optab, seem to have 
been added for completeness, providing a counterpart to VEC_RSHIFT_EXPR and 
vec_shr_optab. However, whereas VEC_RSHIFT_EXPRs are generated (only) by the 
vectorizer, VEC_LSHIFT_EXPR expressions are not generated at all, so there seems 
little point in maintaining it.

Bootstrapped on x86_64-unknown-linux-gnu.
aarch64.exp+vect.exp on aarch64-none-elf and aarch64_be-none-elf.

gcc/ChangeLog:

	* expr.c (expand_expr_real_2): Remove code handling VEC_LSHIFT_EXPR.
	* fold-const.c (const_binop): Likewise.
	* cfgexpand.c (expand_debug_expr): Likewise.
	* tree-inline.c (estimate_operator_cost, dump_generic_node,
	op_code_prio, op_symbol_code): Likewise.
	* tree-vect-generic.c (expand_vector_operations_1): Likewise.
	* optabs.c (optab_for_tree_code): Likewise.
	(expand_vec_shift_expr): Likewise, update comment.
	* tree.def: Delete VEC_LSHIFT_EXPR, remove comment.
	* optabs.h (expand_vec_shift_expr): Remove comment re. VEC_LSHIFT_EXPR.
	* optabs.def: Remove vec_shl_optab.
	* doc/md.texi: Remove references to vec_shr_m.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 11_remove_vec_lshift.patch --]
[-- Type: text/x-patch; name=11_remove_vec_lshift.patch, Size: 8812 bytes --]

diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index f6da5d632f441544fdacafc266e9cf17083a825a..6b46b08538c01190215a174773dfcb1109134873 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -4592,7 +4592,6 @@ expand_debug_expr (tree exp)
     case REDUC_MIN_EXPR:
     case REDUC_PLUS_EXPR:
     case VEC_COND_EXPR:
-    case VEC_LSHIFT_EXPR:
     case VEC_PACK_FIX_TRUNC_EXPR:
     case VEC_PACK_SAT_EXPR:
     case VEC_PACK_TRUNC_EXPR:
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index a78aea2f3f6e35b0d89719a42d734e62a2f5bd65..f94e0f62c622d43e2df0d0619fb1eba74c415165 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4883,10 +4883,9 @@ operand 1. Add operand 1 to operand 2 and place the widened result in
 operand 0. (This is used express accumulation of elements into an accumulator
 of a wider mode.)
 
-@cindex @code{vec_shl_@var{m}} instruction pattern
 @cindex @code{vec_shr_@var{m}} instruction pattern
-@item @samp{vec_shl_@var{m}}, @samp{vec_shr_@var{m}}
-Whole vector left/right shift in bits.
+@item @samp{vec_shr_@var{m}}
+Whole vector right shift in bits.
 Operand 1 is a vector to be shifted.
 Operand 2 is an integer shift amount in bits, which must be a multiple of the
 element size.
diff --git a/gcc/expr.c b/gcc/expr.c
index 11930ca121e4e1f3807261a2e5b0ca4f6723176d..30ea87af3ef102d7071c6c29db37df875af316f5 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9053,7 +9053,6 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
 	return temp;
       }
 
-    case VEC_LSHIFT_EXPR:
     case VEC_RSHIFT_EXPR:
       {
 	target = expand_vec_shift_expr (ops, target);
diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index b8baa94d37a74ebb824e2a4d03f2a10befcdf749..bd4ba5f0c64c710df9fa36d4059f7b08e949fae0 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -1406,8 +1406,7 @@ const_binop (enum tree_code code, tree arg1, tree arg2)
       int count = TYPE_VECTOR_SUBPARTS (type), i;
       tree *elts = XALLOCAVEC (tree, count);
 
-      if (code == VEC_LSHIFT_EXPR
-	  || code == VEC_RSHIFT_EXPR)
+      if (code == VEC_RSHIFT_EXPR)
 	{
 	  if (!tree_fits_uhwi_p (arg2))
 	    return NULL_TREE;
@@ -1419,11 +1418,10 @@ const_binop (enum tree_code code, tree arg1, tree arg2)
 	  if (shiftc >= outerc || (shiftc % innerc) != 0)
 	    return NULL_TREE;
 	  int offset = shiftc / innerc;
-	  /* The direction of VEC_[LR]SHIFT_EXPR is endian dependent.
-	     For reductions, compiler emits VEC_RSHIFT_EXPR always,
-	     for !BYTES_BIG_ENDIAN picks first vector element, but
-	     for BYTES_BIG_ENDIAN last element from the vector.  */
-	  if ((code == VEC_RSHIFT_EXPR) ^ (!BYTES_BIG_ENDIAN))
+	  /* The direction of VEC_RSHIFT_EXPR is endian dependent.
+	     For reductions, if !BYTES_BIG_ENDIAN then compiler picks first
+	     vector element, but last element if BYTES_BIG_ENDIAN.  */
+	  if (BYTES_BIG_ENDIAN)
 	    offset = -offset;
 	  tree zero = build_zero_cst (TREE_TYPE (type));
 	  for (i = 0; i < count; i++)
diff --git a/gcc/optabs.c b/gcc/optabs.c
index e422bcce18d06a39b26547b510c35858efc2303e..9c5b5daa6f2b51bda5ba92fcd61534f1dd55e646 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -515,9 +515,6 @@ optab_for_tree_code (enum tree_code code, const_tree type,
     case REDUC_PLUS_EXPR:
       return reduc_plus_scal_optab;
 
-    case VEC_LSHIFT_EXPR:
-      return vec_shl_optab;
-
     case VEC_RSHIFT_EXPR:
       return vec_shr_optab;
 
@@ -765,7 +762,7 @@ force_expand_binop (enum machine_mode mode, optab binoptab,
   return true;
 }
 
-/* Generate insns for VEC_LSHIFT_EXPR, VEC_RSHIFT_EXPR.  */
+/* Generate insns for VEC_RSHIFT_EXPR.  */
 
 rtx
 expand_vec_shift_expr (sepops ops, rtx target)
@@ -776,21 +773,10 @@ expand_vec_shift_expr (sepops ops, rtx target)
   enum machine_mode mode = TYPE_MODE (ops->type);
   tree vec_oprnd = ops->op0;
   tree shift_oprnd = ops->op1;
-  optab shift_optab;
 
-  switch (ops->code)
-    {
-      case VEC_RSHIFT_EXPR:
-	shift_optab = vec_shr_optab;
-	break;
-      case VEC_LSHIFT_EXPR:
-	shift_optab = vec_shl_optab;
-	break;
-      default:
-	gcc_unreachable ();
-    }
+  gcc_assert (ops->code == VEC_RSHIFT_EXPR);
 
-  icode = optab_handler (shift_optab, mode);
+  icode = optab_handler (vec_shr_optab, mode);
   gcc_assert (icode != CODE_FOR_nothing);
 
   rtx_op1 = expand_normal (vec_oprnd);
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 131ea048b012b073345be3b426d4ac8f33061809..a07e1639ed680ad49765cfe7b2df020df06f4e29 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -276,7 +276,6 @@ OPTAB_D (vec_perm_const_optab, "vec_perm_const$a")
 OPTAB_D (vec_perm_optab, "vec_perm$a")
 OPTAB_D (vec_realign_load_optab, "vec_realign_load_$a")
 OPTAB_D (vec_set_optab, "vec_set$a")
-OPTAB_D (vec_shl_optab, "vec_shl_$a")
 OPTAB_D (vec_shr_optab, "vec_shr_$a")
 OPTAB_D (vec_unpacks_float_hi_optab, "vec_unpacks_float_hi_$a")
 OPTAB_D (vec_unpacks_float_lo_optab, "vec_unpacks_float_lo_$a")
diff --git a/gcc/optabs.h b/gcc/optabs.h
index d9f4900620a13d74fc3dfb1bac9bcb34416012de..1085047721ed1350866de3b1c981531a3095d93e 100644
--- a/gcc/optabs.h
+++ b/gcc/optabs.h
@@ -240,7 +240,7 @@ bool expand_vec_cond_expr_p (tree, tree);
 
 /* Generate code for VEC_COND_EXPR.  */
 extern rtx expand_vec_cond_expr (tree, tree, tree, tree, rtx);
-/* Generate code for VEC_LSHIFT_EXPR and VEC_RSHIFT_EXPR.  */
+/* Generate code for VEC_RSHIFT_EXPR.  */
 extern rtx expand_vec_shift_expr (sepops, rtx);
 
 /* Return true if target supports vector operations for VEC_PERM_EXPR.  */
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 1ea2e256b09b25331810a57a9c35e5cc875d0404..7b73090f26b001db400c436dae8a250c0d06a6dc 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3648,7 +3648,6 @@ verify_gimple_assign_binary (gimple stmt)
 	return false;
       }
 
-    case VEC_LSHIFT_EXPR:
     case VEC_RSHIFT_EXPR:
       {
 	if (TREE_CODE (rhs1_type) != VECTOR_TYPE
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index b6ecaa4b25a9a7f907ace67332ae6b1540189c4c..ca5a676a2c5b93c6e7adfdc6b5e3d96847c797dc 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3689,7 +3689,6 @@ estimate_operator_cost (enum tree_code code, eni_weights *weights,
     case RSHIFT_EXPR:
     case LROTATE_EXPR:
     case RROTATE_EXPR:
-    case VEC_LSHIFT_EXPR:
     case VEC_RSHIFT_EXPR:
 
     case BIT_IOR_EXPR:
diff --git a/gcc/tree-pretty-print.c b/gcc/tree-pretty-print.c
index aee03319cf0bb8fa06fb420d111461b036749164..2d18d56115bcc0873f88973865cf98164c466491 100644
--- a/gcc/tree-pretty-print.c
+++ b/gcc/tree-pretty-print.c
@@ -1836,7 +1836,6 @@ dump_generic_node (pretty_printer *buffer, tree node, int spc, int flags,
     case RSHIFT_EXPR:
     case LROTATE_EXPR:
     case RROTATE_EXPR:
-    case VEC_LSHIFT_EXPR:
     case VEC_RSHIFT_EXPR:
     case WIDEN_LSHIFT_EXPR:
     case BIT_IOR_EXPR:
@@ -3001,7 +3000,6 @@ op_code_prio (enum tree_code code)
     case REDUC_MAX_EXPR:
     case REDUC_MIN_EXPR:
     case REDUC_PLUS_EXPR:
-    case VEC_LSHIFT_EXPR:
     case VEC_RSHIFT_EXPR:
     case VEC_UNPACK_HI_EXPR:
     case VEC_UNPACK_LO_EXPR:
@@ -3112,9 +3110,6 @@ op_symbol_code (enum tree_code code)
     case RROTATE_EXPR:
       return "r>>";
 
-    case VEC_LSHIFT_EXPR:
-      return "v<<";
-
     case VEC_RSHIFT_EXPR:
       return "v>>";
 
diff --git a/gcc/tree-vect-generic.c b/gcc/tree-vect-generic.c
index f631c99cc0b2a5af244033162f86bb409575d494..b9a0aea6ddd58f972e05fbd506981a10687b349b 100644
--- a/gcc/tree-vect-generic.c
+++ b/gcc/tree-vect-generic.c
@@ -1595,7 +1595,7 @@ expand_vector_operations_1 (gimple_stmt_iterator *gsi)
   if (compute_type == type)
     return;
 
-  gcc_assert (code != VEC_LSHIFT_EXPR && code != VEC_RSHIFT_EXPR);
+  gcc_assert (code != VEC_RSHIFT_EXPR);
   new_rhs = expand_vector_operation (gsi, type, compute_type, stmt, code);
 
   /* Leave expression untouched for later expansion.  */
diff --git a/gcc/tree.def b/gcc/tree.def
index 5406ffe67c53ff3f12920ca8c965cf0740a079c2..ff56bfc18bc00e8dac2dfc072fd4fa878a0f2a04 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1238,11 +1238,10 @@ DEFTREECODE (WIDEN_LSHIFT_EXPR, "widen_lshift_expr", tcc_binary, 2)
    before adding operand three.  */
 DEFTREECODE (FMA_EXPR, "fma_expr", tcc_expression, 3)
 
-/* Whole vector left/right shift in bits.
+/* Whole vector right shift in bits.
    Operand 0 is a vector to be shifted.
    Operand 1 is an integer shift amount in bits, which must be a multiple of the
    element size.  */
-DEFTREECODE (VEC_LSHIFT_EXPR, "vec_lshift_expr", tcc_binary, 2)
 DEFTREECODE (VEC_RSHIFT_EXPR, "vec_rshift_expr", tcc_binary, 2)
 \f
 /* Widening vector multiplication.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 12/14][Vectorizer] Redefine VEC_RSHIFT_EXPR and vec_shr_optab as endianness-neutral
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (10 preceding siblings ...)
  2014-09-18 12:35 ` [PATCH 11/14] Remove VEC_LSHIFT_EXPR and vec_shl_optab Alan Lawrence
@ 2014-09-18 12:43 ` Alan Lawrence
  2014-09-18 13:12   ` David Edelsohn
  2014-09-22 10:58   ` Richard Biener
  2014-09-18 12:45 ` [PATCH 13/14][AArch64_be] Fix vec_shr pattern to correctly implement endianness-neutral optab Alan Lawrence
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 12:43 UTC (permalink / raw)
  To: gcc-patches
  Cc: David Edelsohn, Aldy Hernandez, Steve Ellcey, Eric Christopher

[-- Attachment #1: Type: text/plain, Size: 1084 bytes --]

The direction of VEC_RSHIFT_EXPR has been endian-dependent, contrary to the 
general principles of tree. This patch updates fold-const and the vectorizer 
(the only place where such expressions are created), such that VEC_RSHIFT_EXPR 
always shifts towards element 0.

The tree code still maps directly onto the vec_shr_optab, and so this patch 
*will break any bigendian platform defining the vec_shr optab*.
--> For AArch64_be, patch follows next in series;
--> For PowerPC, I think patch/rfc 15 should fix, please inspect;
--> For MIPS, I think patch/rfc 16 should fix, please inspect.

gcc/ChangeLog:

	* fold-const.c (const_binop): VEC_RSHIFT_EXPR always shifts towards
	element 0.

	* tree-vect-loop.c (vect_create_epilog_for_reduction): always extract
	the result of a reduction with vector shifts from element 0.

	* tree.def (VEC_RSHIFT_EXPR, VEC_LSHIFT_EXPR): Comment shift direction.

	* doc/md.texi (vec_shr_m, vec_shl_m): Document shift direction.

Testing Done:

Bootstrap and check-gcc on x86_64-none-linux-gnu; check-gcc on aarch64-none-elf.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 12_vec_rshift_endian_neutral.patch --]
[-- Type: text/x-patch; name=12_vec_rshift_endian_neutral.patch, Size: 3428 bytes --]

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index f94e0f62c622d43e2df0d0619fb1eba74c415165..a2e8f297fbdd69dfec23e6e0769a21917b06b5c7 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4885,7 +4885,7 @@ of a wider mode.)
 
 @cindex @code{vec_shr_@var{m}} instruction pattern
 @item @samp{vec_shr_@var{m}}
-Whole vector right shift in bits.
+Whole vector right shift in bits, i.e. towards element 0.
 Operand 1 is a vector to be shifted.
 Operand 2 is an integer shift amount in bits, which must be a multiple of the
 element size.
diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index bd4ba5f0c64c710df9fa36d4059f7b08e949fae0..2a4fafa1b0634edd7a56f2484dec3a51a4699222 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -1418,15 +1418,10 @@ const_binop (enum tree_code code, tree arg1, tree arg2)
 	  if (shiftc >= outerc || (shiftc % innerc) != 0)
 	    return NULL_TREE;
 	  int offset = shiftc / innerc;
-	  /* The direction of VEC_RSHIFT_EXPR is endian dependent.
-	     For reductions, if !BYTES_BIG_ENDIAN then compiler picks first
-	     vector element, but last element if BYTES_BIG_ENDIAN.  */
-	  if (BYTES_BIG_ENDIAN)
-	    offset = -offset;
 	  tree zero = build_zero_cst (TREE_TYPE (type));
 	  for (i = 0; i < count; i++)
 	    {
-	      if (i + offset < 0 || i + offset >= count)
+	      if (i + offset >= count)
 		elts[i] = zero;
 	      else
 		elts[i] = VECTOR_CST_ELT (arg1, i + offset);
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index d0a29d312bfd9a7eb552d937e3c64cf9b30d558a..016e2c1fc839fc4d1c97caaa38064fb8bbb510d8 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3860,7 +3860,7 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
   gimple epilog_stmt = NULL;
   enum tree_code code = gimple_assign_rhs_code (stmt);
   gimple exit_phi;
-  tree bitsize, bitpos;
+  tree bitsize;
   tree adjustment_def = NULL;
   tree vec_initial_def = NULL;
   tree reduction_op, expr, def;
@@ -4371,14 +4371,8 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
         dump_printf_loc (MSG_NOTE, vect_location,
 			 "extract scalar result\n");
 
-      if (BYTES_BIG_ENDIAN)
-        bitpos = size_binop (MULT_EXPR,
-                             bitsize_int (TYPE_VECTOR_SUBPARTS (vectype) - 1),
-                             TYPE_SIZE (scalar_type));
-      else
-        bitpos = bitsize_zero_node;
-
-      rhs = build3 (BIT_FIELD_REF, scalar_type, new_temp, bitsize, bitpos);
+      rhs = build3 (BIT_FIELD_REF, scalar_type,
+		    new_temp, bitsize, bitsize_zero_node);
       epilog_stmt = gimple_build_assign (new_scalar_dest, rhs);
       new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
       gimple_assign_set_lhs (epilog_stmt, new_temp);
diff --git a/gcc/tree.def b/gcc/tree.def
index ff56bfc18bc00e8dac2dfc072fd4fa878a0f2a04..90bc27fde303e1606baac858738a7a86a517573b 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1238,7 +1238,7 @@ DEFTREECODE (WIDEN_LSHIFT_EXPR, "widen_lshift_expr", tcc_binary, 2)
    before adding operand three.  */
 DEFTREECODE (FMA_EXPR, "fma_expr", tcc_expression, 3)
 
-/* Whole vector right shift in bits.
+/* Whole vector right shift in bits, i.e. towards element 0.
    Operand 0 is a vector to be shifted.
    Operand 1 is an integer shift amount in bits, which must be a multiple of the
    element size.  */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 13/14][AArch64_be] Fix vec_shr pattern to correctly implement endianness-neutral optab
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (11 preceding siblings ...)
  2014-09-18 12:43 ` [PATCH 12/14][Vectorizer] Redefine VEC_RSHIFT_EXPR and vec_shr_optab as endianness-neutral Alan Lawrence
@ 2014-09-18 12:45 ` Alan Lawrence
  2014-09-22 10:52   ` Richard Biener
  2014-09-18 12:48 ` [PATCH 14/14][Vectorizer] Tidy up vect_create_epilog / use_scalar_result Alan Lawrence
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 12:45 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 448 bytes --]

The previous patch broke aarch64_be by redefining VEC_RSHIFT_EXPR / 
vec_shr_optab to always shift the vector towards gcc's element 0. This fixes 
aarch64_be to do that.

check-gcc on aarch64-none-elf (no changes) and aarch64_be-none-elf (fixes all 
regressions produced by previous patch, i.e. no regressions from before 
redefining vec_shr).


gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (vec_shr_<mode> *2): Fix bigendian.



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 13_aarch64_vec_shr_neutral.patch --]
[-- Type: text/x-patch; name=13_aarch64_vec_shr_neutral.patch, Size: 968 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 3fcf809113d73b37a95653b8c2be432478d2bc1e..e45eddbda7528cfbb4b0953b2c9934c5408d2f6d 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -776,7 +776,12 @@
         (lshiftrt:VD (match_operand:VD 1 "register_operand" "w")
 		     (match_operand:SI 2 "immediate_operand" "i")))]
   "TARGET_SIMD"
-  "ushr %d0, %d1, %2"
+  {
+    if (BYTES_BIG_ENDIAN)
+      return "ushl %d0, %d1, %2";
+    else
+      return "ushr %d0, %d1, %2";
+  }
   [(set_attr "type" "neon_shift_imm")]
 )
 
@@ -804,6 +809,14 @@
       DONE;
     }
 
+  if (BYTES_BIG_ENDIAN)
+    {
+      rtx temp = operands[1];
+      operands[1] = zero_reg;
+      zero_reg = temp;
+      num_bits = 128 - num_bits;
+    }
+
   emit_insn (gen_aarch64_ext<mode> (operands[0], operands[1], zero_reg,
 		      GEN_INT (num_bits / elem_bits)));
   DONE;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 14/14][Vectorizer] Tidy up vect_create_epilog / use_scalar_result
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (12 preceding siblings ...)
  2014-09-18 12:45 ` [PATCH 13/14][AArch64_be] Fix vec_shr pattern to correctly implement endianness-neutral optab Alan Lawrence
@ 2014-09-18 12:48 ` Alan Lawrence
  2014-09-22 10:53   ` Richard Biener
  2014-09-18 12:58 ` [PATCH/RFC 15 / 14+2][RS6000] Remove vec_shl and (hopefully) fix vec_shr Alan Lawrence
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 12:48 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 439 bytes --]

Following earlier patches, vect_create_epilog_for_reduction contains exactly one 
case where extract_scalar_result==true. Hence, move the code 'if 
(extract_scalar_result)' there, and tidy-up/remove some variables.

bootstrapped on x86_64-none-linux-gnu + check-gcc + check-g++.

gcc/ChangeLog:

	* tree-vect-loop.c (vect_create_epilog_for_reduction): Move code for
	'if (extract_scalar_result)' to the only place that it is true.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 14_tidy_vect_create_epilog.patch --]
[-- Type: text/x-patch; name=14_tidy_vect_create_epilog.patch, Size: 2808 bytes --]

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 016e2c1fc839fc4d1c97caaa38064fb8bbb510d8..62b279e4d29d1fdfbfbd4e606fc8be9d608d3707 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3867,7 +3867,6 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
   tree orig_name, scalar_result;
   imm_use_iterator imm_iter, phi_imm_iter;
   use_operand_p use_p, phi_use_p;
-  bool extract_scalar_result = false;
   gimple use_stmt, orig_stmt, reduction_phi = NULL;
   bool nested_in_vect_loop = false;
   auto_vec<gimple> new_phis;
@@ -4235,6 +4234,8 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
                   Create:  va = vop <va, va'>
                 }  */
 
+          tree rhs;
+
           if (dump_enabled_p ())
             dump_printf_loc (MSG_NOTE, vect_location,
 			     "Reduce using vector shifts\n");
@@ -4260,7 +4261,20 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
               gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
             }
 
-          extract_scalar_result = true;
+	  /* 2.4  Extract the final scalar result.  Create:
+	     s_out3 = extract_field <v_out2, bitpos>  */
+
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "extract scalar result\n");
+
+	  rhs = build3 (BIT_FIELD_REF, scalar_type, new_temp,
+			bitsize, bitsize_zero_node);
+	  epilog_stmt = gimple_build_assign (new_scalar_dest, rhs);
+	  new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
+	  gimple_assign_set_lhs (epilog_stmt, new_temp);
+	  gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
+	  scalar_results.safe_push (new_temp);
         }
       else
         {
@@ -4355,30 +4369,8 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
           else
             /* Not SLP - we have one scalar to keep in SCALAR_RESULTS.  */
             scalar_results.safe_push (new_temp);
-
-          extract_scalar_result = false;
         }
     }
-
-  /* 2.4  Extract the final scalar result.  Create:
-          s_out3 = extract_field <v_out2, bitpos>  */
-
-  if (extract_scalar_result)
-    {
-      tree rhs;
-
-      if (dump_enabled_p ())
-        dump_printf_loc (MSG_NOTE, vect_location,
-			 "extract scalar result\n");
-
-      rhs = build3 (BIT_FIELD_REF, scalar_type,
-		    new_temp, bitsize, bitsize_zero_node);
-      epilog_stmt = gimple_build_assign (new_scalar_dest, rhs);
-      new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
-      gimple_assign_set_lhs (epilog_stmt, new_temp);
-      gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
-      scalar_results.safe_push (new_temp);
-    }
   
 vect_finalize_reduction:
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH/RFC 15 / 14+2][RS6000] Remove vec_shl and (hopefully) fix vec_shr
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (13 preceding siblings ...)
  2014-09-18 12:48 ` [PATCH 14/14][Vectorizer] Tidy up vect_create_epilog / use_scalar_result Alan Lawrence
@ 2014-09-18 12:58 ` Alan Lawrence
  2014-09-23 12:50   ` David Edelsohn
  2014-09-18 13:02 ` [PATCH 16 / 14+2][MIPS] " Alan Lawrence
  2014-09-22 11:21 ` [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Richard Biener
  16 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 12:58 UTC (permalink / raw)
  To: gcc-patches; +Cc: David Edelsohn, Aldy Hernandez

[-- Attachment #1: Type: text/plain, Size: 831 bytes --]

Patch 12 of 14 (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01475.html) will 
break bigendian targets implementing vec_shr. This is a PowerPC parallel of 
patch 13 of 14 (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01477.html) for 
AArch64. I've checked I can build a stage 1 compiler for powerpc-none-eabi and 
that the assembly output looks plausible but no further than that.

In fact I find BYTES_BIG_ENDIAN is defined to true on powerpcle-none-eabi as 
well as powerpc-none-eabi (and also on ppc64-none-elf, but to false on 
ppc64le-none-elf), so I'm not quite sure how your backend works in this regard - 
nonetheless I hope this is a helpful starting point even if not definitive.

gcc/ChangeLog:

	* config/rs6000/vector.md (vec_shl_<mode>): Remove.
	(vec_shr_<mode>): Reverse shift if BYTES_BIG_ENDIAN.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 15_powerpc.patch --]
[-- Type: text/x-patch; name=15_powerpc.patch, Size: 2513 bytes --]

diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index edbb83161d142b1a562735635fe90ef65b09fbbf..8bc010eb26526e2997d02ea7aef655e60eca8707 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -972,53 +972,11 @@
  "VECTOR_MEM_VSX_P (<MODE>mode) && TARGET_ALLOW_MOVMISALIGN"
  "")
 
-\f
-;; Vector shift left in bits.  Currently supported ony for shift
-;; amounts that can be expressed as byte shifts (divisible by 8).
-;; General shift amounts can be supported using vslo + vsl. We're
-;; not expecting to see these yet (the vectorizer currently
-;; generates only shifts divisible by byte_size).
-(define_expand "vec_shl_<mode>"
-  [(match_operand:VEC_L 0 "vlogical_operand" "")
-   (match_operand:VEC_L 1 "vlogical_operand" "")
-   (match_operand:QI 2 "reg_or_short_operand" "")]
-  "TARGET_ALTIVEC"
-  "
-{
-  rtx bitshift = operands[2];
-  rtx shift;
-  rtx insn;
-  HOST_WIDE_INT bitshift_val;
-  HOST_WIDE_INT byteshift_val;
-
-  if (! CONSTANT_P (bitshift))
-    FAIL;
-  bitshift_val = INTVAL (bitshift);
-  if (bitshift_val & 0x7)
-    FAIL;
-  byteshift_val = bitshift_val >> 3;
-  if (TARGET_VSX && (byteshift_val & 0x3) == 0)
-    {
-      shift = gen_rtx_CONST_INT (QImode, byteshift_val >> 2);
-      insn = gen_vsx_xxsldwi_<mode> (operands[0], operands[1], operands[1],
-				     shift);
-    }
-  else
-    {
-      shift = gen_rtx_CONST_INT (QImode, byteshift_val);
-      insn = gen_altivec_vsldoi_<mode> (operands[0], operands[1], operands[1],
-					shift);
-    }
-
-  emit_insn (insn);
-  DONE;
-}")
-
 ;; Vector shift right in bits. Currently supported ony for shift
 ;; amounts that can be expressed as byte shifts (divisible by 8).
 ;; General shift amounts can be supported using vsro + vsr. We're
 ;; not expecting to see these yet (the vectorizer currently
-;; generates only shifts divisible by byte_size).
+;; generates only shifts by a whole number of vector elements).
 (define_expand "vec_shr_<mode>"
   [(match_operand:VEC_L 0 "vlogical_operand" "")
    (match_operand:VEC_L 1 "vlogical_operand" "")
@@ -1037,7 +995,9 @@
   bitshift_val = INTVAL (bitshift);
   if (bitshift_val & 0x7)
     FAIL;
-  byteshift_val = 16 - (bitshift_val >> 3);
+  byteshift_val = (bitshift_val >> 3);
+  if (!BYTES_BIG_ENDIAN)
+    byteshift_val = 16 - byteshift_val;
   if (TARGET_VSX && (byteshift_val & 0x3) == 0)
     {
       shift = gen_rtx_CONST_INT (QImode, byteshift_val >> 2);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 16 / 14+2][MIPS] Remove vec_shl and (hopefully) fix vec_shr
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (14 preceding siblings ...)
  2014-09-18 12:58 ` [PATCH/RFC 15 / 14+2][RS6000] Remove vec_shl and (hopefully) fix vec_shr Alan Lawrence
@ 2014-09-18 13:02 ` Alan Lawrence
  2014-09-22 11:21 ` [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Richard Biener
  16 siblings, 0 replies; 52+ messages in thread
From: Alan Lawrence @ 2014-09-18 13:02 UTC (permalink / raw)
  To: gcc-patches; +Cc: Steve Ellcey, Eric Christopher

[-- Attachment #1: Type: text/plain, Size: 772 bytes --]

Patch 12 of 14 (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01475.html) will
break bigendian targets implementing vec_shr. This is a MIPS parallel of
patch 13 of 14 (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01477.html) for
AArch64; the idea is that vec_shr should be unaffected on little-endian, but 
reversed (to be the same as the old vec_shl) if big-endian.

Manual inspection of assembler output looks to do the right sort of thing on 
mips and mips64, but I haven't been able to run any testcases so this is not 
definitive. I'm hoping it is nonetheless helpful as a starting point!

gcc/ChangeLog:

	* config/mips/loongson.md (unspec): Remove UNSPEC_LOONGSON_DSLL.
	(vec_shl_<mode>): Remove.
	(vec_shr_<mode>): Reverse shift if BYTES_BIG_ENDIAN.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 16_mips.patch --]
[-- Type: text/x-patch; name=16_mips.patch, Size: 1420 bytes --]

diff --git a/gcc/config/mips/loongson.md b/gcc/config/mips/loongson.md
index 474033d1e2c244d3b70ad5ed630ab9f29d5fd5f6..dcba23440a5cb8cf0f2063ee15fbcf9d2a579714 100644
--- a/gcc/config/mips/loongson.md
+++ b/gcc/config/mips/loongson.md
@@ -39,7 +39,6 @@
   UNSPEC_LOONGSON_PUNPCKL
   UNSPEC_LOONGSON_PADDD
   UNSPEC_LOONGSON_PSUBD
-  UNSPEC_LOONGSON_DSLL
   UNSPEC_LOONGSON_DSRL
 ])
 
@@ -834,22 +833,18 @@
 })
 
 ;; Whole vector shifts, used for reduction epilogues.
-(define_insn "vec_shl_<mode>"
-  [(set (match_operand:VWHBDI 0 "register_operand" "=f")
-        (unspec:VWHBDI [(match_operand:VWHBDI 1 "register_operand" "f")
-                        (match_operand:SI 2 "register_operand" "f")]
-                       UNSPEC_LOONGSON_DSLL))]
-  "TARGET_HARD_FLOAT && TARGET_LOONGSON_VECTORS"
-  "dsll\t%0,%1,%2"
-  [(set_attr "type" "fcvt")])
-
 (define_insn "vec_shr_<mode>"
   [(set (match_operand:VWHBDI 0 "register_operand" "=f")
         (unspec:VWHBDI [(match_operand:VWHBDI 1 "register_operand" "f")
                         (match_operand:SI 2 "register_operand" "f")]
                        UNSPEC_LOONGSON_DSRL))]
   "TARGET_HARD_FLOAT && TARGET_LOONGSON_VECTORS"
-  "dsrl\t%0,%1,%2"
+  {
+    if (BYTES_BIG_ENDIAN)
+      return "dsll\t%0,%1,%2";
+    else
+      return "dsrl\t%0,%1,%2";
+  }
   [(set_attr "type" "fcvt")])
 
 (define_expand "reduc_uplus_<mode>"

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/14][Vectorizer] Redefine VEC_RSHIFT_EXPR and vec_shr_optab as endianness-neutral
  2014-09-18 12:43 ` [PATCH 12/14][Vectorizer] Redefine VEC_RSHIFT_EXPR and vec_shr_optab as endianness-neutral Alan Lawrence
@ 2014-09-18 13:12   ` David Edelsohn
  2014-09-22 13:27     ` Bill Schmidt
  2014-09-22 10:58   ` Richard Biener
  1 sibling, 1 reply; 52+ messages in thread
From: David Edelsohn @ 2014-09-18 13:12 UTC (permalink / raw)
  To: Alan Lawrence, William J. Schmidt
  Cc: gcc-patches, Aldy Hernandez, Steve Ellcey, Eric Christopher

On Thu, Sep 18, 2014 at 8:42 AM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> The direction of VEC_RSHIFT_EXPR has been endian-dependent, contrary to the
> general principles of tree. This patch updates fold-const and the vectorizer
> (the only place where such expressions are created), such that
> VEC_RSHIFT_EXPR always shifts towards element 0.
>
> The tree code still maps directly onto the vec_shr_optab, and so this patch
> *will break any bigendian platform defining the vec_shr optab*.
> --> For AArch64_be, patch follows next in series;
> --> For PowerPC, I think patch/rfc 15 should fix, please inspect;
> --> For MIPS, I think patch/rfc 16 should fix, please inspect.
>
> gcc/ChangeLog:
>
>         * fold-const.c (const_binop): VEC_RSHIFT_EXPR always shifts towards
>         element 0.
>
>         * tree-vect-loop.c (vect_create_epilog_for_reduction): always
> extract
>         the result of a reduction with vector shifts from element 0.
>
>         * tree.def (VEC_RSHIFT_EXPR, VEC_LSHIFT_EXPR): Comment shift
> direction.
>
>         * doc/md.texi (vec_shr_m, vec_shl_m): Document shift direction.
>
> Testing Done:
>
> Bootstrap and check-gcc on x86_64-none-linux-gnu; check-gcc on
> aarch64-none-elf.

Why wasn't this tested on the PowerLinux system in the GCC Compile Farm?

Also, Bill Schmidt can help check the PPC parts fo the patches.

Thanks, David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/14][Vectorizer] Make REDUC_xxx_EXPR tree codes produce a scalar result
  2014-09-18 11:51 ` [PATCH 2/14][Vectorizer] Make REDUC_xxx_EXPR tree codes produce a scalar result Alan Lawrence
@ 2014-09-22 10:34   ` Richard Biener
  2014-09-22 13:23     ` Alan Lawrence
  2014-09-24 15:02     ` Alan Lawrence
  0 siblings, 2 replies; 52+ messages in thread
From: Richard Biener @ 2014-09-22 10:34 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Thu, Sep 18, 2014 at 1:50 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> This fixes PR/61114 by redefining the REDUC_{MIN,MAX,PLUS}_EXPR tree codes.
>
> These are presently documented as producing a vector with the result in
> element 0, and this is inconsistent with their use in tree-vect-loop.c
> (which on bigendian targets pulls the bits out of the wrong end of the
> vector result). This leads to bugs on bigendian targets - see also
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114.
>
> I discounted "fixing" the vectorizer (to read from element 0) and then
> making bigendian targets (whose architectural insn produces the result in
> lane N-1) permute the result vector, as optimization of vectors in RTL seems
> unlikely to remove such a permute and would lead to a performance
> regression.
>
> Instead it seems more natural for the tree code to produce a scalar result
> (producing a vector with the result in lane 0 has already caused confusion,
> e.g. https://gcc.gnu.org/ml/gcc-patches/2012-10/msg01100.html).
>
> However, this patch preserves the meaning of the optab (producing a result
> in lane 0 on little-endian architectures or N-1 on bigendian), thus
> generally avoiding the need to change backends. Thus, expr.c extracts an
> endianness-dependent element from the optab result to give the result
> expected for the tree code.
>
> Previously posted as an RFC
> https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html , now with an extra
> VIEW_CONVERT_EXPR if the types of the reduction/result do not match.

Huh.  Does that ever happen?  Please use a NOP_EXPR instead of
a VIEW_CONVERT_EXPR.

Ok with that change.

Thanks,
Richard.

> Testing:
>         x86_86-none-linux-gnu: bootstrap, check-gcc, check-g++
>         aarch64-none-linux-gnu: bootstrap
>         aarch64-none-elf:  check-gcc, check-g++
>         arm-none-eabi: check-gcc
>
>         aarch64_be-none-elf: check-gcc, showing
>         FAIL->PASS: gcc.dg/vect/no-scevccp-outer-7.c execution test
>         FAIL->PASS: gcc.dg/vect/no-scevccp-outer-13.c execution test
>         Passes the (previously-failing) reduced testcase on
>                 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114
>
>         Have also assembler/stage-1 tested that testcase on PowerPC, also
> fixed.

> gcc/ChangeLog:
>
>         * expr.c (expand_expr_real_2): For REDUC_{MIN,MAX,PLUS}_EXPR, add
>         extract_bit_field around optab result.
>
>         * fold-const.c (fold_unary_loc): For REDUC_{MIN,MAX,PLUS}_EXPR,
> produce
>         scalar not vector.
>
>         * tree-cfg.c (verify_gimple_assign_unary): Check result vs operand
> type
>         for REDUC_{MIN,MAX,PLUS}_EXPR.
>
>         * tree-vect-loop.c (vect_analyze_loop): Update comment.
>         (vect_create_epilog_for_reduction): For direct vector reduction, use
>         result of tree code directly without extract_bit_field.
>
>         * tree.def (REDUC_MAX_EXPR, REDUC_MIN_EXPR, REDUC_PLUS_EXPR): Update
>         comment.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/14] Add new optabs for reducing vectors to scalars
  2014-09-18 11:54 ` [PATCH 3/14] Add new optabs for reducing vectors to scalars Alan Lawrence
@ 2014-09-22 10:40   ` Richard Biener
  2014-09-22 13:26     ` Alan Lawrence
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2014-09-22 10:40 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Thu, Sep 18, 2014 at 1:54 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> These match their corresponding tree codes, by taking a vector and returning
> a scalar; this is more architecturally neutral than the (somewhat loosely
> defined) previous optab that took a vector and returned a vector with the
> result in the least significant bits (i.e. element 0 for little-endian or
> N-1 for bigendian). However, the old optabs are preserved so as not to break
> existing backends, so clients check for both old + new optabs.
>
> Bootstrap, check-gcc and check-g++ on x86_64-none-linux-gnu.
> aarch64.exp + vect.exp on aarch64{,_be}-none-elf.
> (of course at this point in the series all these are using the old optab +
> migration path.)

scalar_reduc_to_vector misses a comment.

I wonder if at the end we wouldn't transition all backends and then
renaming reduc_*_scal_optab back to reduc_*_optab makes sense.

The optabs have only one mode - I wouldn't be surprised if an ISA
invents for example v4si -> di reduction?  So do we want to make
reduc_plus_scal_optab a little bit more future proof (maybe there
is already an ISA that supports this kind of reduction?).

Otherwise the patch looks good to me.

Thanks,
Richard.

> gcc/ChangeLog:
>
>         * doc/md.texi (Standard Names): Add reduc_(plus,[us](min|max))|scal
>         optabs, and note in reduc_[us](plus|min|max) to prefer the former.
>
>         * expr.c (expand_expr_real_2): Use reduc_..._scal if available, fall
>         back to old reduc_... + BIT_FIELD_REF only if not.
>
>         * optabs.c (optab_for_tree_code): for REDUC_(MAX,MIN,PLUS)_EXPR,
>         return the reduce-to-scalar (reduc_..._scal) optab.
>         (scalar_reduc_to_vector): New.
>
>         * optabs.def (reduc_smax_scal_optab, reduc_smin_scal_optab,
>         reduc_plus_scal_optab, reduc_umax_scal_optab,
> reduc_umin_scal_optab):
>         New.
>
>         * optabs.h (scalar_reduc_to_vector): Declare.
>
>         * tree-vect-loop.c (vectorizable_reduction): Look for optabs
> reducing
>         to either scalar or vector.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 7/14][Testsuite] Add tests of reductions using whole-vector-shifts (multiplication)
  2014-09-18 12:19 ` [PATCH 7/14][Testsuite] Add tests of reductions using whole-vector-shifts (multiplication) Alan Lawrence
@ 2014-09-22 10:41   ` Richard Biener
  0 siblings, 0 replies; 52+ messages in thread
From: Richard Biener @ 2014-09-22 10:41 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches, David Edelsohn, Aldy Hernandez

On Thu, Sep 18, 2014 at 2:19 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> For reduction operations (e.g. multiply) that don't have such a tree code
> ,or where the target platform doesn't define an optab handler for the tree
> code, we can perform the reduction using a series of log(N) shifts (where N
> = #elements in vector), using the VEC_RSHIFT_EXPR=whole-vector-shift tree
> code (if the platform handles the vec_shr_optab).
>
> First stage is to add some tests of non-(min/max/plus) reductions; here,
> multiplies. The first is designed to be non-foldable, so we make sure the
> architectural instructions line up with what the tree codes specify. The
> second is designed to be easily constant-propagated, to test the (currently
> endianness-dependent) constant folding code.
>
> In lib/target-supports.exp, I've defined a new
> check_effective_target_whole_vector_shift, which I intended to define to
> true for platforms with the vec_shr optab. However, I've not managed to make
> this test pass on PowerPC - even with -maltivec, -fdump-tree-vect-details
> gives me a message about the target not supporting vector multiplication -
> so I've omitted PowerPC from the whole_vector_shift. This doesn't feel
> right, suggestions welcomed from PowerPC maintainers?
>
> Tests passing on arm-none-eabi and x86_64-none-linux-gnu;
> also verified the scan-tree-dump part works on ia64-none-linux-gnu (by
> compiling to assembly only).
> (Tests are not run on AArch64, because we have no vec_shr_optab at this
> point; PowerPC, as above; or MIPS, as check_effective_target_vect_int_mult
> yields 0.)

Ok.

Thanks,
Richard.

> gcc/testsuite/ChangeLog:
>
>         * lib/target-supports.exp
> (check_effective_target_whole_vector_shift):
>         New.
>
>         * gcc.dg/vect/vect-reduc-mul_1.c: New test.
>         * gcc.dg/vect/vect-reduc-mul_2.c: New test.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/14][Testsuite] Add tests of reductions using whole-vector-shifts (ior)
  2014-09-18 12:25 ` [PATCH 8/14][Testsuite] Add tests of reductions using whole-vector-shifts (ior) Alan Lawrence
@ 2014-09-22 10:42   ` Richard Biener
  0 siblings, 0 replies; 52+ messages in thread
From: Richard Biener @ 2014-09-22 10:42 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Thu, Sep 18, 2014 at 2:25 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> These are like the previous patch, but using | rather than * - I was unable
> to get the previous test to pass on PowerPC and MIPS.
>
> I note there is no inherent vector operation here - a bitwise OR across a
> word, and a "reduction via shifts" using scalar (not vector) ops would be
> all that's necessary. However, GCC doesn't exploit this possibility at
> present, and I don't have any plans at present to add such myself.
>
> Passing on x86_64-linux-gnu, aarch64-none-elf, aarch64_be-none-elf,
> arm-none-eabi.
> The 'scan-tree-dump' part passes on mips64 and powerpc (although the latter
> is disabled as check_effective_target_whole_vector_shift gives 0, as per
> previous patch)

Ok.

Thanks,
Richard.

> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/vect-reduc-or_1.c: New test.
>         * gcc.dg/vect/vect-reduc-or_2.c: Likewise.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 9/14] Enforce whole-vector-shifts to always be by a whole number of elements
  2014-09-18 12:27 ` [PATCH 9/14] Enforce whole-vector-shifts to always be by a whole number of elements Alan Lawrence
@ 2014-09-22 10:50   ` Richard Biener
  0 siblings, 0 replies; 52+ messages in thread
From: Richard Biener @ 2014-09-22 10:50 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Thu, Sep 18, 2014 at 2:27 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> The VEC_RSHIFT_EXPR is only ever used by the vectorizer in tree-vect-loop.c
> (vect_create_epilog_for_reduction), to shift the vector by a whole number of
> elements. The tree code allows more general shifts but only for integral
> types. This only causes pain and difficulty for backends (particularly for
> backends with different endiannesses), and enforcing that restriction for
> integral types too does no harm.
>
> bootstrapped on aarch64-none-linux-gnu and x86-64-none-linux-gnu
> check-gcc on aarch64-none-elf and x86_64-none-linux-gnu

Hmm, but then (coming from the tree / gimple level) all shifts can
be expressed with a VEC_PERM_EXPR.  And of course a general
whole-vector shift could be expressed using a VIEW_CONVERT_EXPR
to a 1-element integer vector and a regular [RL]SHIFT_EXPR and then
converting back.

So it seems to me that the vectorizer should instead emit a
VEC_PERM_EXPR (making sure the backends or the generic
vec_perm expansion code in optabs.c handles the whole-vector-shift
case in an optimal way).

The current VEC_RSHIFT_EXPR description lacks information
on what is shifted in btw (always zeros? the most significant bit (endian
dependent?!)).

So - can we instead remove VEC_[LR]SHIFT_EXPR?  Seems that
VEC_LSHIFT_EXPR is unused anyway, and thus vec_shl_optabs
as well.

Thanks,
Richard.

> gcc/ChangeLog:
>
>         * tree-cfg.c (verify_gimple_assign_binary): for VEC_RSHIFT_EXPR (and
>         VEC_LSHIFT_EXPR), require shifts to be by a whole number of elements
>         for all types, rather than only non-integral types.
>
>         * tree.def (VEC_LSHIFT_EXPR, VEC_RSHIFT_EXPR): Update comment.
>
>         * doc/md.texi (vec_shl_m, vec_shr_m): Update comment.
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/14][AArch64_be] Fix vec_shr pattern to correctly implement endianness-neutral optab
  2014-09-18 12:45 ` [PATCH 13/14][AArch64_be] Fix vec_shr pattern to correctly implement endianness-neutral optab Alan Lawrence
@ 2014-09-22 10:52   ` Richard Biener
  0 siblings, 0 replies; 52+ messages in thread
From: Richard Biener @ 2014-09-22 10:52 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Thu, Sep 18, 2014 at 2:45 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> The previous patch broke aarch64_be by redefining VEC_RSHIFT_EXPR /
> vec_shr_optab to always shift the vector towards gcc's element 0. This fixes
> aarch64_be to do that.
>
> check-gcc on aarch64-none-elf (no changes) and aarch64_be-none-elf (fixes
> all regressions produced by previous patch, i.e. no regressions from before
> redefining vec_shr).

Using vector permutes would have avoided this I guess?

Richard.

>
> gcc/ChangeLog:
>
>         * config/aarch64/aarch64-simd.md (vec_shr_<mode> *2): Fix bigendian.
>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 11/14] Remove VEC_LSHIFT_EXPR and vec_shl_optab
  2014-09-18 12:35 ` [PATCH 11/14] Remove VEC_LSHIFT_EXPR and vec_shl_optab Alan Lawrence
@ 2014-09-22 10:52   ` Richard Biener
  2014-10-27 18:45     ` Alan Lawrence
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2014-09-22 10:52 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Thu, Sep 18, 2014 at 2:35 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> The VEC_LSHIFT_EXPR tree code, and the corresponding vec_shl_optab, seem to
> have been added for completeness, providing a counterpart to VEC_RSHIFT_EXPR
> and vec_shr_optab. However, whereas VEC_RSHIFT_EXPRs are generated (only) by
> the vectorizer, VEC_LSHIFT_EXPR expressions are not generated at all, so
> there seems little point in maintaining it.
>
> Bootstrapped on x86_64-unknown-linux-gnu.
> aarch64.exp+vect.exp on aarch64-none-elf and aarch64_be-none-elf.

Ah, there it is ;)

Ok.

Thanks,
Richard.

> gcc/ChangeLog:
>
>         * expr.c (expand_expr_real_2): Remove code handling VEC_LSHIFT_EXPR.
>         * fold-const.c (const_binop): Likewise.
>         * cfgexpand.c (expand_debug_expr): Likewise.
>         * tree-inline.c (estimate_operator_cost, dump_generic_node,
>         op_code_prio, op_symbol_code): Likewise.
>         * tree-vect-generic.c (expand_vector_operations_1): Likewise.
>         * optabs.c (optab_for_tree_code): Likewise.
>         (expand_vec_shift_expr): Likewise, update comment.
>         * tree.def: Delete VEC_LSHIFT_EXPR, remove comment.
>         * optabs.h (expand_vec_shift_expr): Remove comment re.
> VEC_LSHIFT_EXPR.
>         * optabs.def: Remove vec_shl_optab.
>         * doc/md.texi: Remove references to vec_shr_m.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 14/14][Vectorizer] Tidy up vect_create_epilog / use_scalar_result
  2014-09-18 12:48 ` [PATCH 14/14][Vectorizer] Tidy up vect_create_epilog / use_scalar_result Alan Lawrence
@ 2014-09-22 10:53   ` Richard Biener
  2014-11-14 17:29     ` PUSHED: " Alan Lawrence
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2014-09-22 10:53 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Thu, Sep 18, 2014 at 2:48 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> Following earlier patches, vect_create_epilog_for_reduction contains exactly
> one case where extract_scalar_result==true. Hence, move the code 'if
> (extract_scalar_result)' there, and tidy-up/remove some variables.
>
> bootstrapped on x86_64-none-linux-gnu + check-gcc + check-g++.

Ok.

Thanks,
Richard.

> gcc/ChangeLog:
>
>         * tree-vect-loop.c (vect_create_epilog_for_reduction): Move code for
>         'if (extract_scalar_result)' to the only place that it is true.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/14][Vectorizer] Redefine VEC_RSHIFT_EXPR and vec_shr_optab as endianness-neutral
  2014-09-18 12:43 ` [PATCH 12/14][Vectorizer] Redefine VEC_RSHIFT_EXPR and vec_shr_optab as endianness-neutral Alan Lawrence
  2014-09-18 13:12   ` David Edelsohn
@ 2014-09-22 10:58   ` Richard Biener
  1 sibling, 0 replies; 52+ messages in thread
From: Richard Biener @ 2014-09-22 10:58 UTC (permalink / raw)
  To: Alan Lawrence
  Cc: gcc-patches, David Edelsohn, Aldy Hernandez, Steve Ellcey,
	Eric Christopher

On Thu, Sep 18, 2014 at 2:42 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> The direction of VEC_RSHIFT_EXPR has been endian-dependent, contrary to the
> general principles of tree. This patch updates fold-const and the vectorizer
> (the only place where such expressions are created), such that
> VEC_RSHIFT_EXPR always shifts towards element 0.
>
> The tree code still maps directly onto the vec_shr_optab, and so this patch
> *will break any bigendian platform defining the vec_shr optab*.
> --> For AArch64_be, patch follows next in series;
> --> For PowerPC, I think patch/rfc 15 should fix, please inspect;
> --> For MIPS, I think patch/rfc 16 should fix, please inspect.
>
> gcc/ChangeLog:
>
>         * fold-const.c (const_binop): VEC_RSHIFT_EXPR always shifts towards
>         element 0.
>
>         * tree-vect-loop.c (vect_create_epilog_for_reduction): always
> extract
>         the result of a reduction with vector shifts from element 0.
>
>         * tree.def (VEC_RSHIFT_EXPR, VEC_LSHIFT_EXPR): Comment shift
> direction.
>
>         * doc/md.texi (vec_shr_m, vec_shl_m): Document shift direction.
>
> Testing Done:
>
> Bootstrap and check-gcc on x86_64-none-linux-gnu; check-gcc on
> aarch64-none-elf.

As said elsewhere I'd like the vectorizer to use VEC_PERM_EXPRs
and the generic vec_perm expansion machinery handle the
case where the permute can be expressed using the vec_shr_optab.
You'd have, for a 1-element shift of V4SI x, VEC_PERM <x, { 0, 0, 0, 0
}, {4, 3, 2, 1 }>

I'd say that if the target says it can handle the constant permute just fine
then use the vec_perm_const expansion path.

Richard.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114
  2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
                   ` (15 preceding siblings ...)
  2014-09-18 13:02 ` [PATCH 16 / 14+2][MIPS] " Alan Lawrence
@ 2014-09-22 11:21 ` Richard Biener
  2014-09-22 11:26   ` Richard Biener
                     ` (2 more replies)
  16 siblings, 3 replies; 52+ messages in thread
From: Richard Biener @ 2014-09-22 11:21 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Thu, Sep 18, 2014 at 1:41 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> The end goal here is to remove this code from tree-vect-loop.c
> (vect_create_epilog_for_reduction):
>
>       if (BYTES_BIG_ENDIAN)
>         bitpos = size_binop (MULT_EXPR,
>                              bitsize_int (TYPE_VECTOR_SUBPARTS (vectype) -
> 1),
>                              TYPE_SIZE (scalar_type));
>       else
>
> as this is the root cause of PR/61114 (see testcase there, failing on all
> bigendian targets supporting reduc_[us]plus_optab). Quoting Richard Biener,
> "all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
> suspicious". The code snippet above is used on two paths:
>
> (Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
> reduc_[us](plus|min|max)_optab.
> The optab is documented as "the scalar result is stored in the least
> significant bits of operand 0", but the tree code as "the first element in
> the vector holding the result of the reduction of all elements of the
> operand". This mismatch means that when the tree code is folded, the code
> snippet above reads the result from the wrong end of the vector.
>
> The strategy (as per
> https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define new
> tree codes and optabs that produce scalar results directly; this seems
> better than tying (the element of the vector into which the result is
> placed) to (the endianness of the target), and avoids generating extra moves
> on current bigendian targets. However, the previous optabs are retained for
> now as a migration strategy so as not to break existing backends; moving
> individual platforms over will follow.
>
> A complication here is on AArch64, where we directly generate
> REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
> remove this folding in order to decouple the midend and AArch64 backend.

Sounds fine.  I hope we can transition all backends for 5.0 and remove
the vector variant optabs (maybe renaming the scalar ones).

> (Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
> VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the optab
> is defined in an endianness-dependent way, leading to significant
> complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab is
> never used!). Few platforms appear to handle vec_shr_optab (and fewer
> bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
> the existing optab to be endianness-neutral.
>
> Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
> updates that implementation to fit the new endianness-neutral specification,
> serving as a guide for other existing backends. Patches/RFCs 15 and 16 are
> equivalents for MIPS and PowerPC; I haven't tested these but hope they act
> as useful pointers for the port maintainers.
>
> Finally patch 14 cleans up the affected part of tree-vect-loop.c
> (vect_create_epilog_for_reduction).

As said during the individual patches review I'd like the vectorizer to
use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
only whole-element amounts).  This means we can remove
VEC_RSHIFT_EXPR.  It also means that if the backend defines
vec_perm_const (which it really should) it can handle the special
permutes that boil down to a possibly more efficient vector shift
there (a good optimization anyway).  Until it does that all backends
would at least create correct code (with the endian dependent
vec_shr removed).

Richard.

> --Alan
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114
  2014-09-22 11:21 ` [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Richard Biener
@ 2014-09-22 11:26   ` Richard Biener
  2014-10-06 17:31   ` Alan Lawrence
       [not found]   ` <5432D1A5.6080208@arm.com>
  2 siblings, 0 replies; 52+ messages in thread
From: Richard Biener @ 2014-09-22 11:26 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Mon, Sep 22, 2014 at 1:21 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Thu, Sep 18, 2014 at 1:41 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> The end goal here is to remove this code from tree-vect-loop.c
>> (vect_create_epilog_for_reduction):
>>
>>       if (BYTES_BIG_ENDIAN)
>>         bitpos = size_binop (MULT_EXPR,
>>                              bitsize_int (TYPE_VECTOR_SUBPARTS (vectype) -
>> 1),
>>                              TYPE_SIZE (scalar_type));
>>       else
>>
>> as this is the root cause of PR/61114 (see testcase there, failing on all
>> bigendian targets supporting reduc_[us]plus_optab). Quoting Richard Biener,
>> "all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
>> suspicious". The code snippet above is used on two paths:
>>
>> (Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
>> reduc_[us](plus|min|max)_optab.
>> The optab is documented as "the scalar result is stored in the least
>> significant bits of operand 0", but the tree code as "the first element in
>> the vector holding the result of the reduction of all elements of the
>> operand". This mismatch means that when the tree code is folded, the code
>> snippet above reads the result from the wrong end of the vector.
>>
>> The strategy (as per
>> https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define new
>> tree codes and optabs that produce scalar results directly; this seems
>> better than tying (the element of the vector into which the result is
>> placed) to (the endianness of the target), and avoids generating extra moves
>> on current bigendian targets. However, the previous optabs are retained for
>> now as a migration strategy so as not to break existing backends; moving
>> individual platforms over will follow.
>>
>> A complication here is on AArch64, where we directly generate
>> REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
>> remove this folding in order to decouple the midend and AArch64 backend.
>
> Sounds fine.  I hope we can transition all backends for 5.0 and remove
> the vector variant optabs (maybe renaming the scalar ones).
>
>> (Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
>> VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the optab
>> is defined in an endianness-dependent way, leading to significant
>> complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab is
>> never used!). Few platforms appear to handle vec_shr_optab (and fewer
>> bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
>> the existing optab to be endianness-neutral.
>>
>> Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
>> updates that implementation to fit the new endianness-neutral specification,
>> serving as a guide for other existing backends. Patches/RFCs 15 and 16 are
>> equivalents for MIPS and PowerPC; I haven't tested these but hope they act
>> as useful pointers for the port maintainers.
>>
>> Finally patch 14 cleans up the affected part of tree-vect-loop.c
>> (vect_create_epilog_for_reduction).
>
> As said during the individual patches review I'd like the vectorizer to
> use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
> only whole-element amounts).  This means we can remove
> VEC_RSHIFT_EXPR.  It also means that if the backend defines
> vec_perm_const (which it really should) it can handle the special
> permutes that boil down to a possibly more efficient vector shift
> there (a good optimization anyway).  Until it does that all backends
> would at least create correct code (with the endian dependent
> vec_shr removed).

It seems only Alpha completely lacks vec_perm_const but implements
vec_shr.

Richard.

> Richard.
>
>> --Alan
>>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/14][Vectorizer] Make REDUC_xxx_EXPR tree codes produce a scalar result
  2014-09-22 10:34   ` Richard Biener
@ 2014-09-22 13:23     ` Alan Lawrence
  2014-09-24 15:02     ` Alan Lawrence
  1 sibling, 0 replies; 52+ messages in thread
From: Alan Lawrence @ 2014-09-22 13:23 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener wrote:
> 
> Huh.  Does that ever happen?  Please use a NOP_EXPR instead of
> a VIEW_CONVERT_EXPR.

Yes, the testcase is gcc.target/i386/pr51235.c which performs black magic*** 
with void *. (This testcase otherwise fails the verify_gimple_assign_unary check 
in tree-cfg.c .)   However, test passes also with your suggestion of NOP_EXPR so 
that's good by me.

***that is, computes the minimum

--Alan

> 
> Ok with that change.
> 
> Thanks,
> Richard.
> 
>> Testing:
>>         x86_86-none-linux-gnu: bootstrap, check-gcc, check-g++
>>         aarch64-none-linux-gnu: bootstrap
>>         aarch64-none-elf:  check-gcc, check-g++
>>         arm-none-eabi: check-gcc
>>
>>         aarch64_be-none-elf: check-gcc, showing
>>         FAIL->PASS: gcc.dg/vect/no-scevccp-outer-7.c execution test
>>         FAIL->PASS: gcc.dg/vect/no-scevccp-outer-13.c execution test
>>         Passes the (previously-failing) reduced testcase on
>>                 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114
>>
>>         Have also assembler/stage-1 tested that testcase on PowerPC, also
>> fixed.
> 
>> gcc/ChangeLog:
>>
>>         * expr.c (expand_expr_real_2): For REDUC_{MIN,MAX,PLUS}_EXPR, add
>>         extract_bit_field around optab result.
>>
>>         * fold-const.c (fold_unary_loc): For REDUC_{MIN,MAX,PLUS}_EXPR,
>> produce
>>         scalar not vector.
>>
>>         * tree-cfg.c (verify_gimple_assign_unary): Check result vs operand
>> type
>>         for REDUC_{MIN,MAX,PLUS}_EXPR.
>>
>>         * tree-vect-loop.c (vect_analyze_loop): Update comment.
>>         (vect_create_epilog_for_reduction): For direct vector reduction, use
>>         result of tree code directly without extract_bit_field.
>>
>>         * tree.def (REDUC_MAX_EXPR, REDUC_MIN_EXPR, REDUC_PLUS_EXPR): Update
>>         comment.
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/14] Add new optabs for reducing vectors to scalars
  2014-09-22 10:40   ` Richard Biener
@ 2014-09-22 13:26     ` Alan Lawrence
  2014-09-22 13:38       ` Richard Biener
  0 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-22 13:26 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener wrote:
> 
> scalar_reduc_to_vector misses a comment.

Ok to reuse the comment in optabs.h in optabs.c also?

> I wonder if at the end we wouldn't transition all backends and then
> renaming reduc_*_scal_optab back to reduc_*_optab makes sense.

Yes, that sounds like a plan, the _scal is a bit of a mouthful.

> The optabs have only one mode - I wouldn't be surprised if an ISA
> invents for example v4si -> di reduction?  So do we want to make
> reduc_plus_scal_optab a little bit more future proof (maybe there
> is already an ISA that supports this kind of reduction?).

That sounds like a plausible thing for an ISA to do, indeed. However given these 
names are only used by the autovectorizer rather than directly, the question is 
what the corresponding source code looks like, and/or what changes to the 
autovectorizer we might have to make to (look for code to) exploit such an 
instruction. At this point I could go for a 
reduc_{plus,min_max}_scal_<mode><mode> which reduces from the first vector mode 
to the second scalar mode, and then make the vectorizer look only for cases 
where the second mode was the element type of the first; but I'm not sure I want 
to do anything more complicated than that at this stage. (However, indeed it 
would leave the possibility open for the future.)

--Alan

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/14][Vectorizer] Redefine VEC_RSHIFT_EXPR and vec_shr_optab as endianness-neutral
  2014-09-18 13:12   ` David Edelsohn
@ 2014-09-22 13:27     ` Bill Schmidt
  0 siblings, 0 replies; 52+ messages in thread
From: Bill Schmidt @ 2014-09-22 13:27 UTC (permalink / raw)
  To: Alan Lawrence
  Cc: David Edelsohn, gcc-patches, Aldy Hernandez, Steve Ellcey,
	Eric Christopher

On Thu, 2014-09-18 at 09:12 -0400, David Edelsohn wrote:
> On Thu, Sep 18, 2014 at 8:42 AM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> > The direction of VEC_RSHIFT_EXPR has been endian-dependent, contrary to the
> > general principles of tree. This patch updates fold-const and the vectorizer
> > (the only place where such expressions are created), such that
> > VEC_RSHIFT_EXPR always shifts towards element 0.
> >
> > The tree code still maps directly onto the vec_shr_optab, and so this patch
> > *will break any bigendian platform defining the vec_shr optab*.
> > --> For AArch64_be, patch follows next in series;
> > --> For PowerPC, I think patch/rfc 15 should fix, please inspect;
> > --> For MIPS, I think patch/rfc 16 should fix, please inspect.
> >
> > gcc/ChangeLog:
> >
> >         * fold-const.c (const_binop): VEC_RSHIFT_EXPR always shifts towards
> >         element 0.
> >
> >         * tree-vect-loop.c (vect_create_epilog_for_reduction): always
> > extract
> >         the result of a reduction with vector shifts from element 0.
> >
> >         * tree.def (VEC_RSHIFT_EXPR, VEC_LSHIFT_EXPR): Comment shift
> > direction.
> >
> >         * doc/md.texi (vec_shr_m, vec_shl_m): Document shift direction.
> >
> > Testing Done:
> >
> > Bootstrap and check-gcc on x86_64-none-linux-gnu; check-gcc on
> > aarch64-none-elf.
> 
> Why wasn't this tested on the PowerLinux system in the GCC Compile Farm?
> 
> Also, Bill Schmidt can help check the PPC parts fo the patches.

Sorry for the late response; I just returned from vacation.  I think
that patch 15 looks reasonable on the surface, but would be more
comfortable if it had been tested.  I would echo David's suggestion that
you please test this on gcc110 in the compile farm to avoid surprises.
Given the similarity between vec_shl_<mode> and vec_shr_<mode> I am ok
with removing the former; it won't be difficult to re-create it later if
needed.

Please add some of the language you used above about VEC_RSHIFT_EXPR as
commentary for vec_shr_<mode> in vector.md, as right-shifting towards
element zero is not an obvious concept on a BE machine.

Thanks,
Bill

> 
> Thanks, David
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/14] Add new optabs for reducing vectors to scalars
  2014-09-22 13:26     ` Alan Lawrence
@ 2014-09-22 13:38       ` Richard Biener
  2014-09-25 14:33         ` [PATCH/RFC v2 " Alan Lawrence
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2014-09-22 13:38 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Mon, Sep 22, 2014 at 3:26 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> Richard Biener wrote:
>>
>>
>> scalar_reduc_to_vector misses a comment.
>
>
> Ok to reuse the comment in optabs.h in optabs.c also?

Sure.

>> I wonder if at the end we wouldn't transition all backends and then
>> renaming reduc_*_scal_optab back to reduc_*_optab makes sense.
>
>
> Yes, that sounds like a plan, the _scal is a bit of a mouthful.
>
>> The optabs have only one mode - I wouldn't be surprised if an ISA
>> invents for example v4si -> di reduction?  So do we want to make
>> reduc_plus_scal_optab a little bit more future proof (maybe there
>> is already an ISA that supports this kind of reduction?).
>
>
> That sounds like a plausible thing for an ISA to do, indeed. However given
> these names are only used by the autovectorizer rather than directly, the
> question is what the corresponding source code looks like, and/or what
> changes to the autovectorizer we might have to make to (look for code to)
> exploit such an instruction.

Ah, indeed.  Would be sth like a REDUC_WIDEN_SUM_EXPR or so.

> At this point I could go for a
> reduc_{plus,min_max}_scal_<mode><mode> which reduces from the first vector
> mode to the second scalar mode, and then make the vectorizer look only for
> cases where the second mode was the element type of the first; but I'm not
> sure I want to do anything more complicated than that at this stage.
> (However, indeed it would leave the possibility open for the future.)

Yeah, agreed.  For the min/max case a widen variant isn't useful anyway.

Thanks,
Richard.

> --Alan
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH/RFC 15 / 14+2][RS6000] Remove vec_shl and (hopefully) fix vec_shr
  2014-09-18 12:58 ` [PATCH/RFC 15 / 14+2][RS6000] Remove vec_shl and (hopefully) fix vec_shr Alan Lawrence
@ 2014-09-23 12:50   ` David Edelsohn
  0 siblings, 0 replies; 52+ messages in thread
From: David Edelsohn @ 2014-09-23 12:50 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches, Aldy Hernandez

On Thu, Sep 18, 2014 at 8:57 AM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> Patch 12 of 14 (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01475.html)
> will break bigendian targets implementing vec_shr. This is a PowerPC
> parallel of patch 13 of 14
> (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01477.html) for AArch64. I've
> checked I can build a stage 1 compiler for powerpc-none-eabi and that the
> assembly output looks plausible but no further than that.
>
> In fact I find BYTES_BIG_ENDIAN is defined to true on powerpcle-none-eabi as
> well as powerpc-none-eabi (and also on ppc64-none-elf, but to false on
> ppc64le-none-elf), so I'm not quite sure how your backend works in this
> regard - nonetheless I hope this is a helpful starting point even if not
> definitive.
>
> gcc/ChangeLog:
>
>         * config/rs6000/vector.md (vec_shl_<mode>): Remove.
>         (vec_shr_<mode>): Reverse shift if BYTES_BIG_ENDIAN.

This patch is okay if no regressions on a PowerLinux system (either
you or Segher can test on the GCC Compile Farm).

Thanks, David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/14][AArch64] Temporarily remove aarch64_gimple_fold_builtin code for reduction operations
  2014-09-18 11:45 ` [PATCH 1/14][AArch64] Temporarily remove aarch64_gimple_fold_builtin code for reduction operations Alan Lawrence
@ 2014-09-24  9:41   ` Marcus Shawcroft
  0 siblings, 0 replies; 52+ messages in thread
From: Marcus Shawcroft @ 2014-09-24  9:41 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On 18 September 2014 12:45, Alan Lawrence <alan.lawrence@arm.com> wrote:
> The gimple folding ties the AArch64 backend to the tree representation of
> the midend via the neon intrinsics. This code enables constant folding of
> Neon intrinsics reduction ops, so improves performance, but is not necessary
> for correctness. By temporarily removing it (here), we can then change the
> midend representation independently of the AArch64 backend + intrinsics.
>
> However, I'm leaving the code in place, as a later patch will bring it all
> back in a very similar form (but enabled for bigendian).
>
> Bootstrapped on aarch64-none-linux; tested aarch64.exp on aarch64-none-elf
> and aarch64_be-none-elf. (The removed code was already disabled for
> bigendian; and this is solely a __builtin-folding mechanism, i.e. used only
> for Neon/ACLE intrinsics.)
>
> gcc/ChangeLog:
>         * config/aarch64/aarch64.c (TARGET_GIMPLE_FOLD_BUILTIN): Comment
> out.
>         * config/aarch64/aarch64-builtins.c (aarch64_gimple_fold_builtin):
>         Remove using preprocessor directives.

OK /Marcus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/14][AArch64] Use new reduc_plus_scal optabs, inc. for __builtins
  2014-09-18 11:59 ` [PATCH 4/14][AArch64] Use new reduc_plus_scal optabs, inc. for __builtins Alan Lawrence
@ 2014-09-24  9:44   ` Marcus Shawcroft
  0 siblings, 0 replies; 52+ messages in thread
From: Marcus Shawcroft @ 2014-09-24  9:44 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On 18 September 2014 12:59, Alan Lawrence <alan.lawrence@arm.com> wrote:
> This migrates AArch64 over to the new optab for 'plus' reductions, i.e. so
> the define_expands produce scalars by generating a MOV to a GPR.
> Effectively, this moves the vget_lane inside every arm_neon.h intrinsic,
> into the inside of the define_expand.
>
> Tested: aarch64.exp vect.exp on aarch64-none-elf and aarch64_be-none-elf
> (full check-gcc on next patch for reduc_min/max)
>

+(define_expand "reduc_splus_<mode>"
+

Can't we just drop the define_expands for the old optabs altogether?

/Marcus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/14][AArch64] Use new reduc_[us](min|max)_scal optabs, inc. for builtins
  2014-09-18 12:02 ` [PATCH 5/14][AArch64] Use new reduc_[us](min|max)_scal optabs, inc. for builtins Alan Lawrence
@ 2014-09-24  9:47   ` Marcus Shawcroft
  0 siblings, 0 replies; 52+ messages in thread
From: Marcus Shawcroft @ 2014-09-24  9:47 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On 18 September 2014 13:02, Alan Lawrence <alan.lawrence@arm.com> wrote:
> Similarly to the previous patch (r/2205), this migrates AArch64 to the new
> reduce-to-scalar optabs for min and max. For consistency we apply the same
> treatment to the smax_nan and smin_nan patterns (used for __builtins), even
> though reduc_smin_nan_scal (etc.) is not a standard name.
>
> Tested: check-gcc on aarch64-none-elf and aarch64_be-none-elf.
>
> gcc/ChangeLog:
>
>         * config/aarch64/aarch64-simd-builtins.def (reduc_smax_,
> reduc_smin_,
>         reduc_umax_, reduc_umin_, reduc_smax_nan_, reduc_smin_nan_): Remove.
>         (reduc_smax_scal_, reduc_smin_scal_, reduc_umax_scal_,
>         reduc_umin_scal_, reduc_smax_nan_scal_, reduc_smin_nan_scal_): New.
>
>         * config/aarch64/aarch64-simd.md
>         (reduc_<maxmin_uns>_<mode>): Rename VDQV_S variant to...
>         (reduc_<maxmin_uns>_internal<mode>): ...this.
>         (reduc_<maxmin_uns>_<mode>): New (VDQ_BHSI).
>         (reduc_<maxmin_uns>_scal_<mode>): New (*2).
>
>         (reduc_<maxmin_uns>_v2si): Combine with below, renaming...
>         (reduc_<maxmin_uns>_<mode>): Combine V2F with above, renaming...
>         (reduc_<maxmin_uns>_internal_<mode>): ...to this (VDQF).
>
>         * config/aarch64/arm_neon.h (vmaxv_f32, vmaxv_s8, vmaxv_s16,
>         vmaxv_s32, vmaxv_u8, vmaxv_u16, vmaxv_u32, vmaxvq_f32, vmaxvq_f64,
>         vmaxvq_s8, vmaxvq_s16, vmaxvq_s32, vmaxvq_u8, vmaxvq_u16,
> vmaxvq_u32,
>         vmaxnmv_f32, vmaxnmvq_f32, vmaxnmvq_f64, vminv_f32, vminv_s8,
>         vminv_s16, vminv_s32, vminv_u8, vminv_u16, vminv_u32, vminvq_f32,
>         vminvq_f64, vminvq_s8, vminvq_s16, vminvq_s32, vminvq_u8,
> vminvq_u16,
>         vminvq_u32, vminnmv_f32, vminnmvq_f32, vminnmvq_f64): Update to use
>         __builtin_aarch64_reduc_..._scal; remove vget_lane wrapper.

If we don;t need the old optabs, I think would be better to drop those
define_expands, otherwise OK.
/Marcus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 6/14][AArch64] Restore gimple_folding of reduction intrinsics
  2014-09-18 12:05 ` [PATCH 6/14][AArch64] Restore gimple_folding of reduction intrinsics Alan Lawrence
@ 2014-09-24  9:48   ` Marcus Shawcroft
  0 siblings, 0 replies; 52+ messages in thread
From: Marcus Shawcroft @ 2014-09-24  9:48 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On 18 September 2014 13:05, Alan Lawrence <alan.lawrence@arm.com> wrote:
> This gives us back the constant-folding of the neon-intrinsics that was
> removed in the first patch, but is now OK for bigendian too.
>
> bootstrapped on aarch64-none-linux-gnu.
> check-gcc on aarch64-none-elf and aarch64_be-none-elf.
>
> gcc/ChangeLog:
>
>         * config/aarch64/aarch64.c (TARGET_GIMPLE_FOLD_BUILTIN): Define
> again.
>         * config/aarch64/aarch64-builtins.c (aarch64_gimple_fold_builtin):
>         Restore, enable for bigendian, update to use __builtin..._scal...

OK /Marcus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/14][Vectorizer] Make REDUC_xxx_EXPR tree codes produce a scalar result
  2014-09-22 10:34   ` Richard Biener
  2014-09-22 13:23     ` Alan Lawrence
@ 2014-09-24 15:02     ` Alan Lawrence
  2014-09-24 18:08       ` Segher Boessenkool
  1 sibling, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-24 15:02 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3862 bytes --]

So it looks like patches 1-6 (reduc_foo) are relatively close to final, and 
given these fix PR/61114, I'm gonna try to land these while working on a respin 
of the second half (vec_shr)...(summary: yes I like the vec_perm idea too, but 
the devil is in the detail!)

However my CompileFarm account is still pending, so to that end, if you were 
able to test patch 2/14 (attached inc. Richie's s/VIEW_CONVERT_EXPR/NOP_EXPR/) 
on the CompileFarm PowerPC machine, that'd be great, many thanks indeed. It 
should apply on its own without patch 1. I'll aim to get an alternative patch 3 
back to the list shortly, and follow up with .md updates to the various backends.

Cheers, Alan


Richard Biener wrote:
> On Thu, Sep 18, 2014 at 1:50 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> This fixes PR/61114 by redefining the REDUC_{MIN,MAX,PLUS}_EXPR tree codes.
>>
>> These are presently documented as producing a vector with the result in
>> element 0, and this is inconsistent with their use in tree-vect-loop.c
>> (which on bigendian targets pulls the bits out of the wrong end of the
>> vector result). This leads to bugs on bigendian targets - see also
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114.
>>
>> I discounted "fixing" the vectorizer (to read from element 0) and then
>> making bigendian targets (whose architectural insn produces the result in
>> lane N-1) permute the result vector, as optimization of vectors in RTL seems
>> unlikely to remove such a permute and would lead to a performance
>> regression.
>>
>> Instead it seems more natural for the tree code to produce a scalar result
>> (producing a vector with the result in lane 0 has already caused confusion,
>> e.g. https://gcc.gnu.org/ml/gcc-patches/2012-10/msg01100.html).
>>
>> However, this patch preserves the meaning of the optab (producing a result
>> in lane 0 on little-endian architectures or N-1 on bigendian), thus
>> generally avoiding the need to change backends. Thus, expr.c extracts an
>> endianness-dependent element from the optab result to give the result
>> expected for the tree code.
>>
>> Previously posted as an RFC
>> https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html , now with an extra
>> VIEW_CONVERT_EXPR if the types of the reduction/result do not match.
> 
> Huh.  Does that ever happen?  Please use a NOP_EXPR instead of
> a VIEW_CONVERT_EXPR.
> 
> Ok with that change.
> 
> Thanks,
> Richard.
> 
>> Testing:
>>         x86_86-none-linux-gnu: bootstrap, check-gcc, check-g++
>>         aarch64-none-linux-gnu: bootstrap
>>         aarch64-none-elf:  check-gcc, check-g++
>>         arm-none-eabi: check-gcc
>>
>>         aarch64_be-none-elf: check-gcc, showing
>>         FAIL->PASS: gcc.dg/vect/no-scevccp-outer-7.c execution test
>>         FAIL->PASS: gcc.dg/vect/no-scevccp-outer-13.c execution test
>>         Passes the (previously-failing) reduced testcase on
>>                 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114
>>
>>         Have also assembler/stage-1 tested that testcase on PowerPC, also
>> fixed.
> 
>> gcc/ChangeLog:
>>
>>         * expr.c (expand_expr_real_2): For REDUC_{MIN,MAX,PLUS}_EXPR, add
>>         extract_bit_field around optab result.
>>
>>         * fold-const.c (fold_unary_loc): For REDUC_{MIN,MAX,PLUS}_EXPR,
>> produce
>>         scalar not vector.
>>
>>         * tree-cfg.c (verify_gimple_assign_unary): Check result vs operand
>> type
>>         for REDUC_{MIN,MAX,PLUS}_EXPR.
>>
>>         * tree-vect-loop.c (vect_analyze_loop): Update comment.
>>         (vect_create_epilog_for_reduction): For direct vector reduction, use
>>         result of tree code directly without extract_bit_field.
>>
>>         * tree.def (REDUC_MAX_EXPR, REDUC_MIN_EXPR, REDUC_PLUS_EXPR): Update
>>         comment.
> 

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 2_make_reduc_tree_codes_scalar.patch --]
[-- Type: text/x-patch; name=2_make_reduc_tree_codes_scalar.patch, Size: 6705 bytes --]

commit a7b173d5efc6f08589b04fffeec9b3942b6282a0
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Tue Jul 29 11:46:01 2014 +0100

    Make tree codes produce scalar, with NOP_EXPRs. (tree-vect-loop.c mess)

diff --git a/gcc/expr.c b/gcc/expr.c
index a6233f3..c792028 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9044,7 +9044,17 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
       {
         op0 = expand_normal (treeop0);
         this_optab = optab_for_tree_code (code, type, optab_default);
-        temp = expand_unop (mode, this_optab, op0, target, unsignedp);
+        enum machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop0));
+        temp = expand_unop (vec_mode, this_optab, op0, NULL_RTX, unsignedp);
+        gcc_assert (temp);
+        /* The tree code produces a scalar result, but (somewhat by convention)
+           the optab produces a vector with the result in element 0 if
+           little-endian, or element N-1 if big-endian.  So pull the scalar
+           result out of that element.  */
+        int index = BYTES_BIG_ENDIAN ? GET_MODE_NUNITS (vec_mode) - 1 : 0;
+        int bitsize = GET_MODE_BITSIZE (GET_MODE_INNER (vec_mode));
+        temp = extract_bit_field (temp, bitsize, bitsize * index, unsignedp,
+				  target, mode, mode);
         gcc_assert (temp);
         return temp;
       }
diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index d1b59a1..1773585 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -8246,12 +8246,13 @@ fold_unary_loc (location_t loc, enum tree_code code, tree type, tree op0)
     case REDUC_MAX_EXPR:
     case REDUC_PLUS_EXPR:
       {
-	unsigned int nelts = TYPE_VECTOR_SUBPARTS (type), i;
+	unsigned int nelts, i;
 	tree *elts;
 	enum tree_code subcode;
 
 	if (TREE_CODE (op0) != VECTOR_CST)
 	  return NULL_TREE;
+        nelts = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
 
 	elts = XALLOCAVEC (tree, nelts);
 	if (!vec_cst_ctor_to_array (op0, elts))
@@ -8270,10 +8271,9 @@ fold_unary_loc (location_t loc, enum tree_code code, tree type, tree op0)
 	    elts[0] = const_binop (subcode, elts[0], elts[i]);
 	    if (elts[0] == NULL_TREE || !CONSTANT_CLASS_P (elts[0]))
 	      return NULL_TREE;
-	    elts[i] = build_zero_cst (TREE_TYPE (type));
 	  }
 
-	return build_vector (type, elts);
+	return elts[0];
       }
 
     default:
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index e89d76a..6c6ff18 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3539,12 +3539,21 @@ verify_gimple_assign_unary (gimple stmt)
 
         return false;
       }
-
-    case VEC_UNPACK_HI_EXPR:
-    case VEC_UNPACK_LO_EXPR:
     case REDUC_MAX_EXPR:
     case REDUC_MIN_EXPR:
     case REDUC_PLUS_EXPR:
+      if (!VECTOR_TYPE_P (rhs1_type)
+	  || !useless_type_conversion_p (lhs_type, TREE_TYPE (rhs1_type)))
+        {
+	  error ("reduction should convert from vector to element type");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  return true;
+	}
+      return false;
+
+    case VEC_UNPACK_HI_EXPR:
+    case VEC_UNPACK_LO_EXPR:
     case VEC_UNPACK_FLOAT_HI_EXPR:
     case VEC_UNPACK_FLOAT_LO_EXPR:
       /* FIXME.  */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index fd1166f..8d97e17 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1892,9 +1892,9 @@ vect_analyze_loop (struct loop *loop)
 
    Output:
    REDUC_CODE - the corresponding tree-code to be used to reduce the
-      vector of partial results into a single scalar result (which
-      will also reside in a vector) or ERROR_MARK if the operation is
-      a supported reduction operation, but does not have such tree-code.
+      vector of partial results into a single scalar result, or ERROR_MARK
+      if the operation is a supported reduction operation, but does not have
+      such a tree-code.
 
    Return FALSE if CODE currently cannot be vectorized as reduction.  */
 
@@ -4168,6 +4168,7 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
   if (reduc_code != ERROR_MARK && !slp_reduc)
     {
       tree tmp;
+      tree vec_elem_type;
 
       /*** Case 1:  Create:
            v_out2 = reduc_expr <v_out1>  */
@@ -4176,14 +4177,26 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
         dump_printf_loc (MSG_NOTE, vect_location,
 			 "Reduce using direct vector reduction.\n");
 
-      vec_dest = vect_create_destination_var (scalar_dest, vectype);
-      tmp = build1 (reduc_code, vectype, new_phi_result);
-      epilog_stmt = gimple_build_assign (vec_dest, tmp);
-      new_temp = make_ssa_name (vec_dest, epilog_stmt);
+      vec_elem_type = TREE_TYPE (TREE_TYPE (new_phi_result));
+      if (!useless_type_conversion_p (scalar_type, vec_elem_type))
+	{
+          tree tmp_dest =
+	      vect_create_destination_var (scalar_dest, vec_elem_type);
+	  tmp = build1 (reduc_code, vec_elem_type, new_phi_result);
+	  epilog_stmt = gimple_build_assign (tmp_dest, tmp);
+	  new_temp = make_ssa_name (tmp_dest, epilog_stmt);
+	  gimple_assign_set_lhs (epilog_stmt, new_temp);
+	  gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
+
+	  tmp = build1 (NOP_EXPR, scalar_type, new_temp);
+	}
+      else
+	tmp = build1 (reduc_code, scalar_type, new_phi_result);
+      epilog_stmt = gimple_build_assign (new_scalar_dest, tmp);
+      new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
       gimple_assign_set_lhs (epilog_stmt, new_temp);
       gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
-
-      extract_scalar_result = true;
+      scalar_results.safe_push (new_temp);
     }
   else
     {
diff --git a/gcc/tree.def b/gcc/tree.def
index bd39e4b..c830e4b 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1161,10 +1161,9 @@ DEFTREECODE (TRANSACTION_EXPR, "transaction_expr", tcc_expression, 1)
    result (e.g. summing the elements of the vector, finding the minimum over
    the vector elements, etc).
    Operand 0 is a vector.
-   The expression returns a vector of the same type, with the first
-   element in the vector holding the result of the reduction of all elements
-   of the operand.  The content of the other elements in the returned vector
-   is undefined.  */
+   The expression returns a scalar, with type the same as the elements of the
+   vector, holding the result of the reduction of all elements of the operand.
+   */
 DEFTREECODE (REDUC_MAX_EXPR, "reduc_max_expr", tcc_unary, 1)
 DEFTREECODE (REDUC_MIN_EXPR, "reduc_min_expr", tcc_unary, 1)
 DEFTREECODE (REDUC_PLUS_EXPR, "reduc_plus_expr", tcc_unary, 1)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/14][Vectorizer] Make REDUC_xxx_EXPR tree codes produce a scalar result
  2014-09-24 15:02     ` Alan Lawrence
@ 2014-09-24 18:08       ` Segher Boessenkool
  2014-09-25 16:07         ` Alan Lawrence
  0 siblings, 1 reply; 52+ messages in thread
From: Segher Boessenkool @ 2014-09-24 18:08 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches, Richard Biener

On Wed, Sep 24, 2014 at 04:02:11PM +0100, Alan Lawrence wrote:
> However my CompileFarm account is still pending, so to that end, if you 
> were able to test patch 2/14 (attached inc. Richie's 
> s/VIEW_CONVERT_EXPR/NOP_EXPR/) on the CompileFarm PowerPC machine, that'd 
> be great, many thanks indeed. It should apply on its own without patch 1. 

Patch 2/14 on its own has no regressions on gcc110 (powerpc64-linux,
c,c++,fortran, -m64,-m32,-m32/-mpowerpc64,-m64/-mlra).

Cheers,


Segher

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH/RFC v2 3/14] Add new optabs for reducing vectors to scalars
  2014-09-22 13:38       ` Richard Biener
@ 2014-09-25 14:33         ` Alan Lawrence
  2014-09-25 15:31           ` Richard Biener
  0 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-25 14:33 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3138 bytes --]

Ok, so, I've tried making reduc_plus optab take two modes: that of the vector to 
reduce, and the result; thus allowing platforms to provide a widening reduction. 
However, I'm keeping reduc_[us](min|max)_optab with only a single mode, as 
widening makes no sense there.

I've not gone as far as making the vectorizer use any such a widening reduction, 
however: as previously stated, I'm not really sure what the input source code 
for that even looks like (maybe in a language other than C?). If we wanted to do 
a non-widening reduction using such an instruction (by discarding the extra 
bits), strikes me the platform can/should provide a non-widening optab for that 
case...

Testing: bootstrapped on x86_64 linux + check-gcc; cross-tested aarch64-none-elf 
check-gcc; cross-tested aarch64_be-none-elf aarch64.exp + vect.exp.

So, my feeling is that the extra complexity here doesn't really buy us anything; 
and that if we do want to support / use widening reductions in the future, we 
should do so with a separate, reduc_plus_widen... optab, and stick with the 
original patch/formulation for now. (In other words: this patch is a guide to 
how I think a dual-mode reduc_plus_optab looks, but I don't honestly like it!).

If you agree, I shall transplant the comments on scalar_reduc_to_vector from 
this patch into the original, and then post that revised version?


Cheers, Alan

Richard Biener wrote:
> On Mon, Sep 22, 2014 at 3:26 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> Richard Biener wrote:
>>>
>>> scalar_reduc_to_vector misses a comment.
>>
>> Ok to reuse the comment in optabs.h in optabs.c also?
> 
> Sure.
> 
>>> I wonder if at the end we wouldn't transition all backends and then
>>> renaming reduc_*_scal_optab back to reduc_*_optab makes sense.
>>
>> Yes, that sounds like a plan, the _scal is a bit of a mouthful.
>>
>>> The optabs have only one mode - I wouldn't be surprised if an ISA
>>> invents for example v4si -> di reduction?  So do we want to make
>>> reduc_plus_scal_optab a little bit more future proof (maybe there
>>> is already an ISA that supports this kind of reduction?).
>>
>> That sounds like a plausible thing for an ISA to do, indeed. However given
>> these names are only used by the autovectorizer rather than directly, the
>> question is what the corresponding source code looks like, and/or what
>> changes to the autovectorizer we might have to make to (look for code to)
>> exploit such an instruction.
> 
> Ah, indeed.  Would be sth like a REDUC_WIDEN_SUM_EXPR or so.
> 
>> At this point I could go for a
>> reduc_{plus,min_max}_scal_<mode><mode> which reduces from the first vector
>> mode to the second scalar mode, and then make the vectorizer look only for
>> cases where the second mode was the element type of the first; but I'm not
>> sure I want to do anything more complicated than that at this stage.
>> (However, indeed it would leave the possibility open for the future.)
> 
> Yeah, agreed.  For the min/max case a widen variant isn't useful anyway.
> 
> Thanks,
> Richard.
> 
>> --Alan
>>
> 

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 3b_add_new_optabs.patch --]
[-- Type: text/x-patch; name=3b_add_new_optabs.patch, Size: 11250 bytes --]

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 80e8bd6a079b8bf77ef396643aaba512cf83b317..0a9381fc3a26cdaad02e6f837b94c7738daa3a7f 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4783,29 +4783,49 @@ it is unspecified which of the two operands is returned as the result.
 @cindex @code{reduc_smax_@var{m}} instruction pattern
 @item @samp{reduc_smin_@var{m}}, @samp{reduc_smax_@var{m}}
 Find the signed minimum/maximum of the elements of a vector. The vector is
-operand 1, and the scalar result is stored in the least significant bits of
+operand 1, and the result is stored in the least significant bits of
 operand 0 (also a vector). The output and input vector should have the same
-modes.
+modes. These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_smin_scal_@var{m}} and @samp{reduc_smax_scal_@var{m}}.
 
 @cindex @code{reduc_umin_@var{m}} instruction pattern
 @cindex @code{reduc_umax_@var{m}} instruction pattern
 @item @samp{reduc_umin_@var{m}}, @samp{reduc_umax_@var{m}}
 Find the unsigned minimum/maximum of the elements of a vector. The vector is
-operand 1, and the scalar result is stored in the least significant bits of
+operand 1, and the result is stored in the least significant bits of
 operand 0 (also a vector). The output and input vector should have the same
-modes.
+modes. These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_umin_scal_@var{m}} and @samp{reduc_umax_scal_@var{m}}.
 
 @cindex @code{reduc_splus_@var{m}} instruction pattern
-@item @samp{reduc_splus_@var{m}}
-Compute the sum of the signed elements of a vector. The vector is operand 1,
-and the scalar result is stored in the least significant bits of operand 0
-(also a vector). The output and input vector should have the same modes.
-
 @cindex @code{reduc_uplus_@var{m}} instruction pattern
-@item @samp{reduc_uplus_@var{m}}
-Compute the sum of the unsigned elements of a vector. The vector is operand 1,
-and the scalar result is stored in the least significant bits of operand 0
+@item @samp{reduc_splus_@var{m}}, @samp{reduc_uplus_@var{m}}
+Compute the sum of the signed/unsigned elements of a vector. The vector is
+operand 1, and the result is stored in the least significant bits of operand 0
 (also a vector). The output and input vector should have the same modes.
+These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_plus_scal_@var{m}@var{n}}.
+
+@cindex @code{reduc_smin_scal_@var{m}} instruction pattern
+@cindex @code{reduc_smax_scal_@var{m}} instruction pattern
+@item @samp{reduc_smin_scal_@var{m}}, @samp{reduc_smax_scal_@var{m}}
+Find the signed minimum/maximum of the elements of a vector. The vector is
+operand 1, and operand 0 is the scalar result, with mode equal to the mode of
+the elements of the input vector.
+
+@cindex @code{reduc_umin_scal_@var{m}} instruction pattern
+@cindex @code{reduc_umax_scal_@var{m}} instruction pattern
+@item @samp{reduc_umin_scal_@var{m}}, @samp{reduc_umax_scal_@var{m}}
+Find the unsigned minimum/maximum of the elements of a vector. The vector is
+operand 1, and operand 0 is the scalar result, with mode equal to the mode of
+the elements of the input vector.
+
+@cindex @code{reduc_plus_scal_@var{m}@var{n}} instruction pattern
+@item @samp{reduc_plus_scal_@var{m}@var{n}}
+Compute the sum of the elements of a vector. The vector, of mode @var{m}, is
+operand 1, and operand 0 is the scalar result, of mode @var{n}. Note that at
+present the vectorizer only looks for patterns where @var{n} is the mode of the
+elements of @var{m}.
 
 @cindex @code{sdot_prod@var{m}} instruction pattern
 @item @samp{sdot_prod@var{m}}
diff --git a/gcc/expr.c b/gcc/expr.c
index c7920282416747ab41afcd47179d4ed92d8fbc23..4bd5a3f248c7de487586abbae677770359098ecb 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9045,6 +9045,23 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
         op0 = expand_normal (treeop0);
         this_optab = optab_for_tree_code (code, type, optab_default);
         enum machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop0));
+	enum insn_code icode = reduction_optab_handler (this_optab, vec_mode);
+	if (icode != CODE_FOR_nothing)
+	  {
+	    struct expand_operand ops[2];
+
+	    create_output_operand (&ops[0], target, mode);
+	    create_input_operand (&ops[1], op0, vec_mode);
+	    if (maybe_expand_insn (icode, 2, ops))
+	      {
+		target = ops[0].value;
+		if (GET_MODE (target) != mode)
+		  return gen_lowpart (tmode, target);
+		return target;
+	      }
+	  }
+	/* Fall back to optab with vector result, and then extract scalar.  */
+	this_optab = scalar_reduc_to_vector (this_optab, type);
         temp = expand_unop (vec_mode, this_optab, op0, NULL_RTX, unsignedp);
         gcc_assert (temp);
         /* The tree code produces a scalar result, but (somewhat by convention)
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 605615d7458e794995dfd27d7fdf39e37baa910a..722fc1230b119fd78b1cb2074f96f56d24982fbb 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -506,13 +506,15 @@ optab_for_tree_code (enum tree_code code, const_tree type,
       return fma_optab;
 
     case REDUC_MAX_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_umax_optab : reduc_smax_optab;
+      return TYPE_UNSIGNED (type)
+	     ? reduc_umax_scal_optab : reduc_smax_scal_optab;
 
     case REDUC_MIN_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_umin_optab : reduc_smin_optab;
+      return TYPE_UNSIGNED (type)
+	     ? reduc_umin_scal_optab : reduc_smin_scal_optab;
 
     case REDUC_PLUS_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;
+      return reduc_plus_scal_optab;
 
     case VEC_LSHIFT_EXPR:
       return vec_shl_optab;
@@ -608,7 +610,49 @@ optab_for_tree_code (enum tree_code code, const_tree type,
       return unknown_optab;
     }
 }
-\f
+
+/* Given optab UNOPTAB that reduces a vector to a scalar, find instead the old
+   optab that produces a vector with the reduction result in one element,
+   for a tree with type TYPE.  */
+
+optab
+scalar_reduc_to_vector (optab unoptab, const_tree type)
+{
+  switch (unoptab)
+    {
+    case reduc_plus_scal_optab:
+      return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;
+
+    case reduc_smin_scal_optab: return reduc_smin_optab;
+    case reduc_umin_scal_optab: return reduc_umin_optab;
+    case reduc_smax_scal_optab: return reduc_smax_optab;
+    case reduc_umax_scal_optab: return reduc_umax_optab;
+    default: return unknown_optab;
+    }
+}
+
+/* Given reduction optab OPTAB, find the handler that reduces a vector of mode
+   VEC_MODE to a scalar of mode the same as the vector elements.  */
+
+insn_code
+reduction_optab_handler (optab optab, enum machine_mode vec_mode)
+{
+  gcc_assert (VECTOR_MODE_P (vec_mode));
+  switch (optab)
+    {
+    case reduc_plus_scal_optab:
+      /* Optab allows for the scalar result to be different/wider than the
+         mode of the vector elements. However we don't yet exploit this.  */
+      return convert_optab_handler (optab, vec_mode, GET_MODE_INNER (vec_mode));
+    case reduc_smin_scal_optab:
+    case reduc_umin_scal_optab:
+    case reduc_smax_scal_optab:
+    case reduc_umax_scal_optab:
+      return optab_handler (optab, vec_mode);
+    default:
+      return CODE_FOR_nothing;
+    }
+}
 
 /* Expand vector widening operations.
 
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b75547006585267d9f5b4f17ba972ba388852cf5..26eea26df73f416319afe1c7f9ac74f5c8ef48df 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -61,6 +61,9 @@ OPTAB_CD(vec_load_lanes_optab, "vec_load_lanes$a$b")
 OPTAB_CD(vec_store_lanes_optab, "vec_store_lanes$a$b")
 OPTAB_CD(vcond_optab, "vcond$a$b")
 OPTAB_CD(vcondu_optab, "vcondu$a$b")
+/* Vector reduction to a scalar, possibly widening.  The second mode is for the
+   result, usually (but possibly wider than) the elements of the mode input.  */
+OPTAB_CD (reduc_plus_scal_optab, "reduc_plus_scal_$a$b")
 
 OPTAB_NL(add_optab, "add$P$a3", PLUS, "add", '3', gen_int_fp_fixed_libfunc)
 OPTAB_NX(add_optab, "add$F$a3")
@@ -243,12 +246,19 @@ OPTAB_D (sin_optab, "sin$a2")
 OPTAB_D (sincos_optab, "sincos$a3")
 OPTAB_D (tan_optab, "tan$a2")
 
+/* Vector reduction to a scalar.  */
+OPTAB_D (reduc_smax_scal_optab, "reduc_smax_scal_$a")
+OPTAB_D (reduc_smin_scal_optab, "reduc_smin_scal_$a")
+OPTAB_D (reduc_umax_scal_optab, "reduc_umax_scal_$a")
+OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a")
+/* (Old) Vector reduction, returning a vector with the result in one lane.  */
 OPTAB_D (reduc_smax_optab, "reduc_smax_$a")
 OPTAB_D (reduc_smin_optab, "reduc_smin_$a")
 OPTAB_D (reduc_splus_optab, "reduc_splus_$a")
 OPTAB_D (reduc_umax_optab, "reduc_umax_$a")
 OPTAB_D (reduc_umin_optab, "reduc_umin_$a")
 OPTAB_D (reduc_uplus_optab, "reduc_uplus_$a")
+
 OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
diff --git a/gcc/optabs.h b/gcc/optabs.h
index 089b15a6fcd261bb15c898f185a157f1257284ba..10d080ef9347fc6e2b7d92d099a7b51a6b7eb1a0 100644
--- a/gcc/optabs.h
+++ b/gcc/optabs.h
@@ -162,6 +162,15 @@ enum optab_subtype
    vector shifts and rotates */
 extern optab optab_for_tree_code (enum tree_code, const_tree, enum optab_subtype);
 
+/* Given an optab that reduces a vector to a scalar, find instead the old
+   optab that produces a vector with the reduction result in one element,
+   for a tree with the specified type.  */
+extern optab scalar_reduc_to_vector (optab, const_tree type);
+
+/* Given an optab that reduces a vector to a scalar, find the handler for the
+   specified vector mode.  */
+extern insn_code reduction_optab_handler (optab, enum machine_mode);
+
 /* The various uses that a comparison can have; used by can_compare_p:
    jumps, conditional moves, store flag operations.  */
 enum can_compare_purpose
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 8d97e176446f0f6963ecc443f1db0a84ebf2b169..89036e76ae2835bb22f2f3a51d20f1288e26f6db 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -5102,16 +5102,21 @@ vectorizable_reduction (gimple stmt, gimple_stmt_iterator *gsi,
 
           epilog_reduc_code = ERROR_MARK;
         }
-
-      if (reduc_optab
-          && optab_handler (reduc_optab, vec_mode) == CODE_FOR_nothing)
+      else
         {
-          if (dump_enabled_p ())
-	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "reduc op not supported by target.\n");
+	  if (!reduction_optab_handler (reduc_optab, vec_mode))
+	    {
+	      optab = scalar_reduc_to_vector (reduc_optab, vectype_out);
+	      if (optab_handler (optab, vec_mode) == CODE_FOR_nothing)
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				     "reduc op not supported by target.\n");
 
-          epilog_reduc_code = ERROR_MARK;
-        }
+		  epilog_reduc_code = ERROR_MARK;
+		}
+	    }
+	}
     }
   else
     {

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH/RFC v2 3/14] Add new optabs for reducing vectors to scalars
  2014-09-25 14:33         ` [PATCH/RFC v2 " Alan Lawrence
@ 2014-09-25 15:31           ` Richard Biener
  2014-09-25 16:12             ` Alan Lawrence
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Biener @ 2014-09-25 15:31 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Thu, Sep 25, 2014 at 4:32 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> Ok, so, I've tried making reduc_plus optab take two modes: that of the
> vector to reduce, and the result; thus allowing platforms to provide a
> widening reduction. However, I'm keeping reduc_[us](min|max)_optab with only
> a single mode, as widening makes no sense there.
>
> I've not gone as far as making the vectorizer use any such a widening
> reduction, however: as previously stated, I'm not really sure what the input
> source code for that even looks like (maybe in a language other than C?). If
> we wanted to do a non-widening reduction using such an instruction (by
> discarding the extra bits), strikes me the platform can/should provide a
> non-widening optab for that case...

I expect it to apply to sth like

int foo (char *in, int n)
{
   int res = 0;
   for (int i = 0; i < n; ++i)
     res += *in;
   return res;
}

where you'd see

  temc = *in;
  tem = (int)temc;
  res += tem;

we probably handle this by widening the chars to ints and unrolling
the loop enough to make that work (thus for n == 16 it would maybe
fail to vectorize?).  It should be more efficient to pattern-detect
this as widening reduction.

> Testing: bootstrapped on x86_64 linux + check-gcc; cross-tested
> aarch64-none-elf check-gcc; cross-tested aarch64_be-none-elf aarch64.exp +
> vect.exp.
>
> So, my feeling is that the extra complexity here doesn't really buy us
> anything; and that if we do want to support / use widening reductions in the
> future, we should do so with a separate, reduc_plus_widen... optab, and
> stick with the original patch/formulation for now. (In other words: this
> patch is a guide to how I think a dual-mode reduc_plus_optab looks, but I
> don't honestly like it!).
>
> If you agree, I shall transplant the comments on scalar_reduc_to_vector from
> this patch into the original, and then post that revised version?

I agree.  We can come back once a target implements such widening
reduction.

Richard.

>
> Cheers, Alan
>
>
> Richard Biener wrote:
>>
>> On Mon, Sep 22, 2014 at 3:26 PM, Alan Lawrence <alan.lawrence@arm.com>
>> wrote:
>>>
>>> Richard Biener wrote:
>>>>
>>>>
>>>> scalar_reduc_to_vector misses a comment.
>>>
>>>
>>> Ok to reuse the comment in optabs.h in optabs.c also?
>>
>>
>> Sure.
>>
>>>> I wonder if at the end we wouldn't transition all backends and then
>>>> renaming reduc_*_scal_optab back to reduc_*_optab makes sense.
>>>
>>>
>>> Yes, that sounds like a plan, the _scal is a bit of a mouthful.
>>>
>>>> The optabs have only one mode - I wouldn't be surprised if an ISA
>>>> invents for example v4si -> di reduction?  So do we want to make
>>>> reduc_plus_scal_optab a little bit more future proof (maybe there
>>>> is already an ISA that supports this kind of reduction?).
>>>
>>>
>>> That sounds like a plausible thing for an ISA to do, indeed. However
>>> given
>>> these names are only used by the autovectorizer rather than directly, the
>>> question is what the corresponding source code looks like, and/or what
>>> changes to the autovectorizer we might have to make to (look for code to)
>>> exploit such an instruction.
>>
>>
>> Ah, indeed.  Would be sth like a REDUC_WIDEN_SUM_EXPR or so.
>>
>>> At this point I could go for a
>>> reduc_{plus,min_max}_scal_<mode><mode> which reduces from the first
>>> vector
>>> mode to the second scalar mode, and then make the vectorizer look only
>>> for
>>> cases where the second mode was the element type of the first; but I'm
>>> not
>>> sure I want to do anything more complicated than that at this stage.
>>> (However, indeed it would leave the possibility open for the future.)
>>
>>
>> Yeah, agreed.  For the min/max case a widen variant isn't useful anyway.
>>
>> Thanks,
>> Richard.
>>
>>> --Alan
>>>
>>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/14][Vectorizer] Make REDUC_xxx_EXPR tree codes produce a scalar result
  2014-09-24 18:08       ` Segher Boessenkool
@ 2014-09-25 16:07         ` Alan Lawrence
  0 siblings, 0 replies; 52+ messages in thread
From: Alan Lawrence @ 2014-09-25 16:07 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches

Many thanks indeed! :)

--Alan

Segher Boessenkool wrote:
> On Wed, Sep 24, 2014 at 04:02:11PM +0100, Alan Lawrence wrote:
>> However my CompileFarm account is still pending, so to that end, if you 
>> were able to test patch 2/14 (attached inc. Richie's 
>> s/VIEW_CONVERT_EXPR/NOP_EXPR/) on the CompileFarm PowerPC machine, that'd 
>> be great, many thanks indeed. It should apply on its own without patch 1. 
> 
> Patch 2/14 on its own has no regressions on gcc110 (powerpc64-linux,
> c,c++,fortran, -m64,-m32,-m32/-mpowerpc64,-m64/-mlra).
> 
> Cheers,
> 
> 
> Segher
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH/RFC v2 3/14] Add new optabs for reducing vectors to scalars
  2014-09-25 15:31           ` Richard Biener
@ 2014-09-25 16:12             ` Alan Lawrence
  2014-09-25 19:20               ` Segher Boessenkool
  0 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-09-25 16:12 UTC (permalink / raw)
  To: Richard Biener, Segher Boessenkool; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 4494 bytes --]

Well, even that C source, you'd need to be careful and ensure that the 
vectorized loop never went round more than once, or else the additions within 
the loop would be performed in 8 bits, different from the final reduction...

So: original patch with updated commenting attached...Segher, is there any 
chance you could test this on powerpc too? (in combination with patch 2/14, 
which will need to be applied first; you can skip patch 1, and >=4.)

--Alan

Richard Biener wrote:
> On Thu, Sep 25, 2014 at 4:32 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> Ok, so, I've tried making reduc_plus optab take two modes: that of the
>> vector to reduce, and the result; thus allowing platforms to provide a
>> widening reduction. However, I'm keeping reduc_[us](min|max)_optab with only
>> a single mode, as widening makes no sense there.
>>
>> I've not gone as far as making the vectorizer use any such a widening
>> reduction, however: as previously stated, I'm not really sure what the input
>> source code for that even looks like (maybe in a language other than C?). If
>> we wanted to do a non-widening reduction using such an instruction (by
>> discarding the extra bits), strikes me the platform can/should provide a
>> non-widening optab for that case...
> 
> I expect it to apply to sth like
> 
> int foo (char *in, int n)
> {
>    int res = 0;
>    for (int i = 0; i < n; ++i)
>      res += *in;
>    return res;
> }
> 
> where you'd see
> 
>   temc = *in;
>   tem = (int)temc;
>   res += tem;
> 
> we probably handle this by widening the chars to ints and unrolling
> the loop enough to make that work (thus for n == 16 it would maybe
> fail to vectorize?).  It should be more efficient to pattern-detect
> this as widening reduction.
> 
>> Testing: bootstrapped on x86_64 linux + check-gcc; cross-tested
>> aarch64-none-elf check-gcc; cross-tested aarch64_be-none-elf aarch64.exp +
>> vect.exp.
>>
>> So, my feeling is that the extra complexity here doesn't really buy us
>> anything; and that if we do want to support / use widening reductions in the
>> future, we should do so with a separate, reduc_plus_widen... optab, and
>> stick with the original patch/formulation for now. (In other words: this
>> patch is a guide to how I think a dual-mode reduc_plus_optab looks, but I
>> don't honestly like it!).
>>
>> If you agree, I shall transplant the comments on scalar_reduc_to_vector from
>> this patch into the original, and then post that revised version?
> 
> I agree.  We can come back once a target implements such widening
> reduction.
> 
> Richard.
> 
>> Cheers, Alan
>>
>>
>> Richard Biener wrote:
>>> On Mon, Sep 22, 2014 at 3:26 PM, Alan Lawrence <alan.lawrence@arm.com>
>>> wrote:
>>>> Richard Biener wrote:
>>>>>
>>>>> scalar_reduc_to_vector misses a comment.
>>>>
>>>> Ok to reuse the comment in optabs.h in optabs.c also?
>>>
>>> Sure.
>>>
>>>>> I wonder if at the end we wouldn't transition all backends and then
>>>>> renaming reduc_*_scal_optab back to reduc_*_optab makes sense.
>>>>
>>>> Yes, that sounds like a plan, the _scal is a bit of a mouthful.
>>>>
>>>>> The optabs have only one mode - I wouldn't be surprised if an ISA
>>>>> invents for example v4si -> di reduction?  So do we want to make
>>>>> reduc_plus_scal_optab a little bit more future proof (maybe there
>>>>> is already an ISA that supports this kind of reduction?).
>>>>
>>>> That sounds like a plausible thing for an ISA to do, indeed. However
>>>> given
>>>> these names are only used by the autovectorizer rather than directly, the
>>>> question is what the corresponding source code looks like, and/or what
>>>> changes to the autovectorizer we might have to make to (look for code to)
>>>> exploit such an instruction.
>>>
>>> Ah, indeed.  Would be sth like a REDUC_WIDEN_SUM_EXPR or so.
>>>
>>>> At this point I could go for a
>>>> reduc_{plus,min_max}_scal_<mode><mode> which reduces from the first
>>>> vector
>>>> mode to the second scalar mode, and then make the vectorizer look only
>>>> for
>>>> cases where the second mode was the element type of the first; but I'm
>>>> not
>>>> sure I want to do anything more complicated than that at this stage.
>>>> (However, indeed it would leave the possibility open for the future.)
>>>
>>> Yeah, agreed.  For the min/max case a widen variant isn't useful anyway.
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> --Alan
>>>>
> 

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 3_add_new_optabs.patch --]
[-- Type: text/x-patch; name=3_add_new_optabs.patch, Size: 9276 bytes --]

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 80e8bd6..84e5261 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4783,29 +4783,48 @@ it is unspecified which of the two operands is returned as the result.
 @cindex @code{reduc_smax_@var{m}} instruction pattern
 @item @samp{reduc_smin_@var{m}}, @samp{reduc_smax_@var{m}}
 Find the signed minimum/maximum of the elements of a vector. The vector is
-operand 1, and the scalar result is stored in the least significant bits of
+operand 1, and the result is stored in the least significant bits of
 operand 0 (also a vector). The output and input vector should have the same
-modes.
+modes. These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_smin_scal_@var{m}} and @samp{reduc_smax_scal_@var{m}}.
 
 @cindex @code{reduc_umin_@var{m}} instruction pattern
 @cindex @code{reduc_umax_@var{m}} instruction pattern
 @item @samp{reduc_umin_@var{m}}, @samp{reduc_umax_@var{m}}
 Find the unsigned minimum/maximum of the elements of a vector. The vector is
-operand 1, and the scalar result is stored in the least significant bits of
+operand 1, and the result is stored in the least significant bits of
 operand 0 (also a vector). The output and input vector should have the same
-modes.
+modes. These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_umin_scal_@var{m}} and @samp{reduc_umax_scal_@var{m}}.
 
 @cindex @code{reduc_splus_@var{m}} instruction pattern
-@item @samp{reduc_splus_@var{m}}
-Compute the sum of the signed elements of a vector. The vector is operand 1,
-and the scalar result is stored in the least significant bits of operand 0
-(also a vector). The output and input vector should have the same modes.
-
 @cindex @code{reduc_uplus_@var{m}} instruction pattern
-@item @samp{reduc_uplus_@var{m}}
-Compute the sum of the unsigned elements of a vector. The vector is operand 1,
-and the scalar result is stored in the least significant bits of operand 0
+@item @samp{reduc_splus_@var{m}}, @samp{reduc_uplus_@var{m}}
+Compute the sum of the signed/unsigned elements of a vector. The vector is
+operand 1, and the result is stored in the least significant bits of operand 0
 (also a vector). The output and input vector should have the same modes.
+These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_plus_scal_@var{m}}.
+
+@cindex @code{reduc_smin_scal_@var{m}} instruction pattern
+@cindex @code{reduc_smax_scal_@var{m}} instruction pattern
+@item @samp{reduc_smin_scal_@var{m}}, @samp{reduc_smax_scal_@var{m}}
+Find the signed minimum/maximum of the elements of a vector. The vector is
+operand 1, and operand 0 is the scalar result, with mode equal to the mode of
+the elements of the input vector.
+
+@cindex @code{reduc_umin_scal_@var{m}} instruction pattern
+@cindex @code{reduc_umax_scal_@var{m}} instruction pattern
+@item @samp{reduc_umin_scal_@var{m}}, @samp{reduc_umax_scal_@var{m}}
+Find the unsigned minimum/maximum of the elements of a vector. The vector is
+operand 1, and operand 0 is the scalar result, with mode equal to the mode of
+the elements of the input vector.
+
+@cindex @code{reduc_plus_scal_@var{m}} instruction pattern
+@item @samp{reduc_plus_scal_@var{m}}
+Compute the sum of the elements of a vector. The vector is operand 1, and
+operand 0 is the scalar result, with mode equal to the mode of the elements of
+the input vector.
 
 @cindex @code{sdot_prod@var{m}} instruction pattern
 @item @samp{sdot_prod@var{m}}
diff --git a/gcc/expr.c b/gcc/expr.c
index c792028..3763614 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9045,6 +9045,24 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
         op0 = expand_normal (treeop0);
         this_optab = optab_for_tree_code (code, type, optab_default);
         enum machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop0));
+
+	if (optab_handler (this_optab, vec_mode) != CODE_FOR_nothing)
+	  {
+	    struct expand_operand ops[2];
+	    enum insn_code icode = optab_handler (this_optab, vec_mode);
+
+	    create_output_operand (&ops[0], target, mode);
+	    create_input_operand (&ops[1], op0, vec_mode);
+	    if (maybe_expand_insn (icode, 2, ops))
+	      {
+		target = ops[0].value;
+		if (GET_MODE (target) != mode)
+		  return gen_lowpart (tmode, target);
+		return target;
+	      }
+	  }
+	/* Fall back to optab with vector result, and then extract scalar.  */
+	this_optab = scalar_reduc_to_vector (this_optab, type);
         temp = expand_unop (vec_mode, this_optab, op0, NULL_RTX, unsignedp);
         gcc_assert (temp);
         /* The tree code produces a scalar result, but (somewhat by convention)
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 605615d..cfb408c 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -506,13 +506,15 @@ optab_for_tree_code (enum tree_code code, const_tree type,
       return fma_optab;
 
     case REDUC_MAX_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_umax_optab : reduc_smax_optab;
+      return TYPE_UNSIGNED (type)
+	     ? reduc_umax_scal_optab : reduc_smax_scal_optab;
 
     case REDUC_MIN_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_umin_optab : reduc_smin_optab;
+      return TYPE_UNSIGNED (type)
+	     ? reduc_umin_scal_optab : reduc_smin_scal_optab;
 
     case REDUC_PLUS_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;
+      return reduc_plus_scal_optab;
 
     case VEC_LSHIFT_EXPR:
       return vec_shl_optab;
@@ -608,7 +610,26 @@ optab_for_tree_code (enum tree_code code, const_tree type,
       return unknown_optab;
     }
 }
-\f
+
+/* Given optab UNOPTAB that reduces a vector to a scalar, find instead the old
+   optab that produces a vector with the reduction result in one element,
+   for a tree with type TYPE.  */
+
+optab
+scalar_reduc_to_vector (optab unoptab, const_tree type)
+{
+  switch (unoptab)
+    {
+    case reduc_plus_scal_optab:
+      return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;
+
+    case reduc_smin_scal_optab: return reduc_smin_optab;
+    case reduc_umin_scal_optab: return reduc_umin_optab;
+    case reduc_smax_scal_optab: return reduc_smax_optab;
+    case reduc_umax_scal_optab: return reduc_umax_optab;
+    default: return unknown_optab;
+    }
+}
 
 /* Expand vector widening operations.
 
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b755470..131ea04 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -243,12 +243,20 @@ OPTAB_D (sin_optab, "sin$a2")
 OPTAB_D (sincos_optab, "sincos$a3")
 OPTAB_D (tan_optab, "tan$a2")
 
+/* Vector reduction to a scalar.  */
+OPTAB_D (reduc_smax_scal_optab, "reduc_smax_scal_$a")
+OPTAB_D (reduc_smin_scal_optab, "reduc_smin_scal_$a")
+OPTAB_D (reduc_plus_scal_optab, "reduc_plus_scal_$a")
+OPTAB_D (reduc_umax_scal_optab, "reduc_umax_scal_$a")
+OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a")
+/* (Old) Vector reduction, returning a vector with the result in one lane.  */
 OPTAB_D (reduc_smax_optab, "reduc_smax_$a")
 OPTAB_D (reduc_smin_optab, "reduc_smin_$a")
 OPTAB_D (reduc_splus_optab, "reduc_splus_$a")
 OPTAB_D (reduc_umax_optab, "reduc_umax_$a")
 OPTAB_D (reduc_umin_optab, "reduc_umin_$a")
 OPTAB_D (reduc_uplus_optab, "reduc_uplus_$a")
+
 OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
diff --git a/gcc/optabs.h b/gcc/optabs.h
index 089b15a..1f8e44d 100644
--- a/gcc/optabs.h
+++ b/gcc/optabs.h
@@ -162,6 +162,11 @@ enum optab_subtype
    vector shifts and rotates */
 extern optab optab_for_tree_code (enum tree_code, const_tree, enum optab_subtype);
 
+/* Given an optab that reduces a vector to a scalar, find instead the old
+   optab that produces a vector with the reduction result in one element,
+   for a tree with the specified type.  */
+extern optab scalar_reduc_to_vector (optab, const_tree type);
+
 /* The various uses that a comparison can have; used by can_compare_p:
    jumps, conditional moves, store flag operations.  */
 enum can_compare_purpose
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 8d97e17..e8b108d 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -5102,15 +5102,17 @@ vectorizable_reduction (gimple stmt, gimple_stmt_iterator *gsi,
 
           epilog_reduc_code = ERROR_MARK;
         }
-
-      if (reduc_optab
-          && optab_handler (reduc_optab, vec_mode) == CODE_FOR_nothing)
+      else if (optab_handler (reduc_optab, vec_mode) == CODE_FOR_nothing)
         {
-          if (dump_enabled_p ())
-	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "reduc op not supported by target.\n");
+          optab = scalar_reduc_to_vector (reduc_optab, vectype_out);
+          if (optab_handler (optab, vec_mode) == CODE_FOR_nothing)
+            {
+              if (dump_enabled_p ())
+	        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "reduc op not supported by target.\n");
 
-          epilog_reduc_code = ERROR_MARK;
+	      epilog_reduc_code = ERROR_MARK;
+	    }
         }
     }
   else

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH/RFC v2 3/14] Add new optabs for reducing vectors to scalars
  2014-09-25 16:12             ` Alan Lawrence
@ 2014-09-25 19:20               ` Segher Boessenkool
  0 siblings, 0 replies; 52+ messages in thread
From: Segher Boessenkool @ 2014-09-25 19:20 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: Richard Biener, gcc-patches

On Thu, Sep 25, 2014 at 05:12:24PM +0100, Alan Lawrence wrote:
> So: original patch with updated commenting attached...Segher, is there any 
> chance you could test this on powerpc too? (in combination with patch 2/14, 
> which will need to be applied first; you can skip patch 1, and >=4.)

2+3/14, tested as before, on powerpc64-linux; no regressions.

Cheers,


Segher

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114
  2014-09-22 11:21 ` [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Richard Biener
  2014-09-22 11:26   ` Richard Biener
@ 2014-10-06 17:31   ` Alan Lawrence
       [not found]   ` <5432D1A5.6080208@arm.com>
  2 siblings, 0 replies; 52+ messages in thread
From: Alan Lawrence @ 2014-10-06 17:31 UTC (permalink / raw)
  To: Richard Biener, Marcus Shawcroft; +Cc: gcc-patches

Ok, so unless there are objections, I plan to commit patches 1, 2, 4, 5, and 6,
which have been previously approved, in that sequence. (Of those, all bar patch
2 are AArch64 only.) I think this is better than maintaining an ever-expanding
patch series.

Then I'll get to work on migrating all backends to the new _scal_ optab (and
removing the vector optab). Certainly I'd like to replace vec_shr/l with
vec_perm_expr too, but I'm conscious that the end of stage 1 is approaching!

--Alan


Richard Biener wrote:
> On Thu, Sep 18, 2014 at 1:41 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> The end goal here is to remove this code from tree-vect-loop.c
>> (vect_create_epilog_for_reduction):
>>
>>       if (BYTES_BIG_ENDIAN)
>>         bitpos = size_binop (MULT_EXPR,
>>                              bitsize_int (TYPE_VECTOR_SUBPARTS (vectype) -
>> 1),
>>                              TYPE_SIZE (scalar_type));
>>       else
>>
>> as this is the root cause of PR/61114 (see testcase there, failing on all
>> bigendian targets supporting reduc_[us]plus_optab). Quoting Richard Biener,
>> "all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
>> suspicious". The code snippet above is used on two paths:
>>
>> (Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
>> reduc_[us](plus|min|max)_optab.
>> The optab is documented as "the scalar result is stored in the least
>> significant bits of operand 0", but the tree code as "the first element in
>> the vector holding the result of the reduction of all elements of the
>> operand". This mismatch means that when the tree code is folded, the code
>> snippet above reads the result from the wrong end of the vector.
>>
>> The strategy (as per
>> https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define new
>> tree codes and optabs that produce scalar results directly; this seems
>> better than tying (the element of the vector into which the result is
>> placed) to (the endianness of the target), and avoids generating extra moves
>> on current bigendian targets. However, the previous optabs are retained for
>> now as a migration strategy so as not to break existing backends; moving
>> individual platforms over will follow.
>>
>> A complication here is on AArch64, where we directly generate
>> REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
>> remove this folding in order to decouple the midend and AArch64 backend.
> 
> Sounds fine.  I hope we can transition all backends for 5.0 and remove
> the vector variant optabs (maybe renaming the scalar ones).
> 
>> (Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
>> VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the optab
>> is defined in an endianness-dependent way, leading to significant
>> complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab is
>> never used!). Few platforms appear to handle vec_shr_optab (and fewer
>> bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
>> the existing optab to be endianness-neutral.
>>
>> Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
>> updates that implementation to fit the new endianness-neutral specification,
>> serving as a guide for other existing backends. Patches/RFCs 15 and 16 are
>> equivalents for MIPS and PowerPC; I haven't tested these but hope they act
>> as useful pointers for the port maintainers.
>>
>> Finally patch 14 cleans up the affected part of tree-vect-loop.c
>> (vect_create_epilog_for_reduction).
> 
> As said during the individual patches review I'd like the vectorizer to
> use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
> only whole-element amounts).  This means we can remove
> VEC_RSHIFT_EXPR.  It also means that if the backend defines
> vec_perm_const (which it really should) it can handle the special
> permutes that boil down to a possibly more efficient vector shift
> there (a good optimization anyway).  Until it does that all backends
> would at least create correct code (with the endian dependent
> vec_shr removed).
> 
> Richard.
> 
>> --Alan
>>
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114
       [not found]   ` <5432D1A5.6080208@arm.com>
@ 2014-10-07  7:45     ` Richard Biener
  2014-10-07  7:46       ` Richard Biener
       [not found]       ` <5436C138.50208@arm.com>
  0 siblings, 2 replies; 52+ messages in thread
From: Richard Biener @ 2014-10-07  7:45 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: Marcus Shawcroft, gcc-patches

On Mon, Oct 6, 2014 at 7:30 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> Ok, so unless there are objections, I plan to commit patches 1, 2, 4, 5, and
> 6,
> which have been previously approved, in that sequence. (Of those, all bar
> patch
> 2 are AArch64 only.) I think this is better than maintaining an
> ever-expanding
> patch series.

Agreed.

> Then I'll get to work on migrating all backends to the new _scal_ optab (and
> removing the vector optab). Certainly I'd like to replace vec_shr/l with
> vec_perm_expr too, but I'm conscious that the end of stage 1 is approaching!

I suppose we all are.  It will last until end of October at least
(stage1 of gcc 4.9
ended Nov 22th, certainly a bit late).

I do expect we will continue merging already developed / posted stuff through
stage3 (as usual).

That said, it would be really nice to get rid of VEC_RSHIFT_EXPR.

Thanks,
Richard.

> --Alan
>
>
>
>
> Richard Biener wrote:
>>
>> On Thu, Sep 18, 2014 at 1:41 PM, Alan Lawrence <alan.lawrence@arm.com>
>> wrote:
>>>
>>> The end goal here is to remove this code from tree-vect-loop.c
>>> (vect_create_epilog_for_reduction):
>>>
>>>       if (BYTES_BIG_ENDIAN)
>>>         bitpos = size_binop (MULT_EXPR,
>>>                              bitsize_int (TYPE_VECTOR_SUBPARTS (vectype)
>>> -
>>> 1),
>>>                              TYPE_SIZE (scalar_type));
>>>       else
>>>
>>> as this is the root cause of PR/61114 (see testcase there, failing on all
>>> bigendian targets supporting reduc_[us]plus_optab). Quoting Richard
>>> Biener,
>>> "all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
>>> suspicious". The code snippet above is used on two paths:
>>>
>>> (Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
>>> reduc_[us](plus|min|max)_optab.
>>> The optab is documented as "the scalar result is stored in the least
>>> significant bits of operand 0", but the tree code as "the first element
>>> in
>>> the vector holding the result of the reduction of all elements of the
>>> operand". This mismatch means that when the tree code is folded, the code
>>> snippet above reads the result from the wrong end of the vector.
>>>
>>> The strategy (as per
>>> https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define
>>> new
>>> tree codes and optabs that produce scalar results directly; this seems
>>> better than tying (the element of the vector into which the result is
>>> placed) to (the endianness of the target), and avoids generating extra
>>> moves
>>> on current bigendian targets. However, the previous optabs are retained
>>> for
>>> now as a migration strategy so as not to break existing backends; moving
>>> individual platforms over will follow.
>>>
>>> A complication here is on AArch64, where we directly generate
>>> REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
>>> remove this folding in order to decouple the midend and AArch64 backend.
>>
>>
>> Sounds fine.  I hope we can transition all backends for 5.0 and remove
>> the vector variant optabs (maybe renaming the scalar ones).
>>
>>> (Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
>>> VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the
>>> optab
>>> is defined in an endianness-dependent way, leading to significant
>>> complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab
>>> is
>>> never used!). Few platforms appear to handle vec_shr_optab (and fewer
>>> bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
>>> the existing optab to be endianness-neutral.
>>>
>>> Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
>>> updates that implementation to fit the new endianness-neutral
>>> specification,
>>> serving as a guide for other existing backends. Patches/RFCs 15 and 16
>>> are
>>> equivalents for MIPS and PowerPC; I haven't tested these but hope they
>>> act
>>> as useful pointers for the port maintainers.
>>>
>>> Finally patch 14 cleans up the affected part of tree-vect-loop.c
>>> (vect_create_epilog_for_reduction).
>>
>>
>> As said during the individual patches review I'd like the vectorizer to
>> use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
>> only whole-element amounts).  This means we can remove
>> VEC_RSHIFT_EXPR.  It also means that if the backend defines
>> vec_perm_const (which it really should) it can handle the special
>> permutes that boil down to a possibly more efficient vector shift
>> there (a good optimization anyway).  Until it does that all backends
>> would at least create correct code (with the endian dependent
>> vec_shr removed).
>>
>> Richard.
>>
>>> --Alan
>>>
>>
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium.  Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No:  2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No:  2548782
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114
  2014-10-07  7:45     ` Richard Biener
@ 2014-10-07  7:46       ` Richard Biener
       [not found]       ` <5436C138.50208@arm.com>
  1 sibling, 0 replies; 52+ messages in thread
From: Richard Biener @ 2014-10-07  7:46 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: Marcus Shawcroft, gcc-patches

On Tue, Oct 7, 2014 at 9:45 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Mon, Oct 6, 2014 at 7:30 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> Ok, so unless there are objections, I plan to commit patches 1, 2, 4, 5, and
>> 6,
>> which have been previously approved, in that sequence. (Of those, all bar
>> patch
>> 2 are AArch64 only.) I think this is better than maintaining an
>> ever-expanding
>> patch series.
>
> Agreed.
>
>> Then I'll get to work on migrating all backends to the new _scal_ optab (and
>> removing the vector optab). Certainly I'd like to replace vec_shr/l with
>> vec_perm_expr too, but I'm conscious that the end of stage 1 is approaching!
>
> I suppose we all are.  It will last until end of October at least
> (stage1 of gcc 4.9
> ended Nov 22th, certainly a bit late).
>
> I do expect we will continue merging already developed / posted stuff through
> stage3 (as usual).
>
> That said, it would be really nice to get rid of VEC_RSHIFT_EXPR.

And you can fix performance regressions you introduce (badly handled
VEC_PERM) until the GCC 5 release happens (and even after that).
Heh.  Easy way out ;)

Richard.

> Thanks,
> Richard.
>
>> --Alan
>>
>>
>>
>>
>> Richard Biener wrote:
>>>
>>> On Thu, Sep 18, 2014 at 1:41 PM, Alan Lawrence <alan.lawrence@arm.com>
>>> wrote:
>>>>
>>>> The end goal here is to remove this code from tree-vect-loop.c
>>>> (vect_create_epilog_for_reduction):
>>>>
>>>>       if (BYTES_BIG_ENDIAN)
>>>>         bitpos = size_binop (MULT_EXPR,
>>>>                              bitsize_int (TYPE_VECTOR_SUBPARTS (vectype)
>>>> -
>>>> 1),
>>>>                              TYPE_SIZE (scalar_type));
>>>>       else
>>>>
>>>> as this is the root cause of PR/61114 (see testcase there, failing on all
>>>> bigendian targets supporting reduc_[us]plus_optab). Quoting Richard
>>>> Biener,
>>>> "all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
>>>> suspicious". The code snippet above is used on two paths:
>>>>
>>>> (Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
>>>> reduc_[us](plus|min|max)_optab.
>>>> The optab is documented as "the scalar result is stored in the least
>>>> significant bits of operand 0", but the tree code as "the first element
>>>> in
>>>> the vector holding the result of the reduction of all elements of the
>>>> operand". This mismatch means that when the tree code is folded, the code
>>>> snippet above reads the result from the wrong end of the vector.
>>>>
>>>> The strategy (as per
>>>> https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define
>>>> new
>>>> tree codes and optabs that produce scalar results directly; this seems
>>>> better than tying (the element of the vector into which the result is
>>>> placed) to (the endianness of the target), and avoids generating extra
>>>> moves
>>>> on current bigendian targets. However, the previous optabs are retained
>>>> for
>>>> now as a migration strategy so as not to break existing backends; moving
>>>> individual platforms over will follow.
>>>>
>>>> A complication here is on AArch64, where we directly generate
>>>> REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
>>>> remove this folding in order to decouple the midend and AArch64 backend.
>>>
>>>
>>> Sounds fine.  I hope we can transition all backends for 5.0 and remove
>>> the vector variant optabs (maybe renaming the scalar ones).
>>>
>>>> (Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
>>>> VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the
>>>> optab
>>>> is defined in an endianness-dependent way, leading to significant
>>>> complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab
>>>> is
>>>> never used!). Few platforms appear to handle vec_shr_optab (and fewer
>>>> bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
>>>> the existing optab to be endianness-neutral.
>>>>
>>>> Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
>>>> updates that implementation to fit the new endianness-neutral
>>>> specification,
>>>> serving as a guide for other existing backends. Patches/RFCs 15 and 16
>>>> are
>>>> equivalents for MIPS and PowerPC; I haven't tested these but hope they
>>>> act
>>>> as useful pointers for the port maintainers.
>>>>
>>>> Finally patch 14 cleans up the affected part of tree-vect-loop.c
>>>> (vect_create_epilog_for_reduction).
>>>
>>>
>>> As said during the individual patches review I'd like the vectorizer to
>>> use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
>>> only whole-element amounts).  This means we can remove
>>> VEC_RSHIFT_EXPR.  It also means that if the backend defines
>>> vec_perm_const (which it really should) it can handle the special
>>> permutes that boil down to a possibly more efficient vector shift
>>> there (a good optimization anyway).  Until it does that all backends
>>> would at least create correct code (with the endian dependent
>>> vec_shr removed).
>>>
>>> Richard.
>>>
>>>> --Alan
>>>>
>>>
>>
>>
>> -- IMPORTANT NOTICE: The contents of this email and any attachments are
>> confidential and may also be privileged. If you are not the intended
>> recipient, please notify the sender immediately and do not disclose the
>> contents to any other person, use it for any purpose, or store or copy the
>> information in any medium.  Thank you.
>>
>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> Registered in England & Wales, Company No:  2557590
>> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> Registered in England & Wales, Company No:  2548782
>>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114
       [not found]       ` <5436C138.50208@arm.com>
@ 2014-10-09 17:13         ` Alan Lawrence
  0 siblings, 0 replies; 52+ messages in thread
From: Alan Lawrence @ 2014-10-09 17:13 UTC (permalink / raw)
  To: Richard Biener
  Cc: Marcus Shawcroft, gcc-patches, Catherine Moore, Eric Christopher,
	Matthew Fortune

Ok....well, I see a path forward, somewhere there....

however (bah), I can't push that subset of patches - I came back from a week's
holiday and misremembered - the AArch64 changes depend upon the introduction of
the _scal_optabs, not just the tree changes  :( .

I'll try to post optab migration patches next week for x86, rs6000 (mostly, I
haven't figured out paired.md yet), ARM, and IA64 (fwiw; has only v2sf
reductions). Having looked at MIPS/Loongson I'm feeling a bit bewildered and not
sure how to proceed, so I think I must ask the MIPS maintainers (CC'd) for
assistance: how can one add a vec_extract, to produce a scalar result, to the
end of each reduc_ optab ?

--Alan


Richard Biener wrote:
 > > On Mon, Oct 6, 2014 at 7:30 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
 >> >> Ok, so unless there are objections, I plan to commit patches 1, 2, 4, 5, and
 >> >> 6,
 >> >> which have been previously approved, in that sequence. (Of those, all bar
 >> >> patch
 >> >> 2 are AArch64 only.) I think this is better than maintaining an
 >> >> ever-expanding
 >> >> patch series.
 > >
 > > Agreed.
 > >
 >> >> Then I'll get to work on migrating all backends to the new _scal_ optab (and
 >> >> removing the vector optab). Certainly I'd like to replace vec_shr/l with
 >> >> vec_perm_expr too, but I'm conscious that the end of stage 1 is approaching!
 > >
 > > I suppose we all are.  It will last until end of October at least
 > > (stage1 of gcc 4.9
 > > ended Nov 22th, certainly a bit late).
 > >
 > > I do expect we will continue merging already developed / posted stuff through
 > > stage3 (as usual).
 > >
 > > That said, it would be really nice to get rid of VEC_RSHIFT_EXPR.
 > >
 > > Thanks,
 > > Richard.
 > >
 >> >> --Alan
 >> >>
 >> >>
 >> >>
 >> >>
 >> >> Richard Biener wrote:
 >>> >>> On Thu, Sep 18, 2014 at 1:41 PM, Alan Lawrence <alan.lawrence@arm.com>
 >>> >>> wrote:
 >>>> >>>> The end goal here is to remove this code from tree-vect-loop.c
 >>>> >>>> (vect_create_epilog_for_reduction):
 >>>> >>>>
 >>>> >>>>       if (BYTES_BIG_ENDIAN)
 >>>> >>>>         bitpos = size_binop (MULT_EXPR,
 >>>> >>>>                              bitsize_int (TYPE_VECTOR_SUBPARTS (vectype)
 >>>> >>>> -
 >>>> >>>> 1),
 >>>> >>>>                              TYPE_SIZE (scalar_type));
 >>>> >>>>       else
 >>>> >>>>
 >>>> >>>> as this is the root cause of PR/61114 (see testcase there, failing on all
 >>>> >>>> bigendian targets supporting reduc_[us]plus_optab). Quoting Richard
 >>>> >>>> Biener,
 >>>> >>>> "all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
 >>>> >>>> suspicious". The code snippet above is used on two paths:
 >>>> >>>>
 >>>> >>>> (Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
 >>>> >>>> reduc_[us](plus|min|max)_optab.
 >>>> >>>> The optab is documented as "the scalar result is stored in the least
 >>>> >>>> significant bits of operand 0", but the tree code as "the first element
 >>>> >>>> in
 >>>> >>>> the vector holding the result of the reduction of all elements of the
 >>>> >>>> operand". This mismatch means that when the tree code is folded, the code
 >>>> >>>> snippet above reads the result from the wrong end of the vector.
 >>>> >>>>
 >>>> >>>> The strategy (as per
 >>>> >>>> https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define
 >>>> >>>> new
 >>>> >>>> tree codes and optabs that produce scalar results directly; this seems
 >>>> >>>> better than tying (the element of the vector into which the result is
 >>>> >>>> placed) to (the endianness of the target), and avoids generating extra
 >>>> >>>> moves
 >>>> >>>> on current bigendian targets. However, the previous optabs are retained
 >>>> >>>> for
 >>>> >>>> now as a migration strategy so as not to break existing backends; moving
 >>>> >>>> individual platforms over will follow.
 >>>> >>>>
 >>>> >>>> A complication here is on AArch64, where we directly generate
 >>>> >>>> REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
 >>>> >>>> remove this folding in order to decouple the midend and AArch64 backend.
 >>> >>>
 >>> >>> Sounds fine.  I hope we can transition all backends for 5.0 and remove
 >>> >>> the vector variant optabs (maybe renaming the scalar ones).
 >>> >>>
 >>>> >>>> (Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
 >>>> >>>> VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the
 >>>> >>>> optab
 >>>> >>>> is defined in an endianness-dependent way, leading to significant
 >>>> >>>> complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab
 >>>> >>>> is
 >>>> >>>> never used!). Few platforms appear to handle vec_shr_optab (and fewer
 >>>> >>>> bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
 >>>> >>>> the existing optab to be endianness-neutral.
 >>>> >>>>
 >>>> >>>> Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
 >>>> >>>> updates that implementation to fit the new endianness-neutral
 >>>> >>>> specification,
 >>>> >>>> serving as a guide for other existing backends. Patches/RFCs 15 and 16
 >>>> >>>> are
 >>>> >>>> equivalents for MIPS and PowerPC; I haven't tested these but hope they
 >>>> >>>> act
 >>>> >>>> as useful pointers for the port maintainers.
 >>>> >>>>
 >>>> >>>> Finally patch 14 cleans up the affected part of tree-vect-loop.c
 >>>> >>>> (vect_create_epilog_for_reduction).
 >>> >>>
 >>> >>> As said during the individual patches review I'd like the vectorizer to
 >>> >>> use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
 >>> >>> only whole-element amounts).  This means we can remove
 >>> >>> VEC_RSHIFT_EXPR.  It also means that if the backend defines
 >>> >>> vec_perm_const (which it really should) it can handle the special
 >>> >>> permutes that boil down to a possibly more efficient vector shift
 >>> >>> there (a good optimization anyway).  Until it does that all backends
 >>> >>> would at least create correct code (with the endian dependent
 >>> >>> vec_shr removed).
 >>> >>>
 >>> >>> Richard.
 >>> >>>
 >>>> >>>> --Alan
 >>>> >>>>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 11/14] Remove VEC_LSHIFT_EXPR and vec_shl_optab
  2014-09-22 10:52   ` Richard Biener
@ 2014-10-27 18:45     ` Alan Lawrence
  2014-10-27 20:24       ` Richard Biener
  0 siblings, 1 reply; 52+ messages in thread
From: Alan Lawrence @ 2014-10-27 18:45 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Ok, I've now pushed the previously-approved first half of this, and am now 
looking at replacing VEC_RSHIFT_EXPR with a VEC_PERM_EXPR. However: does it seem 
reasonable to push this patch 11 (removing VEC_LSHIFT_EXPR and vec_shl_optab) 
out-of-sequence? The patch applies almost-cleanly, there is just a one-line 
conflict with a change to a comment from the previous patch (which I'm skipping)...

Cheers, Alan

Richard Biener wrote:
> On Thu, Sep 18, 2014 at 2:35 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> The VEC_LSHIFT_EXPR tree code, and the corresponding vec_shl_optab, seem to
>> have been added for completeness, providing a counterpart to VEC_RSHIFT_EXPR
>> and vec_shr_optab. However, whereas VEC_RSHIFT_EXPRs are generated (only) by
>> the vectorizer, VEC_LSHIFT_EXPR expressions are not generated at all, so
>> there seems little point in maintaining it.
>>
>> Bootstrapped on x86_64-unknown-linux-gnu.
>> aarch64.exp+vect.exp on aarch64-none-elf and aarch64_be-none-elf.
> 
> Ah, there it is ;)
> 
> Ok.
> 
> Thanks,
> Richard.
> 
>> gcc/ChangeLog:
>>
>>         * expr.c (expand_expr_real_2): Remove code handling VEC_LSHIFT_EXPR.
>>         * fold-const.c (const_binop): Likewise.
>>         * cfgexpand.c (expand_debug_expr): Likewise.
>>         * tree-inline.c (estimate_operator_cost, dump_generic_node,
>>         op_code_prio, op_symbol_code): Likewise.
>>         * tree-vect-generic.c (expand_vector_operations_1): Likewise.
>>         * optabs.c (optab_for_tree_code): Likewise.
>>         (expand_vec_shift_expr): Likewise, update comment.
>>         * tree.def: Delete VEC_LSHIFT_EXPR, remove comment.
>>         * optabs.h (expand_vec_shift_expr): Remove comment re.
>> VEC_LSHIFT_EXPR.
>>         * optabs.def: Remove vec_shl_optab.
>>         * doc/md.texi: Remove references to vec_shr_m.
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 11/14] Remove VEC_LSHIFT_EXPR and vec_shl_optab
  2014-10-27 18:45     ` Alan Lawrence
@ 2014-10-27 20:24       ` Richard Biener
  0 siblings, 0 replies; 52+ messages in thread
From: Richard Biener @ 2014-10-27 20:24 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On October 27, 2014 7:43:05 PM CET, Alan Lawrence <alan.lawrence@arm.com> wrote:
>Ok, I've now pushed the previously-approved first half of this, and am
>now 
>looking at replacing VEC_RSHIFT_EXPR with a VEC_PERM_EXPR. However:
>does it seem 
>reasonable to push this patch 11 (removing VEC_LSHIFT_EXPR and
>vec_shl_optab) 
>out-of-sequence? The patch applies almost-cleanly, there is just a
>one-line 
>conflict with a change to a comment from the previous patch (which I'm
>skipping)...

Sure - please go ahead!

Thanks,
Richard.

>Cheers, Alan
>
>Richard Biener wrote:
>> On Thu, Sep 18, 2014 at 2:35 PM, Alan Lawrence
><alan.lawrence@arm.com> wrote:
>>> The VEC_LSHIFT_EXPR tree code, and the corresponding vec_shl_optab,
>seem to
>>> have been added for completeness, providing a counterpart to
>VEC_RSHIFT_EXPR
>>> and vec_shr_optab. However, whereas VEC_RSHIFT_EXPRs are generated
>(only) by
>>> the vectorizer, VEC_LSHIFT_EXPR expressions are not generated at
>all, so
>>> there seems little point in maintaining it.
>>>
>>> Bootstrapped on x86_64-unknown-linux-gnu.
>>> aarch64.exp+vect.exp on aarch64-none-elf and aarch64_be-none-elf.
>> 
>> Ah, there it is ;)
>> 
>> Ok.
>> 
>> Thanks,
>> Richard.
>> 
>>> gcc/ChangeLog:
>>>
>>>         * expr.c (expand_expr_real_2): Remove code handling
>VEC_LSHIFT_EXPR.
>>>         * fold-const.c (const_binop): Likewise.
>>>         * cfgexpand.c (expand_debug_expr): Likewise.
>>>         * tree-inline.c (estimate_operator_cost, dump_generic_node,
>>>         op_code_prio, op_symbol_code): Likewise.
>>>         * tree-vect-generic.c (expand_vector_operations_1):
>Likewise.
>>>         * optabs.c (optab_for_tree_code): Likewise.
>>>         (expand_vec_shift_expr): Likewise, update comment.
>>>         * tree.def: Delete VEC_LSHIFT_EXPR, remove comment.
>>>         * optabs.h (expand_vec_shift_expr): Remove comment re.
>>> VEC_LSHIFT_EXPR.
>>>         * optabs.def: Remove vec_shl_optab.
>>>         * doc/md.texi: Remove references to vec_shr_m.
>> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* PUSHED: [PATCH 14/14][Vectorizer] Tidy up vect_create_epilog / use_scalar_result
  2014-09-22 10:53   ` Richard Biener
@ 2014-11-14 17:29     ` Alan Lawrence
  0 siblings, 0 replies; 52+ messages in thread
From: Alan Lawrence @ 2014-11-14 17:29 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Biener

After recent updates, tree-vect-loop.c is in the same state as when this cleanup 
patch was first written and approved, so I've just pushed it as r/217580.

Cheers,
Alan

Richard Biener wrote:
> On Thu, Sep 18, 2014 at 2:48 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> Following earlier patches, vect_create_epilog_for_reduction contains exactly
>> one case where extract_scalar_result==true. Hence, move the code 'if
>> (extract_scalar_result)' there, and tidy-up/remove some variables.
>>
>> bootstrapped on x86_64-none-linux-gnu + check-gcc + check-g++.
> 
> Ok.
> 
> Thanks,
> Richard.
> 
>> gcc/ChangeLog:
>>
>>         * tree-vect-loop.c (vect_create_epilog_for_reduction): Move code for
>>         'if (extract_scalar_result)' to the only place that it is true.
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2014-11-14 17:21 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-18 11:41 [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Alan Lawrence
2014-09-18 11:45 ` [PATCH 1/14][AArch64] Temporarily remove aarch64_gimple_fold_builtin code for reduction operations Alan Lawrence
2014-09-24  9:41   ` Marcus Shawcroft
2014-09-18 11:51 ` [PATCH 2/14][Vectorizer] Make REDUC_xxx_EXPR tree codes produce a scalar result Alan Lawrence
2014-09-22 10:34   ` Richard Biener
2014-09-22 13:23     ` Alan Lawrence
2014-09-24 15:02     ` Alan Lawrence
2014-09-24 18:08       ` Segher Boessenkool
2014-09-25 16:07         ` Alan Lawrence
2014-09-18 11:54 ` [PATCH 3/14] Add new optabs for reducing vectors to scalars Alan Lawrence
2014-09-22 10:40   ` Richard Biener
2014-09-22 13:26     ` Alan Lawrence
2014-09-22 13:38       ` Richard Biener
2014-09-25 14:33         ` [PATCH/RFC v2 " Alan Lawrence
2014-09-25 15:31           ` Richard Biener
2014-09-25 16:12             ` Alan Lawrence
2014-09-25 19:20               ` Segher Boessenkool
2014-09-18 11:59 ` [PATCH 4/14][AArch64] Use new reduc_plus_scal optabs, inc. for __builtins Alan Lawrence
2014-09-24  9:44   ` Marcus Shawcroft
2014-09-18 12:02 ` [PATCH 5/14][AArch64] Use new reduc_[us](min|max)_scal optabs, inc. for builtins Alan Lawrence
2014-09-24  9:47   ` Marcus Shawcroft
2014-09-18 12:05 ` [PATCH 6/14][AArch64] Restore gimple_folding of reduction intrinsics Alan Lawrence
2014-09-24  9:48   ` Marcus Shawcroft
2014-09-18 12:19 ` [PATCH 7/14][Testsuite] Add tests of reductions using whole-vector-shifts (multiplication) Alan Lawrence
2014-09-22 10:41   ` Richard Biener
2014-09-18 12:25 ` [PATCH 8/14][Testsuite] Add tests of reductions using whole-vector-shifts (ior) Alan Lawrence
2014-09-22 10:42   ` Richard Biener
2014-09-18 12:27 ` [PATCH 9/14] Enforce whole-vector-shifts to always be by a whole number of elements Alan Lawrence
2014-09-22 10:50   ` Richard Biener
2014-09-18 12:34 ` [PATCH 10/14][AArch64] Implement vec_shr optab Alan Lawrence
2014-09-18 12:35 ` [PATCH 11/14] Remove VEC_LSHIFT_EXPR and vec_shl_optab Alan Lawrence
2014-09-22 10:52   ` Richard Biener
2014-10-27 18:45     ` Alan Lawrence
2014-10-27 20:24       ` Richard Biener
2014-09-18 12:43 ` [PATCH 12/14][Vectorizer] Redefine VEC_RSHIFT_EXPR and vec_shr_optab as endianness-neutral Alan Lawrence
2014-09-18 13:12   ` David Edelsohn
2014-09-22 13:27     ` Bill Schmidt
2014-09-22 10:58   ` Richard Biener
2014-09-18 12:45 ` [PATCH 13/14][AArch64_be] Fix vec_shr pattern to correctly implement endianness-neutral optab Alan Lawrence
2014-09-22 10:52   ` Richard Biener
2014-09-18 12:48 ` [PATCH 14/14][Vectorizer] Tidy up vect_create_epilog / use_scalar_result Alan Lawrence
2014-09-22 10:53   ` Richard Biener
2014-11-14 17:29     ` PUSHED: " Alan Lawrence
2014-09-18 12:58 ` [PATCH/RFC 15 / 14+2][RS6000] Remove vec_shl and (hopefully) fix vec_shr Alan Lawrence
2014-09-23 12:50   ` David Edelsohn
2014-09-18 13:02 ` [PATCH 16 / 14+2][MIPS] " Alan Lawrence
2014-09-22 11:21 ` [PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114 Richard Biener
2014-09-22 11:26   ` Richard Biener
2014-10-06 17:31   ` Alan Lawrence
     [not found]   ` <5432D1A5.6080208@arm.com>
2014-10-07  7:45     ` Richard Biener
2014-10-07  7:46       ` Richard Biener
     [not found]       ` <5436C138.50208@arm.com>
2014-10-09 17:13         ` Alan Lawrence

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).