[PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral
@ 2014-10-24 11:57 Alan Lawrence
  2014-10-24 11:58 ` [PATCH 7/11][ARM] Migrate to new reduc_plus_scal_optab Alan Lawrence
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Alan Lawrence @ 2014-10-24 11:57 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Biener, David Edelsohn

[-- Attachment #1: Type: text/plain, Size: 2492 bytes --]

This is the first half of my previous patch series 
(https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01456.html), that is the part 
making the REDUC_..._EXPR tree codes endian-neutral, and adding a new 
reduce-to-scalar optab in place of the endianness-dependent 
reduc_[us](plus|min|max)_optab.

I'm leaving the vec_shr portion out of this patch series, as the link between 
the two halves is only the end goal of removing an "if (BYTES_BIG_ENDIAN)" from 
tree-vect-loop.c; this series removes that from one code path so can stand alone.

Patches 1-6 are as previously posted apart from rebasing and removing the 
old/poisoned AArch64 patterns as per maintainer's request. Patches 1, 2, 4, 5 
and 6 have already been approved; patch 3 was discussed somewhat but I think we 
decided against most of the ideas raised, I have added comment to 
scalar_reduc_to_vector. I now reread Richie's "Otherwise the patch looks good to 
me" and wonder if I should have taken that as an approval but I didn't read it 
that way at the time...???

Patches 7-11 migrate migrate ARM, x86, IA64 (I think), and mostly PowerPC, to 
the new reduc_(plus|[us](min|max))_scal_optab. I have not managed to work out 
how to do the same for MIPS (specifically what I need to add to 
mips_expand_vec_reduc), and have had no response from the maintainers, so am 
leaving that for now. Also I haven't migrated (or worked out how to target) 
rs6000/paired.md, help would be most welcome.


The suggestion was then to "complete" the migration, by removing the old optabs. 
There are a few options here and I'll follow up with appropriate patches 
according to feedback received. I see options:

(1) just delete the old optabs (and the migration code). This would 
performance-regress the MIPS backend, but should not break it, although one 
should really do *something* with the then-unused reduc_[us](plus|min|max)_optab 
in config/mips/loongson.md.

(2) also renaming reduc_..._scal_optab back to reduc_..._optab; would break the 
MIPS backend if something were not done with it's existing patterns.

(2a) Alternatively I could just use a different new name, e.g. reduce_...., 
reduct_...., vec_reduc_..., anything that's less of a mouthful than 
reduc_..._scal. Whilst being only-very-slightly-different from the current 
reduc_... might be confusing, so might changing the meaning of the optab, and 
its signature, with the existing name, so am open to suggestions?

Cheers, Alan

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 1_aarch64_disable_gimple_fold.patch --]
[-- Type: text/x-patch; name=1_aarch64_disable_gimple_fold.patch, Size: 1592 bytes --]

commit 9819291c17610dcdcca19a3d9ea3a4260df0577e
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Thu Aug 21 13:05:43 2014 +0100

    Temporarily remove gimple_fold

diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
index 3dba1b2..a49da89 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -1188,6 +1188,9 @@ aarch64_fold_builtin (tree fndecl, int n_args ATTRIBUTE_UNUSED, tree *args,
   return NULL_TREE;
 }
 
+/* Handling of reduction operations temporarily removed so as to decouple
+   changes to tree codes from AArch64 NEON Intrinsics.  */
+#if 0
 bool
 aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 {
@@ -1259,6 +1262,7 @@ aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 
   return changed;
 }
+#endif
 
 void
 aarch64_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index db5ff59..27d82f3 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -10015,8 +10015,8 @@ aarch64_asan_shadow_offset (void)
 #undef TARGET_FRAME_POINTER_REQUIRED
 #define TARGET_FRAME_POINTER_REQUIRED aarch64_frame_pointer_required
 
-#undef TARGET_GIMPLE_FOLD_BUILTIN
-#define TARGET_GIMPLE_FOLD_BUILTIN aarch64_gimple_fold_builtin
+//#undef TARGET_GIMPLE_FOLD_BUILTIN
+//#define TARGET_GIMPLE_FOLD_BUILTIN aarch64_gimple_fold_builtin
 
 #undef TARGET_GIMPLIFY_VA_ARG_EXPR
 #define TARGET_GIMPLIFY_VA_ARG_EXPR aarch64_gimplify_va_arg_expr

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 2_make_reduc_tree_codes_scalar.patch --]
[-- Type: text/x-patch; name=2_make_reduc_tree_codes_scalar.patch, Size: 6705 bytes --]

commit bf6d5d32c552ce1c6ccd890f501db4f39291088f
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Tue Jul 29 11:46:01 2014 +0100

    Make tree codes produce scalar, with NOP_EXPRs. (tree-vect-loop.c mess)

diff --git a/gcc/expr.c b/gcc/expr.c
index a6233f3..c792028 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9044,7 +9044,17 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
       {
         op0 = expand_normal (treeop0);
         this_optab = optab_for_tree_code (code, type, optab_default);
-        temp = expand_unop (mode, this_optab, op0, target, unsignedp);
+        enum machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop0));
+        temp = expand_unop (vec_mode, this_optab, op0, NULL_RTX, unsignedp);
+        gcc_assert (temp);
+        /* The tree code produces a scalar result, but (somewhat by convention)
+           the optab produces a vector with the result in element 0 if
+           little-endian, or element N-1 if big-endian.  So pull the scalar
+           result out of that element.  */
+        int index = BYTES_BIG_ENDIAN ? GET_MODE_NUNITS (vec_mode) - 1 : 0;
+        int bitsize = GET_MODE_BITSIZE (GET_MODE_INNER (vec_mode));
+        temp = extract_bit_field (temp, bitsize, bitsize * index, unsignedp,
+				  target, mode, mode);
         gcc_assert (temp);
         return temp;
       }
diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index 9f1bc09..9bb86f9 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -8246,12 +8246,13 @@ fold_unary_loc (location_t loc, enum tree_code code, tree type, tree op0)
     case REDUC_MAX_EXPR:
     case REDUC_PLUS_EXPR:
       {
-	unsigned int nelts = TYPE_VECTOR_SUBPARTS (type), i;
+	unsigned int nelts, i;
 	tree *elts;
 	enum tree_code subcode;
 
 	if (TREE_CODE (op0) != VECTOR_CST)
 	  return NULL_TREE;
+        nelts = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
 
 	elts = XALLOCAVEC (tree, nelts);
 	if (!vec_cst_ctor_to_array (op0, elts))
@@ -8270,10 +8271,9 @@ fold_unary_loc (location_t loc, enum tree_code code, tree type, tree op0)
 	    elts[0] = const_binop (subcode, elts[0], elts[i]);
 	    if (elts[0] == NULL_TREE || !CONSTANT_CLASS_P (elts[0]))
 	      return NULL_TREE;
-	    elts[i] = build_zero_cst (TREE_TYPE (type));
 	  }
 
-	return build_vector (type, elts);
+	return elts[0];
       }
 
     default:
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index cdab639..14fff81 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3539,12 +3539,21 @@ verify_gimple_assign_unary (gimple stmt)
 
         return false;
       }
-
-    case VEC_UNPACK_HI_EXPR:
-    case VEC_UNPACK_LO_EXPR:
     case REDUC_MAX_EXPR:
     case REDUC_MIN_EXPR:
     case REDUC_PLUS_EXPR:
+      if (!VECTOR_TYPE_P (rhs1_type)
+	  || !useless_type_conversion_p (lhs_type, TREE_TYPE (rhs1_type)))
+        {
+	  error ("reduction should convert from vector to element type");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  return true;
+	}
+      return false;
+
+    case VEC_UNPACK_HI_EXPR:
+    case VEC_UNPACK_LO_EXPR:
     case VEC_UNPACK_FLOAT_HI_EXPR:
     case VEC_UNPACK_FLOAT_LO_EXPR:
       /* FIXME.  */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index fd1166f..8d97e17 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1892,9 +1892,9 @@ vect_analyze_loop (struct loop *loop)
 
    Output:
    REDUC_CODE - the corresponding tree-code to be used to reduce the
-      vector of partial results into a single scalar result (which
-      will also reside in a vector) or ERROR_MARK if the operation is
-      a supported reduction operation, but does not have such tree-code.
+      vector of partial results into a single scalar result, or ERROR_MARK
+      if the operation is a supported reduction operation, but does not have
+      such a tree-code.
 
    Return FALSE if CODE currently cannot be vectorized as reduction.  */
 
@@ -4168,6 +4168,7 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
   if (reduc_code != ERROR_MARK && !slp_reduc)
     {
       tree tmp;
+      tree vec_elem_type;
 
       /*** Case 1:  Create:
            v_out2 = reduc_expr <v_out1>  */
@@ -4176,14 +4177,26 @@ vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
         dump_printf_loc (MSG_NOTE, vect_location,
 			 "Reduce using direct vector reduction.\n");
 
-      vec_dest = vect_create_destination_var (scalar_dest, vectype);
-      tmp = build1 (reduc_code, vectype, new_phi_result);
-      epilog_stmt = gimple_build_assign (vec_dest, tmp);
-      new_temp = make_ssa_name (vec_dest, epilog_stmt);
+      vec_elem_type = TREE_TYPE (TREE_TYPE (new_phi_result));
+      if (!useless_type_conversion_p (scalar_type, vec_elem_type))
+	{
+          tree tmp_dest =
+	      vect_create_destination_var (scalar_dest, vec_elem_type);
+	  tmp = build1 (reduc_code, vec_elem_type, new_phi_result);
+	  epilog_stmt = gimple_build_assign (tmp_dest, tmp);
+	  new_temp = make_ssa_name (tmp_dest, epilog_stmt);
+	  gimple_assign_set_lhs (epilog_stmt, new_temp);
+	  gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
+
+	  tmp = build1 (NOP_EXPR, scalar_type, new_temp);
+	}
+      else
+	tmp = build1 (reduc_code, scalar_type, new_phi_result);
+      epilog_stmt = gimple_build_assign (new_scalar_dest, tmp);
+      new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
       gimple_assign_set_lhs (epilog_stmt, new_temp);
       gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
-
-      extract_scalar_result = true;
+      scalar_results.safe_push (new_temp);
     }
   else
     {
diff --git a/gcc/tree.def b/gcc/tree.def
index bd39e4b..c830e4b 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1161,10 +1161,9 @@ DEFTREECODE (TRANSACTION_EXPR, "transaction_expr", tcc_expression, 1)
    result (e.g. summing the elements of the vector, finding the minimum over
    the vector elements, etc).
    Operand 0 is a vector.
-   The expression returns a vector of the same type, with the first
-   element in the vector holding the result of the reduction of all elements
-   of the operand.  The content of the other elements in the returned vector
-   is undefined.  */
+   The expression returns a scalar, with type the same as the elements of the
+   vector, holding the result of the reduction of all elements of the operand.
+   */
 DEFTREECODE (REDUC_MAX_EXPR, "reduc_max_expr", tcc_unary, 1)
 DEFTREECODE (REDUC_MIN_EXPR, "reduc_min_expr", tcc_unary, 1)
 DEFTREECODE (REDUC_PLUS_EXPR, "reduc_plus_expr", tcc_unary, 1)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: 3_add_new_optabs.patch --]
[-- Type: text/x-patch; name=3_add_new_optabs.patch, Size: 9500 bytes --]

commit ab17754b481aa163cbdda053c0cb7ce65e1aa8e9
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Thu Sep 25 16:58:13 2014 +0100

    Add new optabs (2nd posted version: comment scalar_reduc_to_vector; revert sp)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index ccebc70..e7347cc 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4717,29 +4717,48 @@ it is unspecified which of the two operands is returned as the result.
 @cindex @code{reduc_smax_@var{m}} instruction pattern
 @item @samp{reduc_smin_@var{m}}, @samp{reduc_smax_@var{m}}
 Find the signed minimum/maximum of the elements of a vector. The vector is
-operand 1, and the scalar result is stored in the least significant bits of
+operand 1, and the result is stored in the least significant bits of
 operand 0 (also a vector). The output and input vector should have the same
-modes.
+modes. These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_smin_scal_@var{m}} and @samp{reduc_smax_scal_@var{m}}.
 
 @cindex @code{reduc_umin_@var{m}} instruction pattern
 @cindex @code{reduc_umax_@var{m}} instruction pattern
 @item @samp{reduc_umin_@var{m}}, @samp{reduc_umax_@var{m}}
 Find the unsigned minimum/maximum of the elements of a vector. The vector is
-operand 1, and the scalar result is stored in the least significant bits of
+operand 1, and the result is stored in the least significant bits of
 operand 0 (also a vector). The output and input vector should have the same
-modes.
+modes. These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_umin_scal_@var{m}} and @samp{reduc_umax_scal_@var{m}}.
 
 @cindex @code{reduc_splus_@var{m}} instruction pattern
-@item @samp{reduc_splus_@var{m}}
-Compute the sum of the signed elements of a vector. The vector is operand 1,
-and the scalar result is stored in the least significant bits of operand 0
-(also a vector). The output and input vector should have the same modes.
-
 @cindex @code{reduc_uplus_@var{m}} instruction pattern
-@item @samp{reduc_uplus_@var{m}}
-Compute the sum of the unsigned elements of a vector. The vector is operand 1,
-and the scalar result is stored in the least significant bits of operand 0
+@item @samp{reduc_splus_@var{m}}, @samp{reduc_uplus_@var{m}}
+Compute the sum of the signed/unsigned elements of a vector. The vector is
+operand 1, and the result is stored in the least significant bits of operand 0
 (also a vector). The output and input vector should have the same modes.
+These are legacy optabs, and platforms should prefer to implement
+@samp{reduc_plus_scal_@var{m}}.
+
+@cindex @code{reduc_smin_scal_@var{m}} instruction pattern
+@cindex @code{reduc_smax_scal_@var{m}} instruction pattern
+@item @samp{reduc_smin_scal_@var{m}}, @samp{reduc_smax_scal_@var{m}}
+Find the signed minimum/maximum of the elements of a vector. The vector is
+operand 1, and operand 0 is the scalar result, with mode equal to the mode of
+the elements of the input vector.
+
+@cindex @code{reduc_umin_scal_@var{m}} instruction pattern
+@cindex @code{reduc_umax_scal_@var{m}} instruction pattern
+@item @samp{reduc_umin_scal_@var{m}}, @samp{reduc_umax_scal_@var{m}}
+Find the unsigned minimum/maximum of the elements of a vector. The vector is
+operand 1, and operand 0 is the scalar result, with mode equal to the mode of
+the elements of the input vector.
+
+@cindex @code{reduc_plus_scal_@var{m}} instruction pattern
+@item @samp{reduc_plus_scal_@var{m}}
+Compute the sum of the elements of a vector. The vector is operand 1, and
+operand 0 is the scalar result, with mode equal to the mode of the elements of
+the input vector.
 
 @cindex @code{sdot_prod@var{m}} instruction pattern
 @item @samp{sdot_prod@var{m}}
diff --git a/gcc/expr.c b/gcc/expr.c
index c792028..3763614 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9045,6 +9045,24 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
         op0 = expand_normal (treeop0);
         this_optab = optab_for_tree_code (code, type, optab_default);
         enum machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop0));
+
+	if (optab_handler (this_optab, vec_mode) != CODE_FOR_nothing)
+	  {
+	    struct expand_operand ops[2];
+	    enum insn_code icode = optab_handler (this_optab, vec_mode);
+
+	    create_output_operand (&ops[0], target, mode);
+	    create_input_operand (&ops[1], op0, vec_mode);
+	    if (maybe_expand_insn (icode, 2, ops))
+	      {
+		target = ops[0].value;
+		if (GET_MODE (target) != mode)
+		  return gen_lowpart (tmode, target);
+		return target;
+	      }
+	  }
+	/* Fall back to optab with vector result, and then extract scalar.  */
+	this_optab = scalar_reduc_to_vector (this_optab, type);
         temp = expand_unop (vec_mode, this_optab, op0, NULL_RTX, unsignedp);
         gcc_assert (temp);
         /* The tree code produces a scalar result, but (somewhat by convention)
diff --git a/gcc/optabs.c b/gcc/optabs.c
index d55a6bb..f0547e5 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -500,13 +500,15 @@ optab_for_tree_code (enum tree_code code, const_tree type,
       return fma_optab;
 
     case REDUC_MAX_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_umax_optab : reduc_smax_optab;
+      return TYPE_UNSIGNED (type)
+	     ? reduc_umax_scal_optab : reduc_smax_scal_optab;
 
     case REDUC_MIN_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_umin_optab : reduc_smin_optab;
+      return TYPE_UNSIGNED (type)
+	     ? reduc_umin_scal_optab : reduc_smin_scal_optab;
 
     case REDUC_PLUS_EXPR:
-      return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;
+      return reduc_plus_scal_optab;
 
     case VEC_LSHIFT_EXPR:
       return vec_shl_optab;
@@ -602,7 +604,26 @@ optab_for_tree_code (enum tree_code code, const_tree type,
       return unknown_optab;
     }
 }
-\f
+
+/* Given optab UNOPTAB that reduces a vector to a scalar, find instead the old
+   optab that produces a vector with the reduction result in one element,
+   for a tree with type TYPE.  */
+
+optab
+scalar_reduc_to_vector (optab unoptab, const_tree type)
+{
+  switch (unoptab)
+    {
+    case reduc_plus_scal_optab:
+      return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;
+
+    case reduc_smin_scal_optab: return reduc_smin_optab;
+    case reduc_umin_scal_optab: return reduc_umin_optab;
+    case reduc_smax_scal_optab: return reduc_smax_optab;
+    case reduc_umax_scal_optab: return reduc_umax_optab;
+    default: return unknown_optab;
+    }
+}
 
 /* Expand vector widening operations.
 
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b755470..131ea04 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -243,12 +243,20 @@ OPTAB_D (sin_optab, "sin$a2")
 OPTAB_D (sincos_optab, "sincos$a3")
 OPTAB_D (tan_optab, "tan$a2")
 
+/* Vector reduction to a scalar.  */
+OPTAB_D (reduc_smax_scal_optab, "reduc_smax_scal_$a")
+OPTAB_D (reduc_smin_scal_optab, "reduc_smin_scal_$a")
+OPTAB_D (reduc_plus_scal_optab, "reduc_plus_scal_$a")
+OPTAB_D (reduc_umax_scal_optab, "reduc_umax_scal_$a")
+OPTAB_D (reduc_umin_scal_optab, "reduc_umin_scal_$a")
+/* (Old) Vector reduction, returning a vector with the result in one lane.  */
 OPTAB_D (reduc_smax_optab, "reduc_smax_$a")
 OPTAB_D (reduc_smin_optab, "reduc_smin_$a")
 OPTAB_D (reduc_splus_optab, "reduc_splus_$a")
 OPTAB_D (reduc_umax_optab, "reduc_umax_$a")
 OPTAB_D (reduc_umin_optab, "reduc_umin_$a")
 OPTAB_D (reduc_uplus_optab, "reduc_uplus_$a")
+
 OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
diff --git a/gcc/optabs.h b/gcc/optabs.h
index 089b15a..1f8e44d 100644
--- a/gcc/optabs.h
+++ b/gcc/optabs.h
@@ -162,6 +162,11 @@ enum optab_subtype
    vector shifts and rotates */
 extern optab optab_for_tree_code (enum tree_code, const_tree, enum optab_subtype);
 
+/* Given an optab that reduces a vector to a scalar, find instead the old
+   optab that produces a vector with the reduction result in one element,
+   for a tree with the specified type.  */
+extern optab scalar_reduc_to_vector (optab, const_tree type);
+
 /* The various uses that a comparison can have; used by can_compare_p:
    jumps, conditional moves, store flag operations.  */
 enum can_compare_purpose
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 8d97e17..e8b108d 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -5102,15 +5102,17 @@ vectorizable_reduction (gimple stmt, gimple_stmt_iterator *gsi,
 
           epilog_reduc_code = ERROR_MARK;
         }
-
-      if (reduc_optab
-          && optab_handler (reduc_optab, vec_mode) == CODE_FOR_nothing)
+      else if (optab_handler (reduc_optab, vec_mode) == CODE_FOR_nothing)
         {
-          if (dump_enabled_p ())
-	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "reduc op not supported by target.\n");
+          optab = scalar_reduc_to_vector (reduc_optab, vectype_out);
+          if (optab_handler (optab, vec_mode) == CODE_FOR_nothing)
+            {
+              if (dump_enabled_p ())
+	        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "reduc op not supported by target.\n");
 
-          epilog_reduc_code = ERROR_MARK;
+	      epilog_reduc_code = ERROR_MARK;
+	    }
         }
     }
   else

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #5: 4_aarch64_reduc_plus.patch --]
[-- Type: text/x-patch; name=4_aarch64_reduc_plus.patch, Size: 10470 bytes --]

commit 0a87420deca7f94d4e4f637ab89606b8d14a7775
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Wed Aug 13 17:23:31 2014 +0100

    AArch64: use new reduc_plus_scal_optab, inc. for __builtins; remove old.

diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index 2367436..7e77b2e 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -214,9 +214,8 @@
   BUILTIN_VSDQ_I (SHIFTIMM, sqshl_n, 0)
   BUILTIN_VSDQ_I (USHIFTIMM, uqshl_n, 0)
 
-  /* Implemented by reduc_<sur>plus_<mode>.  */
-  BUILTIN_VALL (UNOP, reduc_splus_, 10)
-  BUILTIN_VDQ (UNOP, reduc_uplus_, 10)
+  /* Implemented by aarch64_reduc_plus_<mode>.  */
+  BUILTIN_VALL (UNOP, reduc_plus_scal_, 10)
 
   /* Implemented by reduc_<maxmin_uns>_<mode>.  */
   BUILTIN_VDQIF (UNOP, reduc_smax_, 10)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index cab26a3..a5b9f3d 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1760,25 +1760,52 @@
 
 ;; 'across lanes' add.
 
-(define_insn "reduc_<sur>plus_<mode>"
+(define_expand "reduc_plus_scal_<mode>"
+  [(match_operand:<VEL> 0 "register_operand" "=w")
+   (unspec:VDQ [(match_operand:VDQ 1 "register_operand" "w")]
+	       UNSPEC_ADDV)]
+  "TARGET_SIMD"
+  {
+    rtx elt = GEN_INT (ENDIAN_LANE_N (<MODE>mode, 0));
+    rtx scratch = gen_reg_rtx (<MODE>mode);
+    emit_insn (gen_aarch64_reduc_plus_internal<mode> (scratch, operands[1]));
+    emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
+    DONE;
+  }
+)
+
+(define_expand "reduc_plus_scal_<mode>"
+  [(match_operand:<VEL> 0 "register_operand" "=w")
+   (match_operand:V2F 1 "register_operand" "w")]
+  "TARGET_SIMD"
+  {
+    rtx elt = GEN_INT (ENDIAN_LANE_N (<MODE>mode, 0));
+    rtx scratch = gen_reg_rtx (<MODE>mode);
+    emit_insn (gen_aarch64_reduc_plus_internal<mode> (scratch, operands[1]));
+    emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
+    DONE;
+  }
+)
+
+(define_insn "aarch64_reduc_plus_internal<mode>"
  [(set (match_operand:VDQV 0 "register_operand" "=w")
        (unspec:VDQV [(match_operand:VDQV 1 "register_operand" "w")]
-		    SUADDV))]
+		    UNSPEC_ADDV))]
  "TARGET_SIMD"
  "add<VDQV:vp>\\t%<Vetype>0, %1.<Vtype>"
   [(set_attr "type" "neon_reduc_add<q>")]
 )
 
-(define_insn "reduc_<sur>plus_v2si"
+(define_insn "aarch64_reduc_plus_internalv2si"
  [(set (match_operand:V2SI 0 "register_operand" "=w")
        (unspec:V2SI [(match_operand:V2SI 1 "register_operand" "w")]
-		    SUADDV))]
+		    UNSPEC_ADDV))]
  "TARGET_SIMD"
  "addp\\t%0.2s, %1.2s, %1.2s"
   [(set_attr "type" "neon_reduc_add")]
 )
 
-(define_insn "reduc_splus_<mode>"
+(define_insn "aarch64_reduc_plus_internal<mode>"
  [(set (match_operand:V2F 0 "register_operand" "=w")
        (unspec:V2F [(match_operand:V2F 1 "register_operand" "w")]
 		   UNSPEC_FADDV))]
@@ -1796,14 +1823,17 @@
   [(set_attr "type" "neon_fp_reduc_add_s_q")]
 )
 
-(define_expand "reduc_splus_v4sf"
- [(set (match_operand:V4SF 0 "register_operand")
+(define_expand "reduc_plus_scal_v4sf"
+ [(set (match_operand:SF 0 "register_operand")
        (unspec:V4SF [(match_operand:V4SF 1 "register_operand")]
 		    UNSPEC_FADDV))]
  "TARGET_SIMD"
 {
-  emit_insn (gen_aarch64_addpv4sf (operands[0], operands[1]));
-  emit_insn (gen_aarch64_addpv4sf (operands[0], operands[0]));
+  rtx elt = GEN_INT (ENDIAN_LANE_N (V4SFmode, 0));
+  rtx scratch = gen_reg_rtx (V4SFmode);
+  emit_insn (gen_aarch64_addpv4sf (scratch, operands[1]));
+  emit_insn (gen_aarch64_addpv4sf (scratch, scratch));
+  emit_insn (gen_aarch64_get_lanev4sf (operands[0], scratch, elt));
   DONE;
 })
 
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 9b1873f..d17e7fe 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -13209,121 +13209,103 @@ vaddd_u64 (uint64_t __a, uint64_t __b)
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vaddv_s8 (int8x8_t __a)
 {
-  return vget_lane_s8 (__builtin_aarch64_reduc_splus_v8qi (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v8qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vaddv_s16 (int16x4_t __a)
 {
-  return vget_lane_s16 (__builtin_aarch64_reduc_splus_v4hi (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v4hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vaddv_s32 (int32x2_t __a)
 {
-  return vget_lane_s32 (__builtin_aarch64_reduc_splus_v2si (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v2si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vaddv_u8 (uint8x8_t __a)
 {
-  return vget_lane_u8 ((uint8x8_t)
-		__builtin_aarch64_reduc_uplus_v8qi ((int8x8_t) __a),
-		0);
+  return (uint8_t) __builtin_aarch64_reduc_plus_scal_v8qi ((int8x8_t) __a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vaddv_u16 (uint16x4_t __a)
 {
-  return vget_lane_u16 ((uint16x4_t)
-		__builtin_aarch64_reduc_uplus_v4hi ((int16x4_t) __a),
-		0);
+  return (uint16_t) __builtin_aarch64_reduc_plus_scal_v4hi ((int16x4_t) __a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vaddv_u32 (uint32x2_t __a)
 {
-  return vget_lane_u32 ((uint32x2_t)
-		__builtin_aarch64_reduc_uplus_v2si ((int32x2_t) __a),
-		0);
+  return (int32_t) __builtin_aarch64_reduc_plus_scal_v2si ((int32x2_t) __a);
 }
 
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vaddvq_s8 (int8x16_t __a)
 {
-  return vgetq_lane_s8 (__builtin_aarch64_reduc_splus_v16qi (__a),
-			0);
+  return __builtin_aarch64_reduc_plus_scal_v16qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vaddvq_s16 (int16x8_t __a)
 {
-  return vgetq_lane_s16 (__builtin_aarch64_reduc_splus_v8hi (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v8hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vaddvq_s32 (int32x4_t __a)
 {
-  return vgetq_lane_s32 (__builtin_aarch64_reduc_splus_v4si (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v4si (__a);
 }
 
 __extension__ static __inline int64_t __attribute__ ((__always_inline__))
 vaddvq_s64 (int64x2_t __a)
 {
-  return vgetq_lane_s64 (__builtin_aarch64_reduc_splus_v2di (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v2di (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vaddvq_u8 (uint8x16_t __a)
 {
-  return vgetq_lane_u8 ((uint8x16_t)
-		__builtin_aarch64_reduc_uplus_v16qi ((int8x16_t) __a),
-		0);
+  return (uint8_t) __builtin_aarch64_reduc_plus_scal_v16qi ((int8x16_t) __a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vaddvq_u16 (uint16x8_t __a)
 {
-  return vgetq_lane_u16 ((uint16x8_t)
-		__builtin_aarch64_reduc_uplus_v8hi ((int16x8_t) __a),
-		0);
+  return (uint16_t) __builtin_aarch64_reduc_plus_scal_v8hi ((int16x8_t) __a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vaddvq_u32 (uint32x4_t __a)
 {
-  return vgetq_lane_u32 ((uint32x4_t)
-		__builtin_aarch64_reduc_uplus_v4si ((int32x4_t) __a),
-		0);
+  return (uint32_t) __builtin_aarch64_reduc_plus_scal_v4si ((int32x4_t) __a);
 }
 
 __extension__ static __inline uint64_t __attribute__ ((__always_inline__))
 vaddvq_u64 (uint64x2_t __a)
 {
-  return vgetq_lane_u64 ((uint64x2_t)
-		__builtin_aarch64_reduc_uplus_v2di ((int64x2_t) __a),
-		0);
+  return (uint64_t) __builtin_aarch64_reduc_plus_scal_v2di ((int64x2_t) __a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vaddv_f32 (float32x2_t __a)
 {
-  float32x2_t __t = __builtin_aarch64_reduc_splus_v2sf (__a);
-  return vget_lane_f32 (__t, 0);
+  return __builtin_aarch64_reduc_plus_scal_v2sf (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vaddvq_f32 (float32x4_t __a)
 {
-  float32x4_t __t = __builtin_aarch64_reduc_splus_v4sf (__a);
-  return vgetq_lane_f32 (__t, 0);
+  return __builtin_aarch64_reduc_plus_scal_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vaddvq_f64 (float64x2_t __a)
 {
-  float64x2_t __t = __builtin_aarch64_reduc_splus_v2df (__a);
-  return vgetq_lane_f64 (__t, 0);
+  return __builtin_aarch64_reduc_plus_scal_v2df (__a);
 }
 
 /* vbsl  */
@@ -18875,7 +18857,7 @@ vpadd_u32 (uint32x2_t __a, uint32x2_t __b)
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vpaddd_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_splus_v2df (__a), 0);
+  return __builtin_aarch64_reduc_plus_scal_v2df (__a);
 }
 
 __extension__ static __inline int64_t __attribute__ ((__always_inline__))
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index efd006f..74c71fc 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -213,8 +213,7 @@
     UNSPEC_FMINNMV	; Used in aarch64-simd.md.
     UNSPEC_FMINV	; Used in aarch64-simd.md.
     UNSPEC_FADDV	; Used in aarch64-simd.md.
-    UNSPEC_SADDV	; Used in aarch64-simd.md.
-    UNSPEC_UADDV	; Used in aarch64-simd.md.
+    UNSPEC_ADDV		; Used in aarch64-simd.md.
     UNSPEC_SMAXV	; Used in aarch64-simd.md.
     UNSPEC_SMINV	; Used in aarch64-simd.md.
     UNSPEC_UMAXV	; Used in aarch64-simd.md.
@@ -859,8 +858,6 @@
 (define_int_iterator FMAXMINV [UNSPEC_FMAXV UNSPEC_FMINV
 			       UNSPEC_FMAXNMV UNSPEC_FMINNMV])
 
-(define_int_iterator SUADDV [UNSPEC_SADDV UNSPEC_UADDV])
-
 (define_int_iterator HADDSUB [UNSPEC_SHADD UNSPEC_UHADD
 			      UNSPEC_SRHADD UNSPEC_URHADD
 			      UNSPEC_SHSUB UNSPEC_UHSUB
@@ -965,7 +962,6 @@
 		      (UNSPEC_SUBHN2 "") (UNSPEC_RSUBHN2 "r")
 		      (UNSPEC_SQXTN "s") (UNSPEC_UQXTN "u")
 		      (UNSPEC_USQADD "us") (UNSPEC_SUQADD "su")
-		      (UNSPEC_SADDV "s") (UNSPEC_UADDV "u")
 		      (UNSPEC_SSLI  "s") (UNSPEC_USLI  "u")
 		      (UNSPEC_SSRI  "s") (UNSPEC_USRI  "u")
 		      (UNSPEC_USRA  "u") (UNSPEC_SSRA  "s")

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #6: 5_aarch64_reduc_minmax.patch --]
[-- Type: text/x-patch; name=5_aarch64_reduc_minmax.patch, Size: 14448 bytes --]

commit c15a1d55498eade13cef550b7635e5791bb48d79
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Wed Aug 13 17:23:58 2014 +0100

    AArch64: reduc_[us](min|max)_scal_optab, inc _nan variant and __builtins.
    
    Also combine V2F and V4SF variants.

diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index 7e77b2e..4bedb4a 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -217,13 +217,13 @@
   /* Implemented by aarch64_reduc_plus_<mode>.  */
   BUILTIN_VALL (UNOP, reduc_plus_scal_, 10)
 
-  /* Implemented by reduc_<maxmin_uns>_<mode>.  */
-  BUILTIN_VDQIF (UNOP, reduc_smax_, 10)
-  BUILTIN_VDQIF (UNOP, reduc_smin_, 10)
-  BUILTIN_VDQ_BHSI (UNOP, reduc_umax_, 10)
-  BUILTIN_VDQ_BHSI (UNOP, reduc_umin_, 10)
-  BUILTIN_VDQF (UNOP, reduc_smax_nan_, 10)
-  BUILTIN_VDQF (UNOP, reduc_smin_nan_, 10)
+  /* Implemented by reduc_<maxmin_uns>_scal_<mode> (producing scalar).  */
+  BUILTIN_VDQIF (UNOP, reduc_smax_scal_, 10)
+  BUILTIN_VDQIF (UNOP, reduc_smin_scal_, 10)
+  BUILTIN_VDQ_BHSI (UNOPU, reduc_umax_scal_, 10)
+  BUILTIN_VDQ_BHSI (UNOPU, reduc_umin_scal_, 10)
+  BUILTIN_VDQF (UNOP, reduc_smax_nan_scal_, 10)
+  BUILTIN_VDQF (UNOP, reduc_smin_nan_scal_, 10)
 
   /* Implemented by <maxmin><mode>3.
      smax variants map to fmaxnm,
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index a5b9f3d..c17e1ee 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1847,7 +1847,40 @@
 
 ;; 'across lanes' max and min ops.
 
-(define_insn "reduc_<maxmin_uns>_<mode>"
+;; Template for outputting a scalar, so we can create __builtins which can be
+;; gimple_fold'd to the REDUC_(MAX|MIN)_EXPR tree code.  (This is FP smax/smin).
+(define_expand "reduc_<maxmin_uns>_scal_<mode>"
+  [(match_operand:<VEL> 0 "register_operand")
+   (unspec:VDQF [(match_operand:VDQF 1 "register_operand")]
+		FMAXMINV)]
+  "TARGET_SIMD"
+  {
+    rtx elt = GEN_INT (ENDIAN_LANE_N (<MODE>mode, 0));
+    rtx scratch = gen_reg_rtx (<MODE>mode);
+    emit_insn (gen_aarch64_reduc_<maxmin_uns>_internal<mode> (scratch,
+							      operands[1]));
+    emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
+    DONE;
+  }
+)
+
+;; Likewise for integer cases, signed and unsigned.
+(define_expand "reduc_<maxmin_uns>_scal_<mode>"
+  [(match_operand:<VEL> 0 "register_operand")
+   (unspec:VDQ_BHSI [(match_operand:VDQ_BHSI 1 "register_operand")]
+		    MAXMINV)]
+  "TARGET_SIMD"
+  {
+    rtx elt = GEN_INT (ENDIAN_LANE_N (<MODE>mode, 0));
+    rtx scratch = gen_reg_rtx (<MODE>mode);
+    emit_insn (gen_aarch64_reduc_<maxmin_uns>_internal<mode> (scratch,
+							      operands[1]));
+    emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
+    DONE;
+  }
+)
+
+(define_insn "aarch64_reduc_<maxmin_uns>_internal<mode>"
  [(set (match_operand:VDQV_S 0 "register_operand" "=w")
        (unspec:VDQV_S [(match_operand:VDQV_S 1 "register_operand" "w")]
 		    MAXMINV))]
@@ -1856,7 +1889,7 @@
   [(set_attr "type" "neon_reduc_minmax<q>")]
 )
 
-(define_insn "reduc_<maxmin_uns>_v2si"
+(define_insn "aarch64_reduc_<maxmin_uns>_internalv2si"
  [(set (match_operand:V2SI 0 "register_operand" "=w")
        (unspec:V2SI [(match_operand:V2SI 1 "register_operand" "w")]
 		    MAXMINV))]
@@ -1865,24 +1898,15 @@
   [(set_attr "type" "neon_reduc_minmax")]
 )
 
-(define_insn "reduc_<maxmin_uns>_<mode>"
- [(set (match_operand:V2F 0 "register_operand" "=w")
-       (unspec:V2F [(match_operand:V2F 1 "register_operand" "w")]
+(define_insn "aarch64_reduc_<maxmin_uns>_internal<mode>"
+ [(set (match_operand:VDQF 0 "register_operand" "=w")
+       (unspec:VDQF [(match_operand:VDQF 1 "register_operand" "w")]
 		    FMAXMINV))]
  "TARGET_SIMD"
- "<maxmin_uns_op>p\\t%<Vetype>0, %1.<Vtype>"
+ "<maxmin_uns_op><vp>\\t%<Vetype>0, %1.<Vtype>"
   [(set_attr "type" "neon_fp_reduc_minmax_<Vetype><q>")]
 )
 
-(define_insn "reduc_<maxmin_uns>_v4sf"
- [(set (match_operand:V4SF 0 "register_operand" "=w")
-       (unspec:V4SF [(match_operand:V4SF 1 "register_operand" "w")]
-		    FMAXMINV))]
- "TARGET_SIMD"
- "<maxmin_uns_op>v\\t%s0, %1.4s"
-  [(set_attr "type" "neon_fp_reduc_minmax_s_q")]
-)
-
 ;; aarch64_simd_bsl may compile to any of bsl/bif/bit depending on register
 ;; allocation.
 ;; Operand 1 is the mask, operands 2 and 3 are the bitfields from which
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index d17e7fe..9b0ff30 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -17688,106 +17688,91 @@ vmaxnmq_f64 (float64x2_t __a, float64x2_t __b)
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vmaxv_f32 (float32x2_t __a)
 {
-  return vget_lane_f32 (__builtin_aarch64_reduc_smax_nan_v2sf (__a),
-			0);
+  return __builtin_aarch64_reduc_smax_nan_scal_v2sf (__a);
 }
 
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vmaxv_s8 (int8x8_t __a)
 {
-  return vget_lane_s8 (__builtin_aarch64_reduc_smax_v8qi (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v8qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vmaxv_s16 (int16x4_t __a)
 {
-  return vget_lane_s16 (__builtin_aarch64_reduc_smax_v4hi (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v4hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vmaxv_s32 (int32x2_t __a)
 {
-  return vget_lane_s32 (__builtin_aarch64_reduc_smax_v2si (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v2si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vmaxv_u8 (uint8x8_t __a)
 {
-  return vget_lane_u8 ((uint8x8_t)
-		__builtin_aarch64_reduc_umax_v8qi ((int8x8_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v8qi_uu (__a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vmaxv_u16 (uint16x4_t __a)
 {
-  return vget_lane_u16 ((uint16x4_t)
-		__builtin_aarch64_reduc_umax_v4hi ((int16x4_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v4hi_uu (__a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vmaxv_u32 (uint32x2_t __a)
 {
-  return vget_lane_u32 ((uint32x2_t)
-		__builtin_aarch64_reduc_umax_v2si ((int32x2_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v2si_uu (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vmaxvq_f32 (float32x4_t __a)
 {
-  return vgetq_lane_f32 (__builtin_aarch64_reduc_smax_nan_v4sf (__a),
-			 0);
+  return __builtin_aarch64_reduc_smax_nan_scal_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vmaxvq_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_smax_nan_v2df (__a),
-			 0);
+  return __builtin_aarch64_reduc_smax_nan_scal_v2df (__a);
 }
 
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vmaxvq_s8 (int8x16_t __a)
 {
-  return vgetq_lane_s8 (__builtin_aarch64_reduc_smax_v16qi (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v16qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vmaxvq_s16 (int16x8_t __a)
 {
-  return vgetq_lane_s16 (__builtin_aarch64_reduc_smax_v8hi (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v8hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vmaxvq_s32 (int32x4_t __a)
 {
-  return vgetq_lane_s32 (__builtin_aarch64_reduc_smax_v4si (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v4si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vmaxvq_u8 (uint8x16_t __a)
 {
-  return vgetq_lane_u8 ((uint8x16_t)
-		__builtin_aarch64_reduc_umax_v16qi ((int8x16_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v16qi_uu (__a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vmaxvq_u16 (uint16x8_t __a)
 {
-  return vgetq_lane_u16 ((uint16x8_t)
-		__builtin_aarch64_reduc_umax_v8hi ((int16x8_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v8hi_uu (__a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vmaxvq_u32 (uint32x4_t __a)
 {
-  return vgetq_lane_u32 ((uint32x4_t)
-		__builtin_aarch64_reduc_umax_v4si ((int32x4_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umax_scal_v4si_uu (__a);
 }
 
 /* vmaxnmv  */
@@ -17795,20 +17780,19 @@ vmaxvq_u32 (uint32x4_t __a)
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vmaxnmv_f32 (float32x2_t __a)
 {
-  return vget_lane_f32 (__builtin_aarch64_reduc_smax_v2sf (__a),
-			0);
+  return __builtin_aarch64_reduc_smax_scal_v2sf (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vmaxnmvq_f32 (float32x4_t __a)
 {
-  return vgetq_lane_f32 (__builtin_aarch64_reduc_smax_v4sf (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vmaxnmvq_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_smax_v2df (__a), 0);
+  return __builtin_aarch64_reduc_smax_scal_v2df (__a);
 }
 
 /* vmin  */
@@ -17934,107 +17918,91 @@ vminnmq_f64 (float64x2_t __a, float64x2_t __b)
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vminv_f32 (float32x2_t __a)
 {
-  return vget_lane_f32 (__builtin_aarch64_reduc_smin_nan_v2sf (__a),
-			0);
+  return __builtin_aarch64_reduc_smin_nan_scal_v2sf (__a);
 }
 
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vminv_s8 (int8x8_t __a)
 {
-  return vget_lane_s8 (__builtin_aarch64_reduc_smin_v8qi (__a),
-		       0);
+  return __builtin_aarch64_reduc_smin_scal_v8qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vminv_s16 (int16x4_t __a)
 {
-  return vget_lane_s16 (__builtin_aarch64_reduc_smin_v4hi (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v4hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vminv_s32 (int32x2_t __a)
 {
-  return vget_lane_s32 (__builtin_aarch64_reduc_smin_v2si (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v2si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vminv_u8 (uint8x8_t __a)
 {
-  return vget_lane_u8 ((uint8x8_t)
-		__builtin_aarch64_reduc_umin_v8qi ((int8x8_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v8qi_uu (__a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vminv_u16 (uint16x4_t __a)
 {
-  return vget_lane_u16 ((uint16x4_t)
-		__builtin_aarch64_reduc_umin_v4hi ((int16x4_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v4hi_uu (__a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vminv_u32 (uint32x2_t __a)
 {
-  return vget_lane_u32 ((uint32x2_t)
-		__builtin_aarch64_reduc_umin_v2si ((int32x2_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v2si_uu (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vminvq_f32 (float32x4_t __a)
 {
-  return vgetq_lane_f32 (__builtin_aarch64_reduc_smin_nan_v4sf (__a),
-			 0);
+  return __builtin_aarch64_reduc_smin_nan_scal_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vminvq_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_smin_nan_v2df (__a),
-			 0);
+  return __builtin_aarch64_reduc_smin_nan_scal_v2df (__a);
 }
 
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vminvq_s8 (int8x16_t __a)
 {
-  return vgetq_lane_s8 (__builtin_aarch64_reduc_smin_v16qi (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v16qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vminvq_s16 (int16x8_t __a)
 {
-  return vgetq_lane_s16 (__builtin_aarch64_reduc_smin_v8hi (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v8hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vminvq_s32 (int32x4_t __a)
 {
-  return vgetq_lane_s32 (__builtin_aarch64_reduc_smin_v4si (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v4si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
 vminvq_u8 (uint8x16_t __a)
 {
-  return vgetq_lane_u8 ((uint8x16_t)
-		__builtin_aarch64_reduc_umin_v16qi ((int8x16_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v16qi_uu (__a);
 }
 
 __extension__ static __inline uint16_t __attribute__ ((__always_inline__))
 vminvq_u16 (uint16x8_t __a)
 {
-  return vgetq_lane_u16 ((uint16x8_t)
-		__builtin_aarch64_reduc_umin_v8hi ((int16x8_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v8hi_uu (__a);
 }
 
 __extension__ static __inline uint32_t __attribute__ ((__always_inline__))
 vminvq_u32 (uint32x4_t __a)
 {
-  return vgetq_lane_u32 ((uint32x4_t)
-		__builtin_aarch64_reduc_umin_v4si ((int32x4_t) __a),
-		0);
+  return __builtin_aarch64_reduc_umin_scal_v4si_uu (__a);
 }
 
 /* vminnmv  */
@@ -18042,19 +18010,19 @@ vminvq_u32 (uint32x4_t __a)
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vminnmv_f32 (float32x2_t __a)
 {
-  return vget_lane_f32 (__builtin_aarch64_reduc_smin_v2sf (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v2sf (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vminnmvq_f32 (float32x4_t __a)
 {
-  return vgetq_lane_f32 (__builtin_aarch64_reduc_smin_v4sf (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vminnmvq_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_smin_v2df (__a), 0);
+  return __builtin_aarch64_reduc_smin_scal_v2df (__a);
 }
 
 /* vmla */

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #7: 6_aarch64_restore_gimple_fold.patch --]
[-- Type: text/x-patch; name=6_aarch64_restore_gimple_fold.patch, Size: 3755 bytes --]

commit 846d5932041e04bbf386efbc739aee9749051bc7
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Wed Aug 13 17:25:13 2014 +0100

    AArch64: Reintroduce gimple_fold for min+max+plus

diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
index a49da89..283469b 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -1188,9 +1188,6 @@ aarch64_fold_builtin (tree fndecl, int n_args ATTRIBUTE_UNUSED, tree *args,
   return NULL_TREE;
 }
 
-/* Handling of reduction operations temporarily removed so as to decouple
-   changes to tree codes from AArch64 NEON Intrinsics.  */
-#if 0
 bool
 aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 {
@@ -1200,19 +1197,6 @@ aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
   tree fndecl;
   gimple new_stmt = NULL;
 
-  /* The operations folded below are reduction operations.  These are
-     defined to leave their result in the 0'th element (from the perspective
-     of GCC).  The architectural instruction we are folding will leave the
-     result in the 0'th element (from the perspective of the architecture).
-     For big-endian systems, these perspectives are not aligned.
-
-     It is therefore wrong to perform this fold on big-endian.  There
-     are some tricks we could play with shuffling, but the mid-end is
-     inconsistent in the way it treats reduction operations, so we will
-     end up in difficulty.  Until we fix the ambiguity - just bail out.  */
-  if (BYTES_BIG_ENDIAN)
-    return false;
-
   if (call)
     {
       fndecl = gimple_call_fndecl (stmt);
@@ -1224,23 +1208,28 @@ aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 			? gimple_call_arg_ptr (stmt, 0)
 			: &error_mark_node);
 
+	  /* We use gimple's REDUC_(PLUS|MIN|MAX)_EXPRs for float, signed int
+	     and unsigned int; it will distinguish according to the types of
+	     the arguments to the __builtin.  */
 	  switch (fcode)
 	    {
-	      BUILTIN_VALL (UNOP, reduc_splus_, 10)
-		new_stmt = gimple_build_assign_with_ops (
+	      BUILTIN_VALL (UNOP, reduc_plus_scal_, 10)
+	        new_stmt = gimple_build_assign_with_ops (
 						REDUC_PLUS_EXPR,
 						gimple_call_lhs (stmt),
 						args[0],
 						NULL_TREE);
 		break;
-	      BUILTIN_VDQIF (UNOP, reduc_smax_, 10)
+	      BUILTIN_VDQIF (UNOP, reduc_smax_scal_, 10)
+	      BUILTIN_VDQ_BHSI (UNOPU, reduc_umax_scal_, 10)
 		new_stmt = gimple_build_assign_with_ops (
 						REDUC_MAX_EXPR,
 						gimple_call_lhs (stmt),
 						args[0],
 						NULL_TREE);
 		break;
-	      BUILTIN_VDQIF (UNOP, reduc_smin_, 10)
+	      BUILTIN_VDQIF (UNOP, reduc_smin_scal_, 10)
+	      BUILTIN_VDQ_BHSI (UNOPU, reduc_umin_scal_, 10)
 		new_stmt = gimple_build_assign_with_ops (
 						REDUC_MIN_EXPR,
 						gimple_call_lhs (stmt),
@@ -1262,7 +1251,6 @@ aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 
   return changed;
 }
-#endif
 
 void
 aarch64_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 27d82f3..db5ff59 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -10015,8 +10015,8 @@ aarch64_asan_shadow_offset (void)
 #undef TARGET_FRAME_POINTER_REQUIRED
 #define TARGET_FRAME_POINTER_REQUIRED aarch64_frame_pointer_required
 
-//#undef TARGET_GIMPLE_FOLD_BUILTIN
-//#define TARGET_GIMPLE_FOLD_BUILTIN aarch64_gimple_fold_builtin
+#undef TARGET_GIMPLE_FOLD_BUILTIN
+#define TARGET_GIMPLE_FOLD_BUILTIN aarch64_gimple_fold_builtin
 
 #undef TARGET_GIMPLIFY_VA_ARG_EXPR
 #define TARGET_GIMPLIFY_VA_ARG_EXPR aarch64_gimplify_va_arg_expr

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 7/11][ARM] Migrate to new reduc_plus_scal_optab
  2014-10-24 11:57 [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Alan Lawrence
@ 2014-10-24 11:58 ` Alan Lawrence
  2014-11-03 17:32   ` Ramana Radhakrishnan
  2014-10-24 12:01 ` [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Richard Biener
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Alan Lawrence @ 2014-10-24 11:58 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 362 bytes --]

This migrates ARM from reduc_splus_optab and reduc_uplus optab to a single 
reduc_plus_optab.

Tested, in combination with next patch:
bootstrap on arm-none-linux-gnueabihf
cross-tested check-gcc on arm-none-eabi.

gcc/ChangeLog:

	config/arm/neon.md (reduc_plus_*): Rename to...
	(reduc_plus_scal_*): ...this; reduce to temp and extract scalar result.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 7_arm_reduc_plus.patch --]
[-- Type: text/x-patch; name=7_arm_reduc_plus.patch, Size: 3026 bytes --]

commit 22e60bd46f2a591f5357a543d76b19ed89f401ed
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Thu Aug 28 16:12:24 2014 +0100

    ARM reduc_plus_scal, V_elem not V_ext, rm old reduc_[us]plus, emit the extract!

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 41cf913..d13fe5d 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -1349,33 +1349,47 @@
 
 ;; Reduction operations
 
-(define_expand "reduc_splus_<mode>"
-  [(match_operand:VD 0 "s_register_operand" "")
+(define_expand "reduc_plus_scal_<mode>"
+  [(match_operand:<V_elem> 0 "nonimmediate_operand" "")
    (match_operand:VD 1 "s_register_operand" "")]
   "TARGET_NEON && (!<Is_float_mode> || flag_unsafe_math_optimizations)"
 {
-  neon_pairwise_reduce (operands[0], operands[1], <MODE>mode,
+  rtx vec = gen_reg_rtx (<MODE>mode);
+  neon_pairwise_reduce (vec, operands[1], <MODE>mode,
 			&gen_neon_vpadd_internal<mode>);
+  /* The same result is actually computed into every element.  */
+  emit_insn (gen_vec_extract<mode> (operands[0], vec, const0_rtx));
   DONE;
 })
 
-(define_expand "reduc_splus_<mode>"
-  [(match_operand:VQ 0 "s_register_operand" "")
+(define_expand "reduc_plus_scal_<mode>"
+  [(match_operand:<V_elem> 0 "nonimmediate_operand" "")
    (match_operand:VQ 1 "s_register_operand" "")]
   "TARGET_NEON && (!<Is_float_mode> || flag_unsafe_math_optimizations)
    && !BYTES_BIG_ENDIAN"
 {
   rtx step1 = gen_reg_rtx (<V_HALF>mode);
-  rtx res_d = gen_reg_rtx (<V_HALF>mode);
 
   emit_insn (gen_quad_halves_plus<mode> (step1, operands[1]));
-  emit_insn (gen_reduc_splus_<V_half> (res_d, step1));
-  emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
+  emit_insn (gen_reduc_plus_scal_<V_half> (operands[0], step1));
+
+  DONE;
+})
+
+(define_expand "reduc_plus_scal_v2di"
+  [(match_operand:DI 0 "nonimmediate_operand" "=w")
+   (match_operand:V2DI 1 "s_register_operand" "")]
+  "TARGET_NEON && !BYTES_BIG_ENDIAN"
+{
+  rtx vec = gen_reg_rtx (V2DImode);
+
+  emit_insn (gen_arm_reduc_plus_internal_v2di (vec, operands[1]));
+  emit_insn (gen_vec_extractv2di (operands[0], vec, const0_rtx));
 
   DONE;
 })
 
-(define_insn "reduc_splus_v2di"
+(define_insn "arm_reduc_plus_internal_v2di"
   [(set (match_operand:V2DI 0 "s_register_operand" "=w")
 	(unspec:V2DI [(match_operand:V2DI 1 "s_register_operand" "w")]
 		     UNSPEC_VPADD))]
@@ -1384,17 +1398,6 @@
   [(set_attr "type" "neon_add_q")]
 )
 
-;; NEON does not distinguish between signed and unsigned addition except on
-;; widening operations.
-(define_expand "reduc_uplus_<mode>"
-  [(match_operand:VDQI 0 "s_register_operand" "")
-   (match_operand:VDQI 1 "s_register_operand" "")]
-  "TARGET_NEON && (<Is_d_reg> || !BYTES_BIG_ENDIAN)"
-{
-  emit_insn (gen_reduc_splus_<mode> (operands[0], operands[1]));
-  DONE;
-})
-
 (define_expand "reduc_smin_<mode>"
   [(match_operand:VD 0 "s_register_operand" "")
    (match_operand:VD 1 "s_register_operand" "")]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral
  2014-10-24 11:57 [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Alan Lawrence
  2014-10-24 11:58 ` [PATCH 7/11][ARM] Migrate to new reduc_plus_scal_optab Alan Lawrence
@ 2014-10-24 12:01 ` Richard Biener
  2014-10-24 12:05 ` [PATCH 8/11][ARM] Migrate to new reduc_[us](min|max)_scal_optab Alan Lawrence
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Richard Biener @ 2014-10-24 12:01 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches, David Edelsohn

On Fri, 24 Oct 2014, Alan Lawrence wrote:

> This is the first half of my previous patch series
> (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01456.html), that is the part
> making the REDUC_..._EXPR tree codes endian-neutral, and adding a new
> reduce-to-scalar optab in place of the endianness-dependent
> reduc_[us](plus|min|max)_optab.
> 
> I'm leaving the vec_shr portion out of this patch series, as the link between
> the two halves is only the end goal of removing an "if (BYTES_BIG_ENDIAN)"
> from tree-vect-loop.c; this series removes that from one code path so can
> stand alone.
> 
> Patches 1-6 are as previously posted apart from rebasing and removing the
> old/poisoned AArch64 patterns as per maintainer's request. Patches 1, 2, 4, 5
> and 6 have already been approved; patch 3 was discussed somewhat but I think
> we decided against most of the ideas raised, I have added comment to
> scalar_reduc_to_vector. I now reread Richie's "Otherwise the patch looks good
> to me" and wonder if I should have taken that as an approval but I didn't read
> it that way at the time...???

Yes, it was an approval ;)

> Patches 7-11 migrate migrate ARM, x86, IA64 (I think), and mostly PowerPC, to
> the new reduc_(plus|[us](min|max))_scal_optab. I have not managed to work out
> how to do the same for MIPS (specifically what I need to add to
> mips_expand_vec_reduc), and have had no response from the maintainers, so am
> leaving that for now. Also I haven't migrated (or worked out how to target)
> rs6000/paired.md, help would be most welcome.
> 
> 
> The suggestion was then to "complete" the migration, by removing the old
> optabs. There are a few options here and I'll follow up with appropriate
> patches according to feedback received. I see options:
> 
> (1) just delete the old optabs (and the migration code). This would
> performance-regress the MIPS backend, but should not break it, although one
> should really do *something* with the then-unused
> reduc_[us](plus|min|max)_optab in config/mips/loongson.md.
>
> (2) also renaming reduc_..._scal_optab back to reduc_..._optab; would break
> the MIPS backend if something were not done with it's existing patterns.
> 
> (2a) Alternatively I could just use a different new name, e.g. reduce_....,
> reduct_...., vec_reduc_..., anything that's less of a mouthful than
> reduc_..._scal. Whilst being only-very-slightly-different from the current
> reduc_... might be confusing, so might changing the meaning of the optab, and
> its signature, with the existing name, so am open to suggestions?

I definitely prefer (2).

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 8/11][ARM] Migrate to new reduc_[us](min|max)_scal_optab
  2014-10-24 11:57 [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Alan Lawrence
  2014-10-24 11:58 ` [PATCH 7/11][ARM] Migrate to new reduc_plus_scal_optab Alan Lawrence
  2014-10-24 12:01 ` [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Richard Biener
@ 2014-10-24 12:05 ` Alan Lawrence
  2014-11-04 11:08   ` Ramana Radhakrishnan
  2014-10-24 12:06 ` [PATCH 9/11][i386] Migrate reduction optabs to reduc_..._scal Alan Lawrence
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Alan Lawrence @ 2014-10-24 12:05 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]

Similarly to last patch.

Tested, in combination with previous patch:
bootstrap on arm-none-linux-gnueabihf
cross-tested check-gcc on arm-none-eabi.

gcc/ChangeLog:

	config/arm/neon.md (reduc_smin_<mode> *2): Rename to...
	(reduc_smin_scal_<mode> *2): ...this; extract scalar result.
	(reduc_smax_<mode> *2): Rename to...
	(reduc_smax_scal_<mode> *2): ...this; extract scalar result.
	(reduc_umin_<mode> *2): Rename to...
	(reduc_umin_scal_<mode> *2): ...this; extract scalar result.
	(reduc_umax_<mode> *2): Rename to...
	(reduc_umax_scal_<mode> *2): ...this; extract scalar result.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 8_arm_reduc_minmax.patch --]
[-- Type: text/x-patch; name=8_arm_reduc_minmax.patch, Size: 5679 bytes --]

commit 537c31561933f8054a2289198f35b19cf5c4196e
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Thu Aug 28 16:49:24 2014 +0100

    ARM reduc_[us](min|max)_scal, V_elem not V_ext, rm old non-_scal version.

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index d13fe5d..19e1ba0 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -1398,104 +1398,109 @@
   [(set_attr "type" "neon_add_q")]
 )
 
-(define_expand "reduc_smin_<mode>"
-  [(match_operand:VD 0 "s_register_operand" "")
+(define_expand "reduc_smin_scal_<mode>"
+  [(match_operand:<V_elem> 0 "nonimmediate_operand" "")
    (match_operand:VD 1 "s_register_operand" "")]
   "TARGET_NEON && (!<Is_float_mode> || flag_unsafe_math_optimizations)"
 {
-  neon_pairwise_reduce (operands[0], operands[1], <MODE>mode,
+  rtx vec = gen_reg_rtx (<MODE>mode);
+
+  neon_pairwise_reduce (vec, operands[1], <MODE>mode,
 			&gen_neon_vpsmin<mode>);
+  /* The result is computed into every element of the vector.  */
+  emit_insn (gen_vec_extract<mode> (operands[0], vec, const0_rtx));
   DONE;
 })
 
-(define_expand "reduc_smin_<mode>"
-  [(match_operand:VQ 0 "s_register_operand" "")
+(define_expand "reduc_smin_scal_<mode>"
+  [(match_operand:<V_elem> 0 "nonimmediate_operand" "")
    (match_operand:VQ 1 "s_register_operand" "")]
   "TARGET_NEON && (!<Is_float_mode> || flag_unsafe_math_optimizations)
    && !BYTES_BIG_ENDIAN"
 {
   rtx step1 = gen_reg_rtx (<V_HALF>mode);
-  rtx res_d = gen_reg_rtx (<V_HALF>mode);
 
   emit_insn (gen_quad_halves_smin<mode> (step1, operands[1]));
-  emit_insn (gen_reduc_smin_<V_half> (res_d, step1));
-  emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
+  emit_insn (gen_reduc_smin_scal_<V_half> (operands[0], step1));
 
   DONE;
 })
 
-(define_expand "reduc_smax_<mode>"
-  [(match_operand:VD 0 "s_register_operand" "")
+(define_expand "reduc_smax_scal_<mode>"
+  [(match_operand:<V_elem> 0 "nonimmediate_operand" "")
    (match_operand:VD 1 "s_register_operand" "")]
   "TARGET_NEON && (!<Is_float_mode> || flag_unsafe_math_optimizations)"
 {
-  neon_pairwise_reduce (operands[0], operands[1], <MODE>mode,
+  rtx vec = gen_reg_rtx (<MODE>mode);
+  neon_pairwise_reduce (vec, operands[1], <MODE>mode,
 			&gen_neon_vpsmax<mode>);
+  /* The result is computed into every element of the vector.  */
+  emit_insn (gen_vec_extract<mode> (operands[0], vec, const0_rtx));
   DONE;
 })
 
-(define_expand "reduc_smax_<mode>"
-  [(match_operand:VQ 0 "s_register_operand" "")
+(define_expand "reduc_smax_scal_<mode>"
+  [(match_operand:<V_elem> 0 "nonimmediate_operand" "")
    (match_operand:VQ 1 "s_register_operand" "")]
   "TARGET_NEON && (!<Is_float_mode> || flag_unsafe_math_optimizations)
    && !BYTES_BIG_ENDIAN"
 {
   rtx step1 = gen_reg_rtx (<V_HALF>mode);
-  rtx res_d = gen_reg_rtx (<V_HALF>mode);
 
   emit_insn (gen_quad_halves_smax<mode> (step1, operands[1]));
-  emit_insn (gen_reduc_smax_<V_half> (res_d, step1));
-  emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
+  emit_insn (gen_reduc_smax_scal_<V_half> (operands[0], step1));
 
   DONE;
 })
 
-(define_expand "reduc_umin_<mode>"
-  [(match_operand:VDI 0 "s_register_operand" "")
+(define_expand "reduc_umin_scal_<mode>"
+  [(match_operand:<V_elem> 0 "nonimmediate_operand" "")
    (match_operand:VDI 1 "s_register_operand" "")]
   "TARGET_NEON"
 {
-  neon_pairwise_reduce (operands[0], operands[1], <MODE>mode,
+  rtx vec = gen_reg_rtx (<MODE>mode);
+  neon_pairwise_reduce (vec, operands[1], <MODE>mode,
 			&gen_neon_vpumin<mode>);
+  /* The result is computed into every element of the vector.  */
+  emit_insn (gen_vec_extract<mode> (operands[0], vec, const0_rtx));
   DONE;
 })
 
-(define_expand "reduc_umin_<mode>"
-  [(match_operand:VQI 0 "s_register_operand" "")
+(define_expand "reduc_umin_scal_<mode>"
+  [(match_operand:<V_elem> 0 "nonimmediate_operand" "")
    (match_operand:VQI 1 "s_register_operand" "")]
   "TARGET_NEON && !BYTES_BIG_ENDIAN"
 {
   rtx step1 = gen_reg_rtx (<V_HALF>mode);
-  rtx res_d = gen_reg_rtx (<V_HALF>mode);
 
   emit_insn (gen_quad_halves_umin<mode> (step1, operands[1]));
-  emit_insn (gen_reduc_umin_<V_half> (res_d, step1));
-  emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
+  emit_insn (gen_reduc_umin_scal_<V_half> (operands[0], step1));
 
   DONE;
 })
 
-(define_expand "reduc_umax_<mode>"
-  [(match_operand:VDI 0 "s_register_operand" "")
+(define_expand "reduc_umax_scal_<mode>"
+  [(match_operand:<V_elem> 0 "nonimmediate_operand" "")
    (match_operand:VDI 1 "s_register_operand" "")]
   "TARGET_NEON"
 {
-  neon_pairwise_reduce (operands[0], operands[1], <MODE>mode,
+  rtx vec = gen_reg_rtx (<MODE>mode);
+  neon_pairwise_reduce (vec, operands[1], <MODE>mode,
 			&gen_neon_vpumax<mode>);
+  /* The result is computed into every element of the vector.  */
+  emit_insn (gen_vec_extract<mode> (operands[0], vec, const0_rtx));
   DONE;
 })
 
-(define_expand "reduc_umax_<mode>"
-  [(match_operand:VQI 0 "s_register_operand" "")
+(define_expand "reduc_umax_scal_<mode>"
+  [(match_operand:<V_elem> 0 "nonimmediate_operand" "")
    (match_operand:VQI 1 "s_register_operand" "")]
   "TARGET_NEON && !BYTES_BIG_ENDIAN"
 {
   rtx step1 = gen_reg_rtx (<V_HALF>mode);
-  rtx res_d = gen_reg_rtx (<V_HALF>mode);
 
   emit_insn (gen_quad_halves_umax<mode> (step1, operands[1]));
-  emit_insn (gen_reduc_umax_<V_half> (res_d, step1));
-  emit_insn (gen_move_lo_quad_<mode> (operands[0], res_d));
+  emit_insn (gen_reduc_umax_scal_<V_half> (operands[0], step1));
 
   DONE;
 })

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 9/11][i386] Migrate reduction optabs to reduc_..._scal
  2014-10-24 11:57 [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Alan Lawrence
                   ` (2 preceding siblings ...)
  2014-10-24 12:05 ` [PATCH 8/11][ARM] Migrate to new reduc_[us](min|max)_scal_optab Alan Lawrence
@ 2014-10-24 12:06 ` Alan Lawrence
  2014-10-24 12:07 ` [PATCH 10/11][RS6000] " Alan Lawrence
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Alan Lawrence @ 2014-10-24 12:06 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 649 bytes --]

Bootstrapped and check-gcc on x86_64-none-linux-gnu.

gcc/ChangeLog:

	* config/i386/i386.c (ix86_expand_reduc): Extract result into scalar.
	* config/i386/sse.md (reduc_splus_v8df, reduc_<code>_<mode> * 3,
	reduc_umin_v8hi): Rename to...
	(reduc_plus_scal_v8df, reduc_<code>_scal_<mode> * 3,
	reduc_umin_scal_v8hi): ...these, changing result mode to scalar.

	(reduc_splus_v4df, reduc_splus_v2df, reduc_splus_v16sf,
	reduc_splus_v8sf, reduc_splus_v4sf): Rename to...
	(reduc_plus_scal_v4df, reduc_plus_scal_v2df, reduc_plus_scal_v16sf,
	reduc_plus_scal_v8sf, reduc_plus_scal_v4sf): ...these, adding
	gen_vec_extract for scalar result.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 9_x86.patch --]
[-- Type: text/x-patch; name=9_x86.patch, Size: 6610 bytes --]

commit 80b0d10a78b2f3e86325f373e99e9cf71e42e622
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Tue Oct 7 13:25:08 2014 +0100

    i386

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 4c4a6eb..670a5f5 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -41211,12 +41211,12 @@ emit_reduc_half (rtx dest, rtx src, int i)
 }
 
 /* Expand a vector reduction.  FN is the binary pattern to reduce;
-   DEST is the destination; IN is the input vector.  */
+   DEST is the (scalar) destination; IN is the input vector.  */
 
 void
 ix86_expand_reduc (rtx (*fn) (rtx, rtx, rtx), rtx dest, rtx in)
 {
-  rtx half, dst, vec = in;
+  rtx half, dst = NULL_RTX, vec = in;
   enum machine_mode mode = GET_MODE (in);
   int i;
 
@@ -41225,23 +41225,21 @@ ix86_expand_reduc (rtx (*fn) (rtx, rtx, rtx), rtx dest, rtx in)
       && mode == V8HImode
       && fn == gen_uminv8hi3)
     {
-      emit_insn (gen_sse4_1_phminposuw (dest, in));
-      return;
+      dst = gen_reg_rtx (mode);
+      emit_insn (gen_sse4_1_phminposuw (dst, in));
     }
-
-  for (i = GET_MODE_BITSIZE (mode);
-       i > GET_MODE_BITSIZE (GET_MODE_INNER (mode));
-       i >>= 1)
-    {
+  else
+    for (i = GET_MODE_BITSIZE (mode);
+	  i > GET_MODE_BITSIZE (GET_MODE_INNER (mode));
+	  i >>= 1)
+      {
       half = gen_reg_rtx (mode);
       emit_reduc_half (half, vec, i);
-      if (i == GET_MODE_BITSIZE (GET_MODE_INNER (mode)) * 2)
-	dst = dest;
-      else
-	dst = gen_reg_rtx (mode);
+      dst = gen_reg_rtx (mode);
       emit_insn (fn (dst, half, vec));
       vec = dst;
     }
+  ix86_expand_vector_extract (false, dest, dst, 0);
 }
 \f
 /* Target hook for scalar_mode_supported_p.  */
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index e7646d7..e4e0b95 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2238,8 +2238,8 @@
    (set_attr "prefix_rep" "1,*")
    (set_attr "mode" "V4SF")])
 
-(define_expand "reduc_splus_v8df"
-  [(match_operand:V8DF 0 "register_operand")
+(define_expand "reduc_plus_scal_v8df"
+  [(match_operand:DF 0 "register_operand")
    (match_operand:V8DF 1 "register_operand")]
   "TARGET_AVX512F"
 {
@@ -2247,30 +2247,35 @@
   DONE;
 })
 
-(define_expand "reduc_splus_v4df"
-  [(match_operand:V4DF 0 "register_operand")
+(define_expand "reduc_plus_scal_v4df"
+  [(match_operand:DF 0 "register_operand")
    (match_operand:V4DF 1 "register_operand")]
   "TARGET_AVX"
 {
   rtx tmp = gen_reg_rtx (V4DFmode);
   rtx tmp2 = gen_reg_rtx (V4DFmode);
+  rtx tmp3 = gen_reg_rtx (V4DFmode);
+  
   emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1]));
   emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1)));
-  emit_insn (gen_addv4df3 (operands[0], tmp, tmp2));
+  emit_insn (gen_addv4df3 (tmp3, tmp, tmp2));
+  emit_insn (gen_vec_extractv4df (operands[0], tmp3, GEN_INT (1)));
   DONE;
 })
 
-(define_expand "reduc_splus_v2df"
-  [(match_operand:V2DF 0 "register_operand")
+(define_expand "reduc_plus_scal_v2df"
+  [(match_operand:DF 0 "register_operand")
    (match_operand:V2DF 1 "register_operand")]
   "TARGET_SSE3"
 {
-  emit_insn (gen_sse3_haddv2df3 (operands[0], operands[1], operands[1]));
+  rtx tmp = gen_reg_rtx (V2DFmode);
+  emit_insn (gen_sse3_haddv2df3 (tmp, operands[1], operands[1]));
+  emit_insn (gen_vec_extractv2df (operands[0], tmp, GEN_INT (0)));
   DONE;
 })
 
-(define_expand "reduc_splus_v16sf"
-  [(match_operand:V16SF 0 "register_operand")
+(define_expand "reduc_plus_scal_v16sf"
+  [(match_operand:SF 0 "register_operand")
    (match_operand:V16SF 1 "register_operand")]
   "TARGET_AVX512F"
 {
@@ -2278,30 +2283,35 @@
   DONE;
 })
 
-(define_expand "reduc_splus_v8sf"
-  [(match_operand:V8SF 0 "register_operand")
+(define_expand "reduc_plus_scal_v8sf"
+  [(match_operand:SF 0 "register_operand")
    (match_operand:V8SF 1 "register_operand")]
   "TARGET_AVX"
 {
   rtx tmp = gen_reg_rtx (V8SFmode);
   rtx tmp2 = gen_reg_rtx (V8SFmode);
+  rtx tmp3 = gen_reg_rtx (V8SFmode);
+  
   emit_insn (gen_avx_haddv8sf3 (tmp, operands[1], operands[1]));
   emit_insn (gen_avx_haddv8sf3 (tmp2, tmp, tmp));
   emit_insn (gen_avx_vperm2f128v8sf3 (tmp, tmp2, tmp2, GEN_INT (1)));
-  emit_insn (gen_addv8sf3 (operands[0], tmp, tmp2));
+  emit_insn (gen_addv8sf3 (tmp3, tmp, tmp2));
+  emit_insn (gen_vec_extractv8sf (operands[0], tmp3, GEN_INT (0)));
   DONE;
 })
 
-(define_expand "reduc_splus_v4sf"
-  [(match_operand:V4SF 0 "register_operand")
+(define_expand "reduc_plus_scal_v4sf"
+  [(match_operand:SF 0 "register_operand")
    (match_operand:V4SF 1 "register_operand")]
   "TARGET_SSE"
 {
   if (TARGET_SSE3)
     {
       rtx tmp = gen_reg_rtx (V4SFmode);
+      rtx tmp2 = gen_reg_rtx (V4SFmode);
       emit_insn (gen_sse3_haddv4sf3 (tmp, operands[1], operands[1]));
-      emit_insn (gen_sse3_haddv4sf3 (operands[0], tmp, tmp));
+      emit_insn (gen_sse3_haddv4sf3 (tmp2, tmp, tmp));
+      emit_insn (gen_vec_extractv4sf (operands[0], tmp2, GEN_INT (0)));
     }
   else
     ix86_expand_reduc (gen_addv4sf3, operands[0], operands[1]);
@@ -2317,9 +2327,9 @@
    (V8DI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")
    (V8DF "TARGET_AVX512F")])
 
-(define_expand "reduc_<code>_<mode>"
+(define_expand "reduc_<code>_scal_<mode>"
   [(smaxmin:REDUC_SMINMAX_MODE
-     (match_operand:REDUC_SMINMAX_MODE 0 "register_operand")
+     (match_operand:<ssescalarmode> 0 "register_operand")
      (match_operand:REDUC_SMINMAX_MODE 1 "register_operand"))]
   ""
 {
@@ -2327,9 +2337,9 @@
   DONE;
 })
 
-(define_expand "reduc_<code>_<mode>"
+(define_expand "reduc_<code>_scal_<mode>"
   [(umaxmin:VI48_512
-     (match_operand:VI48_512 0 "register_operand")
+     (match_operand:<ssescalarmode> 0 "register_operand")
      (match_operand:VI48_512 1 "register_operand"))]
   "TARGET_AVX512F"
 {
@@ -2337,9 +2347,9 @@
   DONE;
 })
 
-(define_expand "reduc_<code>_<mode>"
+(define_expand "reduc_<code>_scal_<mode>"
   [(umaxmin:VI_256
-     (match_operand:VI_256 0 "register_operand")
+     (match_operand:<ssescalarmode> 0 "register_operand")
      (match_operand:VI_256 1 "register_operand"))]
   "TARGET_AVX2"
 {
@@ -2347,9 +2357,9 @@
   DONE;
 })
 
-(define_expand "reduc_umin_v8hi"
+(define_expand "reduc_umin_scal_v8hi"
   [(umin:V8HI
-     (match_operand:V8HI 0 "register_operand")
+     (match_operand:HI 0 "register_operand")
      (match_operand:V8HI 1 "register_operand"))]
   "TARGET_SSE4_1"
 {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-10-24 11:57 [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Alan Lawrence
                   ` (3 preceding siblings ...)
  2014-10-24 12:06 ` [PATCH 9/11][i386] Migrate reduction optabs to reduc_..._scal Alan Lawrence
@ 2014-10-24 12:07 ` Alan Lawrence
  2014-10-24 12:14   ` Alan Lawrence
                     ` (2 more replies)
  2014-10-24 12:12 ` [Protopatch 11/11][IA64] Migrate to reduc_(plus|min|max)_scal_v2df optab Alan Lawrence
  2014-10-24 15:19 ` [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Matthew Fortune
  6 siblings, 3 replies; 27+ messages in thread
From: Alan Lawrence @ 2014-10-24 12:07 UTC (permalink / raw)
  To: gcc-patches; +Cc: David Edelsohn, Segher Boessenkool

This migrates the reduction patterns in altivec.md and vector.md to the new 
names. I've not touched paired.md as I wasn't really sure how to fix that (how 
do I vec_extractv2sf ?), moreover the testing I did didn't seem to exercise any 
of those patterns (iow: I'm not sure what would be an appropriate target machine?).

I note the reduc_uplus_v16qi (which I've removed, as unsigned and signed 
addition should be equivalent) differed from reduc_splus_v16qi in using 
gen_altivec_vsum4ubs rather than gen_altivec_vsum4sbs.  Testcases 
gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c} 
thus produce assembly which differs from previously (only) in that "vsum4ubs" 
becomes "vsum4sbs". These tests are still passing so I assume this is OK.

The combining of signed and unsigned addition also improves 
gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c} 
: these are now reduced using direct vector reduction, rather than with shifts 
as previously (because there was only a reduc_splus rather than the reduc_uplus 
these tests looked for).

((Side note: the RTL changes to vector.md are to match the combine patterns in 
vsx.md; now that we now longer depend upon combine to generate those patterns 
(as the optab outputs them directly), one might wish to remove the smaller 
pattern from vsx.md, and/or simplify the RTL. I theorize that a reduction of a 
two-element vector is just adding the first element to the second, so maybe to 
something like

   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
		   (VEC_reduc:V2DF
		    (vec_select:DF
		     (match_operand:V2DF 1 "vfloat_operand" "")
		     (parallel [(const_int 1)]))
		    (vec_select:DF
		     (match_dup 1)
		     (parallel [(const_int 0)]))))
	      (clobber (match_scratch:V2DF 2 ""))])]

but I think it's best for me to leave that to the port maintainers.))

Bootstrapped and check-gcc on powerpc64-none-linux-gnu (gcc110.fsffrance.org, 
with thanks to the GCC Compile Farm).

gcc/ChangeLog:

	* config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
	(reduc_plus_scal_<mode>): ...this, and rs6000_expand_vector_extract.
	(reduc_uplus_v16qi): Remove.

	* config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
	(reduc_<VEC_reduc_name>_v2df): Rename to...
	(reduc_<VEC_reduc_name>_scal_v2df): ...this, wrap VEC_reduc in a
	vec_select of element 1.
	(reduc_<VEC_reduc_name>_v4sf): Rename to...
	(reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
	vec_select of element 3, add scratch register.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Protopatch 11/11][IA64] Migrate to reduc_(plus|min|max)_scal_v2df optab
  2014-10-24 11:57 [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Alan Lawrence
                   ` (4 preceding siblings ...)
  2014-10-24 12:07 ` [PATCH 10/11][RS6000] " Alan Lawrence
@ 2014-10-24 12:12 ` Alan Lawrence
  2014-10-24 12:50   ` Alan Lawrence
  2014-10-24 15:19 ` [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Matthew Fortune
  6 siblings, 1 reply; 27+ messages in thread
From: Alan Lawrence @ 2014-10-24 12:12 UTC (permalink / raw)
  To: gcc-patches; +Cc: Steve Ellcey

This is an attempt to migrate IA64 to the newer optabs, however, I found none of 
the tests in gcc.dg/vect seemed to touch any of the affected patterns....so this 
is only really tested by building a stage-1 compiler.

gcc/ChangeLog:

	* config/ia64/vect.md (reduc_splus_v2sf): Rename to...
	(reduc_plus_v2sf): ...this, add a vec_extractv2sf.
	(reduc_smin_v2sf): Rename to...
	(reduc_smin_scal_v2sf): ...this, add a vec_extractv2sf.
	(reduc_smax_v2sf): Rename to...
	(reduc_smax_scal_v2sf): ...this, add a vec_extractv2sf.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-10-24 12:07 ` [PATCH 10/11][RS6000] " Alan Lawrence
@ 2014-10-24 12:14   ` Alan Lawrence
  2014-10-25  0:08   ` David Edelsohn
  2014-11-10 22:39   ` Michael Meissner
  2 siblings, 0 replies; 27+ messages in thread
From: Alan Lawrence @ 2014-10-24 12:14 UTC (permalink / raw)
  To: gcc-patches, David Edelsohn, Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 16 bytes --]

Ooops, attached.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 10_rs6000.patch --]
[-- Type: text/x-patch; name=10_rs6000.patch, Size: 4322 bytes --]

commit e48d59399722ce8316d4b1b4f28b40d87b1193fa
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Tue Oct 7 15:28:47 2014 +0100

    PowerPC v2 (but not paired.md)

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index 02ea142..92bb5d0 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -2596,35 +2596,22 @@
   operands[3] = gen_reg_rtx (GET_MODE (operands[0]));
 })
 
-(define_expand "reduc_splus_<mode>"
-  [(set (match_operand:VIshort 0 "register_operand" "=v")
+(define_expand "reduc_plus_scal_<mode>"
+  [(set (match_operand:<VI_scalar> 0 "register_operand" "=v")
         (unspec:VIshort [(match_operand:VIshort 1 "register_operand" "v")]
 			UNSPEC_REDUC_PLUS))]
   "TARGET_ALTIVEC"
 {
   rtx vzero = gen_reg_rtx (V4SImode);
   rtx vtmp1 = gen_reg_rtx (V4SImode);
-  rtx dest = gen_lowpart (V4SImode, operands[0]);
+  rtx vtmp2 = gen_reg_rtx (<MODE>mode);
+  rtx dest = gen_lowpart (V4SImode, vtmp2);
+  HOST_WIDE_INT last_elem = GET_MODE_NUNITS (<MODE>mode) - 1;
 
   emit_insn (gen_altivec_vspltisw (vzero, const0_rtx));
   emit_insn (gen_altivec_vsum4s<VI_char>s (vtmp1, operands[1], vzero));
   emit_insn (gen_altivec_vsumsws_direct (dest, vtmp1, vzero));
-  DONE;
-})
-
-(define_expand "reduc_uplus_v16qi"
-  [(set (match_operand:V16QI 0 "register_operand" "=v")
-        (unspec:V16QI [(match_operand:V16QI 1 "register_operand" "v")]
-		      UNSPEC_REDUC_PLUS))]
-  "TARGET_ALTIVEC"
-{
-  rtx vzero = gen_reg_rtx (V4SImode);
-  rtx vtmp1 = gen_reg_rtx (V4SImode);
-  rtx dest = gen_lowpart (V4SImode, operands[0]);
-
-  emit_insn (gen_altivec_vspltisw (vzero, const0_rtx));
-  emit_insn (gen_altivec_vsum4ubs (vtmp1, operands[1], vzero));
-  emit_insn (gen_altivec_vsumsws_direct (dest, vtmp1, vzero));
+  rs6000_expand_vector_extract (operands[0], vtmp2, last_elem);
   DONE;
 })
 
diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index 237724e..54b18aa 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -81,7 +81,7 @@
 ;; Vector reduction code iterators
 (define_code_iterator VEC_reduc [plus smin smax])
 
-(define_code_attr VEC_reduc_name [(plus "splus")
+(define_code_attr VEC_reduc_name [(plus "plus")
 				  (smin "smin")
 				  (smax "smax")])
 
@@ -1077,18 +1077,20 @@
 \f
 ;; Vector reduction expanders for VSX
 
-(define_expand "reduc_<VEC_reduc_name>_v2df"
-  [(parallel [(set (match_operand:V2DF 0 "vfloat_operand" "")
-		   (VEC_reduc:V2DF
-		    (vec_concat:V2DF
-		     (vec_select:DF
-		      (match_operand:V2DF 1 "vfloat_operand" "")
-		      (parallel [(const_int 1)]))
-		     (vec_select:DF
-		      (match_dup 1)
-		      (parallel [(const_int 0)])))
-		    (match_dup 1)))
-	      (clobber (match_scratch:V2DF 2 ""))])]
+(define_expand "reduc_<VEC_reduc_name>_scal_v2df"
+  [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
+		   (vec_select:DF
+		    (VEC_reduc:V2DF
+		     (vec_concat:V2DF
+		      (vec_select:DF
+		       (match_operand:V2DF 1 "vfloat_operand" "")
+		       (parallel [(const_int 1)]))
+		      (vec_select:DF
+		       (match_dup 1)
+		       (parallel [(const_int 0)])))
+		     (match_dup 1))
+		    (parallel [(const_int 1)])))
+	      (clobber (match_scratch:DF 2 ""))])]
   "VECTOR_UNIT_VSX_P (V2DFmode)"
   "")
 
@@ -1099,13 +1101,16 @@
 ; is to allow us to use a code iterator, but not completely list all of the
 ; vector rotates, etc. to prevent canonicalization
 
-(define_expand "reduc_<VEC_reduc_name>_v4sf"
-  [(parallel [(set (match_operand:V4SF 0 "vfloat_operand" "")
-		   (VEC_reduc:V4SF
-		    (unspec:V4SF [(const_int 0)] UNSPEC_REDUC)
-		    (match_operand:V4SF 1 "vfloat_operand" "")))
+(define_expand "reduc_<VEC_reduc_name>_scal_v4sf"
+  [(parallel [(set (match_operand:SF 0 "vfloat_operand" "")
+		   (vec_select:SF
+		    (VEC_reduc:V4SF
+		     (unspec:V4SF [(const_int 0)] UNSPEC_REDUC)
+		     (match_operand:V4SF 1 "vfloat_operand" ""))
+		    (parallel [(const_int 3)])))
 	      (clobber (match_scratch:V4SF 2 ""))
-	      (clobber (match_scratch:V4SF 3 ""))])]
+	      (clobber (match_scratch:V4SF 3 ""))
+	      (clobber (match_scratch:V4SF 4 ""))])]
   "VECTOR_UNIT_VSX_P (V4SFmode)"
   "")
 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Protopatch 11/11][IA64] Migrate to reduc_(plus|min|max)_scal_v2df optab
  2014-10-24 12:12 ` [Protopatch 11/11][IA64] Migrate to reduc_(plus|min|max)_scal_v2df optab Alan Lawrence
@ 2014-10-24 12:50   ` Alan Lawrence
  0 siblings, 0 replies; 27+ messages in thread
From: Alan Lawrence @ 2014-10-24 12:50 UTC (permalink / raw)
  To: gcc-patches; +Cc: Steve Ellcey

[-- Attachment #1: Type: text/plain, Size: 16 bytes --]

Ooops, attached.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 11_ia64.patch --]
[-- Type: text/x-patch; name=11_ia64.patch, Size: 2529 bytes --]

commit 56296417b9f6795e541b1101dce6e6ac1789de9a
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Wed Oct 8 15:58:27 2014 +0100

    IA64 (?!)

diff --git a/gcc/config/ia64/vect.md b/gcc/config/ia64/vect.md
index e3ce292..45f4156 100644
--- a/gcc/config/ia64/vect.md
+++ b/gcc/config/ia64/vect.md
@@ -1217,45 +1217,54 @@
   "fpmin %0 = %1, %2"
   [(set_attr "itanium_class" "fmisc")])
 
-(define_expand "reduc_splus_v2sf"
-  [(match_operand:V2SF 0 "fr_register_operand" "")
+(define_expand "reduc_plus_scal_v2sf"
+  [(match_operand:SF 0 "fr_register_operand" "")
    (match_operand:V2SF 1 "fr_register_operand" "")]
   ""
 {
   rtx tmp = gen_reg_rtx (V2SFmode);
+  rtx tmp2 = gen_reg_rtx (V2SFmode);
+
   if (TARGET_BIG_ENDIAN)
     emit_insn (gen_fswap (tmp, CONST0_RTX (V2SFmode), operands[1]));
   else
     emit_insn (gen_fswap (tmp, operands[1], CONST0_RTX (V2SFmode)));
-  emit_insn (gen_addv2sf3 (operands[0], operands[1], tmp));
+  emit_insn (gen_addv2sf3 (tmp2, operands[1], tmp));
+  emit_insn (gen_vec_extractv2sf (operands[0], tmp2, GEN_INT (0)));
   DONE;
 })
 
-(define_expand "reduc_smax_v2sf"
-  [(match_operand:V2SF 0 "fr_register_operand" "")
+(define_expand "reduc_smax_scal_v2sf"
+  [(match_operand:SF 0 "fr_register_operand" "")
    (match_operand:V2SF 1 "fr_register_operand" "")]
   ""
 {
   rtx tmp = gen_reg_rtx (V2SFmode);
+  rtx tmp2 = gen_reg_rtx (V2SFmode);
+
   if (TARGET_BIG_ENDIAN)
     emit_insn (gen_fswap (tmp, CONST0_RTX (V2SFmode), operands[1]));
   else
     emit_insn (gen_fswap (tmp, operands[1], CONST0_RTX (V2SFmode)));
-  emit_insn (gen_smaxv2sf3 (operands[0], operands[1], tmp));
+  emit_insn (gen_smaxv2sf3 (tmp2, operands[1], tmp));
+  emit_insn (gen_vec_extractv2sf (operands[0], tmp2, GEN_INT (0)));
   DONE;
 })
 
-(define_expand "reduc_smin_v2sf"
-  [(match_operand:V2SF 0 "fr_register_operand" "")
+(define_expand "reduc_smin_scal_v2sf"
+  [(match_operand:SF 0 "fr_register_operand" "")
    (match_operand:V2SF 1 "fr_register_operand" "")]
   ""
 {
   rtx tmp = gen_reg_rtx (V2SFmode);
+  rtx tmp2 = gen_reg_rtx (V2SFmode);
+
   if (TARGET_BIG_ENDIAN)
     emit_insn (gen_fswap (tmp, CONST0_RTX (V2SFmode), operands[1]));
   else
     emit_insn (gen_fswap (tmp, operands[1], CONST0_RTX (V2SFmode)));
-  emit_insn (gen_sminv2sf3 (operands[0], operands[1], tmp));
+  emit_insn (gen_sminv2sf3 (tmp2, operands[1], tmp));
+  emit_insn (gen_vec_extractv2sf (operands[0], tmp2, GEN_INT (0)));
   DONE;
 })
 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral
  2014-10-24 11:57 [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Alan Lawrence
                   ` (5 preceding siblings ...)
  2014-10-24 12:12 ` [Protopatch 11/11][IA64] Migrate to reduc_(plus|min|max)_scal_v2df optab Alan Lawrence
@ 2014-10-24 15:19 ` Matthew Fortune
  2014-10-27 11:48   ` Richard Biener
  6 siblings, 1 reply; 27+ messages in thread
From: Matthew Fortune @ 2014-10-24 15:19 UTC (permalink / raw)
  To: Alan Lawrence, gcc-patches; +Cc: Richard Biener, David Edelsohn

Alan Lawrence <alan.lawrence@arm.com> writes:
> Patches 7-11 migrate migrate ARM, x86, IA64 (I think), and mostly PowerPC,
> to
> the new reduc_(plus|[us](min|max))_scal_optab. I have not managed to work
> out
> how to do the same for MIPS (specifically what I need to add to
> mips_expand_vec_reduc), and have had no response from the maintainers, so
> am

Sorry, I was looking at this but failed to send an email saying so. The lack
of vec_extract appears to be the stumbling point here so at the very least
we need to add a naïve version of that I believe.

> (2) also renaming reduc_..._scal_optab back to reduc_..._optab; would
> break the
> MIPS backend if something were not done with it's existing patterns.

I suspect we can deal with this in time to make a rename OK.

One thing occurred to me about this change in general which is that on the
whole the reduction to a scalar seems good for an epilogue but is there
a problem if the result is then replicated across a vector for further
processing. I.e. a vector is reduced to a scalar, which moves the value
from a SIMD register to a GP register (because scalar modes are not
supported in SIMD registers generally) and then gets moved back to a
SIMD register to form part of a new vector? Would you expect the
redundant moves to get eliminated?

Thanks,
Matthew

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-10-24 12:07 ` [PATCH 10/11][RS6000] " Alan Lawrence
  2014-10-24 12:14   ` Alan Lawrence
@ 2014-10-25  0:08   ` David Edelsohn
  2014-11-03 17:51     ` Bill Schmidt
  2014-11-10 22:39   ` Michael Meissner
  2 siblings, 1 reply; 27+ messages in thread
From: David Edelsohn @ 2014-10-25  0:08 UTC (permalink / raw)
  To: Alan Lawrence, William J. Schmidt; +Cc: gcc-patches, Segher Boessenkool

On Fri, Oct 24, 2014 at 8:06 AM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> This migrates the reduction patterns in altivec.md and vector.md to the new
> names. I've not touched paired.md as I wasn't really sure how to fix that
> (how do I vec_extractv2sf ?), moreover the testing I did didn't seem to
> exercise any of those patterns (iow: I'm not sure what would be an
> appropriate target machine?).
>
> I note the reduc_uplus_v16qi (which I've removed, as unsigned and signed
> addition should be equivalent) differed from reduc_splus_v16qi in using
> gen_altivec_vsum4ubs rather than gen_altivec_vsum4sbs.  Testcases
> gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
> thus produce assembly which differs from previously (only) in that
> "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I assume
> this is OK.
>
> The combining of signed and unsigned addition also improves
> gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
> : these are now reduced using direct vector reduction, rather than with
> shifts as previously (because there was only a reduc_splus rather than the
> reduc_uplus these tests looked for).
>
> ((Side note: the RTL changes to vector.md are to match the combine patterns
> in vsx.md; now that we now longer depend upon combine to generate those
> patterns (as the optab outputs them directly), one might wish to remove the
> smaller pattern from vsx.md, and/or simplify the RTL. I theorize that a
> reduction of a two-element vector is just adding the first element to the
> second, so maybe to something like
>
>   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
>                    (VEC_reduc:V2DF
>                     (vec_select:DF
>                      (match_operand:V2DF 1 "vfloat_operand" "")
>                      (parallel [(const_int 1)]))
>                     (vec_select:DF
>                      (match_dup 1)
>                      (parallel [(const_int 0)]))))
>               (clobber (match_scratch:V2DF 2 ""))])]
>
> but I think it's best for me to leave that to the port maintainers.))
>
> Bootstrapped and check-gcc on powerpc64-none-linux-gnu
> (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).
>
> gcc/ChangeLog:
>
>         * config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
>         (reduc_plus_scal_<mode>): ...this, and rs6000_expand_vector_extract.
>         (reduc_uplus_v16qi): Remove.
>
>         * config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
>         (reduc_<VEC_reduc_name>_v2df): Rename to...
>         (reduc_<VEC_reduc_name>_scal_v2df): ...this, wrap VEC_reduc in a
>         vec_select of element 1.
>         (reduc_<VEC_reduc_name>_v4sf): Rename to...
>         (reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
>         vec_select of element 3, add scratch register.
>

This needs some input from Bill, but he will be busy with a conference
this week.

- David

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral
  2014-10-24 15:19 ` [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Matthew Fortune
@ 2014-10-27 11:48   ` Richard Biener
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Biener @ 2014-10-27 11:48 UTC (permalink / raw)
  To: Matthew Fortune
  Cc: Alan Lawrence, gcc-patches, Richard Biener, David Edelsohn

On Fri, Oct 24, 2014 at 5:17 PM, Matthew Fortune
<Matthew.Fortune@imgtec.com> wrote:
> Alan Lawrence <alan.lawrence@arm.com> writes:
>> Patches 7-11 migrate migrate ARM, x86, IA64 (I think), and mostly PowerPC,
>> to
>> the new reduc_(plus|[us](min|max))_scal_optab. I have not managed to work
>> out
>> how to do the same for MIPS (specifically what I need to add to
>> mips_expand_vec_reduc), and have had no response from the maintainers, so
>> am
>
> Sorry, I was looking at this but failed to send an email saying so. The lack
> of vec_extract appears to be the stumbling point here so at the very least
> we need to add a naïve version of that I believe.
>
>> (2) also renaming reduc_..._scal_optab back to reduc_..._optab; would
>> break the
>> MIPS backend if something were not done with it's existing patterns.
>
> I suspect we can deal with this in time to make a rename OK.
>
> One thing occurred to me about this change in general which is that on the
> whole the reduction to a scalar seems good for an epilogue but is there
> a problem if the result is then replicated across a vector for further
> processing. I.e. a vector is reduced to a scalar, which moves the value
> from a SIMD register to a GP register (because scalar modes are not
> supported in SIMD registers generally) and then gets moved back to a
> SIMD register to form part of a new vector? Would you expect the
> redundant moves to get eliminated?

Combine should be able to do this if you help it (it of course depends
on what your actual processor instruction doing the reduction does).

Richard.

> Thanks,
> Matthew

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 7/11][ARM] Migrate to new reduc_plus_scal_optab
  2014-10-24 11:58 ` [PATCH 7/11][ARM] Migrate to new reduc_plus_scal_optab Alan Lawrence
@ 2014-11-03 17:32   ` Ramana Radhakrishnan
  0 siblings, 0 replies; 27+ messages in thread
From: Ramana Radhakrishnan @ 2014-11-03 17:32 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Fri, Oct 24, 2014 at 12:57 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> This migrates ARM from reduc_splus_optab and reduc_uplus optab to a single
> reduc_plus_optab.
>
> Tested, in combination with next patch:
> bootstrap on arm-none-linux-gnueabihf
> cross-tested check-gcc on arm-none-eabi.
>

Ok.

Ramana
> gcc/ChangeLog:
>
>         config/arm/neon.md (reduc_plus_*): Rename to...
>         (reduc_plus_scal_*): ...this; reduce to temp and extract scalar
> result.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-10-25  0:08   ` David Edelsohn
@ 2014-11-03 17:51     ` Bill Schmidt
  2014-11-06 16:44       ` Alan Lawrence
  0 siblings, 1 reply; 27+ messages in thread
From: Bill Schmidt @ 2014-11-03 17:51 UTC (permalink / raw)
  To: David Edelsohn
  Cc: Alan Lawrence, gcc-patches, Segher Boessenkool, Michael Meissner

On Fri, 2014-10-24 at 19:49 -0400, David Edelsohn wrote:
> On Fri, Oct 24, 2014 at 8:06 AM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> > This migrates the reduction patterns in altivec.md and vector.md to the new
> > names. I've not touched paired.md as I wasn't really sure how to fix that
> > (how do I vec_extractv2sf ?), moreover the testing I did didn't seem to
> > exercise any of those patterns (iow: I'm not sure what would be an
> > appropriate target machine?).
> >
> > I note the reduc_uplus_v16qi (which I've removed, as unsigned and signed
> > addition should be equivalent) differed from reduc_splus_v16qi in using
> > gen_altivec_vsum4ubs rather than gen_altivec_vsum4sbs.  Testcases
> > gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
> > thus produce assembly which differs from previously (only) in that
> > "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I assume
> > this is OK.

Given that the only 32-bit quantity being added here is zero, the
difference in saturation points for vsum4ubs and vsum4sbs won't come
into play, so I agree this should be fine.

I would like to ask Mike Meissner to look over the changes to the
reduction patterns in vector.md.  He wrote those and is more familiar
with that piece than I am.  On the surface I don't see any problems, but
I could miss something subtle.

Otherwise I'm ok with the patch.

Thanks,
Bill

(p.s. Sorry for the delay on reviewing this.  As David noted, I was
traveling, and I ended up having no access to my mail for most of the
week due to an IT snafu.)

> >
> > The combining of signed and unsigned addition also improves
> > gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
> > : these are now reduced using direct vector reduction, rather than with
> > shifts as previously (because there was only a reduc_splus rather than the
> > reduc_uplus these tests looked for).
> >
> > ((Side note: the RTL changes to vector.md are to match the combine patterns
> > in vsx.md; now that we now longer depend upon combine to generate those
> > patterns (as the optab outputs them directly), one might wish to remove the
> > smaller pattern from vsx.md, and/or simplify the RTL. I theorize that a
> > reduction of a two-element vector is just adding the first element to the
> > second, so maybe to something like
> >
> >   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
> >                    (VEC_reduc:V2DF
> >                     (vec_select:DF
> >                      (match_operand:V2DF 1 "vfloat_operand" "")
> >                      (parallel [(const_int 1)]))
> >                     (vec_select:DF
> >                      (match_dup 1)
> >                      (parallel [(const_int 0)]))))
> >               (clobber (match_scratch:V2DF 2 ""))])]
> >
> > but I think it's best for me to leave that to the port maintainers.))
> >
> > Bootstrapped and check-gcc on powerpc64-none-linux-gnu
> > (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).
> >
> > gcc/ChangeLog:
> >
> >         * config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
> >         (reduc_plus_scal_<mode>): ...this, and rs6000_expand_vector_extract.
> >         (reduc_uplus_v16qi): Remove.
> >
> >         * config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
> >         (reduc_<VEC_reduc_name>_v2df): Rename to...
> >         (reduc_<VEC_reduc_name>_scal_v2df): ...this, wrap VEC_reduc in a
> >         vec_select of element 1.
> >         (reduc_<VEC_reduc_name>_v4sf): Rename to...
> >         (reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
> >         vec_select of element 3, add scratch register.
> >
> 
> This needs some input from Bill, but he will be busy with a conference
> this week.
> 
> - David
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 8/11][ARM] Migrate to new reduc_[us](min|max)_scal_optab
  2014-10-24 12:05 ` [PATCH 8/11][ARM] Migrate to new reduc_[us](min|max)_scal_optab Alan Lawrence
@ 2014-11-04 11:08   ` Ramana Radhakrishnan
  0 siblings, 0 replies; 27+ messages in thread
From: Ramana Radhakrishnan @ 2014-11-04 11:08 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Fri, Oct 24, 2014 at 1:01 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> Similarly to last patch.
>
> Tested, in combination with previous patch:
> bootstrap on arm-none-linux-gnueabihf
> cross-tested check-gcc on arm-none-eabi.
>
> gcc/ChangeLog:
>
>         config/arm/neon.md (reduc_smin_<mode> *2): Rename to...
>         (reduc_smin_scal_<mode> *2): ...this; extract scalar result.
>         (reduc_smax_<mode> *2): Rename to...
>         (reduc_smax_scal_<mode> *2): ...this; extract scalar result.
>         (reduc_umin_<mode> *2): Rename to...
>         (reduc_umin_scal_<mode> *2): ...this; extract scalar result.
>         (reduc_umax_<mode> *2): Rename to...
>         (reduc_umax_scal_<mode> *2): ...this; extract scalar result.

Ok.

Ramana

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-11-03 17:51     ` Bill Schmidt
@ 2014-11-06 16:44       ` Alan Lawrence
  2014-11-06 18:57         ` Bill Schmidt
  0 siblings, 1 reply; 27+ messages in thread
From: Alan Lawrence @ 2014-11-06 16:44 UTC (permalink / raw)
  To: Bill Schmidt
  Cc: David Edelsohn, gcc-patches, Segher Boessenkool, Michael Meissner

Hmmm. I am a little surprised by your mention of "saturation points" as I would 
not expect any variety of reduc_plus to be a saturating operation???

A.

Bill Schmidt wrote:
> On Fri, 2014-10-24 at 19:49 -0400, David Edelsohn wrote:
>> On Fri, Oct 24, 2014 at 8:06 AM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>>> This migrates the reduction patterns in altivec.md and vector.md to the new
>>> names. I've not touched paired.md as I wasn't really sure how to fix that
>>> (how do I vec_extractv2sf ?), moreover the testing I did didn't seem to
>>> exercise any of those patterns (iow: I'm not sure what would be an
>>> appropriate target machine?).
>>>
>>> I note the reduc_uplus_v16qi (which I've removed, as unsigned and signed
>>> addition should be equivalent) differed from reduc_splus_v16qi in using
>>> gen_altivec_vsum4ubs rather than gen_altivec_vsum4sbs.  Testcases
>>> gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
>>> thus produce assembly which differs from previously (only) in that
>>> "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I assume
>>> this is OK.
> 
> Given that the only 32-bit quantity being added here is zero, the
> difference in saturation points for vsum4ubs and vsum4sbs won't come
> into play, so I agree this should be fine.
> 
> I would like to ask Mike Meissner to look over the changes to the
> reduction patterns in vector.md.  He wrote those and is more familiar
> with that piece than I am.  On the surface I don't see any problems, but
> I could miss something subtle.
> 
> Otherwise I'm ok with the patch.
> 
> Thanks,
> Bill
> 
> (p.s. Sorry for the delay on reviewing this.  As David noted, I was
> traveling, and I ended up having no access to my mail for most of the
> week due to an IT snafu.)
> 
>>> The combining of signed and unsigned addition also improves
>>> gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
>>> : these are now reduced using direct vector reduction, rather than with
>>> shifts as previously (because there was only a reduc_splus rather than the
>>> reduc_uplus these tests looked for).
>>>
>>> ((Side note: the RTL changes to vector.md are to match the combine patterns
>>> in vsx.md; now that we now longer depend upon combine to generate those
>>> patterns (as the optab outputs them directly), one might wish to remove the
>>> smaller pattern from vsx.md, and/or simplify the RTL. I theorize that a
>>> reduction of a two-element vector is just adding the first element to the
>>> second, so maybe to something like
>>>
>>>   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
>>>                    (VEC_reduc:V2DF
>>>                     (vec_select:DF
>>>                      (match_operand:V2DF 1 "vfloat_operand" "")
>>>                      (parallel [(const_int 1)]))
>>>                     (vec_select:DF
>>>                      (match_dup 1)
>>>                      (parallel [(const_int 0)]))))
>>>               (clobber (match_scratch:V2DF 2 ""))])]
>>>
>>> but I think it's best for me to leave that to the port maintainers.))
>>>
>>> Bootstrapped and check-gcc on powerpc64-none-linux-gnu
>>> (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).
>>>
>>> gcc/ChangeLog:
>>>
>>>         * config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
>>>         (reduc_plus_scal_<mode>): ...this, and rs6000_expand_vector_extract.
>>>         (reduc_uplus_v16qi): Remove.
>>>
>>>         * config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
>>>         (reduc_<VEC_reduc_name>_v2df): Rename to...
>>>         (reduc_<VEC_reduc_name>_scal_v2df): ...this, wrap VEC_reduc in a
>>>         vec_select of element 1.
>>>         (reduc_<VEC_reduc_name>_v4sf): Rename to...
>>>         (reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
>>>         vec_select of element 3, add scratch register.
>>>
>> This needs some input from Bill, but he will be busy with a conference
>> this week.
>>
>> - David
>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-11-06 16:44       ` Alan Lawrence
@ 2014-11-06 18:57         ` Bill Schmidt
  2014-11-07 10:09           ` Alan Lawrence
  0 siblings, 1 reply; 27+ messages in thread
From: Bill Schmidt @ 2014-11-06 18:57 UTC (permalink / raw)
  To: Alan Lawrence
  Cc: David Edelsohn, gcc-patches, Segher Boessenkool, Michael Meissner

On Thu, 2014-11-06 at 16:44 +0000, Alan Lawrence wrote:
> Hmmm. I am a little surprised by your mention of "saturation points" as I would 
> not expect any variety of reduc_plus to be a saturating operation???

I wouldn't either, but the underlying vsum4ubs and vsum4sbs instructions
used in these patterns do both a reduction and an add to another value.
If that other value is large enough this can trigger a saturation event.
However, the patterns use vzero for this other value, so it's not
possible to approach the saturation cutoff for either instruction since
the reductions are being done on byte values.  (Each word in the vector
result is the sum of the corresponding four byte values in the vector
source, added to the other value, which here is zero.)

Thanks,
Bill

> 
> A.
> 
> Bill Schmidt wrote:
> > On Fri, 2014-10-24 at 19:49 -0400, David Edelsohn wrote:
> >> On Fri, Oct 24, 2014 at 8:06 AM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> >>> This migrates the reduction patterns in altivec.md and vector.md to the new
> >>> names. I've not touched paired.md as I wasn't really sure how to fix that
> >>> (how do I vec_extractv2sf ?), moreover the testing I did didn't seem to
> >>> exercise any of those patterns (iow: I'm not sure what would be an
> >>> appropriate target machine?).
> >>>
> >>> I note the reduc_uplus_v16qi (which I've removed, as unsigned and signed
> >>> addition should be equivalent) differed from reduc_splus_v16qi in using
> >>> gen_altivec_vsum4ubs rather than gen_altivec_vsum4sbs.  Testcases
> >>> gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
> >>> thus produce assembly which differs from previously (only) in that
> >>> "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I assume
> >>> this is OK.
> > 
> > Given that the only 32-bit quantity being added here is zero, the
> > difference in saturation points for vsum4ubs and vsum4sbs won't come
> > into play, so I agree this should be fine.
> > 
> > I would like to ask Mike Meissner to look over the changes to the
> > reduction patterns in vector.md.  He wrote those and is more familiar
> > with that piece than I am.  On the surface I don't see any problems, but
> > I could miss something subtle.
> > 
> > Otherwise I'm ok with the patch.
> > 
> > Thanks,
> > Bill
> > 
> > (p.s. Sorry for the delay on reviewing this.  As David noted, I was
> > traveling, and I ended up having no access to my mail for most of the
> > week due to an IT snafu.)
> > 
> >>> The combining of signed and unsigned addition also improves
> >>> gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
> >>> : these are now reduced using direct vector reduction, rather than with
> >>> shifts as previously (because there was only a reduc_splus rather than the
> >>> reduc_uplus these tests looked for).
> >>>
> >>> ((Side note: the RTL changes to vector.md are to match the combine patterns
> >>> in vsx.md; now that we now longer depend upon combine to generate those
> >>> patterns (as the optab outputs them directly), one might wish to remove the
> >>> smaller pattern from vsx.md, and/or simplify the RTL. I theorize that a
> >>> reduction of a two-element vector is just adding the first element to the
> >>> second, so maybe to something like
> >>>
> >>>   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
> >>>                    (VEC_reduc:V2DF
> >>>                     (vec_select:DF
> >>>                      (match_operand:V2DF 1 "vfloat_operand" "")
> >>>                      (parallel [(const_int 1)]))
> >>>                     (vec_select:DF
> >>>                      (match_dup 1)
> >>>                      (parallel [(const_int 0)]))))
> >>>               (clobber (match_scratch:V2DF 2 ""))])]
> >>>
> >>> but I think it's best for me to leave that to the port maintainers.))
> >>>
> >>> Bootstrapped and check-gcc on powerpc64-none-linux-gnu
> >>> (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).
> >>>
> >>> gcc/ChangeLog:
> >>>
> >>>         * config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
> >>>         (reduc_plus_scal_<mode>): ...this, and rs6000_expand_vector_extract.
> >>>         (reduc_uplus_v16qi): Remove.
> >>>
> >>>         * config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
> >>>         (reduc_<VEC_reduc_name>_v2df): Rename to...
> >>>         (reduc_<VEC_reduc_name>_scal_v2df): ...this, wrap VEC_reduc in a
> >>>         vec_select of element 1.
> >>>         (reduc_<VEC_reduc_name>_v4sf): Rename to...
> >>>         (reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
> >>>         vec_select of element 3, add scratch register.
> >>>
> >> This needs some input from Bill, but he will be busy with a conference
> >> this week.
> >>
> >> - David
> >>
> > 
> > 
> > 
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-11-06 18:57         ` Bill Schmidt
@ 2014-11-07 10:09           ` Alan Lawrence
  0 siblings, 0 replies; 27+ messages in thread
From: Alan Lawrence @ 2014-11-07 10:09 UTC (permalink / raw)
  To: Bill Schmidt; +Cc: gcc-patches

Ah I see now! Thank you for explaining that bit, I was a bit puzzled when I saw 
it, but it makes sense now!

Cheers, Alan

Bill Schmidt wrote:
> On Thu, 2014-11-06 at 16:44 +0000, Alan Lawrence wrote:
>> Hmmm. I am a little surprised by your mention of "saturation points" as I would 
>> not expect any variety of reduc_plus to be a saturating operation???
> 
> I wouldn't either, but the underlying vsum4ubs and vsum4sbs instructions
> used in these patterns do both a reduction and an add to another value.
> If that other value is large enough this can trigger a saturation event.
> However, the patterns use vzero for this other value, so it's not
> possible to approach the saturation cutoff for either instruction since
> the reductions are being done on byte values.  (Each word in the vector
> result is the sum of the corresponding four byte values in the vector
> source, added to the other value, which here is zero.)
> 
> Thanks,
> Bill


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-10-24 12:07 ` [PATCH 10/11][RS6000] " Alan Lawrence
  2014-10-24 12:14   ` Alan Lawrence
  2014-10-25  0:08   ` David Edelsohn
@ 2014-11-10 22:39   ` Michael Meissner
  2014-11-11  7:10     ` Segher Boessenkool
  2 siblings, 1 reply; 27+ messages in thread
From: Michael Meissner @ 2014-11-10 22:39 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches, David Edelsohn, Segher Boessenkool

On Fri, Oct 24, 2014 at 01:06:41PM +0100, Alan Lawrence wrote:
> This migrates the reduction patterns in altivec.md and vector.md to
> the new names. I've not touched paired.md as I wasn't really sure
> how to fix that (how do I vec_extractv2sf ?), moreover the testing I
> did didn't seem to exercise any of those patterns (iow: I'm not sure
> what would be an appropriate target machine?).
> 
> I note the reduc_uplus_v16qi (which I've removed, as unsigned and
> signed addition should be equivalent) differed from
> reduc_splus_v16qi in using gen_altivec_vsum4ubs rather than
> gen_altivec_vsum4sbs.  Testcases gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
> thus produce assembly which differs from previously (only) in that
> "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I
> assume this is OK.
> 
> The combining of signed and unsigned addition also improves gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
> : these are now reduced using direct vector reduction, rather than
> with shifts as previously (because there was only a reduc_splus
> rather than the reduc_uplus these tests looked for).

I checked the integer vector add reductions, and it seems to generate the same
value with old/new code, and I like eliminating the vector shift.

> ((Side note: the RTL changes to vector.md are to match the combine
> patterns in vsx.md; now that we now longer depend upon combine to
> generate those patterns (as the optab outputs them directly), one
> might wish to remove the smaller pattern from vsx.md, and/or
> simplify the RTL. I theorize that a reduction of a two-element
> vector is just adding the first element to the second, so maybe to
> something like
> 
>   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
> 		   (VEC_reduc:V2DF
> 		    (vec_select:DF
> 		     (match_operand:V2DF 1 "vfloat_operand" "")
> 		     (parallel [(const_int 1)]))
> 		    (vec_select:DF
> 		     (match_dup 1)
> 		     (parallel [(const_int 0)]))))
> 	      (clobber (match_scratch:V2DF 2 ""))])]
> 
> but I think it's best for me to leave that to the port maintainers.))
> 
> Bootstrapped and check-gcc on powerpc64-none-linux-gnu
> (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).

However, the double pattern is completely broken.  This cannot go in.

Consider this source:

#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>

#ifndef TYPE
#define TYPE double
#endif

#ifndef OTYPE
#define OTYPE TYPE
#endif

#ifndef SIZE
#define SIZE 1024
#endif

#ifndef ALIGN
#define ALIGN 32
#endif

TYPE a[SIZE] __attribute__((__aligned__(ALIGN)));

OTYPE sum (void) __attribute__((__noinline__));

OTYPE
sum (void)
{
  size_t i;
  OTYPE s = (OTYPE) 0;

  for (i = 0; i < SIZE; i++)
    s += a[i];

  return s;
}

If I compile with today's trunk, and -mcpu=power8 -ffast-math -O3, I get code
that I expect (though it could xxpermdi instead of xxsldwi):

sum:
	.quad	.L.sum,.TOC.@tocbase,0
	.previous
	.type	sum, @function
.L.sum:
	li 10,512
	addis 9,2,.LC1@toc@ha		# gpr load fusion, type long
	ld 9,.LC1@toc@l(9)
	xxlxor 0,0,0
	mtctr 10
	.p2align 4,,15
.L2:
	lxvd2x 12,0,9
	addi 9,9,16
	xvadddp 0,0,12
	bdnz .L2
	xxsldwi 12,0,0,2
	xvadddp 1,12,0
	xxpermdi 1,1,1,2
	blr
	.long 0

However, the code produced by the patches gives:

sum:
	.quad	.L.sum,.TOC.@tocbase,0
	.previous
	.type	sum, @function
.L.sum:
	xxlxor 0,0,0
	addi 10,1,-16
	li 8,512
	addis 9,2,.LC1@toc@ha		# gpr load fusion, type long
	ld 9,.LC1@toc@l(9)
	mtctr 8
	stxvd2x 0,0,10
	.p2align 5,,31
.L2:
	addi 10,1,-16
	lxvd2x 0,0,9
	addi 9,9,16
	lxvd2x 12,0,10
	xvadddp 12,12,0
	stxvd2x 12,0,10
	bdnz .L2
	lfd 0,-16(1)
	xxpermdi 1,12,12,2
	fadd 1,0,1
	blr
	.long 0

It is unacceptable to have to do the inner loop doing a load, vector add, and
store in the loop.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-11-10 22:39   ` Michael Meissner
@ 2014-11-11  7:10     ` Segher Boessenkool
  2014-11-12  1:54       ` Michael Meissner
  2014-11-12 12:32       ` Alan Lawrence
  0 siblings, 2 replies; 27+ messages in thread
From: Segher Boessenkool @ 2014-11-11  7:10 UTC (permalink / raw)
  To: Michael Meissner, Alan Lawrence, gcc-patches, David Edelsohn

On Mon, Nov 10, 2014 at 05:36:24PM -0500, Michael Meissner wrote:
> However, the double pattern is completely broken.  This cannot go in.

[snip]

> It is unacceptable to have to do the inner loop doing a load, vector add, and
> store in the loop.

Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
add, the latter a float add.  And it uses the same pseudoregister for the
accumulator throughout.  IRA decides a register is more expensive than
memory for this, I suppose because it wants both V2DF and DF?  It doesn't
seem to like the subreg very much.

The new code does look nicer otherwise :-)

Segher

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-11-11  7:10     ` Segher Boessenkool
@ 2014-11-12  1:54       ` Michael Meissner
  2014-11-12  9:26         ` Segher Boessenkool
  2014-11-12 12:32       ` Alan Lawrence
  1 sibling, 1 reply; 27+ messages in thread
From: Michael Meissner @ 2014-11-12  1:54 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Michael Meissner, Alan Lawrence, gcc-patches, David Edelsohn

On Tue, Nov 11, 2014 at 01:10:01AM -0600, Segher Boessenkool wrote:
> On Mon, Nov 10, 2014 at 05:36:24PM -0500, Michael Meissner wrote:
> > However, the double pattern is completely broken.  This cannot go in.
> 
> [snip]
> 
> > It is unacceptable to have to do the inner loop doing a load, vector add, and
> > store in the loop.
> 
> Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
> the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
> add, the latter a float add.  And it uses the same pseudoregister for the
> accumulator throughout.  IRA decides a register is more expensive than
> memory for this, I suppose because it wants both V2DF and DF?  It doesn't
> seem to like the subreg very much.

I haven't looked into in detail (I've been a little busy with th upper regs
patch), but I suspect the problem is that 128-bit and 64-bit types cannot
overlap (i.e. rs6000_cannot_change_mode_class returns true).  This is due to
the fact that scalars in VSX registers occupy the upper 64-bits, which would
not match the compiler's notion of that it should be in the bottom 64-bits.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-11-12  1:54       ` Michael Meissner
@ 2014-11-12  9:26         ` Segher Boessenkool
  2014-11-12 19:20           ` Michael Meissner
  0 siblings, 1 reply; 27+ messages in thread
From: Segher Boessenkool @ 2014-11-12  9:26 UTC (permalink / raw)
  To: Michael Meissner, Alan Lawrence, gcc-patches, David Edelsohn

On Tue, Nov 11, 2014 at 08:27:22PM -0500, Michael Meissner wrote:
> > Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
> > the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
> > add, the latter a float add.  And it uses the same pseudoregister for the
> > accumulator throughout.  IRA decides a register is more expensive than
> > memory for this, I suppose because it wants both V2DF and DF?  It doesn't
> > seem to like the subreg very much.
> 
> I haven't looked into in detail (I've been a little busy with th upper regs
> patch), but I suspect the problem is that 128-bit and 64-bit types cannot
> overlap (i.e. rs6000_cannot_change_mode_class returns true).  This is due to
> the fact that scalars in VSX registers occupy the upper 64-bits, which would
> not match the compiler's notion of that it should be in the bottom 64-bits.

You suspect correctly.  Hacking around that in cannot_change_mode_class
doesn't help, subreg_get_info disallows it next.

Changing the pattern so it does two extracts instead of an extract and
a subreg works (you get an fmr for the high part though, register alloc
doesn't know dest=src is for free here).

_Should_ the subreg thing work?  Or should the patterns be fixed?


Segher

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-11-11  7:10     ` Segher Boessenkool
  2014-11-12  1:54       ` Michael Meissner
@ 2014-11-12 12:32       ` Alan Lawrence
  2014-11-12 18:53         ` Alan Lawrence
  1 sibling, 1 reply; 27+ messages in thread
From: Alan Lawrence @ 2014-11-12 12:32 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Michael Meissner, gcc-patches, David Edelsohn

[-- Attachment #1: Type: text/plain, Size: 4197 bytes --]

So I'm no expert on RS6000 here, but following on from Segher's observation 
about the change in pattern...so the difference in 'expand' is exactly that, a 
vsx_reduc_splus_v2df followed by a vec_extract to DF, becomes a 
vsx_reduc_splus_v2df_scalar - as I expected the combiner to produce by combining 
the two previous insns.

However, inspecting the logs from -fdump-rtl-combine-all, *without* my patch, 
when the combiner tries to put those two together, I see:

Trying 30 -> 31:
Failed to match this instruction:
(set (reg:DF 179 [ stmp_s_5.7D.2196 ])
     (vec_select:DF (plus:V2DF (vec_select:V2DF (reg:V2DF 173 [ vect_s_5.6D.2195 ])
                 (parallel [
                         (const_int 1 [0x1])
                         (const_int 0 [0])
                     ]))
             (reg:V2DF 173 [ vect_s_5.6D.2195 ]))
         (parallel [
                 (const_int 1 [0x1])
             ])))

That is, it looks like combine_simplify_rtx has transformed the (vec_concat 
(vec_select ... 1) (vec_select ... 0)) from the vsx_reduc_plus_v2df insn, into a 
single vec_select, which does not match the vsx_reduc_plus_v2df_scalar insn.

So despite the comment (in vsx.md):

;; Combiner patterns with the vector reduction patterns that knows we can get
;; to the top element of the V2DF array without doing an extract.

It looks like the code generation prior to my patch, considered better, was 
because the combiner didn't actually use the pattern?

In that case whilst you may want to dig into register allocation, 
cannot_change_mode_class, etc., for other reasons, I think the best fix for 
migrating to reduc_plus_scal... is simply to avoid using the "Combiner" patterns 
and just emit two insns, the old pattern followed by a vec_extract. The attached 
snippet does this (I won't call it a patch yet, and it applies on top of the 
previous patch - I went the route of calling the two gen functions rather than 
copying their RTL sequences, but could do the latter if that were 
preferable???), and restores code generation to the original form on your 
example above; it bootstraps OK but I'm still running check-gcc on the Compile 
Farm...

However, again on your example above, I note that if I *remove* the 
reduc_plus_scal_v2df pattern altogether, I get:

.sum:
         li 10,512        # 52   *movdi_internal64/4     [length = 4]
         ld 9,.LC2@toc(2)         # 20   *movdi_internal64/2     [length = 4]
         xxlxor 0,0,0     # 17   *vsx_movv2df/12 [length = 4]
         mtctr 10         # 48   *movdi_internal64/11    [length = 4]
         .align 4
.L2:
         lxvd2x 12,0,9    # 23   *vsx_movv2df/2  [length = 4]
         addi 9,9,16      # 25   *adddi3_internal1/2     [length = 4]
         xvadddp 0,0,12   # 24   *vsx_addv2df3/1 [length = 4]
         bdnz .L2         # 47   *ctrdi_internal1/1      [length = 4]
         xxsldwi 12,0,0,2         # 30   vsx_xxsldwi_v2df        [length = 4]
         xvadddp 1,0,12   # 31   *vsx_addv2df3/1 [length = 4]
         nop      # 37   *vsx_extract_v2df_internal2/1   [length = 4]
         blr      # 55   return  [length = 4]

this is presumably using gcc's scalar reduction code, but (to my untrained eye 
on powerpc!) it looks even better than the first form above (the same in the 
loop, and in the reduction, an xxpermdi is replaced by a nop !)...

--Alan

Segher Boessenkool wrote:
> On Mon, Nov 10, 2014 at 05:36:24PM -0500, Michael Meissner wrote:
>> However, the double pattern is completely broken.  This cannot go in.
> 
> [snip]
> 
>> It is unacceptable to have to do the inner loop doing a load, vector add, and
>> store in the loop.
> 
> Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
> the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
> add, the latter a float add.  And it uses the same pseudoregister for the
> accumulator throughout.  IRA decides a register is more expensive than
> memory for this, I suppose because it wants both V2DF and DF?  It doesn't
> seem to like the subreg very much.
> 
> The new code does look nicer otherwise :-)
> 
> 
> Segher
> 

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: emit_sep.patch --]
[-- Type: text/x-patch; name=emit_sep.patch, Size: 1613 bytes --]

diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index 54b18aa..18a7f08 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -1078,21 +1078,15 @@
 ;; Vector reduction expanders for VSX

 (define_expand "reduc_<VEC_reduc_name>_scal_v2df"
-  [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
-		   (vec_select:DF
-		    (VEC_reduc:V2DF
-		     (vec_concat:V2DF
-		      (vec_select:DF
-		       (match_operand:V2DF 1 "vfloat_operand" "")
-		       (parallel [(const_int 1)]))
-		      (vec_select:DF
-		       (match_dup 1)
-		       (parallel [(const_int 0)])))
-		     (match_dup 1))
-		    (parallel [(const_int 1)])))
-	      (clobber (match_scratch:DF 2 ""))])]
+  [(match_operand:DF 0 "register_operand" "")
+   (VEC_reduc:V2DF (match_operand:V2DF 1 "vfloat_operand" "") (const_int 0))]
   "VECTOR_UNIT_VSX_P (V2DFmode)"
-  "")
+  {
+    rtx vec = gen_reg_rtx (V2DFmode);
+    emit_insn (gen_vsx_reduc_<VEC_reduc_name>_v2df (vec, operand1));
+    emit_insn (gen_vsx_extract_v2df (operand0, vec, const1_rtx));
+    DONE;
+  })

 ; The (VEC_reduc:V4SF
 ;	(op1)
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 7aa0f12..8df6a45 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -2150,7 +2150,7 @@
 \f
 ;; Vector reduction insns and splitters

-(define_insn_and_split "*vsx_reduc_<VEC_reduc_name>_v2df"
+(define_insn_and_split "vsx_reduc_<VEC_reduc_name>_v2df"
   [(set (match_operand:V2DF 0 "vfloat_operand" "=&wd,&?wa,wd,?wa")
 	(VEC_reduc:V2DF
 	 (vec_concat:V2DF

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-11-12 12:32       ` Alan Lawrence
@ 2014-11-12 18:53         ` Alan Lawrence
  2014-12-11 15:59           ` Ping: " Alan Lawrence
  0 siblings, 1 reply; 27+ messages in thread
From: Alan Lawrence @ 2014-11-12 18:53 UTC (permalink / raw)
  Cc: Segher Boessenkool, Michael Meissner, gcc-patches, David Edelsohn

Have run check-gcc on gcc110.fsffrance.org (powerpc64-unknown-linux-gnu) using 
this snippet on top of original patch; no regressions.

Alan Lawrence wrote:
> So I'm no expert on RS6000 here, but following on from Segher's observation 
> about the change in pattern...so the difference in 'expand' is exactly that, a 
> vsx_reduc_splus_v2df followed by a vec_extract to DF, becomes a 
> vsx_reduc_splus_v2df_scalar - as I expected the combiner to produce by combining 
> the two previous insns.
> 
> However, inspecting the logs from -fdump-rtl-combine-all, *without* my patch, 
> when the combiner tries to put those two together, I see:
> 
> Trying 30 -> 31:
> Failed to match this instruction:
> (set (reg:DF 179 [ stmp_s_5.7D.2196 ])
>      (vec_select:DF (plus:V2DF (vec_select:V2DF (reg:V2DF 173 [ vect_s_5.6D.2195 ])
>                  (parallel [
>                          (const_int 1 [0x1])
>                          (const_int 0 [0])
>                      ]))
>              (reg:V2DF 173 [ vect_s_5.6D.2195 ]))
>          (parallel [
>                  (const_int 1 [0x1])
>              ])))
> 
> That is, it looks like combine_simplify_rtx has transformed the (vec_concat 
> (vec_select ... 1) (vec_select ... 0)) from the vsx_reduc_plus_v2df insn, into a 
> single vec_select, which does not match the vsx_reduc_plus_v2df_scalar insn.
> 
> So despite the comment (in vsx.md):
> 
> ;; Combiner patterns with the vector reduction patterns that knows we can get
> ;; to the top element of the V2DF array without doing an extract.
> 
> It looks like the code generation prior to my patch, considered better, was 
> because the combiner didn't actually use the pattern?
> 
> In that case whilst you may want to dig into register allocation, 
> cannot_change_mode_class, etc., for other reasons, I think the best fix for 
> migrating to reduc_plus_scal... is simply to avoid using the "Combiner" patterns 
> and just emit two insns, the old pattern followed by a vec_extract. The attached 
> snippet does this (I won't call it a patch yet, and it applies on top of the 
> previous patch - I went the route of calling the two gen functions rather than 
> copying their RTL sequences, but could do the latter if that were 
> preferable???), and restores code generation to the original form on your 
> example above; it bootstraps OK but I'm still running check-gcc on the Compile 
> Farm...
> 
> However, again on your example above, I note that if I *remove* the 
> reduc_plus_scal_v2df pattern altogether, I get:
> 
> .sum:
>          li 10,512        # 52   *movdi_internal64/4     [length = 4]
>          ld 9,.LC2@toc(2)         # 20   *movdi_internal64/2     [length = 4]
>          xxlxor 0,0,0     # 17   *vsx_movv2df/12 [length = 4]
>          mtctr 10         # 48   *movdi_internal64/11    [length = 4]
>          .align 4
> .L2:
>          lxvd2x 12,0,9    # 23   *vsx_movv2df/2  [length = 4]
>          addi 9,9,16      # 25   *adddi3_internal1/2     [length = 4]
>          xvadddp 0,0,12   # 24   *vsx_addv2df3/1 [length = 4]
>          bdnz .L2         # 47   *ctrdi_internal1/1      [length = 4]
>          xxsldwi 12,0,0,2         # 30   vsx_xxsldwi_v2df        [length = 4]
>          xvadddp 1,0,12   # 31   *vsx_addv2df3/1 [length = 4]
>          nop      # 37   *vsx_extract_v2df_internal2/1   [length = 4]
>          blr      # 55   return  [length = 4]
> 
> this is presumably using gcc's scalar reduction code, but (to my untrained eye 
> on powerpc!) it looks even better than the first form above (the same in the 
> loop, and in the reduction, an xxpermdi is replaced by a nop !)...
> 
> --Alan
> 
> 
> Segher Boessenkool wrote:
>> On Mon, Nov 10, 2014 at 05:36:24PM -0500, Michael Meissner wrote:
>>> However, the double pattern is completely broken.  This cannot go in.
>> [snip]
>>
>>> It is unacceptable to have to do the inner loop doing a load, vector add, and
>>> store in the loop.
>> Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
>> the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
>> add, the latter a float add.  And it uses the same pseudoregister for the
>> accumulator throughout.  IRA decides a register is more expensive than
>> memory for this, I suppose because it wants both V2DF and DF?  It doesn't
>> seem to like the subreg very much.
>>
>> The new code does look nicer otherwise :-)
>>
>>
>> Segher


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-11-12  9:26         ` Segher Boessenkool
@ 2014-11-12 19:20           ` Michael Meissner
  0 siblings, 0 replies; 27+ messages in thread
From: Michael Meissner @ 2014-11-12 19:20 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Michael Meissner, Alan Lawrence, gcc-patches, David Edelsohn

On Wed, Nov 12, 2014 at 03:26:35AM -0600, Segher Boessenkool wrote:
> On Tue, Nov 11, 2014 at 08:27:22PM -0500, Michael Meissner wrote:
> > > Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
> > > the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
> > > add, the latter a float add.  And it uses the same pseudoregister for the
> > > accumulator throughout.  IRA decides a register is more expensive than
> > > memory for this, I suppose because it wants both V2DF and DF?  It doesn't
> > > seem to like the subreg very much.
> > 
> > I haven't looked into in detail (I've been a little busy with th upper regs
> > patch), but I suspect the problem is that 128-bit and 64-bit types cannot
> > overlap (i.e. rs6000_cannot_change_mode_class returns true).  This is due to
> > the fact that scalars in VSX registers occupy the upper 64-bits, which would
> > not match the compiler's notion of that it should be in the bottom 64-bits.
> 
> You suspect correctly.  Hacking around that in cannot_change_mode_class
> doesn't help, subreg_get_info disallows it next.
> 
> Changing the pattern so it does two extracts instead of an extract and
> a subreg works (you get an fmr for the high part though, register alloc
> doesn't know dest=src is for free here).
> 
> _Should_ the subreg thing work?  Or should the patterns be fixed?

As I said, we cannot allow CANNOT_CHANGE_MODE_CLASS to return false for this
case, because the hardware just does not agree with what GCC believes is the
natural placement for smaller values inside of larger register fields.  I
suspect even if you add new target support macros to fix it, it will be a game
of whack-a-mole to find all of the places where there are hidden asumptions in
the compiler about subreg ordering.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Ping: Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-11-12 18:53         ` Alan Lawrence
@ 2014-12-11 15:59           ` Alan Lawrence
  2014-12-11 18:37             ` Alan Lawrence
  0 siblings, 1 reply; 27+ messages in thread
From: Alan Lawrence @ 2014-12-11 15:59 UTC (permalink / raw)
  To: Michael Meissner, Segher Boessenkool; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 6041 bytes --]

So I'm afraid I'm not going to get involved in a discussion about 
CANNOT_CHANGE_MODE_CLASS on RS6000, and what you might want to do there - sorry, 
but I don't think I can really contribute anything there. However, I *am* trying 
to migrate all platforms off the old reduc_xxx optabs to the new version 
producing a scalar.

Hence, can I ping the attached patch (which is just a simple combination of the 
previously-posted patch + snippet)? No regressions on gcc112.fsffrance.org.

This works in exactly the same way as the old code path, with a second insn to 
pull the scalar result out of the reduction, just as the expander would have 
done (or the bitfieldref before that), and avoiding the v2df combine pattern 
(again, as previously).

gcc/ChangeLog:

     * config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
     (reduc_plus_scal_<mode>): ...this, add rs6000_expand_vector_extract.
     (reduc_uplus_v16qi): Remove.

     * config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
     (reduc_<VEC_reduc_name>_v2df): Remove.
     (reduc_<VEC_reduc_name>_scal_v2df): New.
     (reduc_<VEC_reduc_name>_v4sf): Rename to...
     (reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
     vec_select of element 3, add scratch register.




> Have run check-gcc on gcc110.fsffrance.org (powerpc64-unknown-linux-gnu) using this snippet on top of original patch; no regressions.
> 
> 
> Alan Lawrence wrote:
> 
>     So I'm no expert on RS6000 here, but following on from Segher's observation about the change in pattern...so the difference in 'expand' is exactly that, a vsx_reduc_splus_v2df followed by a vec_extract to DF, becomes a vsx_reduc_splus_v2df_scalar - as I expected the combiner to produce by combining the two previous insns.
> 
> 
>     However, inspecting the logs from -fdump-rtl-combine-all, *without* my patch, when the combiner tries to put those two together, I see:
> 
> 
>     Trying 30 -> 31:
>     Failed to match this instruction:
>     (set (reg:DF 179 [ stmp_s_5.7D.2196 ])
>          (vec_select:DF (plus:V2DF (vec_select:V2DF (reg:V2DF 173 [ vect_s_5.6D.2195 ])
>                      (parallel [
>                              (const_int 1 [0x1])
>                              (const_int 0 [0])
>                          ]))
>                  (reg:V2DF 173 [ vect_s_5.6D.2195 ]))
>              (parallel [
>                      (const_int 1 [0x1])
>                  ])))
> 
>     That is, it looks like combine_simplify_rtx has transformed the (vec_concat (vec_select ... 1) (vec_select ... 0)) from the vsx_reduc_plus_v2df insn, into a single vec_select, which does not match the vsx_reduc_plus_v2df_scalar insn.
> 
> 
>     So despite the comment (in vsx.md):
> 
>     ;; Combiner patterns with the vector reduction patterns that knows we can get
>     ;; to the top element of the V2DF array without doing an extract.
> 
>     It looks like the code generation prior to my patch, considered better, was because the combiner didn't actually use the pattern?
> 
> 
>     In that case whilst you may want to dig into register allocation, cannot_change_mode_class, etc., for other reasons, I think the best fix for migrating to reduc_plus_scal... is simply to avoid using the "Combiner" patterns and just emit two insns, the old pattern followed by a vec_extract. The attached snippet does this (I won't call it a patch yet, and it applies on top of the previous patch - I went the route of calling the two gen functions rather than copying their RTL sequences, but could do the latter if that were preferable???), and restores code generation to the original form on your example above; it bootstraps OK but I'm still running check-gcc on the Compile Farm...
> 
> 
>     However, again on your example above, I note that if I *remove* the reduc_plus_scal_v2df pattern altogether, I get:
> 
> 
>     .sum:
>              li 10,512        # 52   *movdi_internal64/4     [length = 4]
>              ld 9,.LC2@toc(2)         # 20   *movdi_internal64/2     [length = 4]
>              xxlxor 0,0,0     # 17   *vsx_movv2df/12 [length = 4]
>              mtctr 10         # 48   *movdi_internal64/11    [length = 4]
>              .align 4
>     .L2:
>              lxvd2x 12,0,9    # 23   *vsx_movv2df/2  [length = 4]
>              addi 9,9,16      # 25   *adddi3_internal1/2     [length = 4]
>              xvadddp 0,0,12   # 24   *vsx_addv2df3/1 [length = 4]
>              bdnz .L2         # 47   *ctrdi_internal1/1      [length = 4]
>              xxsldwi 12,0,0,2         # 30   vsx_xxsldwi_v2df        [length = 4]
>              xvadddp 1,0,12   # 31   *vsx_addv2df3/1 [length = 4]
>              nop      # 37   *vsx_extract_v2df_internal2/1   [length = 4]
>              blr      # 55   return  [length = 4]
> 
>     this is presumably using gcc's scalar reduction code, but (to my untrained eye on powerpc!) it looks even better than the first form above (the same in the loop, and in the reduction, an xxpermdi is replaced by a nop !)...
> 
> 
>     --Alan
> 
> 
>     Segher Boessenkool wrote:
> 
>         On Mon, Nov 10, 2014 at 05:36:24PM -0500, Michael Meissner wrote:
> 
>             However, the double pattern is completely broken.  This cannot go in.
> 
>         [snip]
> 
>             It is unacceptable to have to do the inner loop doing a load, vector add, and
>             store in the loop.
> 
>         Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
>         the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
>         add, the latter a float add.  And it uses the same pseudoregister for the
>         accumulator throughout.  IRA decides a register is more expensive than
>         memory for this, I suppose because it wants both V2DF and DF?  It doesn't
>         seem to like the subreg very much.
> 
>         The new code does look nicer otherwise :-)
> 
> 
>         Segher

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: vec_rs6000_combined.patch --]
[-- Type: text/x-patch; name=vec_rs6000_combined.patch, Size: 4721 bytes --]

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index d46ef191409211a0a9d213b57fb4e657cd5a3cb4..b79f7aa477ec700743ae7c5734672571e395b79d 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -2596,35 +2596,22 @@
   operands[3] = gen_reg_rtx (GET_MODE (operands[0]));
 })
 
-(define_expand "reduc_splus_<mode>"
-  [(set (match_operand:VIshort 0 "register_operand" "=v")
+(define_expand "reduc_plus_scal_<mode>"
+  [(set (match_operand:<VI_scalar> 0 "register_operand" "=v")
         (unspec:VIshort [(match_operand:VIshort 1 "register_operand" "v")]
 			UNSPEC_REDUC_PLUS))]
   "TARGET_ALTIVEC"
 {
   rtx vzero = gen_reg_rtx (V4SImode);
   rtx vtmp1 = gen_reg_rtx (V4SImode);
-  rtx dest = gen_lowpart (V4SImode, operands[0]);
+  rtx vtmp2 = gen_reg_rtx (<MODE>mode);
+  rtx dest = gen_lowpart (V4SImode, vtmp2);
+  HOST_WIDE_INT last_elem = GET_MODE_NUNITS (<MODE>mode) - 1;
 
   emit_insn (gen_altivec_vspltisw (vzero, const0_rtx));
   emit_insn (gen_altivec_vsum4s<VI_char>s (vtmp1, operands[1], vzero));
   emit_insn (gen_altivec_vsumsws_direct (dest, vtmp1, vzero));
-  DONE;
-})
-
-(define_expand "reduc_uplus_v16qi"
-  [(set (match_operand:V16QI 0 "register_operand" "=v")
-        (unspec:V16QI [(match_operand:V16QI 1 "register_operand" "v")]
-		      UNSPEC_REDUC_PLUS))]
-  "TARGET_ALTIVEC"
-{
-  rtx vzero = gen_reg_rtx (V4SImode);
-  rtx vtmp1 = gen_reg_rtx (V4SImode);
-  rtx dest = gen_lowpart (V4SImode, operands[0]);
-
-  emit_insn (gen_altivec_vspltisw (vzero, const0_rtx));
-  emit_insn (gen_altivec_vsum4ubs (vtmp1, operands[1], vzero));
-  emit_insn (gen_altivec_vsumsws_direct (dest, vtmp1, vzero));
+  rs6000_expand_vector_extract (operands[0], vtmp2, last_elem);
   DONE;
 })
 
diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index e2946bd6e312e909471253fc2d75a4b25e050f82..cc01a96cb681a61b0708b34ffa4339529f3c1b9e 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -77,7 +77,7 @@
 ;; Vector reduction code iterators
 (define_code_iterator VEC_reduc [plus smin smax])
 
-(define_code_attr VEC_reduc_name [(plus "splus")
+(define_code_attr VEC_reduc_name [(plus "plus")
 				  (smin "smin")
 				  (smax "smax")])
 
@@ -990,20 +990,16 @@
 \f
 ;; Vector reduction expanders for VSX
 
-(define_expand "reduc_<VEC_reduc_name>_v2df"
-  [(parallel [(set (match_operand:V2DF 0 "vfloat_operand" "")
-		   (VEC_reduc:V2DF
-		    (vec_concat:V2DF
-		     (vec_select:DF
-		      (match_operand:V2DF 1 "vfloat_operand" "")
-		      (parallel [(const_int 1)]))
-		     (vec_select:DF
-		      (match_dup 1)
-		      (parallel [(const_int 0)])))
-		    (match_dup 1)))
-	      (clobber (match_scratch:V2DF 2 ""))])]
+(define_expand "reduc_<VEC_reduc_name>_scal_v2df"
+  [(match_operand:DF 0 "register_operand" "")
+   (VEC_reduc:V2DF (match_operand:V2DF 1 "vfloat_operand" "") (const_int 0))]
   "VECTOR_UNIT_VSX_P (V2DFmode)"
-  "")
+  {
+    rtx vec = gen_reg_rtx (V2DFmode);
+    emit_insn (gen_vsx_reduc_<VEC_reduc_name>_v2df (vec, operand1));
+    emit_insn (gen_vsx_extract_v2df (operand0, vec, const1_rtx));
+    DONE;
+  })
 
 ; The (VEC_reduc:V4SF
 ;	(op1)
@@ -1012,13 +1008,16 @@
 ; is to allow us to use a code iterator, but not completely list all of the
 ; vector rotates, etc. to prevent canonicalization
 
-(define_expand "reduc_<VEC_reduc_name>_v4sf"
-  [(parallel [(set (match_operand:V4SF 0 "vfloat_operand" "")
-		   (VEC_reduc:V4SF
-		    (unspec:V4SF [(const_int 0)] UNSPEC_REDUC)
-		    (match_operand:V4SF 1 "vfloat_operand" "")))
+(define_expand "reduc_<VEC_reduc_name>_scal_v4sf"
+  [(parallel [(set (match_operand:SF 0 "vfloat_operand" "")
+		   (vec_select:SF
+		    (VEC_reduc:V4SF
+		     (unspec:V4SF [(const_int 0)] UNSPEC_REDUC)
+		     (match_operand:V4SF 1 "vfloat_operand" ""))
+		    (parallel [(const_int 3)])))
 	      (clobber (match_scratch:V4SF 2 ""))
-	      (clobber (match_scratch:V4SF 3 ""))])]
+	      (clobber (match_scratch:V4SF 3 ""))
+	      (clobber (match_scratch:V4SF 4 ""))])]
   "VECTOR_UNIT_VSX_P (V4SFmode)"
   "")
 
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 27d464e07f7b77166047dd9ba41966aef411c029..2a30039c2bf415ab92fbd0b806787a74b15d7dcb 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -2150,7 +2150,7 @@
 \f
 ;; Vector reduction insns and splitters
 
-(define_insn_and_split "*vsx_reduc_<VEC_reduc_name>_v2df"
+(define_insn_and_split "vsx_reduc_<VEC_reduc_name>_v2df"
   [(set (match_operand:V2DF 0 "vfloat_operand" "=&wd,&?wa,wd,?wa")
 	(VEC_reduc:V2DF
 	 (vec_concat:V2DF

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Ping: Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal
  2014-12-11 15:59           ` Ping: " Alan Lawrence
@ 2014-12-11 18:37             ` Alan Lawrence
  0 siblings, 0 replies; 27+ messages in thread
From: Alan Lawrence @ 2014-12-11 18:37 UTC (permalink / raw)
  To: Michael Meissner; +Cc: Segher Boessenkool, gcc-patches

Sorry - it works exactly as the current optab/expander *in the v2df case*, but 
is the same as the previous version of the patch in the other cases.

--Alan

Alan Lawrence wrote:
> So I'm afraid I'm not going to get involved in a discussion about 
> CANNOT_CHANGE_MODE_CLASS on RS6000, and what you might want to do there - sorry, 
> but I don't think I can really contribute anything there. However, I *am* trying 
> to migrate all platforms off the old reduc_xxx optabs to the new version 
> producing a scalar.
> 
> Hence, can I ping the attached patch (which is just a simple combination of the 
> previously-posted patch + snippet)? No regressions on gcc112.fsffrance.org.
> 
> This works in exactly the same way as the old code path, with a second insn to 
> pull the scalar result out of the reduction, just as the expander would have 
> done (or the bitfieldref before that), and avoiding the v2df combine pattern 
> (again, as previously).
> 
> gcc/ChangeLog:
> 
>      * config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
>      (reduc_plus_scal_<mode>): ...this, add rs6000_expand_vector_extract.
>      (reduc_uplus_v16qi): Remove.
> 
>      * config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
>      (reduc_<VEC_reduc_name>_v2df): Remove.
>      (reduc_<VEC_reduc_name>_scal_v2df): New.
>      (reduc_<VEC_reduc_name>_v4sf): Rename to...
>      (reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
>      vec_select of element 3, add scratch register.
> 
> 
> 
> 
>> Have run check-gcc on gcc110.fsffrance.org (powerpc64-unknown-linux-gnu) using this snippet on top of original patch; no regressions.
>>
>>
>> Alan Lawrence wrote:
>>
>>     So I'm no expert on RS6000 here, but following on from Segher's observation about the change in pattern...so the difference in 'expand' is exactly that, a vsx_reduc_splus_v2df followed by a vec_extract to DF, becomes a vsx_reduc_splus_v2df_scalar - as I expected the combiner to produce by combining the two previous insns.
>>
>>
>>     However, inspecting the logs from -fdump-rtl-combine-all, *without* my patch, when the combiner tries to put those two together, I see:
>>
>>
>>     Trying 30 -> 31:
>>     Failed to match this instruction:
>>     (set (reg:DF 179 [ stmp_s_5.7D.2196 ])
>>          (vec_select:DF (plus:V2DF (vec_select:V2DF (reg:V2DF 173 [ vect_s_5.6D.2195 ])
>>                      (parallel [
>>                              (const_int 1 [0x1])
>>                              (const_int 0 [0])
>>                          ]))
>>                  (reg:V2DF 173 [ vect_s_5.6D.2195 ]))
>>              (parallel [
>>                      (const_int 1 [0x1])
>>                  ])))
>>
>>     That is, it looks like combine_simplify_rtx has transformed the (vec_concat (vec_select ... 1) (vec_select ... 0)) from the vsx_reduc_plus_v2df insn, into a single vec_select, which does not match the vsx_reduc_plus_v2df_scalar insn.
>>
>>
>>     So despite the comment (in vsx.md):
>>
>>     ;; Combiner patterns with the vector reduction patterns that knows we can get
>>     ;; to the top element of the V2DF array without doing an extract.
>>
>>     It looks like the code generation prior to my patch, considered better, was because the combiner didn't actually use the pattern?
>>
>>
>>     In that case whilst you may want to dig into register allocation, cannot_change_mode_class, etc., for other reasons, I think the best fix for migrating to reduc_plus_scal... is simply to avoid using the "Combiner" patterns and just emit two insns, the old pattern followed by a vec_extract. The attached snippet does this (I won't call it a patch yet, and it applies on top of the previous patch - I went the route of calling the two gen functions rather than copying their RTL sequences, but could do the latter if that were preferable???), and restores code generation to the original form on your example above; it bootstraps OK but I'm still running check-gcc on the Compile Farm...
>>
>>
>>     However, again on your example above, I note that if I *remove* the reduc_plus_scal_v2df pattern altogether, I get:
>>
>>
>>     .sum:
>>              li 10,512        # 52   *movdi_internal64/4     [length = 4]
>>              ld 9,.LC2@toc(2)         # 20   *movdi_internal64/2     [length = 4]
>>              xxlxor 0,0,0     # 17   *vsx_movv2df/12 [length = 4]
>>              mtctr 10         # 48   *movdi_internal64/11    [length = 4]
>>              .align 4
>>     .L2:
>>              lxvd2x 12,0,9    # 23   *vsx_movv2df/2  [length = 4]
>>              addi 9,9,16      # 25   *adddi3_internal1/2     [length = 4]
>>              xvadddp 0,0,12   # 24   *vsx_addv2df3/1 [length = 4]
>>              bdnz .L2         # 47   *ctrdi_internal1/1      [length = 4]
>>              xxsldwi 12,0,0,2         # 30   vsx_xxsldwi_v2df        [length = 4]
>>              xvadddp 1,0,12   # 31   *vsx_addv2df3/1 [length = 4]
>>              nop      # 37   *vsx_extract_v2df_internal2/1   [length = 4]
>>              blr      # 55   return  [length = 4]
>>
>>     this is presumably using gcc's scalar reduction code, but (to my untrained eye on powerpc!) it looks even better than the first form above (the same in the loop, and in the reduction, an xxpermdi is replaced by a nop !)...
>>
>>
>>     --Alan
>>
>>
>>     Segher Boessenkool wrote:
>>
>>         On Mon, Nov 10, 2014 at 05:36:24PM -0500, Michael Meissner wrote:
>>
>>             However, the double pattern is completely broken.  This cannot go in.
>>
>>         [snip]
>>
>>             It is unacceptable to have to do the inner loop doing a load, vector add, and
>>             store in the loop.
>>
>>         Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
>>         the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
>>         add, the latter a float add.  And it uses the same pseudoregister for the
>>         accumulator throughout.  IRA decides a register is more expensive than
>>         memory for this, I suppose because it wants both V2DF and DF?  It doesn't
>>         seem to like the subreg very much.
>>
>>         The new code does look nicer otherwise :-)
>>
>>
>>         Segher


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-12-11 18:37 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-24 11:57 [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Alan Lawrence
2014-10-24 11:58 ` [PATCH 7/11][ARM] Migrate to new reduc_plus_scal_optab Alan Lawrence
2014-11-03 17:32   ` Ramana Radhakrishnan
2014-10-24 12:01 ` [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Richard Biener
2014-10-24 12:05 ` [PATCH 8/11][ARM] Migrate to new reduc_[us](min|max)_scal_optab Alan Lawrence
2014-11-04 11:08   ` Ramana Radhakrishnan
2014-10-24 12:06 ` [PATCH 9/11][i386] Migrate reduction optabs to reduc_..._scal Alan Lawrence
2014-10-24 12:07 ` [PATCH 10/11][RS6000] " Alan Lawrence
2014-10-24 12:14   ` Alan Lawrence
2014-10-25  0:08   ` David Edelsohn
2014-11-03 17:51     ` Bill Schmidt
2014-11-06 16:44       ` Alan Lawrence
2014-11-06 18:57         ` Bill Schmidt
2014-11-07 10:09           ` Alan Lawrence
2014-11-10 22:39   ` Michael Meissner
2014-11-11  7:10     ` Segher Boessenkool
2014-11-12  1:54       ` Michael Meissner
2014-11-12  9:26         ` Segher Boessenkool
2014-11-12 19:20           ` Michael Meissner
2014-11-12 12:32       ` Alan Lawrence
2014-11-12 18:53         ` Alan Lawrence
2014-12-11 15:59           ` Ping: " Alan Lawrence
2014-12-11 18:37             ` Alan Lawrence
2014-10-24 12:12 ` [Protopatch 11/11][IA64] Migrate to reduc_(plus|min|max)_scal_v2df optab Alan Lawrence
2014-10-24 12:50   ` Alan Lawrence
2014-10-24 15:19 ` [PATCH v2 0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral Matthew Fortune
2014-10-27 11:48   ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).