public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH GCC]New vectorization pattern turning cond_expr into max/min and plus/minus
@ 2016-10-11 15:03 Bin Cheng
  2016-10-12  8:12 ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Bin Cheng @ 2016-10-11 15:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd

[-- Attachment #1: Type: text/plain, Size: 1495 bytes --]

Hi,
Given below test case,
int foo (unsigned short a[], unsigned int x)
{
  unsigned int i;
  for (i = 0; i < 1000; i++)
    {
      x = a[i];
      a[i] = (unsigned short)(x >= 32768 ? x - 32768 : 0);
    }
  return x;
}

it now can be vectorized on AArch64, but generated assembly is way from optimal:
.L4:
	ldr	q4, [x3, x1]
	add	w2, w2, 1
	cmp	w2, w0
	ushll	v1.4s, v4.4h, 0
	ushll2	v0.4s, v4.8h, 0
	add	v3.4s, v1.4s, v6.4s
	add	v2.4s, v0.4s, v6.4s
	cmhi	v1.4s, v1.4s, v5.4s
	cmhi	v0.4s, v0.4s, v5.4s
	and	v1.16b, v3.16b, v1.16b
	and	v0.16b, v2.16b, v0.16b
	xtn	v2.4h, v1.4s
	xtn2	v2.8h, v0.4s
	str	q2, [x3, x1]
	add	x1, x1, 16
	bcc	.L4

The vectorized loop has 15 instructions, which can be greatly simplified by turning cond_expr into max_expr, as below:
.L4:
	ldr	q1, [x3, x1]
	add	w2, w2, 1
	cmp	w2, w0
	umax	v0.8h, v1.8h, v2.8h
	add	v0.8h, v0.8h, v2.8h
	str	q0, [x3, x1]
	add	x1, x1, 16
	bcc	.L4

This patch addresses the issue by adding new vectorization pattern.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin

2016-10-11  Bin Cheng  <bin.cheng@arm.com>

	* tree-vect-patterns.c (vect_recog_min_max_modify_pattern): New.
	(vect_vect_recog_func_ptrs): New element for above pattern.
	* tree-vectorizer.h (NUM_PATTERNS): Increase by 1.

gcc/testsuite/ChangeLog
2016-10-11  Bin Cheng  <bin.cheng@arm.com>

	* gcc.dg/vect/vect-umax-modify-pattern.c: New test.
	* gcc.dg/vect/vect-umin-modify-pattern.c: New test.

[-- Attachment #2: umin-max-modify-pattern-20160924.txt --]
[-- Type: text/plain, Size: 12497 bytes --]

diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 7e6e45d..b502498 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -65,6 +65,8 @@ static gimple *vect_recog_divmod_pattern (vec<gimple *> *,
 static gimple *vect_recog_mult_pattern (vec<gimple *> *,
 				       tree *, tree *);
 
+static gimple *vect_recog_min_max_modify_pattern (vec<gimple *> *,
+						  tree *, tree *);
 static gimple *vect_recog_mixed_size_cond_pattern (vec<gimple *> *,
 						  tree *, tree *);
 static gimple *vect_recog_bool_pattern (vec<gimple *> *, tree *, tree *);
@@ -91,6 +93,7 @@ static vect_recog_func vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
       { vect_recog_vector_vector_shift_pattern, "vector_vector_shift" },
       {	vect_recog_divmod_pattern, "divmod" },
       {	vect_recog_mult_pattern, "mult" },
+      {	vect_recog_min_max_modify_pattern, "min_max_modify" },
       {	vect_recog_mixed_size_cond_pattern, "mixed_size_cond" },
       {	vect_recog_bool_pattern, "bool" },
       {	vect_recog_mask_conversion_pattern, "mask_conversion" }
@@ -2964,6 +2967,262 @@ vect_recog_divmod_pattern (vec<gimple *> *stmts,
   return pattern_stmt;
 }
 
+/* Function vect_recog_min_max_modify_pattern
+
+   Try to find the following pattern:
+
+     type x_t, iftmp1_t, iftmp2_t;
+     TYPE a_T;
+   loop:
+     S1  a_T = (TYPE)x_t;
+     S2  iftmp1_t = x_t - CONST1;
+     S3  iftmp2_t = (a_T > (TYPE)CONST2) ? iftmp1_t : (CONST2 - CONST1);
+
+   Or
+
+   loop:
+     S1  a_T = (TYPE)x_t;
+     S2  iftmp1_t = x_t + CONST1;
+     S3  iftmp2_t = (a_T < (TYPE)CONST2) ? iftmp1_t : (CONST2 + CONST1);
+
+   where both type 'TYPE' and 'type' are unsigned integral types and
+   TYPE is larger than type in prevision.  All constant operands need
+   to fit into 'type'.
+
+   Input:
+
+   * LAST_STMT: A stmt from which the pattern search begins.
+
+   Output:
+
+   * TYPE_IN: The type of the input arguments to the pattern.
+
+   * TYPE_OUT: The type of the output of this pattern.
+
+   * Return value: A new stmt that will be used to replace the pattern,
+     as well as an additional def_stmt is generated.
+
+	patt_5 = MAX_EXPR <x_t, CONST2>;
+	iftmp2_t = patt_5 - CONST1;
+
+     Or
+
+	patt_5 = MIN_EXPR <x_t, CONST2>;
+	iftmp2_t = patt_5 + CONST1;  */
+
+static gimple *
+vect_recog_min_max_modify_pattern (vec<gimple *> *stmts, tree *type_in,
+				   tree *type_out)
+{
+  gimple *last_stmt = (*stmts)[0];
+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt), def_stmt_info;
+  gimple *pattern_stmt, *def_stmt;
+  vec_info *vinfo = stmt_vinfo->vinfo;
+  enum tree_code code, p_code;
+  tree op0 = NULL_TREE, op1, p_op1, orig_type;
+  bool promotion;
+
+  if (!is_gimple_assign (last_stmt)
+      || gimple_assign_rhs_code (last_stmt) != COND_EXPR
+      || STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_internal_def)
+    return NULL;
+
+  tree cond_expr = gimple_assign_rhs1 (last_stmt);
+  if (!COMPARISON_CLASS_P (cond_expr))
+    return NULL;
+
+  enum tree_code cond_code = TREE_CODE (cond_expr);
+  if (cond_code != LE_EXPR && cond_code != LT_EXPR
+      && cond_code != GE_EXPR && cond_code != GT_EXPR)
+    return NULL;
+
+  tree then_clause = gimple_assign_rhs2 (last_stmt);
+  tree else_clause = gimple_assign_rhs3 (last_stmt);
+  tree type = TREE_TYPE (then_clause);
+  if (!INTEGRAL_TYPE_P (type) || !TYPE_OVERFLOW_WRAPS (type))
+    return NULL;
+
+  tree vectype = get_vectype_for_scalar_type (type);
+  if (vectype == NULL_TREE)
+    return NULL;
+
+  /* Make sure the constant operand appears in else clause.  */
+  if (TREE_CODE (then_clause) == INTEGER_CST)
+    {
+      cond_code = invert_tree_comparison (cond_code, false);
+      std::swap (then_clause, else_clause);
+    }
+  if (TREE_CODE (then_clause) != SSA_NAME
+      || TREE_CODE (else_clause) != INTEGER_CST)
+    return NULL;
+
+  tree cond_op0 = TREE_OPERAND (cond_expr, 0);
+  tree cond_op1 = TREE_OPERAND (cond_expr, 1);
+  tree cond_type = TREE_TYPE (cond_op0);
+  if (!INTEGRAL_TYPE_P (cond_type) || TREE_CODE (cond_op1) != INTEGER_CST)
+    return NULL;
+
+  /* Handle simple boundary cases.  Below code snippet:
+
+       type x_t;
+       signed type a_T;
+
+       a_T = (signed type)x_t;
+       r = (a_t OP 0) ? then_clause : else_clause;
+
+     in which OP is either "<" or ">=".  It equals to:
+
+       type x_t;
+       signed type a_T;
+
+       a_T = (signed type)x_t;
+       r = (x_t OP TYPE_MAX_VALUE (type)) ? then_clause : else_clause;
+
+     in which OP is ">" or "<=".  */
+  if (tree_nop_conversion_p (type, cond_type)
+      && TYPE_UNSIGNED (type) && !TYPE_UNSIGNED (cond_type)
+      && TREE_CODE (cond_op0) == SSA_NAME && integer_zerop (cond_op1)
+      && type_conversion_p (cond_op0, last_stmt, false,
+			    &orig_type, &def_stmt, &promotion)
+      && orig_type && types_compatible_p (type, orig_type))
+    {
+      if (cond_code == LT_EXPR)
+	cond_code = GT_EXPR;
+      else if (cond_code == GE_EXPR)
+	cond_code = LE_EXPR;
+      else
+	return NULL;
+
+      cond_op0 = gimple_assign_rhs1 (def_stmt);
+      cond_op1 = TYPE_MAX_VALUE (cond_type);
+      cond_type = orig_type;
+    }
+
+  /* Check if MAX_EXPR/MIN_EXPR is supported.  Do it here so we can early
+     out if operator is not supported.  */
+  if (cond_code == GE_EXPR || cond_code == GT_EXPR)
+    code = MAX_EXPR;
+  else if (cond_code == LE_EXPR || cond_code == LT_EXPR)
+    code = MIN_EXPR;
+  else
+    return NULL;
+  optab optab = optab_for_tree_code (code, vectype, optab_default);
+  if (!optab || optab_handler (optab, TYPE_MODE (vectype)) == CODE_FOR_nothing)
+    return NULL;
+
+  /* Check if cond_op0 is converted from type.  */
+  if (TYPE_PRECISION (cond_type) > TYPE_PRECISION (type))
+    {
+      if (!int_fits_type_p (cond_op1, type))
+	return NULL;
+      else
+	cond_op1 = fold_convert (type, cond_op1);
+
+      orig_type = NULL_TREE;
+      /* If the conversion is folded in cond_expr.  */
+      if (CONVERT_EXPR_CODE_P (TREE_CODE (cond_op0)))
+	{
+	  cond_op0 = TREE_OPERAND (cond_op0, 0);
+	  orig_type = TREE_TYPE (cond_op0);
+	}
+      /* Or not.  */
+      else if (type_conversion_p (cond_op0, last_stmt, false,
+				  &orig_type, &def_stmt, &promotion))
+	{
+	  cond_op0 = gimple_assign_rhs1 (def_stmt);
+	}
+
+      if (!orig_type || !types_compatible_p (type, orig_type))
+	return NULL;
+    }
+  /* Only unsigned type supported so far.  */
+  if (!TYPE_UNSIGNED (type) || !TYPE_UNSIGNED (cond_type))
+    return NULL;
+
+  def_stmt = SSA_NAME_DEF_STMT (then_clause);
+  if (gimple_code (def_stmt) != GIMPLE_ASSIGN)
+    return NULL;
+  enum tree_code then_code = gimple_assign_rhs_code (def_stmt);
+  if (then_code != PLUS_EXPR && then_code != MINUS_EXPR)
+    return NULL;
+
+  tree then_op0 = gimple_assign_rhs1 (def_stmt);
+  tree then_op1 = gimple_assign_rhs2 (def_stmt);
+  if (TREE_CODE (then_op1) != INTEGER_CST
+      || !operand_equal_p (cond_op0, then_op0, 0))
+    return NULL;
+
+  /* MAX_EXPR case.  */
+  if (cond_code == GE_EXPR || cond_code == GT_EXPR)
+    {
+      if (then_code == PLUS_EXPR)
+	then_op1 = fold_build1 (NEGATE_EXPR, type, then_op1);
+
+      op1 = fold_build2 (PLUS_EXPR, type, else_clause, then_op1);
+      if (wi::to_widest (cond_op1) != wi::to_widest (op1)
+	  /* Boundary cases need more care, for example:
+
+	       cond_op0 > (CONST2 - 1) ? then_clause : CONST2 - CONST1
+
+	     equals to:
+
+	       cond_op0 >= (CONST2) ? then_clause : CONST2 - CONST1
+
+	     In these cases, cond_op0 and cond_op1 have the same value,
+	     as well as then_clause and else_clause.  It doesn't matter
+	     which value is taken.  */
+	  && !(cond_code == GT_EXPR
+	       && (wi::to_widest (cond_op1) + 1) == wi::to_widest (op1))
+	  && !(cond_code == GE_EXPR
+	       && (wi::to_widest (cond_op1) - 1) == wi::to_widest (op1)))
+	return NULL;
+
+      op0 = then_op0;
+      code = MAX_EXPR;
+      p_code = MINUS_EXPR;
+      p_op1 = then_op1;
+    }
+  /* MIN_EXPR case.  */
+  if (cond_code == LE_EXPR || cond_code == LT_EXPR)
+    {
+      if (then_code == MINUS_EXPR)
+	then_op1 = fold_build1 (NEGATE_EXPR, type, then_op1);
+
+      op1 = fold_build2 (MINUS_EXPR, type, else_clause, then_op1);
+      if (wi::to_widest (cond_op1) != wi::to_widest (op1)
+	  && !(cond_code == LT_EXPR
+	       && (wi::to_widest (cond_op1) - 1) == wi::to_widest (op1))
+	  && !(cond_code == LE_EXPR
+	       && (wi::to_widest (cond_op1) + 1) == wi::to_widest (op1)))
+	return NULL;
+
+      op0 = then_op0;
+      code = MIN_EXPR;
+      p_code = PLUS_EXPR;
+      p_op1 = then_op1;
+    }
+  gcc_assert (op0 != NULL_TREE);
+
+  def_stmt = gimple_build_assign (vect_recog_temp_ssa_var (type, NULL),
+				  code, op0, op1);
+  pattern_stmt = gimple_build_assign (vect_recog_temp_ssa_var (type, NULL),
+				      p_code, gimple_assign_lhs (def_stmt),
+				      p_op1);
+
+  new_pattern_def_seq (stmt_vinfo, def_stmt);
+  def_stmt_info = new_stmt_vec_info (def_stmt, vinfo);
+  set_vinfo_for_stmt (def_stmt, def_stmt_info);
+  STMT_VINFO_VECTYPE (def_stmt_info) = vectype;
+  *type_in = vectype;
+  *type_out = vectype;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "vect_recog_min_max_modify_pattern: detected:\n");
+
+  return pattern_stmt;
+}
+
 /* Function vect_recog_mixed_size_cond_pattern
 
    Try to find the following pattern:
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 29ef676..e8f93ea 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1180,7 +1180,7 @@ extern bool is_simple_and_all_uses_invariant (gimple *, loop_vec_info);
    Additional pattern recognition functions can (and will) be added
    in the future.  */
 typedef gimple *(* vect_recog_func_ptr) (vec<gimple *> *, tree *, tree *);
-#define NUM_PATTERNS 14
+#define NUM_PATTERNS 15
 void vect_pattern_recog (vec_info *);
 
 /* In tree-vectorizer.c.  */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-umax-modify-pattern.c b/gcc/testsuite/gcc.dg/vect/vect-umax-modify-pattern.c
new file mode 100644
index 0000000..87ac36b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-umax-modify-pattern.c
@@ -0,0 +1,48 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+int foo1 (unsigned short a[], unsigned int x)
+{
+  unsigned int i;
+  for (i = 0; i < 1000; i++)
+    {
+      x = a[i];
+      a[i] = (unsigned short)(x >= 32768 ? x - 32768 : 0);
+    }
+  return x;
+}
+
+int foo2 (unsigned short a[], unsigned int x)
+{
+  unsigned int i;
+  for (i = 0; i < 1000; i++)
+    {
+      x = a[i];
+      a[i] = (unsigned short)(x > 32768 ? x - 32768 : 0);
+    }
+  return x;
+}
+
+int foo3 (unsigned short a[], unsigned int x)
+{
+  unsigned int i;
+  for (i = 0; i < 1000; i++)
+    {
+      x = a[i];
+      a[i] = (unsigned short)(x > 1000 ? x - 1000 : 0);
+    }
+  return x;
+}
+
+int foo4 (unsigned short a[], unsigned int x)
+{
+  unsigned int i;
+  for (i = 0; i < 1000; i++)
+    {
+      x = a[i];
+      a[i] = (unsigned short)(x >= 2 ? x - 32768 : 32770);
+    }
+  return x;
+}
+/* { dg-final { scan-tree-dump-times "MAX_EXPR " 4 "optimized" { xfail vect_no_int_min_max } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-umin-modify-pattern.c b/gcc/testsuite/gcc.dg/vect/vect-umin-modify-pattern.c
new file mode 100644
index 0000000..85f8bbc
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-umin-modify-pattern.c
@@ -0,0 +1,49 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+int foo1 (unsigned short a[], unsigned int x)
+{
+  unsigned int i;
+  for (i = 0; i < 1000; i++)
+    {
+      x = a[i];
+      a[i] = (unsigned short)(x <= 32768 ? x + 32768 : 0);
+    }
+  return x;
+}
+
+int foo2 (unsigned short a[], unsigned int x)
+{
+  unsigned int i;
+  for (i = 0; i < 1000; i++)
+    {
+      x = a[i];
+      a[i] = (unsigned short)(x < 32768 ? x + 32768 : 0);
+    }
+  return x;
+}
+
+int foo3 (unsigned short a[], unsigned int x)
+{
+  unsigned int i;
+  for (i = 0; i < 1000; i++)
+    {
+      x = a[i];
+      a[i] = (unsigned short)(x < 1000 ? x - 1000 : 0);
+    }
+  return x;
+}
+
+int foo4 (unsigned short a[], unsigned int x)
+{
+  unsigned int i;
+  for (i = 0; i < 1000; i++)
+    {
+      x = a[i];
+      a[i] = (unsigned short)(x <= 2 ? x + 999 : 1001);
+    }
+  return x;
+}
+
+/* { dg-final { scan-tree-dump-times "MIN_EXPR " 4 "optimized" { xfail vect_no_int_min_max } } } */

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH GCC]New vectorization pattern turning cond_expr into max/min and plus/minus
  2016-10-11 15:03 [PATCH GCC]New vectorization pattern turning cond_expr into max/min and plus/minus Bin Cheng
@ 2016-10-12  8:12 ` Richard Biener
  2016-10-12  8:30   ` Bin.Cheng
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2016-10-12  8:12 UTC (permalink / raw)
  To: Bin Cheng; +Cc: gcc-patches, nd

On Tue, Oct 11, 2016 at 5:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> Given below test case,
> int foo (unsigned short a[], unsigned int x)
> {
>   unsigned int i;
>   for (i = 0; i < 1000; i++)
>     {
>       x = a[i];
>       a[i] = (unsigned short)(x >= 32768 ? x - 32768 : 0);
>     }
>   return x;
> }
>
> it now can be vectorized on AArch64, but generated assembly is way from optimal:
> .L4:
>         ldr     q4, [x3, x1]
>         add     w2, w2, 1
>         cmp     w2, w0
>         ushll   v1.4s, v4.4h, 0
>         ushll2  v0.4s, v4.8h, 0
>         add     v3.4s, v1.4s, v6.4s
>         add     v2.4s, v0.4s, v6.4s
>         cmhi    v1.4s, v1.4s, v5.4s
>         cmhi    v0.4s, v0.4s, v5.4s
>         and     v1.16b, v3.16b, v1.16b
>         and     v0.16b, v2.16b, v0.16b
>         xtn     v2.4h, v1.4s
>         xtn2    v2.8h, v0.4s
>         str     q2, [x3, x1]
>         add     x1, x1, 16
>         bcc     .L4
>
> The vectorized loop has 15 instructions, which can be greatly simplified by turning cond_expr into max_expr, as below:
> .L4:
>         ldr     q1, [x3, x1]
>         add     w2, w2, 1
>         cmp     w2, w0
>         umax    v0.8h, v1.8h, v2.8h
>         add     v0.8h, v0.8h, v2.8h
>         str     q0, [x3, x1]
>         add     x1, x1, 16
>         bcc     .L4
>
> This patch addresses the issue by adding new vectorization pattern.
> Bootstrap and test on x86_64 and AArch64.  Is it OK?

So the COND_EXPRs are generated this way by if-conversion, right?  I
believe that
the MAX/MIN_EXPR form is always preferrable and thus it looks like if-conversion
might want to either directly generate it or make sure to fold the
introduced stmts
(and have a match.pd pattern catching this).

Richard.

> Thanks,
> bin
>
> 2016-10-11  Bin Cheng  <bin.cheng@arm.com>
>
>         * tree-vect-patterns.c (vect_recog_min_max_modify_pattern): New.
>         (vect_vect_recog_func_ptrs): New element for above pattern.
>         * tree-vectorizer.h (NUM_PATTERNS): Increase by 1.
>
> gcc/testsuite/ChangeLog
> 2016-10-11  Bin Cheng  <bin.cheng@arm.com>
>
>         * gcc.dg/vect/vect-umax-modify-pattern.c: New test.
>         * gcc.dg/vect/vect-umin-modify-pattern.c: New test.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH GCC]New vectorization pattern turning cond_expr into max/min and plus/minus
  2016-10-12  8:12 ` Richard Biener
@ 2016-10-12  8:30   ` Bin.Cheng
  2016-10-12  8:50     ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Bin.Cheng @ 2016-10-12  8:30 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On Wed, Oct 12, 2016 at 9:12 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Oct 11, 2016 at 5:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> Hi,
>> Given below test case,
>> int foo (unsigned short a[], unsigned int x)
>> {
>>   unsigned int i;
>>   for (i = 0; i < 1000; i++)
>>     {
>>       x = a[i];
>>       a[i] = (unsigned short)(x >= 32768 ? x - 32768 : 0);
>>     }
>>   return x;
>> }
>>
>> it now can be vectorized on AArch64, but generated assembly is way from optimal:
>> .L4:
>>         ldr     q4, [x3, x1]
>>         add     w2, w2, 1
>>         cmp     w2, w0
>>         ushll   v1.4s, v4.4h, 0
>>         ushll2  v0.4s, v4.8h, 0
>>         add     v3.4s, v1.4s, v6.4s
>>         add     v2.4s, v0.4s, v6.4s
>>         cmhi    v1.4s, v1.4s, v5.4s
>>         cmhi    v0.4s, v0.4s, v5.4s
>>         and     v1.16b, v3.16b, v1.16b
>>         and     v0.16b, v2.16b, v0.16b
>>         xtn     v2.4h, v1.4s
>>         xtn2    v2.8h, v0.4s
>>         str     q2, [x3, x1]
>>         add     x1, x1, 16
>>         bcc     .L4
>>
>> The vectorized loop has 15 instructions, which can be greatly simplified by turning cond_expr into max_expr, as below:
>> .L4:
>>         ldr     q1, [x3, x1]
>>         add     w2, w2, 1
>>         cmp     w2, w0
>>         umax    v0.8h, v1.8h, v2.8h
>>         add     v0.8h, v0.8h, v2.8h
>>         str     q0, [x3, x1]
>>         add     x1, x1, 16
>>         bcc     .L4
>>
>> This patch addresses the issue by adding new vectorization pattern.
>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>
> So the COND_EXPRs are generated this way by if-conversion, right?  I
Though ?: is used in source code, yes, it is if-conv regenerated COND_EXPR.
> believe that
> the MAX/MIN_EXPR form is always preferrable and thus it looks like if-conversion
> might want to either directly generate it or make sure to fold the
> introduced stmts
> (and have a match.pd pattern catching this).
Hmm, I also noticed saturation cases which should be better
transformed before vectorization in scalar optimizers.  But this case
is a bit different because there is additional computation involved
other than type conversion.  We need to prove the computation can be
done in either large or small types.  It is quite specific case and I
don't see good (general) solution in if-conv.  Vect-pattern looks like
a natural place doing this.  I am also looking at general saturation
cases, but this one is different?

Thanks,
bin
>
> Richard.
>
>> Thanks,
>> bin
>>
>> 2016-10-11  Bin Cheng  <bin.cheng@arm.com>
>>
>>         * tree-vect-patterns.c (vect_recog_min_max_modify_pattern): New.
>>         (vect_vect_recog_func_ptrs): New element for above pattern.
>>         * tree-vectorizer.h (NUM_PATTERNS): Increase by 1.
>>
>> gcc/testsuite/ChangeLog
>> 2016-10-11  Bin Cheng  <bin.cheng@arm.com>
>>
>>         * gcc.dg/vect/vect-umax-modify-pattern.c: New test.
>>         * gcc.dg/vect/vect-umin-modify-pattern.c: New test.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH GCC]New vectorization pattern turning cond_expr into max/min and plus/minus
  2016-10-12  8:30   ` Bin.Cheng
@ 2016-10-12  8:50     ` Richard Biener
  2016-10-20 16:32       ` Bin.Cheng
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2016-10-12  8:50 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: gcc-patches

On Wed, Oct 12, 2016 at 10:29 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Oct 12, 2016 at 9:12 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Tue, Oct 11, 2016 at 5:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> Given below test case,
>>> int foo (unsigned short a[], unsigned int x)
>>> {
>>>   unsigned int i;
>>>   for (i = 0; i < 1000; i++)
>>>     {
>>>       x = a[i];
>>>       a[i] = (unsigned short)(x >= 32768 ? x - 32768 : 0);
>>>     }
>>>   return x;
>>> }
>>>
>>> it now can be vectorized on AArch64, but generated assembly is way from optimal:
>>> .L4:
>>>         ldr     q4, [x3, x1]
>>>         add     w2, w2, 1
>>>         cmp     w2, w0
>>>         ushll   v1.4s, v4.4h, 0
>>>         ushll2  v0.4s, v4.8h, 0
>>>         add     v3.4s, v1.4s, v6.4s
>>>         add     v2.4s, v0.4s, v6.4s
>>>         cmhi    v1.4s, v1.4s, v5.4s
>>>         cmhi    v0.4s, v0.4s, v5.4s
>>>         and     v1.16b, v3.16b, v1.16b
>>>         and     v0.16b, v2.16b, v0.16b
>>>         xtn     v2.4h, v1.4s
>>>         xtn2    v2.8h, v0.4s
>>>         str     q2, [x3, x1]
>>>         add     x1, x1, 16
>>>         bcc     .L4
>>>
>>> The vectorized loop has 15 instructions, which can be greatly simplified by turning cond_expr into max_expr, as below:
>>> .L4:
>>>         ldr     q1, [x3, x1]
>>>         add     w2, w2, 1
>>>         cmp     w2, w0
>>>         umax    v0.8h, v1.8h, v2.8h
>>>         add     v0.8h, v0.8h, v2.8h
>>>         str     q0, [x3, x1]
>>>         add     x1, x1, 16
>>>         bcc     .L4
>>>
>>> This patch addresses the issue by adding new vectorization pattern.
>>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>
>> So the COND_EXPRs are generated this way by if-conversion, right?  I
> Though ?: is used in source code, yes, it is if-conv regenerated COND_EXPR.
>> believe that
>> the MAX/MIN_EXPR form is always preferrable and thus it looks like if-conversion
>> might want to either directly generate it or make sure to fold the
>> introduced stmts
>> (and have a match.pd pattern catching this).
> Hmm, I also noticed saturation cases which should be better
> transformed before vectorization in scalar optimizers.  But this case
> is a bit different because there is additional computation involved
> other than type conversion.  We need to prove the computation can be
> done in either large or small types.  It is quite specific case and I
> don't see good (general) solution in if-conv.  Vect-pattern looks like
> a natural place doing this.  I am also looking at general saturation
> cases, but this one is different?

(vect-patterns should go away ...)

But as if-conversion results may also prevail for scalar code doing the
pattern in match.pd would be better - that is, "apply" the pattern
already during if-conversion.

Yes, if-conversion fails to fold the stmts it generates (it only uses
generic folding on the trees it builds - it can need some TLC here).

Richard.

> Thanks,
> bin
>>
>> Richard.
>>
>>> Thanks,
>>> bin
>>>
>>> 2016-10-11  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         * tree-vect-patterns.c (vect_recog_min_max_modify_pattern): New.
>>>         (vect_vect_recog_func_ptrs): New element for above pattern.
>>>         * tree-vectorizer.h (NUM_PATTERNS): Increase by 1.
>>>
>>> gcc/testsuite/ChangeLog
>>> 2016-10-11  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         * gcc.dg/vect/vect-umax-modify-pattern.c: New test.
>>>         * gcc.dg/vect/vect-umin-modify-pattern.c: New test.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH GCC]New vectorization pattern turning cond_expr into max/min and plus/minus
  2016-10-12  8:50     ` Richard Biener
@ 2016-10-20 16:32       ` Bin.Cheng
  2016-10-21  7:41         ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Bin.Cheng @ 2016-10-20 16:32 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On Wed, Oct 12, 2016 at 9:50 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Oct 12, 2016 at 10:29 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Oct 12, 2016 at 9:12 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Tue, Oct 11, 2016 at 5:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> Hi,
>>>> Given below test case,
>>>> int foo (unsigned short a[], unsigned int x)
>>>> {
>>>>   unsigned int i;
>>>>   for (i = 0; i < 1000; i++)
>>>>     {
>>>>       x = a[i];
>>>>       a[i] = (unsigned short)(x >= 32768 ? x - 32768 : 0);
>>>>     }
>>>>   return x;
>>>> }
>>>>
>>>> it now can be vectorized on AArch64, but generated assembly is way from optimal:
>>>> .L4:
>>>>         ldr     q4, [x3, x1]
>>>>         add     w2, w2, 1
>>>>         cmp     w2, w0
>>>>         ushll   v1.4s, v4.4h, 0
>>>>         ushll2  v0.4s, v4.8h, 0
>>>>         add     v3.4s, v1.4s, v6.4s
>>>>         add     v2.4s, v0.4s, v6.4s
>>>>         cmhi    v1.4s, v1.4s, v5.4s
>>>>         cmhi    v0.4s, v0.4s, v5.4s
>>>>         and     v1.16b, v3.16b, v1.16b
>>>>         and     v0.16b, v2.16b, v0.16b
>>>>         xtn     v2.4h, v1.4s
>>>>         xtn2    v2.8h, v0.4s
>>>>         str     q2, [x3, x1]
>>>>         add     x1, x1, 16
>>>>         bcc     .L4
>>>>
>>>> The vectorized loop has 15 instructions, which can be greatly simplified by turning cond_expr into max_expr, as below:
>>>> .L4:
>>>>         ldr     q1, [x3, x1]
>>>>         add     w2, w2, 1
>>>>         cmp     w2, w0
>>>>         umax    v0.8h, v1.8h, v2.8h
>>>>         add     v0.8h, v0.8h, v2.8h
>>>>         str     q0, [x3, x1]
>>>>         add     x1, x1, 16
>>>>         bcc     .L4
>>>>
>>>> This patch addresses the issue by adding new vectorization pattern.
>>>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>>
>>> So the COND_EXPRs are generated this way by if-conversion, right?  I
>> Though ?: is used in source code, yes, it is if-conv regenerated COND_EXPR.
>>> believe that
>>> the MAX/MIN_EXPR form is always preferrable and thus it looks like if-conversion
>>> might want to either directly generate it or make sure to fold the
>>> introduced stmts
>>> (and have a match.pd pattern catching this).
>> Hmm, I also noticed saturation cases which should be better
>> transformed before vectorization in scalar optimizers.  But this case
>> is a bit different because there is additional computation involved
>> other than type conversion.  We need to prove the computation can be
>> done in either large or small types.  It is quite specific case and I
>> don't see good (general) solution in if-conv.  Vect-pattern looks like
>> a natural place doing this.  I am also looking at general saturation
>> cases, but this one is different?
>
> (vect-patterns should go away ...)
>
> But as if-conversion results may also prevail for scalar code doing the
> pattern in match.pd would be better - that is, "apply" the pattern
> already during if-conversion.
>
> Yes, if-conversion fails to fold the stmts it generates (it only uses
> generic folding on the trees it builds - it can need some TLC here).
Hi,
Sorry for being slow in replying, I looked into match.pd and can
transform simpler cond_expr into minmax expr successfully, but this
one is more complicated.  It transforms 3 gimple statements into 2
result statements, but result of match&simplify pattern is an
expression.  How should I write the pattern outputing two gimple
statement as result?  Hmm, now I see the transform looks more like
gimple combine...

Thanks,
bin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH GCC]New vectorization pattern turning cond_expr into max/min and plus/minus
  2016-10-20 16:32       ` Bin.Cheng
@ 2016-10-21  7:41         ` Richard Biener
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Biener @ 2016-10-21  7:41 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: gcc-patches

On Thu, Oct 20, 2016 at 6:32 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Oct 12, 2016 at 9:50 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Wed, Oct 12, 2016 at 10:29 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Wed, Oct 12, 2016 at 9:12 AM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Tue, Oct 11, 2016 at 5:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>> Hi,
>>>>> Given below test case,
>>>>> int foo (unsigned short a[], unsigned int x)
>>>>> {
>>>>>   unsigned int i;
>>>>>   for (i = 0; i < 1000; i++)
>>>>>     {
>>>>>       x = a[i];
>>>>>       a[i] = (unsigned short)(x >= 32768 ? x - 32768 : 0);
>>>>>     }
>>>>>   return x;
>>>>> }
>>>>>
>>>>> it now can be vectorized on AArch64, but generated assembly is way from optimal:
>>>>> .L4:
>>>>>         ldr     q4, [x3, x1]
>>>>>         add     w2, w2, 1
>>>>>         cmp     w2, w0
>>>>>         ushll   v1.4s, v4.4h, 0
>>>>>         ushll2  v0.4s, v4.8h, 0
>>>>>         add     v3.4s, v1.4s, v6.4s
>>>>>         add     v2.4s, v0.4s, v6.4s
>>>>>         cmhi    v1.4s, v1.4s, v5.4s
>>>>>         cmhi    v0.4s, v0.4s, v5.4s
>>>>>         and     v1.16b, v3.16b, v1.16b
>>>>>         and     v0.16b, v2.16b, v0.16b
>>>>>         xtn     v2.4h, v1.4s
>>>>>         xtn2    v2.8h, v0.4s
>>>>>         str     q2, [x3, x1]
>>>>>         add     x1, x1, 16
>>>>>         bcc     .L4
>>>>>
>>>>> The vectorized loop has 15 instructions, which can be greatly simplified by turning cond_expr into max_expr, as below:
>>>>> .L4:
>>>>>         ldr     q1, [x3, x1]
>>>>>         add     w2, w2, 1
>>>>>         cmp     w2, w0
>>>>>         umax    v0.8h, v1.8h, v2.8h
>>>>>         add     v0.8h, v0.8h, v2.8h
>>>>>         str     q0, [x3, x1]
>>>>>         add     x1, x1, 16
>>>>>         bcc     .L4
>>>>>
>>>>> This patch addresses the issue by adding new vectorization pattern.
>>>>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>>>
>>>> So the COND_EXPRs are generated this way by if-conversion, right?  I
>>> Though ?: is used in source code, yes, it is if-conv regenerated COND_EXPR.
>>>> believe that
>>>> the MAX/MIN_EXPR form is always preferrable and thus it looks like if-conversion
>>>> might want to either directly generate it or make sure to fold the
>>>> introduced stmts
>>>> (and have a match.pd pattern catching this).
>>> Hmm, I also noticed saturation cases which should be better
>>> transformed before vectorization in scalar optimizers.  But this case
>>> is a bit different because there is additional computation involved
>>> other than type conversion.  We need to prove the computation can be
>>> done in either large or small types.  It is quite specific case and I
>>> don't see good (general) solution in if-conv.  Vect-pattern looks like
>>> a natural place doing this.  I am also looking at general saturation
>>> cases, but this one is different?
>>
>> (vect-patterns should go away ...)
>>
>> But as if-conversion results may also prevail for scalar code doing the
>> pattern in match.pd would be better - that is, "apply" the pattern
>> already during if-conversion.
>>
>> Yes, if-conversion fails to fold the stmts it generates (it only uses
>> generic folding on the trees it builds - it can need some TLC here).
> Hi,
> Sorry for being slow in replying, I looked into match.pd and can
> transform simpler cond_expr into minmax expr successfully, but this
> one is more complicated.  It transforms 3 gimple statements into 2
> result statements, but result of match&simplify pattern is an
> expression.  How should I write the pattern outputing two gimple
> statement as result?  Hmm, now I see the transform looks more like
> gimple combine...

You simply write a larger expression.  match&simplify happily
creates two gimple stmts from a (plus (minus @1 @2) @3) result.

Richard.

> Thanks,
> bin

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-10-21  7:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-11 15:03 [PATCH GCC]New vectorization pattern turning cond_expr into max/min and plus/minus Bin Cheng
2016-10-12  8:12 ` Richard Biener
2016-10-12  8:30   ` Bin.Cheng
2016-10-12  8:50     ` Richard Biener
2016-10-20 16:32       ` Bin.Cheng
2016-10-21  7:41         ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).