Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
@ 2013-10-31 11:26 Uros Bizjak
  2013-11-01  2:04 ` Cong Hou
  0 siblings, 1 reply; 27+ messages in thread
From: Uros Bizjak @ 2013-10-31 11:26 UTC (permalink / raw)
  To: gcc-patches; +Cc: Cong Hou

Hello!

> SAD (Sum of Absolute Differences) is a common and important algorithm
> in image processing and other areas. SSE2 even introduced a new
> instruction PSADBW for it. A SAD loop can be greatly accelerated by
> this instruction after being vectorized. This patch introduced a new
> operation SAD_EXPR and a SAD pattern recognizer in vectorizer.
>
> In order to express this new operation, a new expression SAD_EXPR is
> introduced in tree.def, and the corresponding entry in optabs is
> added. The patch also added the "define_expand" for SSE2 and AVX2
> platforms for i386.

+(define_expand "sadv16qi"
+  [(match_operand:V4SI 0 "register_operand")
+   (match_operand:V16QI 1 "register_operand")
+   (match_operand:V16QI 2 "register_operand")
+   (match_operand:V4SI 3 "register_operand")]
+  "TARGET_SSE2"
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+  gen_rtx_PLUS (V4SImode,
+ operands[3], t2)));
+  DONE;
+})

Please use generic expanders (expand_simple_binop) to generate plus
expression. Also, please use nonimmediate_operand predicate for
operand 2 and operand 3.

Please note, that nonimmediate operands should be passed as the second
input operand to commutative operators, to match their insn pattern
layout.

Uros.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-10-31 11:26 [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer Uros Bizjak
@ 2013-11-01  2:04 ` Cong Hou
  2013-11-01  7:43   ` Uros Bizjak
  2013-11-01 10:17   ` James Greenhalgh
  0 siblings, 2 replies; 27+ messages in thread
From: Cong Hou @ 2013-11-01  2:04 UTC (permalink / raw)
  To: Uros Bizjak, ramana.gcc, Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 25324 bytes --]

According to your comments, I made the following modifications to this patch:

1. Now SAD pattern does not require the first and second operands to
be unsigned. And two versions (signed/unsigned) of the SAD optabs are
defined: usad_optab and ssad_optab.

2. Use expand_simple_binop instead of gen_rtx_PLUS to generate the
plus expression in sse.md. Also change the type of the second/third
operands to be nonimmediate_operand.

3. Add the document for SAD_EXPR.

4. Verify the operands of SAD_EXPR.

5. Create a new target: vect_usad_char, and use it in the test case.

The updated patch is pasted below.

Thank you!


Cong


diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..d528307 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,23 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+ * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+ pattern recognition.
+ (type_conversion_p): PROMOTION is true if it's a type promotion
+ conversion, and false otherwise.  Return true if the given expression
+ is a type conversion one.
+ * tree-vectorizer.h: Adjust the number of patterns.
+ * tree.def: Add SAD_EXPR.
+ * optabs.def: Add sad_optab.
+ * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+ * expr.c (expand_expr_real_2): Likewise.
+ * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+ * gimple.c (get_gimple_rhs_num_ops): Likewise.
+ * optabs.c (optab_for_tree_code): Likewise.
+ * tree-cfg.c (estimate_operator_cost): Likewise.
+ * tree-ssa-operands.c (get_expr_operands): Likewise.
+ * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+ * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+
 2013-10-14  David Malcolm  <dmalcolm@redhat.com>

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 7ed29f5..9ec761a 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
  {
  case COND_EXPR:
  case DOT_PROD_EXPR:
+ case SAD_EXPR:
  case WIDEN_MULT_PLUS_EXPR:
  case WIDEN_MULT_MINUS_EXPR:
  case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..5b97576 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6052,6 +6052,40 @@
   DONE;
 })

+(define_expand "usadv16qi"
+  [(match_operand:V4SI 0 "register_operand")
+   (match_operand:V16QI 1 "register_operand")
+   (match_operand:V16QI 2 "nonimmediate_operand")
+   (match_operand:V4SI 3 "nonimmediate_operand")]
+  "TARGET_SSE2"
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+  expand_simple_binop (V4SImode, PLUS, t2, operands[3],
+       NULL, 0, OPTAB_DIRECT)));
+  DONE;
+})
+
+(define_expand "usadv32qi"
+  [(match_operand:V8SI 0 "register_operand")
+   (match_operand:V32QI 1 "register_operand")
+   (match_operand:V32QI 2 "nonimmediate_operand")
+   (match_operand:V8SI 3 "nonimmediate_operand")]
+  "TARGET_AVX2"
+{
+  rtx t1 = gen_reg_rtx (V4DImode);
+  rtx t2 = gen_reg_rtx (V8SImode);
+  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+  expand_simple_binop (V8SImode, PLUS, t2, operands[3],
+       NULL, 0, OPTAB_DIRECT)));
+  DONE;
+})
+
 (define_insn "ashr<mode>3"
   [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
  (ashiftrt:VI24_AVX2
diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
index ccecd6e..381ee09 100644
--- a/gcc/doc/generic.texi
+++ b/gcc/doc/generic.texi
@@ -1707,6 +1707,7 @@ its sole argument yields the representation for @code{ap}.
 @tindex VEC_PACK_TRUNC_EXPR
 @tindex VEC_PACK_SAT_EXPR
 @tindex VEC_PACK_FIX_TRUNC_EXPR
+@tindex SAD_EXPR

 @table @code
 @item VEC_LSHIFT_EXPR
@@ -1787,6 +1788,15 @@ value, it is taken from the second operand. It
should never evaluate to
 any other value currently, but optimizations should not rely on that
 property. In contrast with a @code{COND_EXPR}, all operands are always
 evaluated.
+
+@item SAD_EXPR
+This node represents the Sum of Absolute Differences operation.  The three
+operands must be vectors of integral types.  The first and second operand
+must have the same type.  The size of the vector element of the third
+operand must be at lease twice of the size of the vector element of the
+first and second one.  The SAD is calculated between the first and second
+operands, added to the third operand, and returned.
+
 @end table


diff --git a/gcc/expr.c b/gcc/expr.c
index 4975a64..1db8a49 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target,
enum machine_mode tmode,
  return target;
       }

+      case SAD_EXPR:
+      {
+ tree oprnd0 = treeop0;
+ tree oprnd1 = treeop1;
+ tree oprnd2 = treeop2;
+ rtx op2;
+
+ expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
+ op2 = expand_normal (oprnd2);
+ target = expand_widen_pattern_expr (ops, op0, op1, op2,
+    target, unsignedp);
+ return target;
+      }
+
     case REALIGN_LOAD_EXPR:
       {
         tree oprnd0 = treeop0;
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index f0f8166..514ddd1 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple
gs, int spc, int flags)
       dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
       pp_greater (buffer);
       break;
+
+    case SAD_EXPR:
+      pp_string (buffer, "SAD_EXPR <");
+      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
+      pp_greater (buffer);
+      break;

     case VEC_PERM_EXPR:
       pp_string (buffer, "VEC_PERM_EXPR <");
diff --git a/gcc/gimple.c b/gcc/gimple.c
index a12dd67..4975959 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -2562,6 +2562,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
       || (SYM) == WIDEN_MULT_PLUS_EXPR    \
       || (SYM) == WIDEN_MULT_MINUS_EXPR    \
       || (SYM) == DOT_PROD_EXPR    \
+      || (SYM) == SAD_EXPR    \
       || (SYM) == REALIGN_LOAD_EXPR    \
       || (SYM) == VEC_COND_EXPR    \
       || (SYM) == VEC_PERM_EXPR                                             \
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 06a626c..16e8f4f 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
     case DOT_PROD_EXPR:
       return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;

+    case SAD_EXPR:
+      return TYPE_UNSIGNED (type) ? usad_optab : ssad_optab;
+
     case WIDEN_MULT_PLUS_EXPR:
       return (TYPE_UNSIGNED (type)
       ? (TYPE_SATURATING (type)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 6b924ac..377763e 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -248,6 +248,8 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
 OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
+OPTAB_D (usad_optab, "usad$I$a")
+OPTAB_D (ssad_optab, "ssad$I$a")
 OPTAB_D (vec_extract_optab, "vec_extract$a")
 OPTAB_D (vec_init_optab, "vec_init$a")
 OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..1698912 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+ * gcc.dg/vect/vect-reduc-sad.c: New.
+ * lib/target-supports.exp (check_effective_target_vect_usad_char): New.
+
 2013-10-14  Tobias Burnus  <burnus@net-b.de>

  PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
new file mode 100644
index 0000000..15a625f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
@@ -0,0 +1,54 @@
+/* { dg-require-effective-target vect_usad_char } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+
+#define N 64
+#define SAD N*N/2
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+
+/* Sum of absolute differences between arrays of unsigned char types.
+   Detected as a sad pattern.
+   Vectorized on targets that support sad for unsigned chars.  */
+
+__attribute__ ((noinline)) int
+foo (int len)
+{
+  int i;
+  int result = 0;
+
+  for (i = 0; i < len; i++)
+    result += abs (X[i] - Y[i]);
+
+  return result;
+}
+
+
+int
+main (void)
+{
+  int i;
+  int sad;
+
+  check_vect ();
+
+  for (i = 0; i < N; i++)
+    {
+      X[i] = i;
+      Y[i] = N - i;
+      __asm__ volatile ("");
+    }
+
+  sad = foo (N);
+  if (sad != SAD)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern:
detected" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
diff --git a/gcc/testsuite/lib/target-supports.exp
b/gcc/testsuite/lib/target-supports.exp
index 7eb4dfe..01ee6f2 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3672,6 +3672,26 @@ proc check_effective_target_vect_udot_hi { } {
     return $et_vect_udot_hi_saved
 }

+# Return 1 if the target plus current options supports a vector
+# sad operation of unsigned chars, 0 otherwise.
+#
+# This won't change for different subtargets so cache the result.
+
+proc check_effective_target_vect_usad_char { } {
+    global et_vect_usad_char
+
+    if [info exists et_vect_usad_char_saved] {
+        verbose "check_effective_target_vect_usad_char: using cached result" 2
+    } else {
+        set et_vect_usad_char_saved 0
+        if { ([istarget i?86-*-*]
+             || [istarget x86_64-*-*]) } {
+            set et_vect_usad_char_saved 1
+        }
+    }
+    verbose "check_effective_target_vect_usad_char: returning
$et_vect_usad_char_saved" 2
+    return $et_vect_usad_char_saved
+}

 # Return 1 if the target plus current options supports a vector
 # demotion (packing) of shorts (to chars) and ints (to shorts)
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 8b66791..c8f3d33 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3796,6 +3796,36 @@ verify_gimple_assign_ternary (gimple stmt)

       return false;

+    case SAD_EXPR:
+      if (!useless_type_conversion_p (rhs1_type, rhs2_type)
+  || !useless_type_conversion_p (lhs_type, rhs3_type)
+  || 2 * GET_MODE_BITSIZE (GET_MODE_INNER
+     (TYPE_MODE (TREE_TYPE (rhs1_type))))
+       > GET_MODE_BITSIZE (GET_MODE_INNER
+     (TYPE_MODE (TREE_TYPE (lhs_type)))))
+ {
+  error ("type mismatch in sad expression");
+  debug_generic_expr (lhs_type);
+  debug_generic_expr (rhs1_type);
+  debug_generic_expr (rhs2_type);
+  debug_generic_expr (rhs3_type);
+  return true;
+ }
+
+      if (TREE_CODE (rhs1_type) != VECTOR_TYPE
+  || TREE_CODE (rhs2_type) != VECTOR_TYPE
+  || TREE_CODE (rhs3_type) != VECTOR_TYPE)
+ {
+  error ("vector types expected in sad expression");
+  debug_generic_expr (lhs_type);
+  debug_generic_expr (rhs1_type);
+  debug_generic_expr (rhs2_type);
+  debug_generic_expr (rhs3_type);
+  return true;
+ }
+
+      return false;
+
     case DOT_PROD_EXPR:
     case REALIGN_LOAD_EXPR:
       /* FIXME.  */
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 2221b9c..44261a3 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3601,6 +3601,7 @@ estimate_operator_cost (enum tree_code code,
eni_weights *weights,
     case WIDEN_SUM_EXPR:
     case WIDEN_MULT_EXPR:
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
     case WIDEN_LSHIFT_EXPR:
diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
index 603f797..393efc3 100644
--- a/gcc/tree-ssa-operands.c
+++ b/gcc/tree-ssa-operands.c
@@ -854,6 +854,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
       }

     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case REALIGN_LOAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 638b981..89aa8c7 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3607,6 +3607,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
     {
       case WIDEN_SUM_EXPR:
       case DOT_PROD_EXPR:
+      case SAD_EXPR:
       case PLUS_EXPR:
       case MINUS_EXPR:
       case BIT_IOR_EXPR:
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 0a4e812..58a6666 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -45,6 +45,8 @@ static gimple vect_recog_widen_mult_pattern
(vec<gimple> *, tree *,
      tree *);
 static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
    tree *);
+static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
+      tree *);
 static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
 static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
                                                  tree *);
@@ -62,6 +64,7 @@ static vect_recog_func_ptr
vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
  vect_recog_widen_mult_pattern,
  vect_recog_widen_sum_pattern,
  vect_recog_dot_prod_pattern,
+        vect_recog_sad_pattern,
  vect_recog_pow_pattern,
  vect_recog_widen_shift_pattern,
  vect_recog_over_widening_pattern,
@@ -140,9 +143,8 @@ vect_single_imm_use (gimple def_stmt)
 }

 /* Check whether NAME, an ssa-name used in USE_STMT,
-   is a result of a type promotion or demotion, such that:
+   is a result of a type promotion, such that:
      DEF_STMT: NAME = NOP (name0)
-   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
    If CHECK_SIGN is TRUE, check that either both types are signed or both are
    unsigned.  */

@@ -189,10 +191,8 @@ type_conversion_p (tree name, gimple use_stmt,
bool check_sign,

   if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
     *promotion = true;
-  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
-    *promotion = false;
   else
-    return false;
+    *promotion = false;

   if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
    bb_vinfo, &dummy_gimple, &dummy, &dt))
@@ -433,6 +433,240 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts,
tree *type_in,
 }


+/* Function vect_recog_sad_pattern
+
+   Try to find the following Sum of Absolute Difference (SAD) pattern:
+
+     type x_t, y_t;
+     signed TYPE1 diff, abs_diff;
+     TYPE2 sum = init;
+   loop:
+     sum_0 = phi <init, sum_1>
+     S1  x_t = ...
+     S2  y_t = ...
+     S3  x_T = (TYPE1) x_t;
+     S4  y_T = (TYPE1) y_t;
+     S5  diff = x_T - y_T;
+     S6  abs_diff = ABS_EXPR <diff>;
+     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
+     S8  sum_1 = abs_diff + sum_0;
+
+   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
+   same size of 'TYPE1' or bigger. This is a special case of a reduction
+   computation.
+
+   Input:
+
+   * STMTS: Contains a stmt from which the pattern search begins.  In the
+   example, when this function is called with S8, the pattern
+   {S3,S4,S5,S6,S7,S8} will be detected.
+
+   Output:
+
+   * TYPE_IN: The type of the input arguments to the pattern.
+
+   * TYPE_OUT: The type of the output of this pattern.
+
+   * Return value: A new stmt that will be used to replace the sequence of
+   stmts that constitute the pattern. In this case it will be:
+        SAD_EXPR <x_t, y_t, sum_0>
+  */
+
+static gimple
+vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
+     tree *type_out)
+{
+  gimple last_stmt = (*stmts)[0];
+  tree sad_oprnd0, sad_oprnd1;
+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
+  tree half_type;
+  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
+  struct loop *loop;
+  bool promotion;
+
+  if (!loop_info)
+    return NULL;
+
+  loop = LOOP_VINFO_LOOP (loop_info);
+
+  if (!is_gimple_assign (last_stmt))
+    return NULL;
+
+  tree sum_type = gimple_expr_type (last_stmt);
+
+  /* Look for the following pattern
+          DX = (TYPE1) X;
+          DY = (TYPE1) Y;
+          DDIFF = DX - DY;
+          DAD = ABS_EXPR <DDIFF>;
+          DDPROD = (TYPE2) DPROD;
+          sum_1 = DAD + sum_0;
+     In which
+     - DX is at least double the size of X
+     - DY is at least double the size of Y
+     - DX, DY, DDIFF, DAD all have the same type
+     - sum is the same size of DAD or bigger
+     - sum has been recognized as a reduction variable.
+
+     This is equivalent to:
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD w+ sum_0;    #widen summation
+     or
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD + sum_0;     #summation
+   */
+
+  /* Starting from LAST_STMT, follow the defs of its uses in search
+     of the above pattern.  */
+
+  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
+    return NULL;
+
+  tree plus_oprnd0, plus_oprnd1;
+
+  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
+    {
+      /* Has been detected as widening-summation?  */
+
+      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+      sum_type = gimple_expr_type (stmt);
+      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (stmt);
+      half_type = TREE_TYPE (plus_oprnd0);
+    }
+  else
+    {
+      gimple def_stmt;
+
+      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
+      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
+  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
+        return NULL;
+
+      /* The type conversion could be promotion, demotion,
+         or just signed -> unsigned.  */
+      if (type_conversion_p (plus_oprnd0, last_stmt, false,
+                             &half_type, &def_stmt, &promotion))
+        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
+      else
+        half_type = sum_type;
+    }
+
+  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
+     we know that plus_oprnd1 is the reduction variable (defined by a
loop-header
+     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
+     Then check that plus_oprnd0 is defined by an abs_expr  */
+
+  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
+    return NULL;
+
+  tree abs_type = half_type;
+  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
+
+  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
+  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop,
gimple_bb (abs_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (abs_stmt))
+    return NULL;
+
+  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
+  gcc_assert (abs_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
+    return NULL;
+
+  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
+  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
+    return NULL;
+  if (TYPE_UNSIGNED (abs_type))
+    return NULL;
+
+  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
+
+  if (TREE_CODE (abs_oprnd) != SSA_NAME)
+    return NULL;
+
+  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
+
+  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
+  if (!gimple_bb (diff_stmt)
+      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (diff_stmt))
+    return NULL;
+
+  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
+  gcc_assert (diff_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
+    return NULL;
+
+  tree half_type0, half_type1;
+  gimple def_stmt;
+
+  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
+  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
+
+  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
+      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
+    return NULL;
+  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
+                          &half_type0, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
+
+  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
+                          &half_type1, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
+
+  if (!types_compatible_p (half_type0, half_type1))
+    return NULL;
+  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
+      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
+    return NULL;
+
+  *type_in = TREE_TYPE (sad_oprnd0);
+  *type_out = sum_type;
+
+  /* Pattern detected. Create a stmt to be used to replace the pattern: */
+  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
+  gimple pattern_stmt = gimple_build_assign_with_ops
+                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location,
+                       "vect_recog_sad_pattern: detected: ");
+      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
+      dump_printf (MSG_NOTE, "\n");
+    }
+
+  /* We don't allow changing the order of the computation in the inner-loop
+     when doing outer-loop vectorization.  */
+  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
+
+  return pattern_stmt;
+}
+
+
 /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
    and LSHIFT_EXPR.

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 8b7b345..0aac75b 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1044,7 +1044,7 @@ extern void vect_slp_transform_bb (basic_block);
    Additional pattern recognition functions can (and will) be added
    in the future.  */
 typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
-#define NUM_PATTERNS 11
+#define NUM_PATTERNS 12
 void vect_pattern_recog (loop_vec_info, bb_vec_info);

 /* In tree-vectorizer.c.  */
diff --git a/gcc/tree.def b/gcc/tree.def
index 88c850a..e15ee61 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1155,6 +1155,12 @@ DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr",
tcc_expression, 3)
    with the second argument.  */
 DEFTREECODE (WIDEN_SUM_EXPR, "widen_sum_expr", tcc_binary, 2)

+/* Widening sad (sum of absolute differences).
+   The first two arguments are of type t1 which should be integer.
+   The third argument and the result are of type t2, such that t2 is at least
+   twice the size of t1.  */
+DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
+
 /* Widening multiplication.
    The two arguments are of type t1.
    The result is of type t2, such that t2 is at least twice







On Thu, Oct 31, 2013 at 3:34 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> Hello!
>
>> SAD (Sum of Absolute Differences) is a common and important algorithm
>> in image processing and other areas. SSE2 even introduced a new
>> instruction PSADBW for it. A SAD loop can be greatly accelerated by
>> this instruction after being vectorized. This patch introduced a new
>> operation SAD_EXPR and a SAD pattern recognizer in vectorizer.
>>
>> In order to express this new operation, a new expression SAD_EXPR is
>> introduced in tree.def, and the corresponding entry in optabs is
>> added. The patch also added the "define_expand" for SSE2 and AVX2
>> platforms for i386.
>
> +(define_expand "sadv16qi"
> +  [(match_operand:V4SI 0 "register_operand")
> +   (match_operand:V16QI 1 "register_operand")
> +   (match_operand:V16QI 2 "register_operand")
> +   (match_operand:V4SI 3 "register_operand")]
> +  "TARGET_SSE2"
> +{
> +  rtx t1 = gen_reg_rtx (V2DImode);
> +  rtx t2 = gen_reg_rtx (V4SImode);
> +  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
> +  convert_move (t2, t1, 0);
> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
> +  gen_rtx_PLUS (V4SImode,
> + operands[3], t2)));
> +  DONE;
> +})
>
> Please use generic expanders (expand_simple_binop) to generate plus
> expression. Also, please use nonimmediate_operand predicate for
> operand 2 and operand 3.
>
> Please note, that nonimmediate operands should be passed as the second
> input operand to commutative operators, to match their insn pattern
> layout.
>
> Uros.

[-- Attachment #2: patch-sad.txt --]
[-- Type: text/plain, Size: 23279 bytes --]

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..d528307 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,23 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+	* tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+	pattern recognition.
+	(type_conversion_p): PROMOTION is true if it's a type promotion
+	conversion, and false otherwise.  Return true if the given expression
+	is a type conversion one.
+	* tree-vectorizer.h: Adjust the number of patterns.
+	* tree.def: Add SAD_EXPR.
+	* optabs.def: Add sad_optab.
+	* cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+	* expr.c (expand_expr_real_2): Likewise.
+	* gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+	* gimple.c (get_gimple_rhs_num_ops): Likewise.
+	* optabs.c (optab_for_tree_code): Likewise.
+	* tree-cfg.c (estimate_operator_cost): Likewise.
+	* tree-ssa-operands.c (get_expr_operands): Likewise.
+	* tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+	* config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+
 2013-10-14  David Malcolm  <dmalcolm@redhat.com>
 
 	* dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 7ed29f5..9ec761a 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
 	{
 	case COND_EXPR:
 	case DOT_PROD_EXPR:
+	case SAD_EXPR:
 	case WIDEN_MULT_PLUS_EXPR:
 	case WIDEN_MULT_MINUS_EXPR:
 	case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..5b97576 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6052,6 +6052,40 @@
   DONE;
 })
 
+(define_expand "usadv16qi"
+  [(match_operand:V4SI 0 "register_operand")
+   (match_operand:V16QI 1 "register_operand")
+   (match_operand:V16QI 2 "nonimmediate_operand")
+   (match_operand:V4SI 3 "nonimmediate_operand")]
+  "TARGET_SSE2"
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+			  expand_simple_binop (V4SImode, PLUS, t2, operands[3],
+					       NULL, 0, OPTAB_DIRECT)));
+  DONE;
+})
+
+(define_expand "usadv32qi"
+  [(match_operand:V8SI 0 "register_operand")
+   (match_operand:V32QI 1 "register_operand")
+   (match_operand:V32QI 2 "nonimmediate_operand")
+   (match_operand:V8SI 3 "nonimmediate_operand")]
+  "TARGET_AVX2"
+{
+  rtx t1 = gen_reg_rtx (V4DImode);
+  rtx t2 = gen_reg_rtx (V8SImode);
+  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+			  expand_simple_binop (V8SImode, PLUS, t2, operands[3],
+					       NULL, 0, OPTAB_DIRECT)));
+  DONE;
+})
+
 (define_insn "ashr<mode>3"
   [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
 	(ashiftrt:VI24_AVX2
diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
index ccecd6e..381ee09 100644
--- a/gcc/doc/generic.texi
+++ b/gcc/doc/generic.texi
@@ -1707,6 +1707,7 @@ its sole argument yields the representation for @code{ap}.
 @tindex VEC_PACK_TRUNC_EXPR
 @tindex VEC_PACK_SAT_EXPR
 @tindex VEC_PACK_FIX_TRUNC_EXPR
+@tindex SAD_EXPR
 
 @table @code
 @item VEC_LSHIFT_EXPR
@@ -1787,6 +1788,15 @@ value, it is taken from the second operand. It should never evaluate to
 any other value currently, but optimizations should not rely on that
 property. In contrast with a @code{COND_EXPR}, all operands are always
 evaluated.
+
+@item SAD_EXPR
+This node represents the Sum of Absolute Differences operation.  The three
+operands must be vectors of integral types.  The first and second operand
+must have the same type.  The size of the vector element of the third
+operand must be at lease twice of the size of the vector element of the
+first and second one.  The SAD is calculated between the first and second
+operands, added to the third operand, and returned.
+
 @end table
 
 
diff --git a/gcc/expr.c b/gcc/expr.c
index 4975a64..1db8a49 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
 	return target;
       }
 
+      case SAD_EXPR:
+      {
+	tree oprnd0 = treeop0;
+	tree oprnd1 = treeop1;
+	tree oprnd2 = treeop2;
+	rtx op2;
+
+	expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
+	op2 = expand_normal (oprnd2);
+	target = expand_widen_pattern_expr (ops, op0, op1, op2,
+					    target, unsignedp);
+	return target;
+      }
+
     case REALIGN_LOAD_EXPR:
       {
         tree oprnd0 = treeop0;
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index f0f8166..514ddd1 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple gs, int spc, int flags)
       dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
       pp_greater (buffer);
       break;
+
+    case SAD_EXPR:
+      pp_string (buffer, "SAD_EXPR <");
+      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
+      pp_greater (buffer);
+      break;
     
     case VEC_PERM_EXPR:
       pp_string (buffer, "VEC_PERM_EXPR <");
diff --git a/gcc/gimple.c b/gcc/gimple.c
index a12dd67..4975959 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -2562,6 +2562,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
       || (SYM) == WIDEN_MULT_PLUS_EXPR					    \
       || (SYM) == WIDEN_MULT_MINUS_EXPR					    \
       || (SYM) == DOT_PROD_EXPR						    \
+      || (SYM) == SAD_EXPR						    \
       || (SYM) == REALIGN_LOAD_EXPR					    \
       || (SYM) == VEC_COND_EXPR						    \
       || (SYM) == VEC_PERM_EXPR                                             \
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 06a626c..16e8f4f 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
     case DOT_PROD_EXPR:
       return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;
 
+    case SAD_EXPR:
+      return TYPE_UNSIGNED (type) ? usad_optab : ssad_optab;
+
     case WIDEN_MULT_PLUS_EXPR:
       return (TYPE_UNSIGNED (type)
 	      ? (TYPE_SATURATING (type)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 6b924ac..377763e 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -248,6 +248,8 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
 OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
+OPTAB_D (usad_optab, "usad$I$a")
+OPTAB_D (ssad_optab, "ssad$I$a")
 OPTAB_D (vec_extract_optab, "vec_extract$a")
 OPTAB_D (vec_init_optab, "vec_init$a")
 OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..1698912 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+	* gcc.dg/vect/vect-reduc-sad.c: New.
+	* lib/target-supports.exp (check_effective_target_vect_usad_char): New.
+
 2013-10-14  Tobias Burnus  <burnus@net-b.de>
 
 	PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
new file mode 100644
index 0000000..15a625f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
@@ -0,0 +1,54 @@
+/* { dg-require-effective-target vect_usad_char } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+
+#define N 64
+#define SAD N*N/2
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+
+/* Sum of absolute differences between arrays of unsigned char types.
+   Detected as a sad pattern.
+   Vectorized on targets that support sad for unsigned chars.  */
+
+__attribute__ ((noinline)) int
+foo (int len)
+{
+  int i;
+  int result = 0;
+
+  for (i = 0; i < len; i++)
+    result += abs (X[i] - Y[i]);
+
+  return result;
+}
+
+
+int
+main (void)
+{
+  int i;
+  int sad;
+
+  check_vect ();
+
+  for (i = 0; i < N; i++)
+    {
+      X[i] = i;
+      Y[i] = N - i;
+      __asm__ volatile ("");
+    }
+
+  sad = foo (N);
+  if (sad != SAD)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern: detected" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 7eb4dfe..01ee6f2 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3672,6 +3672,26 @@ proc check_effective_target_vect_udot_hi { } {
     return $et_vect_udot_hi_saved
 }
 
+# Return 1 if the target plus current options supports a vector
+# sad operation of unsigned chars, 0 otherwise.
+#
+# This won't change for different subtargets so cache the result.
+
+proc check_effective_target_vect_usad_char { } {
+    global et_vect_usad_char
+
+    if [info exists et_vect_usad_char_saved] {
+        verbose "check_effective_target_vect_usad_char: using cached result" 2
+    } else {
+        set et_vect_usad_char_saved 0
+        if { ([istarget i?86-*-*]
+             || [istarget x86_64-*-*]) } {
+            set et_vect_usad_char_saved 1
+        }
+    }
+    verbose "check_effective_target_vect_usad_char: returning $et_vect_usad_char_saved" 2
+    return $et_vect_usad_char_saved
+}
 
 # Return 1 if the target plus current options supports a vector
 # demotion (packing) of shorts (to chars) and ints (to shorts) 
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 8b66791..c8f3d33 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3796,6 +3796,36 @@ verify_gimple_assign_ternary (gimple stmt)
 
       return false;
 
+    case SAD_EXPR:
+      if (!useless_type_conversion_p (rhs1_type, rhs2_type)
+	  || !useless_type_conversion_p (lhs_type, rhs3_type)
+	  || 2 * GET_MODE_BITSIZE (GET_MODE_INNER
+				     (TYPE_MODE (TREE_TYPE (rhs1_type))))
+	       > GET_MODE_BITSIZE (GET_MODE_INNER
+				     (TYPE_MODE (TREE_TYPE (lhs_type)))))
+	{
+	  error ("type mismatch in sad expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      if (TREE_CODE (rhs1_type) != VECTOR_TYPE
+	  || TREE_CODE (rhs2_type) != VECTOR_TYPE
+	  || TREE_CODE (rhs3_type) != VECTOR_TYPE)
+	{
+	  error ("vector types expected in sad expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      return false;
+
     case DOT_PROD_EXPR:
     case REALIGN_LOAD_EXPR:
       /* FIXME.  */
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 2221b9c..44261a3 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3601,6 +3601,7 @@ estimate_operator_cost (enum tree_code code, eni_weights *weights,
     case WIDEN_SUM_EXPR:
     case WIDEN_MULT_EXPR:
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
     case WIDEN_LSHIFT_EXPR:
diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
index 603f797..393efc3 100644
--- a/gcc/tree-ssa-operands.c
+++ b/gcc/tree-ssa-operands.c
@@ -854,6 +854,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
       }
 
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case REALIGN_LOAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 638b981..89aa8c7 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3607,6 +3607,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
     {
       case WIDEN_SUM_EXPR:
       case DOT_PROD_EXPR:
+      case SAD_EXPR:
       case PLUS_EXPR:
       case MINUS_EXPR:
       case BIT_IOR_EXPR:
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 0a4e812..58a6666 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -45,6 +45,8 @@ static gimple vect_recog_widen_mult_pattern (vec<gimple> *, tree *,
 					     tree *);
 static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
 					   tree *);
+static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
+				      tree *);
 static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
 static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
                                                  tree *);
@@ -62,6 +64,7 @@ static vect_recog_func_ptr vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
 	vect_recog_widen_mult_pattern,
 	vect_recog_widen_sum_pattern,
 	vect_recog_dot_prod_pattern,
+        vect_recog_sad_pattern,
 	vect_recog_pow_pattern,
 	vect_recog_widen_shift_pattern,
 	vect_recog_over_widening_pattern,
@@ -140,9 +143,8 @@ vect_single_imm_use (gimple def_stmt)
 }
 
 /* Check whether NAME, an ssa-name used in USE_STMT,
-   is a result of a type promotion or demotion, such that:
+   is a result of a type promotion, such that:
      DEF_STMT: NAME = NOP (name0)
-   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
    If CHECK_SIGN is TRUE, check that either both types are signed or both are
    unsigned.  */
 
@@ -189,10 +191,8 @@ type_conversion_p (tree name, gimple use_stmt, bool check_sign,
 
   if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
     *promotion = true;
-  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
-    *promotion = false;
   else
-    return false;
+    *promotion = false;
 
   if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
 			   bb_vinfo, &dummy_gimple, &dummy, &dt))
@@ -433,6 +433,240 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts, tree *type_in,
 }
 
 
+/* Function vect_recog_sad_pattern
+
+   Try to find the following Sum of Absolute Difference (SAD) pattern:
+
+     type x_t, y_t;
+     signed TYPE1 diff, abs_diff;
+     TYPE2 sum = init;
+   loop:
+     sum_0 = phi <init, sum_1>
+     S1  x_t = ...
+     S2  y_t = ...
+     S3  x_T = (TYPE1) x_t;
+     S4  y_T = (TYPE1) y_t;
+     S5  diff = x_T - y_T;
+     S6  abs_diff = ABS_EXPR <diff>;
+     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
+     S8  sum_1 = abs_diff + sum_0;
+
+   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
+   same size of 'TYPE1' or bigger. This is a special case of a reduction
+   computation.
+
+   Input:
+
+   * STMTS: Contains a stmt from which the pattern search begins.  In the
+   example, when this function is called with S8, the pattern
+   {S3,S4,S5,S6,S7,S8} will be detected.
+
+   Output:
+
+   * TYPE_IN: The type of the input arguments to the pattern.
+
+   * TYPE_OUT: The type of the output of this pattern.
+
+   * Return value: A new stmt that will be used to replace the sequence of
+   stmts that constitute the pattern. In this case it will be:
+        SAD_EXPR <x_t, y_t, sum_0>
+  */
+
+static gimple
+vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
+			     tree *type_out)
+{
+  gimple last_stmt = (*stmts)[0];
+  tree sad_oprnd0, sad_oprnd1;
+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
+  tree half_type;
+  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
+  struct loop *loop;
+  bool promotion;
+
+  if (!loop_info)
+    return NULL;
+
+  loop = LOOP_VINFO_LOOP (loop_info);
+
+  if (!is_gimple_assign (last_stmt))
+    return NULL;
+
+  tree sum_type = gimple_expr_type (last_stmt);
+
+  /* Look for the following pattern
+          DX = (TYPE1) X;
+          DY = (TYPE1) Y;
+          DDIFF = DX - DY;
+          DAD = ABS_EXPR <DDIFF>;
+          DDPROD = (TYPE2) DPROD;
+          sum_1 = DAD + sum_0;
+     In which
+     - DX is at least double the size of X
+     - DY is at least double the size of Y
+     - DX, DY, DDIFF, DAD all have the same type
+     - sum is the same size of DAD or bigger
+     - sum has been recognized as a reduction variable.
+
+     This is equivalent to:
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD w+ sum_0;    #widen summation
+     or
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD + sum_0;     #summation
+   */
+
+  /* Starting from LAST_STMT, follow the defs of its uses in search
+     of the above pattern.  */
+
+  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
+    return NULL;
+
+  tree plus_oprnd0, plus_oprnd1;
+
+  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
+    {
+      /* Has been detected as widening-summation?  */
+
+      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+      sum_type = gimple_expr_type (stmt);
+      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (stmt);
+      half_type = TREE_TYPE (plus_oprnd0);
+    }
+  else
+    {
+      gimple def_stmt;
+
+      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
+      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
+	  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
+        return NULL;
+
+      /* The type conversion could be promotion, demotion,
+         or just signed -> unsigned.  */
+      if (type_conversion_p (plus_oprnd0, last_stmt, false,
+                             &half_type, &def_stmt, &promotion))
+        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
+      else
+        half_type = sum_type;
+    }
+
+  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
+     we know that plus_oprnd1 is the reduction variable (defined by a loop-header
+     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
+     Then check that plus_oprnd0 is defined by an abs_expr  */
+
+  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
+    return NULL;
+
+  tree abs_type = half_type;
+  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
+
+  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
+  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop, gimple_bb (abs_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (abs_stmt))
+    return NULL;
+
+  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
+  gcc_assert (abs_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
+    return NULL;
+
+  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
+  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
+    return NULL;
+  if (TYPE_UNSIGNED (abs_type))
+    return NULL;
+
+  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
+
+  if (TREE_CODE (abs_oprnd) != SSA_NAME)
+    return NULL;
+
+  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
+
+  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
+  if (!gimple_bb (diff_stmt)
+      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (diff_stmt))
+    return NULL;
+
+  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
+  gcc_assert (diff_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
+    return NULL;
+
+  tree half_type0, half_type1;
+  gimple def_stmt;
+
+  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
+  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
+
+  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
+      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
+    return NULL;
+  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
+                          &half_type0, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
+
+  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
+                          &half_type1, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
+
+  if (!types_compatible_p (half_type0, half_type1))
+    return NULL;
+  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
+      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
+    return NULL;
+
+  *type_in = TREE_TYPE (sad_oprnd0);
+  *type_out = sum_type;
+
+  /* Pattern detected. Create a stmt to be used to replace the pattern: */
+  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
+  gimple pattern_stmt = gimple_build_assign_with_ops
+                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location,
+                       "vect_recog_sad_pattern: detected: ");
+      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
+      dump_printf (MSG_NOTE, "\n");
+    }
+
+  /* We don't allow changing the order of the computation in the inner-loop
+     when doing outer-loop vectorization.  */
+  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
+
+  return pattern_stmt;
+}
+
+
 /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
    and LSHIFT_EXPR.
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 8b7b345..0aac75b 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1044,7 +1044,7 @@ extern void vect_slp_transform_bb (basic_block);
    Additional pattern recognition functions can (and will) be added
    in the future.  */
 typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
-#define NUM_PATTERNS 11
+#define NUM_PATTERNS 12
 void vect_pattern_recog (loop_vec_info, bb_vec_info);
 
 /* In tree-vectorizer.c.  */
diff --git a/gcc/tree.def b/gcc/tree.def
index 88c850a..e15ee61 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1155,6 +1155,12 @@ DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr", tcc_expression, 3)
    with the second argument.  */
 DEFTREECODE (WIDEN_SUM_EXPR, "widen_sum_expr", tcc_binary, 2)
 
+/* Widening sad (sum of absolute differences).
+   The first two arguments are of type t1 which should be integer.
+   The third argument and the result are of type t2, such that t2 is at least
+   twice the size of t1.  */
+DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
+
 /* Widening multiplication.
    The two arguments are of type t1.
    The result is of type t2, such that t2 is at least twice

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-01  2:04 ` Cong Hou
@ 2013-11-01  7:43   ` Uros Bizjak
  2013-11-01 10:17   ` James Greenhalgh
  1 sibling, 0 replies; 27+ messages in thread
From: Uros Bizjak @ 2013-11-01  7:43 UTC (permalink / raw)
  To: Cong Hou; +Cc: ramana.gcc, Richard Biener, gcc-patches

On Fri, Nov 1, 2013 at 3:03 AM, Cong Hou <congh@google.com> wrote:
> According to your comments, I made the following modifications to this patch:
>
> 1. Now SAD pattern does not require the first and second operands to
> be unsigned. And two versions (signed/unsigned) of the SAD optabs are
> defined: usad_optab and ssad_optab.
>
> 2. Use expand_simple_binop instead of gen_rtx_PLUS to generate the
> plus expression in sse.md. Also change the type of the second/third
> operands to be nonimmediate_operand.
>
> 3. Add the document for SAD_EXPR.
>
> 4. Verify the operands of SAD_EXPR.
>
> 5. Create a new target: vect_usad_char, and use it in the test case.
>
> The updated patch is pasted below.

> +(define_expand "usadv16qi"
> +  [(match_operand:V4SI 0 "register_operand")
> +   (match_operand:V16QI 1 "register_operand")
> +   (match_operand:V16QI 2 "nonimmediate_operand")
> +   (match_operand:V4SI 3 "nonimmediate_operand")]
> +  "TARGET_SSE2"
> +{
> +  rtx t1 = gen_reg_rtx (V2DImode);
> +  rtx t2 = gen_reg_rtx (V4SImode);
> +  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
> +  convert_move (t2, t1, 0);
> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
> +  expand_simple_binop (V4SImode, PLUS, t2, operands[3],
> +       NULL, 0, OPTAB_DIRECT)));

It seems to me that generic expander won't bring any benefit there,
operands are already in correct form, so please change the last lines
simply to:

emit_insn (gen_addv4si3 (operands[0], t2, operands[3]));

> +  DONE;
> +})
> +
> +(define_expand "usadv32qi"
> +  [(match_operand:V8SI 0 "register_operand")
> +   (match_operand:V32QI 1 "register_operand")
> +   (match_operand:V32QI 2 "nonimmediate_operand")
> +   (match_operand:V8SI 3 "nonimmediate_operand")]
> +  "TARGET_AVX2"
> +{
> +  rtx t1 = gen_reg_rtx (V4DImode);
> +  rtx t2 = gen_reg_rtx (V8SImode);
> +  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
> +  convert_move (t2, t1, 0);
> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
> +  expand_simple_binop (V8SImode, PLUS, t2, operands[3],
> +       NULL, 0, OPTAB_DIRECT)));

Same here, using gen_addv8si3.

No need to repost the patch with this trivial change.

Sorry for the confusion,
Uros.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-01  2:04 ` Cong Hou
  2013-11-01  7:43   ` Uros Bizjak
@ 2013-11-01 10:17   ` James Greenhalgh
  2013-11-01 16:49     ` Cong Hou
  1 sibling, 1 reply; 27+ messages in thread
From: James Greenhalgh @ 2013-11-01 10:17 UTC (permalink / raw)
  To: Cong Hou; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

On Fri, Nov 01, 2013 at 02:03:52AM +0000, Cong Hou wrote:
> 3. Add the document for SAD_EXPR.

I think this patch should also document the new Standard Names usad and
ssad in doc/md.texi?

Your Changelog is missing the change to doc/generic.texi.

Thanks,
James

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-01 10:17   ` James Greenhalgh
@ 2013-11-01 16:49     ` Cong Hou
  2013-11-04 10:06       ` James Greenhalgh
  0 siblings, 1 reply; 27+ messages in thread
From: Cong Hou @ 2013-11-01 16:49 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 24548 bytes --]

Update the patch according to Uros and James's comments.

Now OK to commit?


thanks,
Cong





diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..f49de27 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,25 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+ * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+ pattern recognition.
+ (type_conversion_p): PROMOTION is true if it's a type promotion
+ conversion, and false otherwise.  Return true if the given expression
+ is a type conversion one.
+ * tree-vectorizer.h: Adjust the number of patterns.
+ * tree.def: Add SAD_EXPR.
+ * optabs.def: Add sad_optab.
+ * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+ * expr.c (expand_expr_real_2): Likewise.
+ * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+ * gimple.c (get_gimple_rhs_num_ops): Likewise.
+ * optabs.c (optab_for_tree_code): Likewise.
+ * tree-cfg.c (estimate_operator_cost): Likewise.
+ * tree-ssa-operands.c (get_expr_operands): Likewise.
+ * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+ * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+ * doc/generic.texi: Add document for SAD_EXPR.
+ * doc/md.texi: Add document for ssad and usad.
+
 2013-10-14  David Malcolm  <dmalcolm@redhat.com>

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 7ed29f5..9ec761a 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
  {
  case COND_EXPR:
  case DOT_PROD_EXPR:
+ case SAD_EXPR:
  case WIDEN_MULT_PLUS_EXPR:
  case WIDEN_MULT_MINUS_EXPR:
  case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..ac9d182 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6052,6 +6052,36 @@
   DONE;
 })

+(define_expand "usadv16qi"
+  [(match_operand:V4SI 0 "register_operand")
+   (match_operand:V16QI 1 "register_operand")
+   (match_operand:V16QI 2 "nonimmediate_operand")
+   (match_operand:V4SI 3 "nonimmediate_operand")]
+  "TARGET_SSE2"
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_addv4si3 (operands[0], t2, operands[3]));
+  DONE;
+})
+
+(define_expand "usadv32qi"
+  [(match_operand:V8SI 0 "register_operand")
+   (match_operand:V32QI 1 "register_operand")
+   (match_operand:V32QI 2 "nonimmediate_operand")
+   (match_operand:V8SI 3 "nonimmediate_operand")]
+  "TARGET_AVX2"
+{
+  rtx t1 = gen_reg_rtx (V4DImode);
+  rtx t2 = gen_reg_rtx (V8SImode);
+  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_addv8si3 (operands[0], t2, operands[3]));
+  DONE;
+})
+
 (define_insn "ashr<mode>3"
   [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
  (ashiftrt:VI24_AVX2
diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
index ccecd6e..381ee09 100644
--- a/gcc/doc/generic.texi
+++ b/gcc/doc/generic.texi
@@ -1707,6 +1707,7 @@ its sole argument yields the representation for @code{ap}.
 @tindex VEC_PACK_TRUNC_EXPR
 @tindex VEC_PACK_SAT_EXPR
 @tindex VEC_PACK_FIX_TRUNC_EXPR
+@tindex SAD_EXPR

 @table @code
 @item VEC_LSHIFT_EXPR
@@ -1787,6 +1788,15 @@ value, it is taken from the second operand. It
should never evaluate to
 any other value currently, but optimizations should not rely on that
 property. In contrast with a @code{COND_EXPR}, all operands are always
 evaluated.
+
+@item SAD_EXPR
+This node represents the Sum of Absolute Differences operation.  The three
+operands must be vectors of integral types.  The first and second operand
+must have the same type.  The size of the vector element of the third
+operand must be at lease twice of the size of the vector element of the
+first and second one.  The SAD is calculated between the first and second
+operands, added to the third operand, and returned.
+
 @end table


diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2a5a2e1..8f5d39a 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
Operand 3 is of a mode equal or
 wider than the mode of the product. The result is placed in operand 0, which
 is of the same mode as operand 3.

+@cindex @code{ssad@var{m}} instruction pattern
+@item @samp{ssad@var{m}}
+@cindex @code{usad@var{m}} instruction pattern
+@item @samp{usad@var{m}}
+Compute the sum of absolute differences of two signed/unsigned elements.
+Operand 1 and operand 2 are of the same mode. Their absolute difference, which
+is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
+equal or wider than the mode of the absolute difference. The result is placed
+in operand 0, which is of the same mode as operand 3.
+
 @cindex @code{ssum_widen@var{m3}} instruction pattern
 @item @samp{ssum_widen@var{m3}}
 @cindex @code{usum_widen@var{m3}} instruction pattern
diff --git a/gcc/expr.c b/gcc/expr.c
index 4975a64..1db8a49 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target,
enum machine_mode tmode,
  return target;
       }

+      case SAD_EXPR:
+      {
+ tree oprnd0 = treeop0;
+ tree oprnd1 = treeop1;
+ tree oprnd2 = treeop2;
+ rtx op2;
+
+ expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
+ op2 = expand_normal (oprnd2);
+ target = expand_widen_pattern_expr (ops, op0, op1, op2,
+    target, unsignedp);
+ return target;
+      }
+
     case REALIGN_LOAD_EXPR:
       {
         tree oprnd0 = treeop0;
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index f0f8166..514ddd1 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple
gs, int spc, int flags)
       dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
       pp_greater (buffer);
       break;
+
+    case SAD_EXPR:
+      pp_string (buffer, "SAD_EXPR <");
+      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
+      pp_greater (buffer);
+      break;

     case VEC_PERM_EXPR:
       pp_string (buffer, "VEC_PERM_EXPR <");
diff --git a/gcc/gimple.c b/gcc/gimple.c
index a12dd67..4975959 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -2562,6 +2562,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
       || (SYM) == WIDEN_MULT_PLUS_EXPR    \
       || (SYM) == WIDEN_MULT_MINUS_EXPR    \
       || (SYM) == DOT_PROD_EXPR    \
+      || (SYM) == SAD_EXPR    \
       || (SYM) == REALIGN_LOAD_EXPR    \
       || (SYM) == VEC_COND_EXPR    \
       || (SYM) == VEC_PERM_EXPR                                             \
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 06a626c..16e8f4f 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
     case DOT_PROD_EXPR:
       return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;

+    case SAD_EXPR:
+      return TYPE_UNSIGNED (type) ? usad_optab : ssad_optab;
+
     case WIDEN_MULT_PLUS_EXPR:
       return (TYPE_UNSIGNED (type)
       ? (TYPE_SATURATING (type)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 6b924ac..377763e 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -248,6 +248,8 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
 OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
+OPTAB_D (usad_optab, "usad$I$a")
+OPTAB_D (ssad_optab, "ssad$I$a")
 OPTAB_D (vec_extract_optab, "vec_extract$a")
 OPTAB_D (vec_init_optab, "vec_init$a")
 OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..1698912 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+ * gcc.dg/vect/vect-reduc-sad.c: New.
+ * lib/target-supports.exp (check_effective_target_vect_usad_char): New.
+
 2013-10-14  Tobias Burnus  <burnus@net-b.de>

  PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
new file mode 100644
index 0000000..15a625f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
@@ -0,0 +1,54 @@
+/* { dg-require-effective-target vect_usad_char } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+
+#define N 64
+#define SAD N*N/2
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+
+/* Sum of absolute differences between arrays of unsigned char types.
+   Detected as a sad pattern.
+   Vectorized on targets that support sad for unsigned chars.  */
+
+__attribute__ ((noinline)) int
+foo (int len)
+{
+  int i;
+  int result = 0;
+
+  for (i = 0; i < len; i++)
+    result += abs (X[i] - Y[i]);
+
+  return result;
+}
+
+
+int
+main (void)
+{
+  int i;
+  int sad;
+
+  check_vect ();
+
+  for (i = 0; i < N; i++)
+    {
+      X[i] = i;
+      Y[i] = N - i;
+      __asm__ volatile ("");
+    }
+
+  sad = foo (N);
+  if (sad != SAD)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern:
detected" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
diff --git a/gcc/testsuite/lib/target-supports.exp
b/gcc/testsuite/lib/target-supports.exp
index 7eb4dfe..01ee6f2 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3672,6 +3672,26 @@ proc check_effective_target_vect_udot_hi { } {
     return $et_vect_udot_hi_saved
 }

+# Return 1 if the target plus current options supports a vector
+# sad operation of unsigned chars, 0 otherwise.
+#
+# This won't change for different subtargets so cache the result.
+
+proc check_effective_target_vect_usad_char { } {
+    global et_vect_usad_char
+
+    if [info exists et_vect_usad_char_saved] {
+        verbose "check_effective_target_vect_usad_char: using cached result" 2
+    } else {
+        set et_vect_usad_char_saved 0
+        if { ([istarget i?86-*-*]
+             || [istarget x86_64-*-*]) } {
+            set et_vect_usad_char_saved 1
+        }
+    }
+    verbose "check_effective_target_vect_usad_char: returning
$et_vect_usad_char_saved" 2
+    return $et_vect_usad_char_saved
+}

 # Return 1 if the target plus current options supports a vector
 # demotion (packing) of shorts (to chars) and ints (to shorts)
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 8b66791..c8f3d33 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3796,6 +3796,36 @@ verify_gimple_assign_ternary (gimple stmt)

       return false;

+    case SAD_EXPR:
+      if (!useless_type_conversion_p (rhs1_type, rhs2_type)
+  || !useless_type_conversion_p (lhs_type, rhs3_type)
+  || 2 * GET_MODE_BITSIZE (GET_MODE_INNER
+     (TYPE_MODE (TREE_TYPE (rhs1_type))))
+       > GET_MODE_BITSIZE (GET_MODE_INNER
+     (TYPE_MODE (TREE_TYPE (lhs_type)))))
+ {
+  error ("type mismatch in sad expression");
+  debug_generic_expr (lhs_type);
+  debug_generic_expr (rhs1_type);
+  debug_generic_expr (rhs2_type);
+  debug_generic_expr (rhs3_type);
+  return true;
+ }
+
+      if (TREE_CODE (rhs1_type) != VECTOR_TYPE
+  || TREE_CODE (rhs2_type) != VECTOR_TYPE
+  || TREE_CODE (rhs3_type) != VECTOR_TYPE)
+ {
+  error ("vector types expected in sad expression");
+  debug_generic_expr (lhs_type);
+  debug_generic_expr (rhs1_type);
+  debug_generic_expr (rhs2_type);
+  debug_generic_expr (rhs3_type);
+  return true;
+ }
+
+      return false;
+
     case DOT_PROD_EXPR:
     case REALIGN_LOAD_EXPR:
       /* FIXME.  */
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 2221b9c..44261a3 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3601,6 +3601,7 @@ estimate_operator_cost (enum tree_code code,
eni_weights *weights,
     case WIDEN_SUM_EXPR:
     case WIDEN_MULT_EXPR:
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
     case WIDEN_LSHIFT_EXPR:
diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
index 603f797..393efc3 100644
--- a/gcc/tree-ssa-operands.c
+++ b/gcc/tree-ssa-operands.c
@@ -854,6 +854,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
       }

     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case REALIGN_LOAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 638b981..89aa8c7 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3607,6 +3607,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
     {
       case WIDEN_SUM_EXPR:
       case DOT_PROD_EXPR:
+      case SAD_EXPR:
       case PLUS_EXPR:
       case MINUS_EXPR:
       case BIT_IOR_EXPR:
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 0a4e812..58a6666 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -45,6 +45,8 @@ static gimple vect_recog_widen_mult_pattern
(vec<gimple> *, tree *,
      tree *);
 static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
    tree *);
+static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
+      tree *);
 static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
 static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
                                                  tree *);
@@ -62,6 +64,7 @@ static vect_recog_func_ptr
vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
  vect_recog_widen_mult_pattern,
  vect_recog_widen_sum_pattern,
  vect_recog_dot_prod_pattern,
+        vect_recog_sad_pattern,
  vect_recog_pow_pattern,
  vect_recog_widen_shift_pattern,
  vect_recog_over_widening_pattern,
@@ -140,9 +143,8 @@ vect_single_imm_use (gimple def_stmt)
 }

 /* Check whether NAME, an ssa-name used in USE_STMT,
-   is a result of a type promotion or demotion, such that:
+   is a result of a type promotion, such that:
      DEF_STMT: NAME = NOP (name0)
-   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
    If CHECK_SIGN is TRUE, check that either both types are signed or both are
    unsigned.  */

@@ -189,10 +191,8 @@ type_conversion_p (tree name, gimple use_stmt,
bool check_sign,

   if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
     *promotion = true;
-  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
-    *promotion = false;
   else
-    return false;
+    *promotion = false;

   if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
    bb_vinfo, &dummy_gimple, &dummy, &dt))
@@ -433,6 +433,240 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts,
tree *type_in,
 }


+/* Function vect_recog_sad_pattern
+
+   Try to find the following Sum of Absolute Difference (SAD) pattern:
+
+     type x_t, y_t;
+     signed TYPE1 diff, abs_diff;
+     TYPE2 sum = init;
+   loop:
+     sum_0 = phi <init, sum_1>
+     S1  x_t = ...
+     S2  y_t = ...
+     S3  x_T = (TYPE1) x_t;
+     S4  y_T = (TYPE1) y_t;
+     S5  diff = x_T - y_T;
+     S6  abs_diff = ABS_EXPR <diff>;
+     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
+     S8  sum_1 = abs_diff + sum_0;
+
+   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
+   same size of 'TYPE1' or bigger. This is a special case of a reduction
+   computation.
+
+   Input:
+
+   * STMTS: Contains a stmt from which the pattern search begins.  In the
+   example, when this function is called with S8, the pattern
+   {S3,S4,S5,S6,S7,S8} will be detected.
+
+   Output:
+
+   * TYPE_IN: The type of the input arguments to the pattern.
+
+   * TYPE_OUT: The type of the output of this pattern.
+
+   * Return value: A new stmt that will be used to replace the sequence of
+   stmts that constitute the pattern. In this case it will be:
+        SAD_EXPR <x_t, y_t, sum_0>
+  */
+
+static gimple
+vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
+     tree *type_out)
+{
+  gimple last_stmt = (*stmts)[0];
+  tree sad_oprnd0, sad_oprnd1;
+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
+  tree half_type;
+  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
+  struct loop *loop;
+  bool promotion;
+
+  if (!loop_info)
+    return NULL;
+
+  loop = LOOP_VINFO_LOOP (loop_info);
+
+  if (!is_gimple_assign (last_stmt))
+    return NULL;
+
+  tree sum_type = gimple_expr_type (last_stmt);
+
+  /* Look for the following pattern
+          DX = (TYPE1) X;
+          DY = (TYPE1) Y;
+          DDIFF = DX - DY;
+          DAD = ABS_EXPR <DDIFF>;
+          DDPROD = (TYPE2) DPROD;
+          sum_1 = DAD + sum_0;
+     In which
+     - DX is at least double the size of X
+     - DY is at least double the size of Y
+     - DX, DY, DDIFF, DAD all have the same type
+     - sum is the same size of DAD or bigger
+     - sum has been recognized as a reduction variable.
+
+     This is equivalent to:
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD w+ sum_0;    #widen summation
+     or
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD + sum_0;     #summation
+   */
+
+  /* Starting from LAST_STMT, follow the defs of its uses in search
+     of the above pattern.  */
+
+  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
+    return NULL;
+
+  tree plus_oprnd0, plus_oprnd1;
+
+  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
+    {
+      /* Has been detected as widening-summation?  */
+
+      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+      sum_type = gimple_expr_type (stmt);
+      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (stmt);
+      half_type = TREE_TYPE (plus_oprnd0);
+    }
+  else
+    {
+      gimple def_stmt;
+
+      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
+      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
+  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
+        return NULL;
+
+      /* The type conversion could be promotion, demotion,
+         or just signed -> unsigned.  */
+      if (type_conversion_p (plus_oprnd0, last_stmt, false,
+                             &half_type, &def_stmt, &promotion))
+        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
+      else
+        half_type = sum_type;
+    }
+
+  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
+     we know that plus_oprnd1 is the reduction variable (defined by a
loop-header
+     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
+     Then check that plus_oprnd0 is defined by an abs_expr  */
+
+  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
+    return NULL;
+
+  tree abs_type = half_type;
+  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
+
+  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
+  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop,
gimple_bb (abs_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (abs_stmt))
+    return NULL;
+
+  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
+  gcc_assert (abs_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
+    return NULL;
+
+  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
+  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
+    return NULL;
+  if (TYPE_UNSIGNED (abs_type))
+    return NULL;
+
+  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
+
+  if (TREE_CODE (abs_oprnd) != SSA_NAME)
+    return NULL;
+
+  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
+
+  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
+  if (!gimple_bb (diff_stmt)
+      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (diff_stmt))
+    return NULL;
+
+  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
+  gcc_assert (diff_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
+    return NULL;
+
+  tree half_type0, half_type1;
+  gimple def_stmt;
+
+  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
+  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
+
+  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
+      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
+    return NULL;
+  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
+                          &half_type0, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
+
+  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
+                          &half_type1, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
+
+  if (!types_compatible_p (half_type0, half_type1))
+    return NULL;
+  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
+      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
+    return NULL;
+
+  *type_in = TREE_TYPE (sad_oprnd0);
+  *type_out = sum_type;
+
+  /* Pattern detected. Create a stmt to be used to replace the pattern: */
+  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
+  gimple pattern_stmt = gimple_build_assign_with_ops
+                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location,
+                       "vect_recog_sad_pattern: detected: ");
+      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
+      dump_printf (MSG_NOTE, "\n");
+    }
+
+  /* We don't allow changing the order of the computation in the inner-loop
+     when doing outer-loop vectorization.  */
+  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
+
+  return pattern_stmt;
+}
+
+
 /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
    and LSHIFT_EXPR.

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 8b7b345..0aac75b 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1044,7 +1044,7 @@ extern void vect_slp_transform_bb (basic_block);
    Additional pattern recognition functions can (and will) be added
    in the future.  */
 typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
-#define NUM_PATTERNS 11
+#define NUM_PATTERNS 12
 void vect_pattern_recog (loop_vec_info, bb_vec_info);

 /* In tree-vectorizer.c.  */
diff --git a/gcc/tree.def b/gcc/tree.def
index 88c850a..e15ee61 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1155,6 +1155,12 @@ DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr",
tcc_expression, 3)
    with the second argument.  */
 DEFTREECODE (WIDEN_SUM_EXPR, "widen_sum_expr", tcc_binary, 2)

+/* Widening sad (sum of absolute differences).
+   The first two arguments are of type t1 which should be integer.
+   The third argument and the result are of type t2, such that t2 is at least
+   twice the size of t1.  */
+DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
+
 /* Widening multiplication.
    The two arguments are of type t1.
    The result is of type t2, such that t2 is at least twice




On Fri, Nov 1, 2013 at 3:16 AM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
> On Fri, Nov 01, 2013 at 02:03:52AM +0000, Cong Hou wrote:
>> 3. Add the document for SAD_EXPR.
>
> I think this patch should also document the new Standard Names usad and
> ssad in doc/md.texi?
>
> Your Changelog is missing the change to doc/generic.texi.
>
> Thanks,
> James
>

[-- Attachment #2: patch-sad.txt --]
[-- Type: text/plain, Size: 24348 bytes --]

diff --git a/.gitignore b/.gitignore
index bda55a3..b5bdd3a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,4 @@
+*.swp
 *.diff
 *.patch
 *.orig
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..f49de27 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,25 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+	* tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+	pattern recognition.
+	(type_conversion_p): PROMOTION is true if it's a type promotion
+	conversion, and false otherwise.  Return true if the given expression
+	is a type conversion one.
+	* tree-vectorizer.h: Adjust the number of patterns.
+	* tree.def: Add SAD_EXPR.
+	* optabs.def: Add sad_optab.
+	* cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+	* expr.c (expand_expr_real_2): Likewise.
+	* gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+	* gimple.c (get_gimple_rhs_num_ops): Likewise.
+	* optabs.c (optab_for_tree_code): Likewise.
+	* tree-cfg.c (estimate_operator_cost): Likewise.
+	* tree-ssa-operands.c (get_expr_operands): Likewise.
+	* tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+	* config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+	* doc/generic.texi: Add document for SAD_EXPR.
+	* doc/md.texi: Add document for ssad and usad.
+
 2013-10-14  David Malcolm  <dmalcolm@redhat.com>
 
 	* dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 7ed29f5..9ec761a 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
 	{
 	case COND_EXPR:
 	case DOT_PROD_EXPR:
+	case SAD_EXPR:
 	case WIDEN_MULT_PLUS_EXPR:
 	case WIDEN_MULT_MINUS_EXPR:
 	case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..ac9d182 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6052,6 +6052,36 @@
   DONE;
 })
 
+(define_expand "usadv16qi"
+  [(match_operand:V4SI 0 "register_operand")
+   (match_operand:V16QI 1 "register_operand")
+   (match_operand:V16QI 2 "nonimmediate_operand")
+   (match_operand:V4SI 3 "nonimmediate_operand")]
+  "TARGET_SSE2"
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_addv4si3 (operands[0], t2, operands[3]));
+  DONE;
+})
+
+(define_expand "usadv32qi"
+  [(match_operand:V8SI 0 "register_operand")
+   (match_operand:V32QI 1 "register_operand")
+   (match_operand:V32QI 2 "nonimmediate_operand")
+   (match_operand:V8SI 3 "nonimmediate_operand")]
+  "TARGET_AVX2"
+{
+  rtx t1 = gen_reg_rtx (V4DImode);
+  rtx t2 = gen_reg_rtx (V8SImode);
+  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_addv8si3 (operands[0], t2, operands[3]));
+  DONE;
+})
+
 (define_insn "ashr<mode>3"
   [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
 	(ashiftrt:VI24_AVX2
diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
index ccecd6e..381ee09 100644
--- a/gcc/doc/generic.texi
+++ b/gcc/doc/generic.texi
@@ -1707,6 +1707,7 @@ its sole argument yields the representation for @code{ap}.
 @tindex VEC_PACK_TRUNC_EXPR
 @tindex VEC_PACK_SAT_EXPR
 @tindex VEC_PACK_FIX_TRUNC_EXPR
+@tindex SAD_EXPR
 
 @table @code
 @item VEC_LSHIFT_EXPR
@@ -1787,6 +1788,15 @@ value, it is taken from the second operand. It should never evaluate to
 any other value currently, but optimizations should not rely on that
 property. In contrast with a @code{COND_EXPR}, all operands are always
 evaluated.
+
+@item SAD_EXPR
+This node represents the Sum of Absolute Differences operation.  The three
+operands must be vectors of integral types.  The first and second operand
+must have the same type.  The size of the vector element of the third
+operand must be at lease twice of the size of the vector element of the
+first and second one.  The SAD is calculated between the first and second
+operands, added to the third operand, and returned.
+
 @end table
 
 
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2a5a2e1..8f5d39a 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3. Operand 3 is of a mode equal or
 wider than the mode of the product. The result is placed in operand 0, which
 is of the same mode as operand 3.
 
+@cindex @code{ssad@var{m}} instruction pattern
+@item @samp{ssad@var{m}}
+@cindex @code{usad@var{m}} instruction pattern
+@item @samp{usad@var{m}}
+Compute the sum of absolute differences of two signed/unsigned elements.
+Operand 1 and operand 2 are of the same mode. Their absolute difference, which
+is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
+equal or wider than the mode of the absolute difference. The result is placed
+in operand 0, which is of the same mode as operand 3.
+
 @cindex @code{ssum_widen@var{m3}} instruction pattern
 @item @samp{ssum_widen@var{m3}}
 @cindex @code{usum_widen@var{m3}} instruction pattern
diff --git a/gcc/expr.c b/gcc/expr.c
index 4975a64..1db8a49 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
 	return target;
       }
 
+      case SAD_EXPR:
+      {
+	tree oprnd0 = treeop0;
+	tree oprnd1 = treeop1;
+	tree oprnd2 = treeop2;
+	rtx op2;
+
+	expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
+	op2 = expand_normal (oprnd2);
+	target = expand_widen_pattern_expr (ops, op0, op1, op2,
+					    target, unsignedp);
+	return target;
+      }
+
     case REALIGN_LOAD_EXPR:
       {
         tree oprnd0 = treeop0;
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index f0f8166..514ddd1 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple gs, int spc, int flags)
       dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
       pp_greater (buffer);
       break;
+
+    case SAD_EXPR:
+      pp_string (buffer, "SAD_EXPR <");
+      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
+      pp_greater (buffer);
+      break;
     
     case VEC_PERM_EXPR:
       pp_string (buffer, "VEC_PERM_EXPR <");
diff --git a/gcc/gimple.c b/gcc/gimple.c
index a12dd67..4975959 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -2562,6 +2562,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
       || (SYM) == WIDEN_MULT_PLUS_EXPR					    \
       || (SYM) == WIDEN_MULT_MINUS_EXPR					    \
       || (SYM) == DOT_PROD_EXPR						    \
+      || (SYM) == SAD_EXPR						    \
       || (SYM) == REALIGN_LOAD_EXPR					    \
       || (SYM) == VEC_COND_EXPR						    \
       || (SYM) == VEC_PERM_EXPR                                             \
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 06a626c..16e8f4f 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
     case DOT_PROD_EXPR:
       return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;
 
+    case SAD_EXPR:
+      return TYPE_UNSIGNED (type) ? usad_optab : ssad_optab;
+
     case WIDEN_MULT_PLUS_EXPR:
       return (TYPE_UNSIGNED (type)
 	      ? (TYPE_SATURATING (type)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 6b924ac..377763e 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -248,6 +248,8 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
 OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
+OPTAB_D (usad_optab, "usad$I$a")
+OPTAB_D (ssad_optab, "ssad$I$a")
 OPTAB_D (vec_extract_optab, "vec_extract$a")
 OPTAB_D (vec_init_optab, "vec_init$a")
 OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..1698912 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+	* gcc.dg/vect/vect-reduc-sad.c: New.
+	* lib/target-supports.exp (check_effective_target_vect_usad_char): New.
+
 2013-10-14  Tobias Burnus  <burnus@net-b.de>
 
 	PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
new file mode 100644
index 0000000..15a625f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
@@ -0,0 +1,54 @@
+/* { dg-require-effective-target vect_usad_char } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+
+#define N 64
+#define SAD N*N/2
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+
+/* Sum of absolute differences between arrays of unsigned char types.
+   Detected as a sad pattern.
+   Vectorized on targets that support sad for unsigned chars.  */
+
+__attribute__ ((noinline)) int
+foo (int len)
+{
+  int i;
+  int result = 0;
+
+  for (i = 0; i < len; i++)
+    result += abs (X[i] - Y[i]);
+
+  return result;
+}
+
+
+int
+main (void)
+{
+  int i;
+  int sad;
+
+  check_vect ();
+
+  for (i = 0; i < N; i++)
+    {
+      X[i] = i;
+      Y[i] = N - i;
+      __asm__ volatile ("");
+    }
+
+  sad = foo (N);
+  if (sad != SAD)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern: detected" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 7eb4dfe..01ee6f2 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3672,6 +3672,26 @@ proc check_effective_target_vect_udot_hi { } {
     return $et_vect_udot_hi_saved
 }
 
+# Return 1 if the target plus current options supports a vector
+# sad operation of unsigned chars, 0 otherwise.
+#
+# This won't change for different subtargets so cache the result.
+
+proc check_effective_target_vect_usad_char { } {
+    global et_vect_usad_char
+
+    if [info exists et_vect_usad_char_saved] {
+        verbose "check_effective_target_vect_usad_char: using cached result" 2
+    } else {
+        set et_vect_usad_char_saved 0
+        if { ([istarget i?86-*-*]
+             || [istarget x86_64-*-*]) } {
+            set et_vect_usad_char_saved 1
+        }
+    }
+    verbose "check_effective_target_vect_usad_char: returning $et_vect_usad_char_saved" 2
+    return $et_vect_usad_char_saved
+}
 
 # Return 1 if the target plus current options supports a vector
 # demotion (packing) of shorts (to chars) and ints (to shorts) 
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 8b66791..c8f3d33 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3796,6 +3796,36 @@ verify_gimple_assign_ternary (gimple stmt)
 
       return false;
 
+    case SAD_EXPR:
+      if (!useless_type_conversion_p (rhs1_type, rhs2_type)
+	  || !useless_type_conversion_p (lhs_type, rhs3_type)
+	  || 2 * GET_MODE_BITSIZE (GET_MODE_INNER
+				     (TYPE_MODE (TREE_TYPE (rhs1_type))))
+	       > GET_MODE_BITSIZE (GET_MODE_INNER
+				     (TYPE_MODE (TREE_TYPE (lhs_type)))))
+	{
+	  error ("type mismatch in sad expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      if (TREE_CODE (rhs1_type) != VECTOR_TYPE
+	  || TREE_CODE (rhs2_type) != VECTOR_TYPE
+	  || TREE_CODE (rhs3_type) != VECTOR_TYPE)
+	{
+	  error ("vector types expected in sad expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      return false;
+
     case DOT_PROD_EXPR:
     case REALIGN_LOAD_EXPR:
       /* FIXME.  */
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 2221b9c..44261a3 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3601,6 +3601,7 @@ estimate_operator_cost (enum tree_code code, eni_weights *weights,
     case WIDEN_SUM_EXPR:
     case WIDEN_MULT_EXPR:
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
     case WIDEN_LSHIFT_EXPR:
diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
index 603f797..393efc3 100644
--- a/gcc/tree-ssa-operands.c
+++ b/gcc/tree-ssa-operands.c
@@ -854,6 +854,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
       }
 
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case REALIGN_LOAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 638b981..89aa8c7 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3607,6 +3607,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
     {
       case WIDEN_SUM_EXPR:
       case DOT_PROD_EXPR:
+      case SAD_EXPR:
       case PLUS_EXPR:
       case MINUS_EXPR:
       case BIT_IOR_EXPR:
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 0a4e812..58a6666 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -45,6 +45,8 @@ static gimple vect_recog_widen_mult_pattern (vec<gimple> *, tree *,
 					     tree *);
 static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
 					   tree *);
+static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
+				      tree *);
 static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
 static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
                                                  tree *);
@@ -62,6 +64,7 @@ static vect_recog_func_ptr vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
 	vect_recog_widen_mult_pattern,
 	vect_recog_widen_sum_pattern,
 	vect_recog_dot_prod_pattern,
+        vect_recog_sad_pattern,
 	vect_recog_pow_pattern,
 	vect_recog_widen_shift_pattern,
 	vect_recog_over_widening_pattern,
@@ -140,9 +143,8 @@ vect_single_imm_use (gimple def_stmt)
 }
 
 /* Check whether NAME, an ssa-name used in USE_STMT,
-   is a result of a type promotion or demotion, such that:
+   is a result of a type promotion, such that:
      DEF_STMT: NAME = NOP (name0)
-   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
    If CHECK_SIGN is TRUE, check that either both types are signed or both are
    unsigned.  */
 
@@ -189,10 +191,8 @@ type_conversion_p (tree name, gimple use_stmt, bool check_sign,
 
   if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
     *promotion = true;
-  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
-    *promotion = false;
   else
-    return false;
+    *promotion = false;
 
   if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
 			   bb_vinfo, &dummy_gimple, &dummy, &dt))
@@ -433,6 +433,240 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts, tree *type_in,
 }
 
 
+/* Function vect_recog_sad_pattern
+
+   Try to find the following Sum of Absolute Difference (SAD) pattern:
+
+     type x_t, y_t;
+     signed TYPE1 diff, abs_diff;
+     TYPE2 sum = init;
+   loop:
+     sum_0 = phi <init, sum_1>
+     S1  x_t = ...
+     S2  y_t = ...
+     S3  x_T = (TYPE1) x_t;
+     S4  y_T = (TYPE1) y_t;
+     S5  diff = x_T - y_T;
+     S6  abs_diff = ABS_EXPR <diff>;
+     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
+     S8  sum_1 = abs_diff + sum_0;
+
+   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
+   same size of 'TYPE1' or bigger. This is a special case of a reduction
+   computation.
+
+   Input:
+
+   * STMTS: Contains a stmt from which the pattern search begins.  In the
+   example, when this function is called with S8, the pattern
+   {S3,S4,S5,S6,S7,S8} will be detected.
+
+   Output:
+
+   * TYPE_IN: The type of the input arguments to the pattern.
+
+   * TYPE_OUT: The type of the output of this pattern.
+
+   * Return value: A new stmt that will be used to replace the sequence of
+   stmts that constitute the pattern. In this case it will be:
+        SAD_EXPR <x_t, y_t, sum_0>
+  */
+
+static gimple
+vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
+			     tree *type_out)
+{
+  gimple last_stmt = (*stmts)[0];
+  tree sad_oprnd0, sad_oprnd1;
+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
+  tree half_type;
+  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
+  struct loop *loop;
+  bool promotion;
+
+  if (!loop_info)
+    return NULL;
+
+  loop = LOOP_VINFO_LOOP (loop_info);
+
+  if (!is_gimple_assign (last_stmt))
+    return NULL;
+
+  tree sum_type = gimple_expr_type (last_stmt);
+
+  /* Look for the following pattern
+          DX = (TYPE1) X;
+          DY = (TYPE1) Y;
+          DDIFF = DX - DY;
+          DAD = ABS_EXPR <DDIFF>;
+          DDPROD = (TYPE2) DPROD;
+          sum_1 = DAD + sum_0;
+     In which
+     - DX is at least double the size of X
+     - DY is at least double the size of Y
+     - DX, DY, DDIFF, DAD all have the same type
+     - sum is the same size of DAD or bigger
+     - sum has been recognized as a reduction variable.
+
+     This is equivalent to:
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD w+ sum_0;    #widen summation
+     or
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD + sum_0;     #summation
+   */
+
+  /* Starting from LAST_STMT, follow the defs of its uses in search
+     of the above pattern.  */
+
+  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
+    return NULL;
+
+  tree plus_oprnd0, plus_oprnd1;
+
+  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
+    {
+      /* Has been detected as widening-summation?  */
+
+      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+      sum_type = gimple_expr_type (stmt);
+      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (stmt);
+      half_type = TREE_TYPE (plus_oprnd0);
+    }
+  else
+    {
+      gimple def_stmt;
+
+      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
+      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
+	  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
+        return NULL;
+
+      /* The type conversion could be promotion, demotion,
+         or just signed -> unsigned.  */
+      if (type_conversion_p (plus_oprnd0, last_stmt, false,
+                             &half_type, &def_stmt, &promotion))
+        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
+      else
+        half_type = sum_type;
+    }
+
+  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
+     we know that plus_oprnd1 is the reduction variable (defined by a loop-header
+     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
+     Then check that plus_oprnd0 is defined by an abs_expr  */
+
+  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
+    return NULL;
+
+  tree abs_type = half_type;
+  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
+
+  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
+  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop, gimple_bb (abs_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (abs_stmt))
+    return NULL;
+
+  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
+  gcc_assert (abs_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
+    return NULL;
+
+  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
+  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
+    return NULL;
+  if (TYPE_UNSIGNED (abs_type))
+    return NULL;
+
+  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
+
+  if (TREE_CODE (abs_oprnd) != SSA_NAME)
+    return NULL;
+
+  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
+
+  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
+  if (!gimple_bb (diff_stmt)
+      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (diff_stmt))
+    return NULL;
+
+  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
+  gcc_assert (diff_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
+    return NULL;
+
+  tree half_type0, half_type1;
+  gimple def_stmt;
+
+  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
+  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
+
+  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
+      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
+    return NULL;
+  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
+                          &half_type0, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
+
+  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
+                          &half_type1, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
+
+  if (!types_compatible_p (half_type0, half_type1))
+    return NULL;
+  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
+      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
+    return NULL;
+
+  *type_in = TREE_TYPE (sad_oprnd0);
+  *type_out = sum_type;
+
+  /* Pattern detected. Create a stmt to be used to replace the pattern: */
+  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
+  gimple pattern_stmt = gimple_build_assign_with_ops
+                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location,
+                       "vect_recog_sad_pattern: detected: ");
+      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
+      dump_printf (MSG_NOTE, "\n");
+    }
+
+  /* We don't allow changing the order of the computation in the inner-loop
+     when doing outer-loop vectorization.  */
+  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
+
+  return pattern_stmt;
+}
+
+
 /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
    and LSHIFT_EXPR.
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 8b7b345..0aac75b 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1044,7 +1044,7 @@ extern void vect_slp_transform_bb (basic_block);
    Additional pattern recognition functions can (and will) be added
    in the future.  */
 typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
-#define NUM_PATTERNS 11
+#define NUM_PATTERNS 12
 void vect_pattern_recog (loop_vec_info, bb_vec_info);
 
 /* In tree-vectorizer.c.  */
diff --git a/gcc/tree.def b/gcc/tree.def
index 88c850a..e15ee61 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1155,6 +1155,12 @@ DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr", tcc_expression, 3)
    with the second argument.  */
 DEFTREECODE (WIDEN_SUM_EXPR, "widen_sum_expr", tcc_binary, 2)
 
+/* Widening sad (sum of absolute differences).
+   The first two arguments are of type t1 which should be integer.
+   The third argument and the result are of type t2, such that t2 is at least
+   twice the size of t1.  */
+DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
+
 /* Widening multiplication.
    The two arguments are of type t1.
    The result is of type t2, such that t2 is at least twice

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-01 16:49     ` Cong Hou
@ 2013-11-04 10:06       ` James Greenhalgh
  2013-11-04 18:34         ` Cong Hou
  0 siblings, 1 reply; 27+ messages in thread
From: James Greenhalgh @ 2013-11-04 10:06 UTC (permalink / raw)
  To: Cong Hou; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

On Fri, Nov 01, 2013 at 04:48:53PM +0000, Cong Hou wrote:
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 2a5a2e1..8f5d39a 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
> Operand 3 is of a mode equal or
>  wider than the mode of the product. The result is placed in operand 0, which
>  is of the same mode as operand 3.
> 
> +@cindex @code{ssad@var{m}} instruction pattern
> +@item @samp{ssad@var{m}}
> +@cindex @code{usad@var{m}} instruction pattern
> +@item @samp{usad@var{m}}
> +Compute the sum of absolute differences of two signed/unsigned elements.
> +Operand 1 and operand 2 are of the same mode. Their absolute difference, which
> +is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
> +equal or wider than the mode of the absolute difference. The result is placed
> +in operand 0, which is of the same mode as operand 3.
> +
>  @cindex @code{ssum_widen@var{m3}} instruction pattern
>  @item @samp{ssum_widen@var{m3}}
>  @cindex @code{usum_widen@var{m3}} instruction pattern
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 4975a64..1db8a49 100644

I'm not sure I follow, and if I do - I don't think it matches what
you have implemented for i386.

From your text description I would guess the series of operations to be:

  v1 = widen (operands[1])
  v2 = widen (operands[2])
  v3 = abs (v1 - v2)
  operands[0] = v3 + operands[3]

But if I understand the behaviour of PSADBW correctly, what you have
actually implemented is:

  v1 = widen (operands[1])
  v2 = widen (operands[2])
  v3 = abs (v1 - v2)
  v4 = reduce_plus (v3)
  operands[0] = v4 + operands[3]

To my mind, synthesizing the reduce_plus step will be wasteful for targets
who do not get this for free with their Absolute Difference step. Imagine a
simple loop where we have synthesized the reduce_plus, we compute partial
sums each loop iteration, though we would be better to leave the reduce_plus
step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
Tree code for this.

I would prefer to see this Tree code not imply the reduce_plus.

Thanks,
James

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-04 10:06       ` James Greenhalgh
@ 2013-11-04 18:34         ` Cong Hou
  2013-11-05 10:03           ` James Greenhalgh
  0 siblings, 1 reply; 27+ messages in thread
From: Cong Hou @ 2013-11-04 18:34 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
> On Fri, Nov 01, 2013 at 04:48:53PM +0000, Cong Hou wrote:
>> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>> index 2a5a2e1..8f5d39a 100644
>> --- a/gcc/doc/md.texi
>> +++ b/gcc/doc/md.texi
>> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
>> Operand 3 is of a mode equal or
>>  wider than the mode of the product. The result is placed in operand 0, which
>>  is of the same mode as operand 3.
>>
>> +@cindex @code{ssad@var{m}} instruction pattern
>> +@item @samp{ssad@var{m}}
>> +@cindex @code{usad@var{m}} instruction pattern
>> +@item @samp{usad@var{m}}
>> +Compute the sum of absolute differences of two signed/unsigned elements.
>> +Operand 1 and operand 2 are of the same mode. Their absolute difference, which
>> +is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
>> +equal or wider than the mode of the absolute difference. The result is placed
>> +in operand 0, which is of the same mode as operand 3.
>> +
>>  @cindex @code{ssum_widen@var{m3}} instruction pattern
>>  @item @samp{ssum_widen@var{m3}}
>>  @cindex @code{usum_widen@var{m3}} instruction pattern
>> diff --git a/gcc/expr.c b/gcc/expr.c
>> index 4975a64..1db8a49 100644
>
> I'm not sure I follow, and if I do - I don't think it matches what
> you have implemented for i386.
>
> From your text description I would guess the series of operations to be:
>
>   v1 = widen (operands[1])
>   v2 = widen (operands[2])
>   v3 = abs (v1 - v2)
>   operands[0] = v3 + operands[3]
>
> But if I understand the behaviour of PSADBW correctly, what you have
> actually implemented is:
>
>   v1 = widen (operands[1])
>   v2 = widen (operands[2])
>   v3 = abs (v1 - v2)
>   v4 = reduce_plus (v3)
>   operands[0] = v4 + operands[3]
>
> To my mind, synthesizing the reduce_plus step will be wasteful for targets
> who do not get this for free with their Absolute Difference step. Imagine a
> simple loop where we have synthesized the reduce_plus, we compute partial
> sums each loop iteration, though we would be better to leave the reduce_plus
> step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
> Tree code for this.

What do you mean when you use "synthesizing" here? For each pattern,
the only synthesized operation is the one being returned from the
pattern recognizer. In this case, it is USAD_EXPR. The recognition of
reduce sum is necessary as we need corresponding prolog and epilog for
reductions, which is already done before pattern recognition. Note
that reduction is not a pattern but is a type of vector definition. A
vectorization pattern can still be a reduction operation as long as
STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
can check the other two reduction patterns: widen_sum_pattern and
dot_prod_pattern for reference.

Thank you for your comment!


Cong

>
> I would prefer to see this Tree code not imply the reduce_plus.
>
> Thanks,
> James
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-04 18:34         ` Cong Hou
@ 2013-11-05 10:03           ` James Greenhalgh
  2013-11-05 18:14             ` Cong Hou
  0 siblings, 1 reply; 27+ messages in thread
From: James Greenhalgh @ 2013-11-05 10:03 UTC (permalink / raw)
  To: Cong Hou; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

On Mon, Nov 04, 2013 at 06:30:55PM +0000, Cong Hou wrote:
> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
> <james.greenhalgh@arm.com> wrote:
> > On Fri, Nov 01, 2013 at 04:48:53PM +0000, Cong Hou wrote:
> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> >> index 2a5a2e1..8f5d39a 100644
> >> --- a/gcc/doc/md.texi
> >> +++ b/gcc/doc/md.texi
> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
> >> Operand 3 is of a mode equal or
> >>  wider than the mode of the product. The result is placed in operand 0, which
> >>  is of the same mode as operand 3.
> >>
> >> +@cindex @code{ssad@var{m}} instruction pattern
> >> +@item @samp{ssad@var{m}}
> >> +@cindex @code{usad@var{m}} instruction pattern
> >> +@item @samp{usad@var{m}}
> >> +Compute the sum of absolute differences of two signed/unsigned elements.
> >> +Operand 1 and operand 2 are of the same mode. Their absolute difference, which
> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
> >> +equal or wider than the mode of the absolute difference. The result is placed
> >> +in operand 0, which is of the same mode as operand 3.
> >> +
> >>  @cindex @code{ssum_widen@var{m3}} instruction pattern
> >>  @item @samp{ssum_widen@var{m3}}
> >>  @cindex @code{usum_widen@var{m3}} instruction pattern
> >> diff --git a/gcc/expr.c b/gcc/expr.c
> >> index 4975a64..1db8a49 100644
> >
> > I'm not sure I follow, and if I do - I don't think it matches what
> > you have implemented for i386.
> >
> > From your text description I would guess the series of operations to be:
> >
> >   v1 = widen (operands[1])
> >   v2 = widen (operands[2])
> >   v3 = abs (v1 - v2)
> >   operands[0] = v3 + operands[3]
> >
> > But if I understand the behaviour of PSADBW correctly, what you have
> > actually implemented is:
> >
> >   v1 = widen (operands[1])
> >   v2 = widen (operands[2])
> >   v3 = abs (v1 - v2)
> >   v4 = reduce_plus (v3)
> >   operands[0] = v4 + operands[3]
> >
> > To my mind, synthesizing the reduce_plus step will be wasteful for targets
> > who do not get this for free with their Absolute Difference step. Imagine a
> > simple loop where we have synthesized the reduce_plus, we compute partial
> > sums each loop iteration, though we would be better to leave the reduce_plus
> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
> > Tree code for this.
> 
> What do you mean when you use "synthesizing" here? For each pattern,
> the only synthesized operation is the one being returned from the
> pattern recognizer. In this case, it is USAD_EXPR. The recognition of
> reduce sum is necessary as we need corresponding prolog and epilog for
> reductions, which is already done before pattern recognition. Note
> that reduction is not a pattern but is a type of vector definition. A
> vectorization pattern can still be a reduction operation as long as
> STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
> can check the other two reduction patterns: widen_sum_pattern and
> dot_prod_pattern for reference.

My apologies for not being clear. What I mean is, for a target which does
not have a dedicated PSADBW instruction, the individual steps of
'usad<m>' must be "synthesized" in such a way as to match the expected
behaviour of the tree code.

So, I must expand 'usadm' to a series of equivalent instructions
as USAD_EXPR expects.

If USAD_EXPR requires me to emit a reduction on each loop iteration,
I think that will be inefficient compared to performing the reduction
after the loop body.

To a first approximation on ARM, I would expect from your description
of 'usad<m>' that generating,

     VABAL   ops[3], ops[1], ops[2]
     (Vector widening Absolute Difference and Accumulate)

would fulfil the requirements.

But to match the behaviour you have implemented in the i386
backend I would be required to generate:

    VABAL   ops[3], ops[1], ops[2]
    VPADD   ops[3], ops[3], ops[3] (add one set of pairs)
    VPADD   ops[3], ops[3], ops[3] (and the other)
    VAND    ops[0], ops[3], MASK   (clear high lanes)

Which additionally performs the (redundant) vector reduction
and high lane zeroing step on each loop iteration.

My comment is that your documentation and implementation are
inconsistent so I am not sure which behaviour you intend for USAD_EXPR.

Additionally, I think it would be more generic to choose the first
behaviour, rather than requiring a wasteful decomposition to match
a very particular i386 opcode.

Thanks,
James

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-05 10:03           ` James Greenhalgh
@ 2013-11-05 18:14             ` Cong Hou
  2013-11-08  6:42               ` Cong Hou
  0 siblings, 1 reply; 27+ messages in thread
From: Cong Hou @ 2013-11-05 18:14 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

Thank you for your detailed explanation.

Once GCC detects a reduction operation, it will automatically
accumulate all elements in the vector after the loop. In the loop the
reduction variable is always a vector whose elements are reductions of
corresponding values from other vectors. Therefore in your case the
only instruction you need to generate is:

    VABAL   ops[3], ops[1], ops[2]

It is OK if you accumulate the elements into one in the vector inside
of the loop (if one instruction can do this), but you have to make
sure other elements in the vector should remain zero so that the final
result is correct.

If you are confused about the documentation, check the one for
udot_prod (just above usad in md.texi), as it has very similar
behavior as usad. Actually I copied the text from there and did some
changes. As those two instruction patterns are both for vectorization,
their behavior should not be difficult to explain.

If you have more questions or think that the documentation is still
improper please let me know.

Thank you very much!


Cong


On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
> On Mon, Nov 04, 2013 at 06:30:55PM +0000, Cong Hou wrote:
>> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
>> <james.greenhalgh@arm.com> wrote:
>> > On Fri, Nov 01, 2013 at 04:48:53PM +0000, Cong Hou wrote:
>> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>> >> index 2a5a2e1..8f5d39a 100644
>> >> --- a/gcc/doc/md.texi
>> >> +++ b/gcc/doc/md.texi
>> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
>> >> Operand 3 is of a mode equal or
>> >>  wider than the mode of the product. The result is placed in operand 0, which
>> >>  is of the same mode as operand 3.
>> >>
>> >> +@cindex @code{ssad@var{m}} instruction pattern
>> >> +@item @samp{ssad@var{m}}
>> >> +@cindex @code{usad@var{m}} instruction pattern
>> >> +@item @samp{usad@var{m}}
>> >> +Compute the sum of absolute differences of two signed/unsigned elements.
>> >> +Operand 1 and operand 2 are of the same mode. Their absolute difference, which
>> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
>> >> +equal or wider than the mode of the absolute difference. The result is placed
>> >> +in operand 0, which is of the same mode as operand 3.
>> >> +
>> >>  @cindex @code{ssum_widen@var{m3}} instruction pattern
>> >>  @item @samp{ssum_widen@var{m3}}
>> >>  @cindex @code{usum_widen@var{m3}} instruction pattern
>> >> diff --git a/gcc/expr.c b/gcc/expr.c
>> >> index 4975a64..1db8a49 100644
>> >
>> > I'm not sure I follow, and if I do - I don't think it matches what
>> > you have implemented for i386.
>> >
>> > From your text description I would guess the series of operations to be:
>> >
>> >   v1 = widen (operands[1])
>> >   v2 = widen (operands[2])
>> >   v3 = abs (v1 - v2)
>> >   operands[0] = v3 + operands[3]
>> >
>> > But if I understand the behaviour of PSADBW correctly, what you have
>> > actually implemented is:
>> >
>> >   v1 = widen (operands[1])
>> >   v2 = widen (operands[2])
>> >   v3 = abs (v1 - v2)
>> >   v4 = reduce_plus (v3)
>> >   operands[0] = v4 + operands[3]
>> >
>> > To my mind, synthesizing the reduce_plus step will be wasteful for targets
>> > who do not get this for free with their Absolute Difference step. Imagine a
>> > simple loop where we have synthesized the reduce_plus, we compute partial
>> > sums each loop iteration, though we would be better to leave the reduce_plus
>> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
>> > Tree code for this.
>>
>> What do you mean when you use "synthesizing" here? For each pattern,
>> the only synthesized operation is the one being returned from the
>> pattern recognizer. In this case, it is USAD_EXPR. The recognition of
>> reduce sum is necessary as we need corresponding prolog and epilog for
>> reductions, which is already done before pattern recognition. Note
>> that reduction is not a pattern but is a type of vector definition. A
>> vectorization pattern can still be a reduction operation as long as
>> STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
>> can check the other two reduction patterns: widen_sum_pattern and
>> dot_prod_pattern for reference.
>
> My apologies for not being clear. What I mean is, for a target which does
> not have a dedicated PSADBW instruction, the individual steps of
> 'usad<m>' must be "synthesized" in such a way as to match the expected
> behaviour of the tree code.
>
> So, I must expand 'usadm' to a series of equivalent instructions
> as USAD_EXPR expects.
>
> If USAD_EXPR requires me to emit a reduction on each loop iteration,
> I think that will be inefficient compared to performing the reduction
> after the loop body.
>
> To a first approximation on ARM, I would expect from your description
> of 'usad<m>' that generating,
>
>      VABAL   ops[3], ops[1], ops[2]
>      (Vector widening Absolute Difference and Accumulate)
>
> would fulfil the requirements.
>
> But to match the behaviour you have implemented in the i386
> backend I would be required to generate:
>
>     VABAL   ops[3], ops[1], ops[2]
>     VPADD   ops[3], ops[3], ops[3] (add one set of pairs)
>     VPADD   ops[3], ops[3], ops[3] (and the other)
>     VAND    ops[0], ops[3], MASK   (clear high lanes)
>
> Which additionally performs the (redundant) vector reduction
> and high lane zeroing step on each loop iteration.
>
> My comment is that your documentation and implementation are
> inconsistent so I am not sure which behaviour you intend for USAD_EXPR.
>
> Additionally, I think it would be more generic to choose the first
> behaviour, rather than requiring a wasteful decomposition to match
> a very particular i386 opcode.
>
> Thanks,
> James
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-05 18:14             ` Cong Hou
@ 2013-11-08  6:42               ` Cong Hou
  2013-11-08 11:30                 ` James Greenhalgh
  0 siblings, 1 reply; 27+ messages in thread
From: Cong Hou @ 2013-11-08  6:42 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

Now is this patch OK for the trunk? Thank you!



thanks,
Cong


On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
> Thank you for your detailed explanation.
>
> Once GCC detects a reduction operation, it will automatically
> accumulate all elements in the vector after the loop. In the loop the
> reduction variable is always a vector whose elements are reductions of
> corresponding values from other vectors. Therefore in your case the
> only instruction you need to generate is:
>
>     VABAL   ops[3], ops[1], ops[2]
>
> It is OK if you accumulate the elements into one in the vector inside
> of the loop (if one instruction can do this), but you have to make
> sure other elements in the vector should remain zero so that the final
> result is correct.
>
> If you are confused about the documentation, check the one for
> udot_prod (just above usad in md.texi), as it has very similar
> behavior as usad. Actually I copied the text from there and did some
> changes. As those two instruction patterns are both for vectorization,
> their behavior should not be difficult to explain.
>
> If you have more questions or think that the documentation is still
> improper please let me know.
>
> Thank you very much!
>
>
> Cong
>
>
> On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh
> <james.greenhalgh@arm.com> wrote:
>> On Mon, Nov 04, 2013 at 06:30:55PM +0000, Cong Hou wrote:
>>> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
>>> <james.greenhalgh@arm.com> wrote:
>>> > On Fri, Nov 01, 2013 at 04:48:53PM +0000, Cong Hou wrote:
>>> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>>> >> index 2a5a2e1..8f5d39a 100644
>>> >> --- a/gcc/doc/md.texi
>>> >> +++ b/gcc/doc/md.texi
>>> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
>>> >> Operand 3 is of a mode equal or
>>> >>  wider than the mode of the product. The result is placed in operand 0, which
>>> >>  is of the same mode as operand 3.
>>> >>
>>> >> +@cindex @code{ssad@var{m}} instruction pattern
>>> >> +@item @samp{ssad@var{m}}
>>> >> +@cindex @code{usad@var{m}} instruction pattern
>>> >> +@item @samp{usad@var{m}}
>>> >> +Compute the sum of absolute differences of two signed/unsigned elements.
>>> >> +Operand 1 and operand 2 are of the same mode. Their absolute difference, which
>>> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
>>> >> +equal or wider than the mode of the absolute difference. The result is placed
>>> >> +in operand 0, which is of the same mode as operand 3.
>>> >> +
>>> >>  @cindex @code{ssum_widen@var{m3}} instruction pattern
>>> >>  @item @samp{ssum_widen@var{m3}}
>>> >>  @cindex @code{usum_widen@var{m3}} instruction pattern
>>> >> diff --git a/gcc/expr.c b/gcc/expr.c
>>> >> index 4975a64..1db8a49 100644
>>> >
>>> > I'm not sure I follow, and if I do - I don't think it matches what
>>> > you have implemented for i386.
>>> >
>>> > From your text description I would guess the series of operations to be:
>>> >
>>> >   v1 = widen (operands[1])
>>> >   v2 = widen (operands[2])
>>> >   v3 = abs (v1 - v2)
>>> >   operands[0] = v3 + operands[3]
>>> >
>>> > But if I understand the behaviour of PSADBW correctly, what you have
>>> > actually implemented is:
>>> >
>>> >   v1 = widen (operands[1])
>>> >   v2 = widen (operands[2])
>>> >   v3 = abs (v1 - v2)
>>> >   v4 = reduce_plus (v3)
>>> >   operands[0] = v4 + operands[3]
>>> >
>>> > To my mind, synthesizing the reduce_plus step will be wasteful for targets
>>> > who do not get this for free with their Absolute Difference step. Imagine a
>>> > simple loop where we have synthesized the reduce_plus, we compute partial
>>> > sums each loop iteration, though we would be better to leave the reduce_plus
>>> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
>>> > Tree code for this.
>>>
>>> What do you mean when you use "synthesizing" here? For each pattern,
>>> the only synthesized operation is the one being returned from the
>>> pattern recognizer. In this case, it is USAD_EXPR. The recognition of
>>> reduce sum is necessary as we need corresponding prolog and epilog for
>>> reductions, which is already done before pattern recognition. Note
>>> that reduction is not a pattern but is a type of vector definition. A
>>> vectorization pattern can still be a reduction operation as long as
>>> STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
>>> can check the other two reduction patterns: widen_sum_pattern and
>>> dot_prod_pattern for reference.
>>
>> My apologies for not being clear. What I mean is, for a target which does
>> not have a dedicated PSADBW instruction, the individual steps of
>> 'usad<m>' must be "synthesized" in such a way as to match the expected
>> behaviour of the tree code.
>>
>> So, I must expand 'usadm' to a series of equivalent instructions
>> as USAD_EXPR expects.
>>
>> If USAD_EXPR requires me to emit a reduction on each loop iteration,
>> I think that will be inefficient compared to performing the reduction
>> after the loop body.
>>
>> To a first approximation on ARM, I would expect from your description
>> of 'usad<m>' that generating,
>>
>>      VABAL   ops[3], ops[1], ops[2]
>>      (Vector widening Absolute Difference and Accumulate)
>>
>> would fulfil the requirements.
>>
>> But to match the behaviour you have implemented in the i386
>> backend I would be required to generate:
>>
>>     VABAL   ops[3], ops[1], ops[2]
>>     VPADD   ops[3], ops[3], ops[3] (add one set of pairs)
>>     VPADD   ops[3], ops[3], ops[3] (and the other)
>>     VAND    ops[0], ops[3], MASK   (clear high lanes)
>>
>> Which additionally performs the (redundant) vector reduction
>> and high lane zeroing step on each loop iteration.
>>
>> My comment is that your documentation and implementation are
>> inconsistent so I am not sure which behaviour you intend for USAD_EXPR.
>>
>> Additionally, I think it would be more generic to choose the first
>> behaviour, rather than requiring a wasteful decomposition to match
>> a very particular i386 opcode.
>>
>> Thanks,
>> James
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-08  6:42               ` Cong Hou
@ 2013-11-08 11:30                 ` James Greenhalgh
  2013-11-11 21:22                   ` Cong Hou
  0 siblings, 1 reply; 27+ messages in thread
From: James Greenhalgh @ 2013-11-08 11:30 UTC (permalink / raw)
  To: Cong Hou; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
> > Thank you for your detailed explanation.
> >
> > Once GCC detects a reduction operation, it will automatically
> > accumulate all elements in the vector after the loop. In the loop the
> > reduction variable is always a vector whose elements are reductions of
> > corresponding values from other vectors. Therefore in your case the
> > only instruction you need to generate is:
> >
> >     VABAL   ops[3], ops[1], ops[2]
> >
> > It is OK if you accumulate the elements into one in the vector inside
> > of the loop (if one instruction can do this), but you have to make
> > sure other elements in the vector should remain zero so that the final
> > result is correct.
> >
> > If you are confused about the documentation, check the one for
> > udot_prod (just above usad in md.texi), as it has very similar
> > behavior as usad. Actually I copied the text from there and did some
> > changes. As those two instruction patterns are both for vectorization,
> > their behavior should not be difficult to explain.
> >
> > If you have more questions or think that the documentation is still
> > improper please let me know.

Hi Cong,

Thanks for your reply.

I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
DOT_PROD_EXPR and I see that the same ambiguity exists for
DOT_PROD_EXPR. Can you please add a note in your tree.def
that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:

  tmp = WIDEN_MINUS_EXPR (arg1, arg2)
  tmp2 = ABS_EXPR (tmp)
  arg3 = PLUS_EXPR (tmp2, arg3)

or:

  tmp = WIDEN_MINUS_EXPR (arg1, arg2)
  tmp2 = ABS_EXPR (tmp)
  arg3 = WIDEN_SUM_EXPR (tmp2, arg3)

Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
a value of the same (widened) type as arg3.

Also, while looking for the history of DOT_PROD_EXPR I spotted this
patch:

  [autovect] [patch] detect mult-hi and sad patterns
  http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html

I wonder what the reason was for that patch to be dropped?

Thanks,
James

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-08 11:30                 ` James Greenhalgh
@ 2013-11-11 21:22                   ` Cong Hou
  2013-11-14  7:50                     ` Cong Hou
  2013-12-03  1:07                     ` Cong Hou
  0 siblings, 2 replies; 27+ messages in thread
From: Cong Hou @ 2013-11-11 21:22 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2614 bytes --]

Hi James

Sorry for the late reply.


On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
>> > Thank you for your detailed explanation.
>> >
>> > Once GCC detects a reduction operation, it will automatically
>> > accumulate all elements in the vector after the loop. In the loop the
>> > reduction variable is always a vector whose elements are reductions of
>> > corresponding values from other vectors. Therefore in your case the
>> > only instruction you need to generate is:
>> >
>> >     VABAL   ops[3], ops[1], ops[2]
>> >
>> > It is OK if you accumulate the elements into one in the vector inside
>> > of the loop (if one instruction can do this), but you have to make
>> > sure other elements in the vector should remain zero so that the final
>> > result is correct.
>> >
>> > If you are confused about the documentation, check the one for
>> > udot_prod (just above usad in md.texi), as it has very similar
>> > behavior as usad. Actually I copied the text from there and did some
>> > changes. As those two instruction patterns are both for vectorization,
>> > their behavior should not be difficult to explain.
>> >
>> > If you have more questions or think that the documentation is still
>> > improper please let me know.
>
> Hi Cong,
>
> Thanks for your reply.
>
> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
> DOT_PROD_EXPR and I see that the same ambiguity exists for
> DOT_PROD_EXPR. Can you please add a note in your tree.def
> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>
>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>   tmp2 = ABS_EXPR (tmp)
>   arg3 = PLUS_EXPR (tmp2, arg3)
>
> or:
>
>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>   tmp2 = ABS_EXPR (tmp)
>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>
> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
> a value of the same (widened) type as arg3.
>


I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
mentioned it in tree.def).


> Also, while looking for the history of DOT_PROD_EXPR I spotted this
> patch:
>
>   [autovect] [patch] detect mult-hi and sad patterns
>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>
> I wonder what the reason was for that patch to be dropped?
>

It has been 8 years.. I have no idea why this patch is not accepted
finally. There is even no reply in that thread. But I believe the SAD
pattern is very important to be recognized. ARM also provides
instructions for it.


Thank you for your comment again!


thanks,
Cong



> Thanks,
> James
>

[-- Attachment #2: patch-sad.txt --]
[-- Type: text/plain, Size: 24649 bytes --]

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 6bdaa31..37ff6c4 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,4 +1,24 @@
-2013-11-01  Trevor Saunders  <tsaunders@mozilla.com>
+2013-10-29  Cong Hou  <congh@google.com>
+
+	* tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+	pattern recognition.
+	(type_conversion_p): PROMOTION is true if it's a type promotion
+	conversion, and false otherwise.  Return true if the given expression
+	is a type conversion one.
+	* tree-vectorizer.h: Adjust the number of patterns.
+	* tree.def: Add SAD_EXPR.
+	* optabs.def: Add sad_optab.
+	* cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+	* expr.c (expand_expr_real_2): Likewise.
+	* gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+	* gimple.c (get_gimple_rhs_num_ops): Likewise.
+	* optabs.c (optab_for_tree_code): Likewise.
+	* tree-cfg.c (estimate_operator_cost): Likewise.
+	* tree-ssa-operands.c (get_expr_operands): Likewise.
+	* tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+	* config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+	* doc/generic.texi: Add document for SAD_EXPR.
+	* doc/md.texi: Add document for ssad and usad.
 
 	* function.c (reorder_blocks): Convert block_stack to a stack_vec.
 	* gimplify.c (gimplify_compound_lval): Likewise.
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index fb05ce7..1f824fb 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2740,6 +2740,7 @@ expand_debug_expr (tree exp)
 	{
 	case COND_EXPR:
 	case DOT_PROD_EXPR:
+	case SAD_EXPR:
 	case WIDEN_MULT_PLUS_EXPR:
 	case WIDEN_MULT_MINUS_EXPR:
 	case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 9094a1c..af73817 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7278,6 +7278,36 @@
   DONE;
 })
 
+(define_expand "usadv16qi"
+  [(match_operand:V4SI 0 "register_operand")
+   (match_operand:V16QI 1 "register_operand")
+   (match_operand:V16QI 2 "nonimmediate_operand")
+   (match_operand:V4SI 3 "nonimmediate_operand")]
+  "TARGET_SSE2"
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_addv4si3 (operands[0], t2, operands[3]));
+  DONE;
+})
+
+(define_expand "usadv32qi"
+  [(match_operand:V8SI 0 "register_operand")
+   (match_operand:V32QI 1 "register_operand")
+   (match_operand:V32QI 2 "nonimmediate_operand")
+   (match_operand:V8SI 3 "nonimmediate_operand")]
+  "TARGET_AVX2"
+{
+  rtx t1 = gen_reg_rtx (V4DImode);
+  rtx t2 = gen_reg_rtx (V8SImode);
+  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_addv8si3 (operands[0], t2, operands[3]));
+  DONE;
+})
+
 (define_insn "ashr<mode>3"
   [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
 	(ashiftrt:VI24_AVX2
diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
index 73dd123..fa9a19a 100644
--- a/gcc/doc/generic.texi
+++ b/gcc/doc/generic.texi
@@ -1713,6 +1713,7 @@ a value from @code{enum annot_expr_kind}.
 @tindex VEC_PACK_TRUNC_EXPR
 @tindex VEC_PACK_SAT_EXPR
 @tindex VEC_PACK_FIX_TRUNC_EXPR
+@tindex SAD_EXPR
 
 @table @code
 @item VEC_LSHIFT_EXPR
@@ -1793,6 +1794,15 @@ value, it is taken from the second operand. It should never evaluate to
 any other value currently, but optimizations should not rely on that
 property. In contrast with a @code{COND_EXPR}, all operands are always
 evaluated.
+
+@item SAD_EXPR
+This node represents the Sum of Absolute Differences operation.  The three
+operands must be vectors of integral types.  The first and second operand
+must have the same type.  The size of the vector element of the third
+operand must be at lease twice of the size of the vector element of the
+first and second one.  The SAD is calculated between the first and second
+operands, added to the third operand, and returned.
+
 @end table
 
 
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index ac10a0a..8142a7f 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4786,6 +4786,16 @@ wider mode, is computed and added to operand 3. Operand 3 is of a mode equal or
 wider than the mode of the product. The result is placed in operand 0, which
 is of the same mode as operand 3.
 
+@cindex @code{ssad@var{m}} instruction pattern
+@item @samp{ssad@var{m}}
+@cindex @code{usad@var{m}} instruction pattern
+@item @samp{usad@var{m}}
+Compute the sum of absolute differences of two signed/unsigned elements.
+Operand 1 and operand 2 are of the same mode. Their absolute difference, which
+is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
+equal or wider than the mode of the absolute difference. The result is placed
+in operand 0, which is of the same mode as operand 3.
+
 @cindex @code{ssum_widen@var{m3}} instruction pattern
 @item @samp{ssum_widen@var{m3}}
 @cindex @code{usum_widen@var{m3}} instruction pattern
diff --git a/gcc/expr.c b/gcc/expr.c
index 551a660..ce45a79 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -8973,6 +8973,20 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
 	return target;
       }
 
+      case SAD_EXPR:
+      {
+	tree oprnd0 = treeop0;
+	tree oprnd1 = treeop1;
+	tree oprnd2 = treeop2;
+	rtx op2;
+
+	expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
+	op2 = expand_normal (oprnd2);
+	target = expand_widen_pattern_expr (ops, op0, op1, op2,
+					    target, unsignedp);
+	return target;
+      }
+
     case REALIGN_LOAD_EXPR:
       {
         tree oprnd0 = treeop0;
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index 6842213..56a90f1 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -429,6 +429,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple gs, int spc, int flags)
       dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
       pp_greater (buffer);
       break;
+
+    case SAD_EXPR:
+      pp_string (buffer, "SAD_EXPR <");
+      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
+      pp_greater (buffer);
+      break;
     
     case VEC_PERM_EXPR:
       pp_string (buffer, "VEC_PERM_EXPR <");
diff --git a/gcc/gimple.c b/gcc/gimple.c
index 20f6010..599d1f6 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -2582,6 +2582,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
       || (SYM) == WIDEN_MULT_PLUS_EXPR					    \
       || (SYM) == WIDEN_MULT_MINUS_EXPR					    \
       || (SYM) == DOT_PROD_EXPR						    \
+      || (SYM) == SAD_EXPR						    \
       || (SYM) == REALIGN_LOAD_EXPR					    \
       || (SYM) == VEC_COND_EXPR						    \
       || (SYM) == VEC_PERM_EXPR                                             \
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 3755670..564aefe 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
     case DOT_PROD_EXPR:
       return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;
 
+    case SAD_EXPR:
+      return TYPE_UNSIGNED (type) ? usad_optab : ssad_optab;
+
     case WIDEN_MULT_PLUS_EXPR:
       return (TYPE_UNSIGNED (type)
 	      ? (TYPE_SATURATING (type)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 6b924ac..377763e 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -248,6 +248,8 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
 OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
+OPTAB_D (usad_optab, "usad$I$a")
+OPTAB_D (ssad_optab, "ssad$I$a")
 OPTAB_D (vec_extract_optab, "vec_extract$a")
 OPTAB_D (vec_init_optab, "vec_init$a")
 OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index e9bd852..9528325 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+	* gcc.dg/vect/vect-reduc-sad.c: New.
+	* lib/target-supports.exp (check_effective_target_vect_usad_char): New.
 2013-11-01  Marc Glisse  <marc.glisse@inria.fr>
 
 	PR c++/58834
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
new file mode 100644
index 0000000..15a625f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
@@ -0,0 +1,54 @@
+/* { dg-require-effective-target vect_usad_char } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+
+#define N 64
+#define SAD N*N/2
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+
+/* Sum of absolute differences between arrays of unsigned char types.
+   Detected as a sad pattern.
+   Vectorized on targets that support sad for unsigned chars.  */
+
+__attribute__ ((noinline)) int
+foo (int len)
+{
+  int i;
+  int result = 0;
+
+  for (i = 0; i < len; i++)
+    result += abs (X[i] - Y[i]);
+
+  return result;
+}
+
+
+int
+main (void)
+{
+  int i;
+  int sad;
+
+  check_vect ();
+
+  for (i = 0; i < N; i++)
+    {
+      X[i] = i;
+      Y[i] = N - i;
+      __asm__ volatile ("");
+    }
+
+  sad = foo (N);
+  if (sad != SAD)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern: detected" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 5ca0b76..3ab71ea 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3685,6 +3685,26 @@ proc check_effective_target_vect_udot_hi { } {
     return $et_vect_udot_hi_saved
 }
 
+# Return 1 if the target plus current options supports a vector
+# sad operation of unsigned chars, 0 otherwise.
+#
+# This won't change for different subtargets so cache the result.
+
+proc check_effective_target_vect_usad_char { } {
+    global et_vect_usad_char
+
+    if [info exists et_vect_usad_char_saved] {
+        verbose "check_effective_target_vect_usad_char: using cached result" 2
+    } else {
+        set et_vect_usad_char_saved 0
+        if { ([istarget i?86-*-*]
+             || [istarget x86_64-*-*]) } {
+            set et_vect_usad_char_saved 1
+        }
+    }
+    verbose "check_effective_target_vect_usad_char: returning $et_vect_usad_char_saved" 2
+    return $et_vect_usad_char_saved
+}
 
 # Return 1 if the target plus current options supports a vector
 # demotion (packing) of shorts (to chars) and ints (to shorts) 
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index d646693..e63fce0 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3830,6 +3830,36 @@ verify_gimple_assign_ternary (gimple stmt)
 
       return false;
 
+    case SAD_EXPR:
+      if (!useless_type_conversion_p (rhs1_type, rhs2_type)
+	  || !useless_type_conversion_p (lhs_type, rhs3_type)
+	  || 2 * GET_MODE_BITSIZE (GET_MODE_INNER
+				     (TYPE_MODE (TREE_TYPE (rhs1_type))))
+	       > GET_MODE_BITSIZE (GET_MODE_INNER
+				     (TYPE_MODE (TREE_TYPE (lhs_type)))))
+	{
+	  error ("type mismatch in sad expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      if (TREE_CODE (rhs1_type) != VECTOR_TYPE
+	  || TREE_CODE (rhs2_type) != VECTOR_TYPE
+	  || TREE_CODE (rhs3_type) != VECTOR_TYPE)
+	{
+	  error ("vector types expected in sad expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      return false;
+
     case DOT_PROD_EXPR:
     case REALIGN_LOAD_EXPR:
       /* FIXME.  */
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 77013b3..a9a23ce 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3605,6 +3605,7 @@ estimate_operator_cost (enum tree_code code, eni_weights *weights,
     case WIDEN_SUM_EXPR:
     case WIDEN_MULT_EXPR:
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
     case WIDEN_LSHIFT_EXPR:
diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
index 4e05d2d..b44aca8 100644
--- a/gcc/tree-ssa-operands.c
+++ b/gcc/tree-ssa-operands.c
@@ -859,6 +859,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
       }
 
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case REALIGN_LOAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index d5f86ad..7700f0c 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3620,6 +3620,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
     {
       case WIDEN_SUM_EXPR:
       case DOT_PROD_EXPR:
+      case SAD_EXPR:
       case PLUS_EXPR:
       case MINUS_EXPR:
       case BIT_IOR_EXPR:
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 0998804..4f873b2 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -49,6 +49,8 @@ static gimple vect_recog_widen_mult_pattern (vec<gimple> *, tree *,
 					     tree *);
 static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
 					   tree *);
+static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
+				      tree *);
 static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
 static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
                                                  tree *);
@@ -66,6 +68,7 @@ static vect_recog_func_ptr vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
 	vect_recog_widen_mult_pattern,
 	vect_recog_widen_sum_pattern,
 	vect_recog_dot_prod_pattern,
+        vect_recog_sad_pattern,
 	vect_recog_pow_pattern,
 	vect_recog_widen_shift_pattern,
 	vect_recog_over_widening_pattern,
@@ -144,9 +147,8 @@ vect_single_imm_use (gimple def_stmt)
 }
 
 /* Check whether NAME, an ssa-name used in USE_STMT,
-   is a result of a type promotion or demotion, such that:
+   is a result of a type promotion, such that:
      DEF_STMT: NAME = NOP (name0)
-   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
    If CHECK_SIGN is TRUE, check that either both types are signed or both are
    unsigned.  */
 
@@ -193,10 +195,8 @@ type_conversion_p (tree name, gimple use_stmt, bool check_sign,
 
   if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
     *promotion = true;
-  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
-    *promotion = false;
   else
-    return false;
+    *promotion = false;
 
   if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
 			   bb_vinfo, &dummy_gimple, &dummy, &dt))
@@ -437,6 +437,240 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts, tree *type_in,
 }
 
 
+/* Function vect_recog_sad_pattern
+
+   Try to find the following Sum of Absolute Difference (SAD) pattern:
+
+     type x_t, y_t;
+     signed TYPE1 diff, abs_diff;
+     TYPE2 sum = init;
+   loop:
+     sum_0 = phi <init, sum_1>
+     S1  x_t = ...
+     S2  y_t = ...
+     S3  x_T = (TYPE1) x_t;
+     S4  y_T = (TYPE1) y_t;
+     S5  diff = x_T - y_T;
+     S6  abs_diff = ABS_EXPR <diff>;
+     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
+     S8  sum_1 = abs_diff + sum_0;
+
+   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
+   same size of 'TYPE1' or bigger. This is a special case of a reduction
+   computation.
+
+   Input:
+
+   * STMTS: Contains a stmt from which the pattern search begins.  In the
+   example, when this function is called with S8, the pattern
+   {S3,S4,S5,S6,S7,S8} will be detected.
+
+   Output:
+
+   * TYPE_IN: The type of the input arguments to the pattern.
+
+   * TYPE_OUT: The type of the output of this pattern.
+
+   * Return value: A new stmt that will be used to replace the sequence of
+   stmts that constitute the pattern. In this case it will be:
+        SAD_EXPR <x_t, y_t, sum_0>
+  */
+
+static gimple
+vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
+			     tree *type_out)
+{
+  gimple last_stmt = (*stmts)[0];
+  tree sad_oprnd0, sad_oprnd1;
+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
+  tree half_type;
+  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
+  struct loop *loop;
+  bool promotion;
+
+  if (!loop_info)
+    return NULL;
+
+  loop = LOOP_VINFO_LOOP (loop_info);
+
+  if (!is_gimple_assign (last_stmt))
+    return NULL;
+
+  tree sum_type = gimple_expr_type (last_stmt);
+
+  /* Look for the following pattern
+          DX = (TYPE1) X;
+          DY = (TYPE1) Y;
+          DDIFF = DX - DY;
+          DAD = ABS_EXPR <DDIFF>;
+          DDPROD = (TYPE2) DPROD;
+          sum_1 = DAD + sum_0;
+     In which
+     - DX is at least double the size of X
+     - DY is at least double the size of Y
+     - DX, DY, DDIFF, DAD all have the same type
+     - sum is the same size of DAD or bigger
+     - sum has been recognized as a reduction variable.
+
+     This is equivalent to:
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD w+ sum_0;    #widen summation
+     or
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD + sum_0;     #summation
+   */
+
+  /* Starting from LAST_STMT, follow the defs of its uses in search
+     of the above pattern.  */
+
+  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
+    return NULL;
+
+  tree plus_oprnd0, plus_oprnd1;
+
+  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
+    {
+      /* Has been detected as widening-summation?  */
+
+      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+      sum_type = gimple_expr_type (stmt);
+      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (stmt);
+      half_type = TREE_TYPE (plus_oprnd0);
+    }
+  else
+    {
+      gimple def_stmt;
+
+      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
+      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
+	  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
+        return NULL;
+
+      /* The type conversion could be promotion, demotion,
+         or just signed -> unsigned.  */
+      if (type_conversion_p (plus_oprnd0, last_stmt, false,
+                             &half_type, &def_stmt, &promotion))
+        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
+      else
+        half_type = sum_type;
+    }
+
+  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
+     we know that plus_oprnd1 is the reduction variable (defined by a loop-header
+     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
+     Then check that plus_oprnd0 is defined by an abs_expr  */
+
+  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
+    return NULL;
+
+  tree abs_type = half_type;
+  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
+
+  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
+  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop, gimple_bb (abs_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (abs_stmt))
+    return NULL;
+
+  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
+  gcc_assert (abs_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
+    return NULL;
+
+  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
+  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
+    return NULL;
+  if (TYPE_UNSIGNED (abs_type))
+    return NULL;
+
+  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
+
+  if (TREE_CODE (abs_oprnd) != SSA_NAME)
+    return NULL;
+
+  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
+
+  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
+  if (!gimple_bb (diff_stmt)
+      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (diff_stmt))
+    return NULL;
+
+  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
+  gcc_assert (diff_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
+    return NULL;
+
+  tree half_type0, half_type1;
+  gimple def_stmt;
+
+  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
+  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
+
+  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
+      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
+    return NULL;
+  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
+                          &half_type0, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
+
+  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
+                          &half_type1, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
+
+  if (!types_compatible_p (half_type0, half_type1))
+    return NULL;
+  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
+      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
+    return NULL;
+
+  *type_in = TREE_TYPE (sad_oprnd0);
+  *type_out = sum_type;
+
+  /* Pattern detected. Create a stmt to be used to replace the pattern: */
+  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
+  gimple pattern_stmt = gimple_build_assign_with_ops
+                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location,
+                       "vect_recog_sad_pattern: detected: ");
+      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
+      dump_printf (MSG_NOTE, "\n");
+    }
+
+  /* We don't allow changing the order of the computation in the inner-loop
+     when doing outer-loop vectorization.  */
+  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
+
+  return pattern_stmt;
+}
+
+
 /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
    and LSHIFT_EXPR.
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index a2f482d..d3a8137 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1046,7 +1046,7 @@ extern void vect_slp_transform_bb (basic_block);
    Additional pattern recognition functions can (and will) be added
    in the future.  */
 typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
-#define NUM_PATTERNS 11
+#define NUM_PATTERNS 12
 void vect_pattern_recog (loop_vec_info, bb_vec_info);
 
 /* In tree-vectorizer.c.  */
diff --git a/gcc/tree.def b/gcc/tree.def
index 399b5af..e9a147a 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1160,6 +1160,22 @@ DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr", tcc_expression, 3)
    with the second argument.  */
 DEFTREECODE (WIDEN_SUM_EXPR, "widen_sum_expr", tcc_binary, 2)
 
+/* Widening sad (sum of absolute differences).
+   The first two arguments are of type t1 which should be integer.
+   The third argument and the result are of type t2, such that t2 is at least
+   twice the size of t1.  Like DOT_PROD_EXPR, SAD_EXPR (arg1,arg2,arg3) is
+   equivalent to (note we don't have WIDEN_MINUS_EXPR now, but we assume its
+   behavior is similar to WIDEN_SUM_EXPR):
+       tmp = WIDEN_MINUS_EXPR (arg1, arg2)
+       tmp2 = ABS_EXPR (tmp)
+       arg3 = PLUS_EXPR (tmp2, arg3)
+  or:
+       tmp = WIDEN_MINUS_EXPR (arg1, arg2)
+       tmp2 = ABS_EXPR (tmp)
+       arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
+ */
+DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
+
 /* Widening multiplication.
    The two arguments are of type t1.
    The result is of type t2, such that t2 is at least twice

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-11 21:22                   ` Cong Hou
@ 2013-11-14  7:50                     ` Cong Hou
  2013-11-15 18:47                       ` Cong Hou
  2013-12-03  1:07                     ` Cong Hou
  1 sibling, 1 reply; 27+ messages in thread
From: Cong Hou @ 2013-11-14  7:50 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

Ping?


thanks,
Cong


On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou <congh@google.com> wrote:
> Hi James
>
> Sorry for the late reply.
>
>
> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
> <james.greenhalgh@arm.com> wrote:
>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
>>> > Thank you for your detailed explanation.
>>> >
>>> > Once GCC detects a reduction operation, it will automatically
>>> > accumulate all elements in the vector after the loop. In the loop the
>>> > reduction variable is always a vector whose elements are reductions of
>>> > corresponding values from other vectors. Therefore in your case the
>>> > only instruction you need to generate is:
>>> >
>>> >     VABAL   ops[3], ops[1], ops[2]
>>> >
>>> > It is OK if you accumulate the elements into one in the vector inside
>>> > of the loop (if one instruction can do this), but you have to make
>>> > sure other elements in the vector should remain zero so that the final
>>> > result is correct.
>>> >
>>> > If you are confused about the documentation, check the one for
>>> > udot_prod (just above usad in md.texi), as it has very similar
>>> > behavior as usad. Actually I copied the text from there and did some
>>> > changes. As those two instruction patterns are both for vectorization,
>>> > their behavior should not be difficult to explain.
>>> >
>>> > If you have more questions or think that the documentation is still
>>> > improper please let me know.
>>
>> Hi Cong,
>>
>> Thanks for your reply.
>>
>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>
>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>   tmp2 = ABS_EXPR (tmp)
>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>
>> or:
>>
>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>   tmp2 = ABS_EXPR (tmp)
>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>
>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>> a value of the same (widened) type as arg3.
>>
>
>
> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
> mentioned it in tree.def).
>
>
>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>> patch:
>>
>>   [autovect] [patch] detect mult-hi and sad patterns
>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>
>> I wonder what the reason was for that patch to be dropped?
>>
>
> It has been 8 years.. I have no idea why this patch is not accepted
> finally. There is even no reply in that thread. But I believe the SAD
> pattern is very important to be recognized. ARM also provides
> instructions for it.
>
>
> Thank you for your comment again!
>
>
> thanks,
> Cong
>
>
>
>> Thanks,
>> James
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-14  7:50                     ` Cong Hou
@ 2013-11-15 18:47                       ` Cong Hou
  2013-11-20 18:59                         ` Cong Hou
  0 siblings, 1 reply; 27+ messages in thread
From: Cong Hou @ 2013-11-15 18:47 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

Any more comments?



thanks,
Cong


On Wed, Nov 13, 2013 at 6:06 PM, Cong Hou <congh@google.com> wrote:
> Ping?
>
>
> thanks,
> Cong
>
>
> On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou <congh@google.com> wrote:
>> Hi James
>>
>> Sorry for the late reply.
>>
>>
>> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
>> <james.greenhalgh@arm.com> wrote:
>>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
>>>> > Thank you for your detailed explanation.
>>>> >
>>>> > Once GCC detects a reduction operation, it will automatically
>>>> > accumulate all elements in the vector after the loop. In the loop the
>>>> > reduction variable is always a vector whose elements are reductions of
>>>> > corresponding values from other vectors. Therefore in your case the
>>>> > only instruction you need to generate is:
>>>> >
>>>> >     VABAL   ops[3], ops[1], ops[2]
>>>> >
>>>> > It is OK if you accumulate the elements into one in the vector inside
>>>> > of the loop (if one instruction can do this), but you have to make
>>>> > sure other elements in the vector should remain zero so that the final
>>>> > result is correct.
>>>> >
>>>> > If you are confused about the documentation, check the one for
>>>> > udot_prod (just above usad in md.texi), as it has very similar
>>>> > behavior as usad. Actually I copied the text from there and did some
>>>> > changes. As those two instruction patterns are both for vectorization,
>>>> > their behavior should not be difficult to explain.
>>>> >
>>>> > If you have more questions or think that the documentation is still
>>>> > improper please let me know.
>>>
>>> Hi Cong,
>>>
>>> Thanks for your reply.
>>>
>>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>>
>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>   tmp2 = ABS_EXPR (tmp)
>>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>>
>>> or:
>>>
>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>   tmp2 = ABS_EXPR (tmp)
>>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>>
>>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>>> a value of the same (widened) type as arg3.
>>>
>>
>>
>> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
>> mentioned it in tree.def).
>>
>>
>>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>>> patch:
>>>
>>>   [autovect] [patch] detect mult-hi and sad patterns
>>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>>
>>> I wonder what the reason was for that patch to be dropped?
>>>
>>
>> It has been 8 years.. I have no idea why this patch is not accepted
>> finally. There is even no reply in that thread. But I believe the SAD
>> pattern is very important to be recognized. ARM also provides
>> instructions for it.
>>
>>
>> Thank you for your comment again!
>>
>>
>> thanks,
>> Cong
>>
>>
>>
>>> Thanks,
>>> James
>>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-15 18:47                       ` Cong Hou
@ 2013-11-20 18:59                         ` Cong Hou
  0 siblings, 0 replies; 27+ messages in thread
From: Cong Hou @ 2013-11-20 18:59 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: Uros Bizjak, ramana.gcc, Richard Biener, gcc-patches

Ping...


thanks,
Cong


On Fri, Nov 15, 2013 at 9:52 AM, Cong Hou <congh@google.com> wrote:
> Any more comments?
>
>
>
> thanks,
> Cong
>
>
> On Wed, Nov 13, 2013 at 6:06 PM, Cong Hou <congh@google.com> wrote:
>> Ping?
>>
>>
>> thanks,
>> Cong
>>
>>
>> On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou <congh@google.com> wrote:
>>> Hi James
>>>
>>> Sorry for the late reply.
>>>
>>>
>>> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
>>> <james.greenhalgh@arm.com> wrote:
>>>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
>>>>> > Thank you for your detailed explanation.
>>>>> >
>>>>> > Once GCC detects a reduction operation, it will automatically
>>>>> > accumulate all elements in the vector after the loop. In the loop the
>>>>> > reduction variable is always a vector whose elements are reductions of
>>>>> > corresponding values from other vectors. Therefore in your case the
>>>>> > only instruction you need to generate is:
>>>>> >
>>>>> >     VABAL   ops[3], ops[1], ops[2]
>>>>> >
>>>>> > It is OK if you accumulate the elements into one in the vector inside
>>>>> > of the loop (if one instruction can do this), but you have to make
>>>>> > sure other elements in the vector should remain zero so that the final
>>>>> > result is correct.
>>>>> >
>>>>> > If you are confused about the documentation, check the one for
>>>>> > udot_prod (just above usad in md.texi), as it has very similar
>>>>> > behavior as usad. Actually I copied the text from there and did some
>>>>> > changes. As those two instruction patterns are both for vectorization,
>>>>> > their behavior should not be difficult to explain.
>>>>> >
>>>>> > If you have more questions or think that the documentation is still
>>>>> > improper please let me know.
>>>>
>>>> Hi Cong,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>>>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>>>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>>>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>>>
>>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>>   tmp2 = ABS_EXPR (tmp)
>>>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>>>
>>>> or:
>>>>
>>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>>   tmp2 = ABS_EXPR (tmp)
>>>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>>>
>>>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>>>> a value of the same (widened) type as arg3.
>>>>
>>>
>>>
>>> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
>>> mentioned it in tree.def).
>>>
>>>
>>>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>>>> patch:
>>>>
>>>>   [autovect] [patch] detect mult-hi and sad patterns
>>>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>>>
>>>> I wonder what the reason was for that patch to be dropped?
>>>>
>>>
>>> It has been 8 years.. I have no idea why this patch is not accepted
>>> finally. There is even no reply in that thread. But I believe the SAD
>>> pattern is very important to be recognized. ARM also provides
>>> instructions for it.
>>>
>>>
>>> Thank you for your comment again!
>>>
>>>
>>> thanks,
>>> Cong
>>>
>>>
>>>
>>>> Thanks,
>>>> James
>>>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-11-11 21:22                   ` Cong Hou
  2013-11-14  7:50                     ` Cong Hou
@ 2013-12-03  1:07                     ` Cong Hou
  2013-12-17 18:04                       ` Cong Hou
  2014-06-24 11:19                       ` Richard Biener
  1 sibling, 2 replies; 27+ messages in thread
From: Cong Hou @ 2013-12-03  1:07 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2953 bytes --]

Hi Richard

Could you please take a look at this patch and see if it is ready for
the trunk? The patch is pasted as a text file here again.

Thank you very much!


Cong


On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou <congh@google.com> wrote:
> Hi James
>
> Sorry for the late reply.
>
>
> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
> <james.greenhalgh@arm.com> wrote:
>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
>>> > Thank you for your detailed explanation.
>>> >
>>> > Once GCC detects a reduction operation, it will automatically
>>> > accumulate all elements in the vector after the loop. In the loop the
>>> > reduction variable is always a vector whose elements are reductions of
>>> > corresponding values from other vectors. Therefore in your case the
>>> > only instruction you need to generate is:
>>> >
>>> >     VABAL   ops[3], ops[1], ops[2]
>>> >
>>> > It is OK if you accumulate the elements into one in the vector inside
>>> > of the loop (if one instruction can do this), but you have to make
>>> > sure other elements in the vector should remain zero so that the final
>>> > result is correct.
>>> >
>>> > If you are confused about the documentation, check the one for
>>> > udot_prod (just above usad in md.texi), as it has very similar
>>> > behavior as usad. Actually I copied the text from there and did some
>>> > changes. As those two instruction patterns are both for vectorization,
>>> > their behavior should not be difficult to explain.
>>> >
>>> > If you have more questions or think that the documentation is still
>>> > improper please let me know.
>>
>> Hi Cong,
>>
>> Thanks for your reply.
>>
>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>
>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>   tmp2 = ABS_EXPR (tmp)
>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>
>> or:
>>
>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>   tmp2 = ABS_EXPR (tmp)
>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>
>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>> a value of the same (widened) type as arg3.
>>
>
>
> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
> mentioned it in tree.def).
>
>
>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>> patch:
>>
>>   [autovect] [patch] detect mult-hi and sad patterns
>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>
>> I wonder what the reason was for that patch to be dropped?
>>
>
> It has been 8 years.. I have no idea why this patch is not accepted
> finally. There is even no reply in that thread. But I believe the SAD
> pattern is very important to be recognized. ARM also provides
> instructions for it.
>
>
> Thank you for your comment again!
>
>
> thanks,
> Cong
>
>
>
>> Thanks,
>> James
>>

[-- Attachment #2: patch-sad.txt --]
[-- Type: text/plain, Size: 24649 bytes --]

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 6bdaa31..37ff6c4 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,4 +1,24 @@
-2013-11-01  Trevor Saunders  <tsaunders@mozilla.com>
+2013-10-29  Cong Hou  <congh@google.com>
+
+	* tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+	pattern recognition.
+	(type_conversion_p): PROMOTION is true if it's a type promotion
+	conversion, and false otherwise.  Return true if the given expression
+	is a type conversion one.
+	* tree-vectorizer.h: Adjust the number of patterns.
+	* tree.def: Add SAD_EXPR.
+	* optabs.def: Add sad_optab.
+	* cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+	* expr.c (expand_expr_real_2): Likewise.
+	* gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+	* gimple.c (get_gimple_rhs_num_ops): Likewise.
+	* optabs.c (optab_for_tree_code): Likewise.
+	* tree-cfg.c (estimate_operator_cost): Likewise.
+	* tree-ssa-operands.c (get_expr_operands): Likewise.
+	* tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+	* config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+	* doc/generic.texi: Add document for SAD_EXPR.
+	* doc/md.texi: Add document for ssad and usad.
 
 	* function.c (reorder_blocks): Convert block_stack to a stack_vec.
 	* gimplify.c (gimplify_compound_lval): Likewise.
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index fb05ce7..1f824fb 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2740,6 +2740,7 @@ expand_debug_expr (tree exp)
 	{
 	case COND_EXPR:
 	case DOT_PROD_EXPR:
+	case SAD_EXPR:
 	case WIDEN_MULT_PLUS_EXPR:
 	case WIDEN_MULT_MINUS_EXPR:
 	case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 9094a1c..af73817 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -7278,6 +7278,36 @@
   DONE;
 })
 
+(define_expand "usadv16qi"
+  [(match_operand:V4SI 0 "register_operand")
+   (match_operand:V16QI 1 "register_operand")
+   (match_operand:V16QI 2 "nonimmediate_operand")
+   (match_operand:V4SI 3 "nonimmediate_operand")]
+  "TARGET_SSE2"
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_addv4si3 (operands[0], t2, operands[3]));
+  DONE;
+})
+
+(define_expand "usadv32qi"
+  [(match_operand:V8SI 0 "register_operand")
+   (match_operand:V32QI 1 "register_operand")
+   (match_operand:V32QI 2 "nonimmediate_operand")
+   (match_operand:V8SI 3 "nonimmediate_operand")]
+  "TARGET_AVX2"
+{
+  rtx t1 = gen_reg_rtx (V4DImode);
+  rtx t2 = gen_reg_rtx (V8SImode);
+  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_addv8si3 (operands[0], t2, operands[3]));
+  DONE;
+})
+
 (define_insn "ashr<mode>3"
   [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
 	(ashiftrt:VI24_AVX2
diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
index 73dd123..fa9a19a 100644
--- a/gcc/doc/generic.texi
+++ b/gcc/doc/generic.texi
@@ -1713,6 +1713,7 @@ a value from @code{enum annot_expr_kind}.
 @tindex VEC_PACK_TRUNC_EXPR
 @tindex VEC_PACK_SAT_EXPR
 @tindex VEC_PACK_FIX_TRUNC_EXPR
+@tindex SAD_EXPR
 
 @table @code
 @item VEC_LSHIFT_EXPR
@@ -1793,6 +1794,15 @@ value, it is taken from the second operand. It should never evaluate to
 any other value currently, but optimizations should not rely on that
 property. In contrast with a @code{COND_EXPR}, all operands are always
 evaluated.
+
+@item SAD_EXPR
+This node represents the Sum of Absolute Differences operation.  The three
+operands must be vectors of integral types.  The first and second operand
+must have the same type.  The size of the vector element of the third
+operand must be at lease twice of the size of the vector element of the
+first and second one.  The SAD is calculated between the first and second
+operands, added to the third operand, and returned.
+
 @end table
 
 
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index ac10a0a..8142a7f 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4786,6 +4786,16 @@ wider mode, is computed and added to operand 3. Operand 3 is of a mode equal or
 wider than the mode of the product. The result is placed in operand 0, which
 is of the same mode as operand 3.
 
+@cindex @code{ssad@var{m}} instruction pattern
+@item @samp{ssad@var{m}}
+@cindex @code{usad@var{m}} instruction pattern
+@item @samp{usad@var{m}}
+Compute the sum of absolute differences of two signed/unsigned elements.
+Operand 1 and operand 2 are of the same mode. Their absolute difference, which
+is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
+equal or wider than the mode of the absolute difference. The result is placed
+in operand 0, which is of the same mode as operand 3.
+
 @cindex @code{ssum_widen@var{m3}} instruction pattern
 @item @samp{ssum_widen@var{m3}}
 @cindex @code{usum_widen@var{m3}} instruction pattern
diff --git a/gcc/expr.c b/gcc/expr.c
index 551a660..ce45a79 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -8973,6 +8973,20 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
 	return target;
       }
 
+      case SAD_EXPR:
+      {
+	tree oprnd0 = treeop0;
+	tree oprnd1 = treeop1;
+	tree oprnd2 = treeop2;
+	rtx op2;
+
+	expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
+	op2 = expand_normal (oprnd2);
+	target = expand_widen_pattern_expr (ops, op0, op1, op2,
+					    target, unsignedp);
+	return target;
+      }
+
     case REALIGN_LOAD_EXPR:
       {
         tree oprnd0 = treeop0;
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index 6842213..56a90f1 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -429,6 +429,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple gs, int spc, int flags)
       dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
       pp_greater (buffer);
       break;
+
+    case SAD_EXPR:
+      pp_string (buffer, "SAD_EXPR <");
+      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
+      pp_greater (buffer);
+      break;
     
     case VEC_PERM_EXPR:
       pp_string (buffer, "VEC_PERM_EXPR <");
diff --git a/gcc/gimple.c b/gcc/gimple.c
index 20f6010..599d1f6 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -2582,6 +2582,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
       || (SYM) == WIDEN_MULT_PLUS_EXPR					    \
       || (SYM) == WIDEN_MULT_MINUS_EXPR					    \
       || (SYM) == DOT_PROD_EXPR						    \
+      || (SYM) == SAD_EXPR						    \
       || (SYM) == REALIGN_LOAD_EXPR					    \
       || (SYM) == VEC_COND_EXPR						    \
       || (SYM) == VEC_PERM_EXPR                                             \
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 3755670..564aefe 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
     case DOT_PROD_EXPR:
       return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;
 
+    case SAD_EXPR:
+      return TYPE_UNSIGNED (type) ? usad_optab : ssad_optab;
+
     case WIDEN_MULT_PLUS_EXPR:
       return (TYPE_UNSIGNED (type)
 	      ? (TYPE_SATURATING (type)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 6b924ac..377763e 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -248,6 +248,8 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
 OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
+OPTAB_D (usad_optab, "usad$I$a")
+OPTAB_D (ssad_optab, "ssad$I$a")
 OPTAB_D (vec_extract_optab, "vec_extract$a")
 OPTAB_D (vec_init_optab, "vec_init$a")
 OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index e9bd852..9528325 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+	* gcc.dg/vect/vect-reduc-sad.c: New.
+	* lib/target-supports.exp (check_effective_target_vect_usad_char): New.
 2013-11-01  Marc Glisse  <marc.glisse@inria.fr>
 
 	PR c++/58834
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
new file mode 100644
index 0000000..15a625f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
@@ -0,0 +1,54 @@
+/* { dg-require-effective-target vect_usad_char } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+
+#define N 64
+#define SAD N*N/2
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+
+/* Sum of absolute differences between arrays of unsigned char types.
+   Detected as a sad pattern.
+   Vectorized on targets that support sad for unsigned chars.  */
+
+__attribute__ ((noinline)) int
+foo (int len)
+{
+  int i;
+  int result = 0;
+
+  for (i = 0; i < len; i++)
+    result += abs (X[i] - Y[i]);
+
+  return result;
+}
+
+
+int
+main (void)
+{
+  int i;
+  int sad;
+
+  check_vect ();
+
+  for (i = 0; i < N; i++)
+    {
+      X[i] = i;
+      Y[i] = N - i;
+      __asm__ volatile ("");
+    }
+
+  sad = foo (N);
+  if (sad != SAD)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern: detected" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 5ca0b76..3ab71ea 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3685,6 +3685,26 @@ proc check_effective_target_vect_udot_hi { } {
     return $et_vect_udot_hi_saved
 }
 
+# Return 1 if the target plus current options supports a vector
+# sad operation of unsigned chars, 0 otherwise.
+#
+# This won't change for different subtargets so cache the result.
+
+proc check_effective_target_vect_usad_char { } {
+    global et_vect_usad_char
+
+    if [info exists et_vect_usad_char_saved] {
+        verbose "check_effective_target_vect_usad_char: using cached result" 2
+    } else {
+        set et_vect_usad_char_saved 0
+        if { ([istarget i?86-*-*]
+             || [istarget x86_64-*-*]) } {
+            set et_vect_usad_char_saved 1
+        }
+    }
+    verbose "check_effective_target_vect_usad_char: returning $et_vect_usad_char_saved" 2
+    return $et_vect_usad_char_saved
+}
 
 # Return 1 if the target plus current options supports a vector
 # demotion (packing) of shorts (to chars) and ints (to shorts) 
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index d646693..e63fce0 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3830,6 +3830,36 @@ verify_gimple_assign_ternary (gimple stmt)
 
       return false;
 
+    case SAD_EXPR:
+      if (!useless_type_conversion_p (rhs1_type, rhs2_type)
+	  || !useless_type_conversion_p (lhs_type, rhs3_type)
+	  || 2 * GET_MODE_BITSIZE (GET_MODE_INNER
+				     (TYPE_MODE (TREE_TYPE (rhs1_type))))
+	       > GET_MODE_BITSIZE (GET_MODE_INNER
+				     (TYPE_MODE (TREE_TYPE (lhs_type)))))
+	{
+	  error ("type mismatch in sad expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      if (TREE_CODE (rhs1_type) != VECTOR_TYPE
+	  || TREE_CODE (rhs2_type) != VECTOR_TYPE
+	  || TREE_CODE (rhs3_type) != VECTOR_TYPE)
+	{
+	  error ("vector types expected in sad expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      return false;
+
     case DOT_PROD_EXPR:
     case REALIGN_LOAD_EXPR:
       /* FIXME.  */
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 77013b3..a9a23ce 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3605,6 +3605,7 @@ estimate_operator_cost (enum tree_code code, eni_weights *weights,
     case WIDEN_SUM_EXPR:
     case WIDEN_MULT_EXPR:
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
     case WIDEN_LSHIFT_EXPR:
diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
index 4e05d2d..b44aca8 100644
--- a/gcc/tree-ssa-operands.c
+++ b/gcc/tree-ssa-operands.c
@@ -859,6 +859,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
       }
 
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case REALIGN_LOAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index d5f86ad..7700f0c 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3620,6 +3620,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
     {
       case WIDEN_SUM_EXPR:
       case DOT_PROD_EXPR:
+      case SAD_EXPR:
       case PLUS_EXPR:
       case MINUS_EXPR:
       case BIT_IOR_EXPR:
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 0998804..4f873b2 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -49,6 +49,8 @@ static gimple vect_recog_widen_mult_pattern (vec<gimple> *, tree *,
 					     tree *);
 static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
 					   tree *);
+static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
+				      tree *);
 static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
 static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
                                                  tree *);
@@ -66,6 +68,7 @@ static vect_recog_func_ptr vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
 	vect_recog_widen_mult_pattern,
 	vect_recog_widen_sum_pattern,
 	vect_recog_dot_prod_pattern,
+        vect_recog_sad_pattern,
 	vect_recog_pow_pattern,
 	vect_recog_widen_shift_pattern,
 	vect_recog_over_widening_pattern,
@@ -144,9 +147,8 @@ vect_single_imm_use (gimple def_stmt)
 }
 
 /* Check whether NAME, an ssa-name used in USE_STMT,
-   is a result of a type promotion or demotion, such that:
+   is a result of a type promotion, such that:
      DEF_STMT: NAME = NOP (name0)
-   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
    If CHECK_SIGN is TRUE, check that either both types are signed or both are
    unsigned.  */
 
@@ -193,10 +195,8 @@ type_conversion_p (tree name, gimple use_stmt, bool check_sign,
 
   if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
     *promotion = true;
-  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
-    *promotion = false;
   else
-    return false;
+    *promotion = false;
 
   if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
 			   bb_vinfo, &dummy_gimple, &dummy, &dt))
@@ -437,6 +437,240 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts, tree *type_in,
 }
 
 
+/* Function vect_recog_sad_pattern
+
+   Try to find the following Sum of Absolute Difference (SAD) pattern:
+
+     type x_t, y_t;
+     signed TYPE1 diff, abs_diff;
+     TYPE2 sum = init;
+   loop:
+     sum_0 = phi <init, sum_1>
+     S1  x_t = ...
+     S2  y_t = ...
+     S3  x_T = (TYPE1) x_t;
+     S4  y_T = (TYPE1) y_t;
+     S5  diff = x_T - y_T;
+     S6  abs_diff = ABS_EXPR <diff>;
+     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
+     S8  sum_1 = abs_diff + sum_0;
+
+   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
+   same size of 'TYPE1' or bigger. This is a special case of a reduction
+   computation.
+
+   Input:
+
+   * STMTS: Contains a stmt from which the pattern search begins.  In the
+   example, when this function is called with S8, the pattern
+   {S3,S4,S5,S6,S7,S8} will be detected.
+
+   Output:
+
+   * TYPE_IN: The type of the input arguments to the pattern.
+
+   * TYPE_OUT: The type of the output of this pattern.
+
+   * Return value: A new stmt that will be used to replace the sequence of
+   stmts that constitute the pattern. In this case it will be:
+        SAD_EXPR <x_t, y_t, sum_0>
+  */
+
+static gimple
+vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
+			     tree *type_out)
+{
+  gimple last_stmt = (*stmts)[0];
+  tree sad_oprnd0, sad_oprnd1;
+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
+  tree half_type;
+  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
+  struct loop *loop;
+  bool promotion;
+
+  if (!loop_info)
+    return NULL;
+
+  loop = LOOP_VINFO_LOOP (loop_info);
+
+  if (!is_gimple_assign (last_stmt))
+    return NULL;
+
+  tree sum_type = gimple_expr_type (last_stmt);
+
+  /* Look for the following pattern
+          DX = (TYPE1) X;
+          DY = (TYPE1) Y;
+          DDIFF = DX - DY;
+          DAD = ABS_EXPR <DDIFF>;
+          DDPROD = (TYPE2) DPROD;
+          sum_1 = DAD + sum_0;
+     In which
+     - DX is at least double the size of X
+     - DY is at least double the size of Y
+     - DX, DY, DDIFF, DAD all have the same type
+     - sum is the same size of DAD or bigger
+     - sum has been recognized as a reduction variable.
+
+     This is equivalent to:
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD w+ sum_0;    #widen summation
+     or
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD + sum_0;     #summation
+   */
+
+  /* Starting from LAST_STMT, follow the defs of its uses in search
+     of the above pattern.  */
+
+  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
+    return NULL;
+
+  tree plus_oprnd0, plus_oprnd1;
+
+  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
+    {
+      /* Has been detected as widening-summation?  */
+
+      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+      sum_type = gimple_expr_type (stmt);
+      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (stmt);
+      half_type = TREE_TYPE (plus_oprnd0);
+    }
+  else
+    {
+      gimple def_stmt;
+
+      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
+      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
+	  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
+        return NULL;
+
+      /* The type conversion could be promotion, demotion,
+         or just signed -> unsigned.  */
+      if (type_conversion_p (plus_oprnd0, last_stmt, false,
+                             &half_type, &def_stmt, &promotion))
+        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
+      else
+        half_type = sum_type;
+    }
+
+  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
+     we know that plus_oprnd1 is the reduction variable (defined by a loop-header
+     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
+     Then check that plus_oprnd0 is defined by an abs_expr  */
+
+  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
+    return NULL;
+
+  tree abs_type = half_type;
+  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
+
+  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
+  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop, gimple_bb (abs_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (abs_stmt))
+    return NULL;
+
+  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
+  gcc_assert (abs_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
+    return NULL;
+
+  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
+  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
+    return NULL;
+  if (TYPE_UNSIGNED (abs_type))
+    return NULL;
+
+  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
+
+  if (TREE_CODE (abs_oprnd) != SSA_NAME)
+    return NULL;
+
+  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
+
+  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
+  if (!gimple_bb (diff_stmt)
+      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (diff_stmt))
+    return NULL;
+
+  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
+  gcc_assert (diff_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
+    return NULL;
+
+  tree half_type0, half_type1;
+  gimple def_stmt;
+
+  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
+  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
+
+  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
+      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
+    return NULL;
+  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
+                          &half_type0, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
+
+  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
+                          &half_type1, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
+
+  if (!types_compatible_p (half_type0, half_type1))
+    return NULL;
+  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
+      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
+    return NULL;
+
+  *type_in = TREE_TYPE (sad_oprnd0);
+  *type_out = sum_type;
+
+  /* Pattern detected. Create a stmt to be used to replace the pattern: */
+  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
+  gimple pattern_stmt = gimple_build_assign_with_ops
+                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location,
+                       "vect_recog_sad_pattern: detected: ");
+      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
+      dump_printf (MSG_NOTE, "\n");
+    }
+
+  /* We don't allow changing the order of the computation in the inner-loop
+     when doing outer-loop vectorization.  */
+  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
+
+  return pattern_stmt;
+}
+
+
 /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
    and LSHIFT_EXPR.
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index a2f482d..d3a8137 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1046,7 +1046,7 @@ extern void vect_slp_transform_bb (basic_block);
    Additional pattern recognition functions can (and will) be added
    in the future.  */
 typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
-#define NUM_PATTERNS 11
+#define NUM_PATTERNS 12
 void vect_pattern_recog (loop_vec_info, bb_vec_info);
 
 /* In tree-vectorizer.c.  */
diff --git a/gcc/tree.def b/gcc/tree.def
index 399b5af..e9a147a 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1160,6 +1160,22 @@ DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr", tcc_expression, 3)
    with the second argument.  */
 DEFTREECODE (WIDEN_SUM_EXPR, "widen_sum_expr", tcc_binary, 2)
 
+/* Widening sad (sum of absolute differences).
+   The first two arguments are of type t1 which should be integer.
+   The third argument and the result are of type t2, such that t2 is at least
+   twice the size of t1.  Like DOT_PROD_EXPR, SAD_EXPR (arg1,arg2,arg3) is
+   equivalent to (note we don't have WIDEN_MINUS_EXPR now, but we assume its
+   behavior is similar to WIDEN_SUM_EXPR):
+       tmp = WIDEN_MINUS_EXPR (arg1, arg2)
+       tmp2 = ABS_EXPR (tmp)
+       arg3 = PLUS_EXPR (tmp2, arg3)
+  or:
+       tmp = WIDEN_MINUS_EXPR (arg1, arg2)
+       tmp2 = ABS_EXPR (tmp)
+       arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
+ */
+DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
+
 /* Widening multiplication.
    The two arguments are of type t1.
    The result is of type t2, such that t2 is at least twice

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-12-03  1:07                     ` Cong Hou
@ 2013-12-17 18:04                       ` Cong Hou
  2014-06-23 23:44                         ` Cong Hou
  2014-06-24 11:19                       ` Richard Biener
  1 sibling, 1 reply; 27+ messages in thread
From: Cong Hou @ 2013-12-17 18:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Ping?


thanks,
Cong


On Mon, Dec 2, 2013 at 5:06 PM, Cong Hou <congh@google.com> wrote:
> Hi Richard
>
> Could you please take a look at this patch and see if it is ready for
> the trunk? The patch is pasted as a text file here again.
>
> Thank you very much!
>
>
> Cong
>
>
> On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou <congh@google.com> wrote:
>> Hi James
>>
>> Sorry for the late reply.
>>
>>
>> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
>> <james.greenhalgh@arm.com> wrote:
>>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
>>>> > Thank you for your detailed explanation.
>>>> >
>>>> > Once GCC detects a reduction operation, it will automatically
>>>> > accumulate all elements in the vector after the loop. In the loop the
>>>> > reduction variable is always a vector whose elements are reductions of
>>>> > corresponding values from other vectors. Therefore in your case the
>>>> > only instruction you need to generate is:
>>>> >
>>>> >     VABAL   ops[3], ops[1], ops[2]
>>>> >
>>>> > It is OK if you accumulate the elements into one in the vector inside
>>>> > of the loop (if one instruction can do this), but you have to make
>>>> > sure other elements in the vector should remain zero so that the final
>>>> > result is correct.
>>>> >
>>>> > If you are confused about the documentation, check the one for
>>>> > udot_prod (just above usad in md.texi), as it has very similar
>>>> > behavior as usad. Actually I copied the text from there and did some
>>>> > changes. As those two instruction patterns are both for vectorization,
>>>> > their behavior should not be difficult to explain.
>>>> >
>>>> > If you have more questions or think that the documentation is still
>>>> > improper please let me know.
>>>
>>> Hi Cong,
>>>
>>> Thanks for your reply.
>>>
>>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>>
>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>   tmp2 = ABS_EXPR (tmp)
>>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>>
>>> or:
>>>
>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>   tmp2 = ABS_EXPR (tmp)
>>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>>
>>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>>> a value of the same (widened) type as arg3.
>>>
>>
>>
>> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
>> mentioned it in tree.def).
>>
>>
>>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>>> patch:
>>>
>>>   [autovect] [patch] detect mult-hi and sad patterns
>>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>>
>>> I wonder what the reason was for that patch to be dropped?
>>>
>>
>> It has been 8 years.. I have no idea why this patch is not accepted
>> finally. There is even no reply in that thread. But I believe the SAD
>> pattern is very important to be recognized. ARM also provides
>> instructions for it.
>>
>>
>> Thank you for your comment again!
>>
>>
>> thanks,
>> Cong
>>
>>
>>
>>> Thanks,
>>> James
>>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-12-17 18:04                       ` Cong Hou
@ 2014-06-23 23:44                         ` Cong Hou
  2014-06-24  7:36                           ` Richard Biener
  0 siblings, 1 reply; 27+ messages in thread
From: Cong Hou @ 2014-06-23 23:44 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Biener, Jakub Jelinek

It has been 8 months since this patch is posted. I have addressed all
comments to this patch.

The SAD pattern is very useful for some multimedia algorithms like
ffmpeg. This patch will greatly improve the performance of such
algorithms. Could you please have a look again and check if it is OK
for the trunk? If it is necessary I can re-post this patch in a new
thread.

Thank you!


Cong


On Tue, Dec 17, 2013 at 10:04 AM, Cong Hou <congh@google.com> wrote:
>
> Ping?
>
>
> thanks,
> Cong
>
>
> On Mon, Dec 2, 2013 at 5:06 PM, Cong Hou <congh@google.com> wrote:
> > Hi Richard
> >
> > Could you please take a look at this patch and see if it is ready for
> > the trunk? The patch is pasted as a text file here again.
> >
> > Thank you very much!
> >
> >
> > Cong
> >
> >
> > On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou <congh@google.com> wrote:
> >> Hi James
> >>
> >> Sorry for the late reply.
> >>
> >>
> >> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
> >> <james.greenhalgh@arm.com> wrote:
> >>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
> >>>> > Thank you for your detailed explanation.
> >>>> >
> >>>> > Once GCC detects a reduction operation, it will automatically
> >>>> > accumulate all elements in the vector after the loop. In the loop the
> >>>> > reduction variable is always a vector whose elements are reductions of
> >>>> > corresponding values from other vectors. Therefore in your case the
> >>>> > only instruction you need to generate is:
> >>>> >
> >>>> >     VABAL   ops[3], ops[1], ops[2]
> >>>> >
> >>>> > It is OK if you accumulate the elements into one in the vector inside
> >>>> > of the loop (if one instruction can do this), but you have to make
> >>>> > sure other elements in the vector should remain zero so that the final
> >>>> > result is correct.
> >>>> >
> >>>> > If you are confused about the documentation, check the one for
> >>>> > udot_prod (just above usad in md.texi), as it has very similar
> >>>> > behavior as usad. Actually I copied the text from there and did some
> >>>> > changes. As those two instruction patterns are both for vectorization,
> >>>> > their behavior should not be difficult to explain.
> >>>> >
> >>>> > If you have more questions or think that the documentation is still
> >>>> > improper please let me know.
> >>>
> >>> Hi Cong,
> >>>
> >>> Thanks for your reply.
> >>>
> >>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
> >>> DOT_PROD_EXPR and I see that the same ambiguity exists for
> >>> DOT_PROD_EXPR. Can you please add a note in your tree.def
> >>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
> >>>
> >>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
> >>>   tmp2 = ABS_EXPR (tmp)
> >>>   arg3 = PLUS_EXPR (tmp2, arg3)
> >>>
> >>> or:
> >>>
> >>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
> >>>   tmp2 = ABS_EXPR (tmp)
> >>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
> >>>
> >>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
> >>> a value of the same (widened) type as arg3.
> >>>
> >>
> >>
> >> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
> >> mentioned it in tree.def).
> >>
> >>
> >>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
> >>> patch:
> >>>
> >>>   [autovect] [patch] detect mult-hi and sad patterns
> >>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
> >>>
> >>> I wonder what the reason was for that patch to be dropped?
> >>>
> >>
> >> It has been 8 years.. I have no idea why this patch is not accepted
> >> finally. There is even no reply in that thread. But I believe the SAD
> >> pattern is very important to be recognized. ARM also provides
> >> instructions for it.
> >>
> >>
> >> Thank you for your comment again!
> >>
> >>
> >> thanks,
> >> Cong
> >>
> >>
> >>
> >>> Thanks,
> >>> James
> >>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2014-06-23 23:44                         ` Cong Hou
@ 2014-06-24  7:36                           ` Richard Biener
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Biener @ 2014-06-24  7:36 UTC (permalink / raw)
  To: Cong Hou; +Cc: gcc-patches, Jakub Jelinek

On Mon, 23 Jun 2014, Cong Hou wrote:

> It has been 8 months since this patch is posted. I have addressed all
> comments to this patch.
> 
> The SAD pattern is very useful for some multimedia algorithms like
> ffmpeg. This patch will greatly improve the performance of such
> algorithms. Could you please have a look again and check if it is OK
> for the trunk? If it is necessary I can re-post this patch in a new
> thread.

I will try to get to this one this week but can't easily find the
latest patch, so - can you re-post it in a new thread?

Thanks,
Richard.

> Thank you!
> 
> 
> Cong
> 
> 
> On Tue, Dec 17, 2013 at 10:04 AM, Cong Hou <congh@google.com> wrote:
> >
> > Ping?
> >
> >
> > thanks,
> > Cong
> >
> >
> > On Mon, Dec 2, 2013 at 5:06 PM, Cong Hou <congh@google.com> wrote:
> > > Hi Richard
> > >
> > > Could you please take a look at this patch and see if it is ready for
> > > the trunk? The patch is pasted as a text file here again.
> > >
> > > Thank you very much!
> > >
> > >
> > > Cong
> > >
> > >
> > > On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou <congh@google.com> wrote:
> > >> Hi James
> > >>
> > >> Sorry for the late reply.
> > >>
> > >>
> > >> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
> > >> <james.greenhalgh@arm.com> wrote:
> > >>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
> > >>>> > Thank you for your detailed explanation.
> > >>>> >
> > >>>> > Once GCC detects a reduction operation, it will automatically
> > >>>> > accumulate all elements in the vector after the loop. In the loop the
> > >>>> > reduction variable is always a vector whose elements are reductions of
> > >>>> > corresponding values from other vectors. Therefore in your case the
> > >>>> > only instruction you need to generate is:
> > >>>> >
> > >>>> >     VABAL   ops[3], ops[1], ops[2]
> > >>>> >
> > >>>> > It is OK if you accumulate the elements into one in the vector inside
> > >>>> > of the loop (if one instruction can do this), but you have to make
> > >>>> > sure other elements in the vector should remain zero so that the final
> > >>>> > result is correct.
> > >>>> >
> > >>>> > If you are confused about the documentation, check the one for
> > >>>> > udot_prod (just above usad in md.texi), as it has very similar
> > >>>> > behavior as usad. Actually I copied the text from there and did some
> > >>>> > changes. As those two instruction patterns are both for vectorization,
> > >>>> > their behavior should not be difficult to explain.
> > >>>> >
> > >>>> > If you have more questions or think that the documentation is still
> > >>>> > improper please let me know.
> > >>>
> > >>> Hi Cong,
> > >>>
> > >>> Thanks for your reply.
> > >>>
> > >>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
> > >>> DOT_PROD_EXPR and I see that the same ambiguity exists for
> > >>> DOT_PROD_EXPR. Can you please add a note in your tree.def
> > >>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
> > >>>
> > >>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
> > >>>   tmp2 = ABS_EXPR (tmp)
> > >>>   arg3 = PLUS_EXPR (tmp2, arg3)
> > >>>
> > >>> or:
> > >>>
> > >>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
> > >>>   tmp2 = ABS_EXPR (tmp)
> > >>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
> > >>>
> > >>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
> > >>> a value of the same (widened) type as arg3.
> > >>>
> > >>
> > >>
> > >> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
> > >> mentioned it in tree.def).
> > >>
> > >>
> > >>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
> > >>> patch:
> > >>>
> > >>>   [autovect] [patch] detect mult-hi and sad patterns
> > >>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
> > >>>
> > >>> I wonder what the reason was for that patch to be dropped?
> > >>>
> > >>
> > >> It has been 8 years.. I have no idea why this patch is not accepted
> > >> finally. There is even no reply in that thread. But I believe the SAD
> > >> pattern is very important to be recognized. ARM also provides
> > >> instructions for it.
> > >>
> > >>
> > >> Thank you for your comment again!
> > >>
> > >>
> > >> thanks,
> > >> Cong
> > >>
> > >>
> > >>
> > >>> Thanks,
> > >>> James
> > >>>
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-12-03  1:07                     ` Cong Hou
  2013-12-17 18:04                       ` Cong Hou
@ 2014-06-24 11:19                       ` Richard Biener
  2014-06-25  2:04                         ` Cong Hou
  1 sibling, 1 reply; 27+ messages in thread
From: Richard Biener @ 2014-06-24 11:19 UTC (permalink / raw)
  To: Cong Hou; +Cc: Richard Biener, gcc-patches

On Tue, Dec 3, 2013 at 2:06 AM, Cong Hou <congh@google.com> wrote:
> Hi Richard
>
> Could you please take a look at this patch and see if it is ready for
> the trunk? The patch is pasted as a text file here again.

(found it)

The patch is ok for trunk.  (please consider re-testing before you commit)

Thanks,
Richard.

> Thank you very much!
>
>
> Cong
>
>
> On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou <congh@google.com> wrote:
>> Hi James
>>
>> Sorry for the late reply.
>>
>>
>> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
>> <james.greenhalgh@arm.com> wrote:
>>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
>>>> > Thank you for your detailed explanation.
>>>> >
>>>> > Once GCC detects a reduction operation, it will automatically
>>>> > accumulate all elements in the vector after the loop. In the loop the
>>>> > reduction variable is always a vector whose elements are reductions of
>>>> > corresponding values from other vectors. Therefore in your case the
>>>> > only instruction you need to generate is:
>>>> >
>>>> >     VABAL   ops[3], ops[1], ops[2]
>>>> >
>>>> > It is OK if you accumulate the elements into one in the vector inside
>>>> > of the loop (if one instruction can do this), but you have to make
>>>> > sure other elements in the vector should remain zero so that the final
>>>> > result is correct.
>>>> >
>>>> > If you are confused about the documentation, check the one for
>>>> > udot_prod (just above usad in md.texi), as it has very similar
>>>> > behavior as usad. Actually I copied the text from there and did some
>>>> > changes. As those two instruction patterns are both for vectorization,
>>>> > their behavior should not be difficult to explain.
>>>> >
>>>> > If you have more questions or think that the documentation is still
>>>> > improper please let me know.
>>>
>>> Hi Cong,
>>>
>>> Thanks for your reply.
>>>
>>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>>
>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>   tmp2 = ABS_EXPR (tmp)
>>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>>
>>> or:
>>>
>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>   tmp2 = ABS_EXPR (tmp)
>>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>>
>>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>>> a value of the same (widened) type as arg3.
>>>
>>
>>
>> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
>> mentioned it in tree.def).
>>
>>
>>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>>> patch:
>>>
>>>   [autovect] [patch] detect mult-hi and sad patterns
>>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>>
>>> I wonder what the reason was for that patch to be dropped?
>>>
>>
>> It has been 8 years.. I have no idea why this patch is not accepted
>> finally. There is even no reply in that thread. But I believe the SAD
>> pattern is very important to be recognized. ARM also provides
>> instructions for it.
>>
>>
>> Thank you for your comment again!
>>
>>
>> thanks,
>> Cong
>>
>>
>>
>>> Thanks,
>>> James
>>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2014-06-24 11:19                       ` Richard Biener
@ 2014-06-25  2:04                         ` Cong Hou
  0 siblings, 0 replies; 27+ messages in thread
From: Cong Hou @ 2014-06-25  2:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener, gcc-patches

OK. Thank you very much for your review, Richard!

thanks,
Cong


On Tue, Jun 24, 2014 at 4:19 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Dec 3, 2013 at 2:06 AM, Cong Hou <congh@google.com> wrote:
>> Hi Richard
>>
>> Could you please take a look at this patch and see if it is ready for
>> the trunk? The patch is pasted as a text file here again.
>
> (found it)
>
> The patch is ok for trunk.  (please consider re-testing before you commit)
>
> Thanks,
> Richard.
>
>> Thank you very much!
>>
>>
>> Cong
>>
>>
>> On Mon, Nov 11, 2013 at 11:25 AM, Cong Hou <congh@google.com> wrote:
>>> Hi James
>>>
>>> Sorry for the late reply.
>>>
>>>
>>> On Fri, Nov 8, 2013 at 2:55 AM, James Greenhalgh
>>> <james.greenhalgh@arm.com> wrote:
>>>>> On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
>>>>> > Thank you for your detailed explanation.
>>>>> >
>>>>> > Once GCC detects a reduction operation, it will automatically
>>>>> > accumulate all elements in the vector after the loop. In the loop the
>>>>> > reduction variable is always a vector whose elements are reductions of
>>>>> > corresponding values from other vectors. Therefore in your case the
>>>>> > only instruction you need to generate is:
>>>>> >
>>>>> >     VABAL   ops[3], ops[1], ops[2]
>>>>> >
>>>>> > It is OK if you accumulate the elements into one in the vector inside
>>>>> > of the loop (if one instruction can do this), but you have to make
>>>>> > sure other elements in the vector should remain zero so that the final
>>>>> > result is correct.
>>>>> >
>>>>> > If you are confused about the documentation, check the one for
>>>>> > udot_prod (just above usad in md.texi), as it has very similar
>>>>> > behavior as usad. Actually I copied the text from there and did some
>>>>> > changes. As those two instruction patterns are both for vectorization,
>>>>> > their behavior should not be difficult to explain.
>>>>> >
>>>>> > If you have more questions or think that the documentation is still
>>>>> > improper please let me know.
>>>>
>>>> Hi Cong,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> I've looked at Dorit's original patch adding WIDEN_SUM_EXPR and
>>>> DOT_PROD_EXPR and I see that the same ambiguity exists for
>>>> DOT_PROD_EXPR. Can you please add a note in your tree.def
>>>> that SAD_EXPR, like DOT_PROD_EXPR can be expanded as either:
>>>>
>>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>>   tmp2 = ABS_EXPR (tmp)
>>>>   arg3 = PLUS_EXPR (tmp2, arg3)
>>>>
>>>> or:
>>>>
>>>>   tmp = WIDEN_MINUS_EXPR (arg1, arg2)
>>>>   tmp2 = ABS_EXPR (tmp)
>>>>   arg3 = WIDEN_SUM_EXPR (tmp2, arg3)
>>>>
>>>> Where WIDEN_MINUS_EXPR is a signed MINUS_EXPR, returning a
>>>> a value of the same (widened) type as arg3.
>>>>
>>>
>>>
>>> I have added it, although we currently don't have WIDEN_MINUS_EXPR (I
>>> mentioned it in tree.def).
>>>
>>>
>>>> Also, while looking for the history of DOT_PROD_EXPR I spotted this
>>>> patch:
>>>>
>>>>   [autovect] [patch] detect mult-hi and sad patterns
>>>>   http://gcc.gnu.org/ml/gcc-patches/2005-10/msg01394.html
>>>>
>>>> I wonder what the reason was for that patch to be dropped?
>>>>
>>>
>>> It has been 8 years.. I have no idea why this patch is not accepted
>>> finally. There is even no reply in that thread. But I believe the SAD
>>> pattern is very important to be recognized. ARM also provides
>>> instructions for it.
>>>
>>>
>>> Thank you for your comment again!
>>>
>>>
>>> thanks,
>>> Cong
>>>
>>>
>>>
>>>> Thanks,
>>>> James
>>>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-10-31  1:10   ` Cong Hou
@ 2013-10-31  3:18     ` Ramana Radhakrishnan
  0 siblings, 0 replies; 27+ messages in thread
From: Ramana Radhakrishnan @ 2013-10-31  3:18 UTC (permalink / raw)
  To: Cong Hou; +Cc: GCC Patches, Richard Biener

On Thu, Oct 31, 2013 at 12:29 AM, Cong Hou <congh@google.com> wrote:
> On Tue, Oct 29, 2013 at 4:49 PM, Ramana Radhakrishnan
> <ramana.gcc@googlemail.com> wrote:
>> Cong,
>>
>> Please don't do the following.
>>
>>>+++ b/gcc/testsuite/gcc.dg/vect/
>> vect-reduc-sad.c
>> @@ -0,0 +1,54 @@
>> +/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */
>>
>> you are adding a test to gcc.dg/vect - It's a common directory
>> containing tests that need to run on multiple architectures and such
>> tests should be keyed by the feature they enable which can be turned
>> on for ports that have such an instruction.
>>
>> The correct way of doing this is to key this on the feature something
>> like dg-require-effective-target vect_sad_char . And define the
>> equivalent routine in testsuite/lib/target-supports.exp and enable it
>> for sse2 for the x86 port. If in doubt look at
>> check_effective_target_vect_int and a whole family of such functions
>> in testsuite/lib/target-supports.exp
>>
>> This makes life easy for other port maintainers who want to turn on
>> this support. And for bonus points please update the testcase writing
>> wiki page with this information if it isn't already there.
>>
>
> OK, I will likely move the test case to gcc.target/i386 as currently
> only SSE2 provides SAD instruction. But your suggestion also helps!

Sorry, no - I really don't like that approach, if the test remains in
the common directory keyed off as I suggested, it makes life easier
when turning this on in other ports as adding this pattern in the port
would take this test from being UNSUPPORTED->XPASS and keeps
gcc.dg/vect reasonably up to date with respect to testing the features
of the vectorizer and in touch with the way in which the tests in
gcc.dg/vect have been written till date.

I think Neon has an equivalent instruction called vaba but I will have
to check in the morning when I get back to my machine.


regards
Ramana


>
>6  abs_diff = ABS_EXPR <diff>;
>>>      [S7  abs_diff = (TYPE2) abs_diff;  #optional]
>>>      S8  sum_1 = abs_diff + sum_0;
>>>
>>>    where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
>>>    same size of 'TYPE1' or bigger. This is a special case of a reduction
>>>    computation.
>>>
>>> For SSE2, type is char, and TYPE1 and TYPE2 are int.
>>>
>>>
>>> In order to express this new operation, a new expression SAD_EXPR is
>>> introduced in tree.def, and the corresponding entry in optabs is
>>> added. The patch also added the "define_expand" for SSE2 and AVX2
>>> platforms for i386.
>>>
>>> The patch is pasted below and also attached as a text file (in which
>>> you can see tabs). Bootstrap and make check got passed on x86. Please
>>> give me your comments.
>>>
>>>
>>>
>>> thanks,
>>> Cong
>>>
>>>
>>>
>>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>>> index 8a38316..d528307 100644
>>> --- a/gcc/ChangeLog
>>> +++ b/gcc/ChangeLog
>>> @@ -1,3 +1,23 @@
>>> +2013-10-29  Cong Hou  <congh@google.com>
>>> +
>>> + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
>>> + pattern recognition.
>>> + (type_conversion_p): PROMOTION is true if it's a type promotion
>>> + conversion, and false otherwise.  Return true if the given expression
>>> + is a type conversion one.
>>> + * tree-vectorizer.h: Adjust the number of patterns.
>>> + * tree.def: Add SAD_EXPR.
>>> + * optabs.def: Add sad_optab.
>>> + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
>>> + * expr.c (expand_expr_real_2): Likewise.
>>> + * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
>>> + * gimple.c (get_gimple_rhs_num_ops): Likewise.
>>> + * optabs.c (optab_for_tree_code): Likewise.
>>> + * tree-cfg.c (estimate_operator_cost): Likewise.
>>> + * tree-ssa-operands.c (get_expr_operands): Likewise.
>>> + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
>>> + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
>>> +
>>>  2013-10-14  David Malcolm  <dmalcolm@redhat.com>
>>>
>>>   * dumpfile.h (gcc::dump_manager): New class, to hold state
>>> diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
>>> index 7ed29f5..9ec761a 100644
>>> --- a/gcc/cfgexpand.c
>>> +++ b/gcc/cfgexpand.c
>>> @@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
>>>   {
>>>   case COND_EXPR:
>>>   case DOT_PROD_EXPR:
>>> + case SAD_EXPR:
>>>   case WIDEN_MULT_PLUS_EXPR:
>>>   case WIDEN_MULT_MINUS_EXPR:
>>>   case FMA_EXPR:
>>> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
>>> index c3f6c94..ca1ab70 100644
>>> --- a/gcc/config/i386/sse.md
>>> +++ b/gcc/config/i386/sse.md
>>> @@ -6052,6 +6052,40 @@
>>>    DONE;
>>>  })
>>>
>>> +(define_expand "sadv16qi"
>>> +  [(match_operand:V4SI 0 "register_operand")
>>> +   (match_operand:V16QI 1 "register_operand")
>>> +   (match_operand:V16QI 2 "register_operand")
>>> +   (match_operand:V4SI 3 "register_operand")]
>>> +  "TARGET_SSE2"
>>> +{
>>> +  rtx t1 = gen_reg_rtx (V2DImode);
>>> +  rtx t2 = gen_reg_rtx (V4SImode);
>>> +  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
>>> +  convert_move (t2, t1, 0);
>>> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
>>> +  gen_rtx_PLUS (V4SImode,
>>> + operands[3], t2)));
>>> +  DONE;
>>> +})
>>> +
>>> +(define_expand "sadv32qi"
>>> +  [(match_operand:V8SI 0 "register_operand")
>>> +   (match_operand:V32QI 1 "register_operand")
>>> +   (match_operand:V32QI 2 "register_operand")
>>> +   (match_operand:V8SI 3 "register_operand")]
>>> +  "TARGET_AVX2"
>>> +{
>>> +  rtx t1 = gen_reg_rtx (V4DImode);
>>> +  rtx t2 = gen_reg_rtx (V8SImode);
>>> +  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
>>> +  convert_move (t2, t1, 0);
>>> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
>>> +  gen_rtx_PLUS (V8SImode,
>>> + operands[3], t2)));
>>> +  DONE;
>>> +})
>>> +
>>>  (define_insn "ashr<mode>3"
>>>    [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
>>>   (ashiftrt:VI24_AVX2
>>> diff --git a/gcc/expr.c b/gcc/expr.c
>>> index 4975a64..1db8a49 100644
>>> --- a/gcc/expr.c
>>> +++ b/gcc/expr.c
>>> @@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target,
>>> enum machine_mode tmode,
>>>   return target;
>>>        }
>>>
>>> +      case SAD_EXPR:
>>> +      {
>>> + tree oprnd0 = treeop0;
>>> + tree oprnd1 = treeop1;
>>> + tree oprnd2 = treeop2;
>>> + rtx op2;
>>> +
>>> + expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
>>> + op2 = expand_normal (oprnd2);
>>> + target = expand_widen_pattern_expr (ops, op0, op1, op2,
>>> +    target, unsignedp);
>>> + return target;
>>> +      }
>>> +
>>>      case REALIGN_LOAD_EXPR:
>>>        {
>>>          tree oprnd0 = treeop0;
>>> diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
>>> index f0f8166..514ddd1 100644
>>> --- a/gcc/gimple-pretty-print.c
>>> +++ b/gcc/gimple-pretty-print.c
>>> @@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple
>>> gs, int spc, int flags)
>>>        dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
>>>        pp_greater (buffer);
>>>        break;
>>> +
>>> +    case SAD_EXPR:
>>> +      pp_string (buffer, "SAD_EXPR <");
>>> +      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
>>> +      pp_string (buffer, ", ");
>>> +      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
>>> +      pp_string (buffer, ", ");
>>> +      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
>>> +      pp_greater (buffer);
>>> +      break;
>>>
>>>      case VEC_PERM_EXPR:
>>>        pp_string (buffer, "VEC_PERM_EXPR <");
>>> diff --git a/gcc/gimple.c b/gcc/gimple.c
>>> index a12dd67..4975959 100644
>>> --- a/gcc/gimple.c
>>> +++ b/gcc/gimple.c
>>> @@ -2562,6 +2562,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
>>>        || (SYM) == WIDEN_MULT_PLUS_EXPR    \
>>>        || (SYM) == WIDEN_MULT_MINUS_EXPR    \
>>>        || (SYM) == DOT_PROD_EXPR    \
>>> +      || (SYM) == SAD_EXPR    \
>>>        || (SYM) == REALIGN_LOAD_EXPR    \
>>>        || (SYM) == VEC_COND_EXPR    \
>>>        || (SYM) == VEC_PERM_EXPR                                             \
>>> diff --git a/gcc/optabs.c b/gcc/optabs.c
>>> index 06a626c..4ddd4d9 100644
>>> --- a/gcc/optabs.c
>>> +++ b/gcc/optabs.c
>>> @@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
>>>      case DOT_PROD_EXPR:
>>>        return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;
>>>
>>> +    case SAD_EXPR:
>>> +      return sad_optab;
>>> +
>>>      case WIDEN_MULT_PLUS_EXPR:
>>>        return (TYPE_UNSIGNED (type)
>>>        ? (TYPE_SATURATING (type)
>>> diff --git a/gcc/optabs.def b/gcc/optabs.def
>>> index 6b924ac..e35d567 100644
>>> --- a/gcc/optabs.def
>>> +++ b/gcc/optabs.def
>>> @@ -248,6 +248,7 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
>>>  OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
>>>  OPTAB_D (udot_prod_optab, "udot_prod$I$a")
>>>  OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
>>> +OPTAB_D (sad_optab, "sad$I$a")
>>>  OPTAB_D (vec_extract_optab, "vec_extract$a")
>>>  OPTAB_D (vec_init_optab, "vec_init$a")
>>>  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
>>> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
>>> index 075d071..226b8d5 100644
>>> --- a/gcc/testsuite/ChangeLog
>>> +++ b/gcc/testsuite/ChangeLog
>>> @@ -1,3 +1,7 @@
>>> +2013-10-29  Cong Hou  <congh@google.com>
>>> +
>>> + * gcc.dg/vect/vect-reduc-sad.c: New.
>>> +
>>>  2013-10-14  Tobias Burnus  <burnus@net-b.de>
>>>
>>>   PR fortran/58658
>>> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
>>> b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
>>> new file mode 100644
>>> index 0000000..14ebb3b
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
>>> @@ -0,0 +1,54 @@
>>> +/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */
>>> +
>>> +#include <stdarg.h>
>>> +#include "tree-vect.h"
>>> +
>>> +#define N 64
>>> +#define SAD N*N/2
>>> +
>>> +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
>>> +unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
>>> +
>>> +/* Sum of absolute differences between arrays of unsigned char types.
>>> +   Detected as a sad pattern.
>>> +   Vectorized on targets that support sad for unsigned chars.  */
>>> +
>>> +__attribute__ ((noinline)) int
>>> +foo (int len)
>>> +{
>>> +  int i;
>>> +  int result = 0;
>>> +
>>> +  for (i = 0; i < len; i++)
>>> +    result += abs (X[i] - Y[i]);
>>> +
>>> +  return result;
>>> +}
>>> +
>>> +
>>> +int
>>> +main (void)
>>> +{
>>> +  int i;
>>> +  int sad;
>>> +
>>> +  check_vect ();
>>> +
>>> +  for (i = 0; i < N; i++)
>>> +    {
>>> +      X[i] = i;
>>> +      Y[i] = N - i;
>>> +      __asm__ volatile ("");
>>> +    }
>>> +
>>> +  sad = foo (N);
>>> +  if (sad != SAD)
>>> +    abort ();
>>> +
>>> +  return 0;
>>> +}
>>> +
>>> +/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern:
>>> detected" 1 "vect" } } */
>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>>> +/* { dg-final { cleanup-tree-dump "vect" } } */
>>> +
>>> diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
>>> index 8b66791..d689cac 100644
>>> --- a/gcc/tree-cfg.c
>>> +++ b/gcc/tree-cfg.c
>>> @@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt)
>>>        return false;
>>>
>>>      case DOT_PROD_EXPR:
>>> +    case SAD_EXPR:
>>>      case REALIGN_LOAD_EXPR:
>>>        /* FIXME.  */
>>>        return false;
>>> diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
>>> index 2221b9c..44261a3 100644
>>> --- a/gcc/tree-inline.c
>>> +++ b/gcc/tree-inline.c
>>> @@ -3601,6 +3601,7 @@ estimate_operator_cost (enum tree_code code,
>>> eni_weights *weights,
>>>      case WIDEN_SUM_EXPR:
>>>      case WIDEN_MULT_EXPR:
>>>      case DOT_PROD_EXPR:
>>> +    case SAD_EXPR:
>>>      case WIDEN_MULT_PLUS_EXPR:
>>>      case WIDEN_MULT_MINUS_EXPR:
>>>      case WIDEN_LSHIFT_EXPR:
>>> diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
>>> index 603f797..393efc3 100644
>>> --- a/gcc/tree-ssa-operands.c
>>> +++ b/gcc/tree-ssa-operands.c
>>> @@ -854,6 +854,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
>>>        }
>>>
>>>      case DOT_PROD_EXPR:
>>> +    case SAD_EXPR:
>>>      case REALIGN_LOAD_EXPR:
>>>      case WIDEN_MULT_PLUS_EXPR:
>>>      case WIDEN_MULT_MINUS_EXPR:
>>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>>> index 638b981..89aa8c7 100644
>>> --- a/gcc/tree-vect-loop.c
>>> +++ b/gcc/tree-vect-loop.c
>>> @@ -3607,6 +3607,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
>>>      {
>>>        case WIDEN_SUM_EXPR:
>>>        case DOT_PROD_EXPR:
>>> +      case SAD_EXPR:
>>>        case PLUS_EXPR:
>>>        case MINUS_EXPR:
>>>        case BIT_IOR_EXPR:
>>> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
>>> index 0a4e812..7919449 100644
>>> --- a/gcc/tree-vect-patterns.c
>>> +++ b/gcc/tree-vect-patterns.c
>>> @@ -45,6 +45,8 @@ static gimple vect_recog_widen_mult_pattern
>>> (vec<gimple> *, tree *,
>>>       tree *);
>>>  static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
>>>     tree *);
>>> +static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
>>> +      tree *);
>>>  static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
>>>  static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
>>>                                                   tree *);
>>> @@ -62,6 +64,7 @@ static vect_recog_func_ptr
>>> vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
>>>   vect_recog_widen_mult_pattern,
>>>   vect_recog_widen_sum_pattern,
>>>   vect_recog_dot_prod_pattern,
>>> +        vect_recog_sad_pattern,
>>>   vect_recog_pow_pattern,
>>>   vect_recog_widen_shift_pattern,
>>>   vect_recog_over_widening_pattern,
>>> @@ -140,9 +143,8 @@ vect_single_imm_use (gimple def_stmt)
>>>  }
>>>
>>>  /* Check whether NAME, an ssa-name used in USE_STMT,
>>> -   is a result of a type promotion or demotion, such that:
>>> +   is a result of a type promotion, such that:
>>>       DEF_STMT: NAME = NOP (name0)
>>> -   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
>>>     If CHECK_SIGN is TRUE, check that either both types are signed or both are
>>>     unsigned.  */
>>>
>>> @@ -189,10 +191,8 @@ type_conversion_p (tree name, gimple use_stmt,
>>> bool check_sign,
>>>
>>>    if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
>>>      *promotion = true;
>>> -  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
>>> -    *promotion = false;
>>>    else
>>> -    return false;
>>> +    *promotion = false;
>>>
>>>    if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
>>>     bb_vinfo, &dummy_gimple, &dummy, &dt))
>>> @@ -433,6 +433,242 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts,
>>> tree *type_in,
>>>  }
>>>
>>>
>>> +/* Function vect_recog_sad_pattern
>>> +
>>> +   Try to find the following Sum of Absolute Difference (SAD) pattern:
>>> +
>>> +     unsigned type x_t, y_t;
>>> +     signed TYPE1 diff, abs_diff;
>>> +     TYPE2 sum = init;
>>> +   loop:
>>> +     sum_0 = phi <init, sum_1>
>>> +     S1  x_t = ...
>>> +     S2  y_t = ...
>>> +     S3  x_T = (TYPE1) x_t;
>>> +     S4  y_T = (TYPE1) y_t;
>>> +     S5  diff = x_T - y_T;
>>> +     S6  abs_diff = ABS_EXPR <diff>;
>>> +     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
>>> +     S8  sum_1 = abs_diff + sum_0;
>>> +
>>> +   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
>>> +   same size of 'TYPE1' or bigger. This is a special case of a reduction
>>> +   computation.
>>> +
>>> +   Input:
>>> +
>>> +   * STMTS: Contains a stmt from which the pattern search begins.  In the
>>> +   example, when this function is called with S8, the pattern
>>> +   {S3,S4,S5,S6,S7,S8} will be detected.
>>> +
>>> +   Output:
>>> +
>>> +   * TYPE_IN: The type of the input arguments to the pattern.
>>> +
>>> +   * TYPE_OUT: The type of the output of this pattern.
>>> +
>>> +   * Return value: A new stmt that will be used to replace the sequence of
>>> +   stmts that constitute the pattern. In this case it will be:
>>> +        SAD_EXPR <x_t, y_t, sum_0>
>>> +  */
>>> +
>>> +static gimple
>>> +vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
>>> +     tree *type_out)
>>> +{
>>> +  gimple last_stmt = (*stmts)[0];
>>> +  tree sad_oprnd0, sad_oprnd1;
>>> +  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
>>> +  tree half_type;
>>> +  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
>>> +  struct loop *loop;
>>> +  bool promotion;
>>> +
>>> +  if (!loop_info)
>>> +    return NULL;
>>> +
>>> +  loop = LOOP_VINFO_LOOP (loop_info);
>>> +
>>> +  if (!is_gimple_assign (last_stmt))
>>> +    return NULL;
>>> +
>>> +  tree sum_type = gimple_expr_type (last_stmt);
>>> +
>>> +  /* Look for the following pattern
>>> +          DX = (TYPE1) X;
>>> +          DY = (TYPE1) Y;
>>> +          DDIFF = DX - DY;
>>> +          DAD = ABS_EXPR <DDIFF>;
>>> +          DDPROD = (TYPE2) DPROD;
>>> +          sum_1 = DAD + sum_0;
>>> +     In which
>>> +     - DX is at least double the size of X
>>> +     - DY is at least double the size of Y
>>> +     - DX, DY, DDIFF, DAD all have the same type
>>> +     - sum is the same size of DAD or bigger
>>> +     - sum has been recognized as a reduction variable.
>>> +
>>> +     This is equivalent to:
>>> +       DDIFF = X w- Y;          #widen sub
>>> +       DAD = ABS_EXPR <DDIFF>;
>>> +       sum_1 = DAD w+ sum_0;    #widen summation
>>> +     or
>>> +       DDIFF = X w- Y;          #widen sub
>>> +       DAD = ABS_EXPR <DDIFF>;
>>> +       sum_1 = DAD + sum_0;     #summation
>>> +   */
>>> +
>>> +  /* Starting from LAST_STMT, follow the defs of its uses in search
>>> +     of the above pattern.  */
>>> +
>>> +  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
>>> +    return NULL;
>>> +
>>> +  tree plus_oprnd0, plus_oprnd1;
>>> +
>>> +  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
>>> +    {
>>> +      /* Has been detected as widening-summation?  */
>>> +
>>> +      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
>>> +      sum_type = gimple_expr_type (stmt);
>>> +      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
>>> +        return NULL;
>>> +      plus_oprnd0 = gimple_assign_rhs1 (stmt);
>>> +      plus_oprnd1 = gimple_assign_rhs2 (stmt);
>>> +      half_type = TREE_TYPE (plus_oprnd0);
>>> +    }
>>> +  else
>>> +    {
>>> +      gimple def_stmt;
>>> +
>>> +      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
>>> +        return NULL;
>>> +      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
>>> +      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
>>> +      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
>>> +  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
>>> +        return NULL;
>>> +
>>> +      /* The type conversion could be promotion, demotion,
>>> +         or just signed -> unsigned.  */
>>> +      if (type_conversion_p (plus_oprnd0, last_stmt, false,
>>> +                             &half_type, &def_stmt, &promotion))
>>> +        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
>>> +      else
>>> +        half_type = sum_type;
>>> +    }
>>> +
>>> +  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
>>> +     we know that plus_oprnd1 is the reduction variable (defined by a
>>> loop-header
>>> +     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
>>> +     Then check that plus_oprnd0 is defined by an abs_expr  */
>>> +
>>> +  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
>>> +    return NULL;
>>> +
>>> +  tree abs_type = half_type;
>>> +  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
>>> +
>>> +  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
>>> +  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop,
>>> gimple_bb (abs_stmt)))
>>> +    return NULL;
>>> +
>>> +  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
>>> +     inside the loop (in case we are analyzing an outer-loop).  */
>>> +  if (!is_gimple_assign (abs_stmt))
>>> +    return NULL;
>>> +
>>> +  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
>>> +  gcc_assert (abs_stmt_vinfo);
>>> +  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
>>> +    return NULL;
>>> +  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
>>> +    return NULL;
>>> +
>>> +  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
>>> +  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
>>> +    return NULL;
>>> +  if (TYPE_UNSIGNED (abs_type))
>>> +    return NULL;
>>> +
>>> +  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
>>> +
>>> +  if (TREE_CODE (abs_oprnd) != SSA_NAME)
>>> +    return NULL;
>>> +
>>> +  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
>>> +
>>> +  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
>>> +  if (!gimple_bb (diff_stmt)
>>> +      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
>>> +    return NULL;
>>> +
>>> +  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
>>> +     inside the loop (in case we are analyzing an outer-loop).  */
>>> +  if (!is_gimple_assign (diff_stmt))
>>> +    return NULL;
>>> +
>>> +  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
>>> +  gcc_assert (diff_stmt_vinfo);
>>> +  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
>>> +    return NULL;
>>> +  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
>>> +    return NULL;
>>> +
>>> +  tree half_type0, half_type1;
>>> +  gimple def_stmt;
>>> +
>>> +  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
>>> +  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
>>> +
>>> +  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
>>> +      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
>>> +    return NULL;
>>> +  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
>>> +                          &half_type0, &def_stmt, &promotion)
>>> +      || !promotion)
>>> +    return NULL;
>>> +  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
>>> +
>>> +  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
>>> +                          &half_type1, &def_stmt, &promotion)
>>> +      || !promotion)
>>> +    return NULL;
>>> +  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
>>> +
>>> +  if (!types_compatible_p (half_type0, half_type1))
>>> +    return NULL;
>>> +  if (!TYPE_UNSIGNED (half_type0))
>>> +    return NULL;
>>> +  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
>>> +      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
>>> +    return NULL;
>>> +
>>> +  *type_in = TREE_TYPE (sad_oprnd0);
>>> +  *type_out = sum_type;
>>> +
>>> +  /* Pattern detected. Create a stmt to be used to replace the pattern: */
>>> +  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
>>> +  gimple pattern_stmt = gimple_build_assign_with_ops
>>> +                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
>>> +
>>> +  if (dump_enabled_p ())
>>> +    {
>>> +      dump_printf_loc (MSG_NOTE, vect_location,
>>> +                       "vect_recog_sad_pattern: detected: ");
>>> +      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
>>> +      dump_printf (MSG_NOTE, "\n");
>>> +    }
>>> +
>>> +  /* We don't allow changing the order of the computation in the inner-loop
>>> +     when doing outer-loop vectorization.  */
>>> +  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
>>> +
>>> +  return pattern_stmt;
>>> +}
>>> +
>>> +
>>>  /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
>>>     and LSHIFT_EXPR.
>>>
>>> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>>> index 8b7b345..0aac75b 100644
>>> --- a/gcc/tree-vectorizer.h
>>> +++ b/gcc/tree-vectorizer.h
>>> @@ -1044,7 +1044,7 @@ extern void vect_slp_transform_bb (basic_block);
>>>     Additional pattern recognition functions can (and will) be added
>>>     in the future.  */
>>>  typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
>>> -#define NUM_PATTERNS 11
>>> +#define NUM_PATTERNS 12
>>>  void vect_pattern_recog (loop_vec_info, bb_vec_info);
>>>
>>>  /* In tree-vectorizer.c.  */
>>> diff --git a/gcc/tree.def b/gcc/tree.def
>>> index 88c850a..31a3b64 100644
>>> --- a/gcc/tree.def
>>> +++ b/gcc/tree.def
>>> @@ -1146,6 +1146,15 @@ DEFTREECODE (REDUC_PLUS_EXPR,
>>> "reduc_plus_expr", tcc_unary, 1)
>>>          arg3 = WIDEN_SUM_EXPR (tmp, arg3); */
>>>  DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr", tcc_expression, 3)
>>>
>>> +/* Widening sad (sum of absolute differences).
>>> +   The first two arguments are of type t1 which should be unsigned integer.
>>> +   The third argument and the result are of type t2, such that t2 is at least
>>> +   twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to:
>>> + tmp1 = WIDEN_MINUS_EXPR (arg1, arg2);
>>> + tmp2 = ABS_EXPR (tmp1);
>>> + arg3 = PLUS_EXPR (tmp2, arg3); */
>>> +DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
>>> +
>>>  /* Widening summation.
>>>     The first argument is of type t1.
>>>     The second argument is of type t2, such that t2 is at least twice

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-10-30  0:09 ` Ramana Radhakrishnan
@ 2013-10-31  1:10   ` Cong Hou
  2013-10-31  3:18     ` Ramana Radhakrishnan
  0 siblings, 1 reply; 27+ messages in thread
From: Cong Hou @ 2013-10-31  1:10 UTC (permalink / raw)
  To: ramrad01; +Cc: GCC Patches, Richard Biener

On Tue, Oct 29, 2013 at 4:49 PM, Ramana Radhakrishnan
<ramana.gcc@googlemail.com> wrote:
> Cong,
>
> Please don't do the following.
>
>>+++ b/gcc/testsuite/gcc.dg/vect/
> vect-reduc-sad.c
> @@ -0,0 +1,54 @@
> +/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */
>
> you are adding a test to gcc.dg/vect - It's a common directory
> containing tests that need to run on multiple architectures and such
> tests should be keyed by the feature they enable which can be turned
> on for ports that have such an instruction.
>
> The correct way of doing this is to key this on the feature something
> like dg-require-effective-target vect_sad_char . And define the
> equivalent routine in testsuite/lib/target-supports.exp and enable it
> for sse2 for the x86 port. If in doubt look at
> check_effective_target_vect_int and a whole family of such functions
> in testsuite/lib/target-supports.exp
>
> This makes life easy for other port maintainers who want to turn on
> this support. And for bonus points please update the testcase writing
> wiki page with this information if it isn't already there.
>

OK, I will likely move the test case to gcc.target/i386 as currently
only SSE2 provides SAD instruction. But your suggestion also helps!


> You are also missing documentation updates for SAD_EXPR, md.texi for
> the new standard pattern name. Shouldn't it be called sad<mode>4
> really ?
>


I will add the documentation for the new operation SAD_EXPR.

I use sad<mode> by just following udot_prod<mode> as those two
operations are quite similar:

 OPTAB_D (udot_prod_optab, "udot_prod$I$a")


thanks,
Cong


>
> regards
> Ramana
>
>
>
>
>
> On Tue, Oct 29, 2013 at 10:23 PM, Cong Hou <congh@google.com> wrote:
>> Hi
>>
>> SAD (Sum of Absolute Differences) is a common and important algorithm
>> in image processing and other areas. SSE2 even introduced a new
>> instruction PSADBW for it. A SAD loop can be greatly accelerated by
>> this instruction after being vectorized. This patch introduced a new
>> operation SAD_EXPR and a SAD pattern recognizer in vectorizer.
>>
>> The pattern of SAD is shown below:
>>
>>      unsigned type x_t, y_t;
>>      signed TYPE1 diff, abs_diff;
>>      TYPE2 sum = init;
>>    loop:
>>      sum_0 = phi <init, sum_1>
>>      S1  x_t = ...
>>      S2  y_t = ...
>>      S3  x_T = (TYPE1) x_t;
>>      S4  y_T = (TYPE1) y_t;
>>      S5  diff = x_T - y_T;
>>      S6  abs_diff = ABS_EXPR <diff>;
>>      [S7  abs_diff = (TYPE2) abs_diff;  #optional]
>>      S8  sum_1 = abs_diff + sum_0;
>>
>>    where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
>>    same size of 'TYPE1' or bigger. This is a special case of a reduction
>>    computation.
>>
>> For SSE2, type is char, and TYPE1 and TYPE2 are int.
>>
>>
>> In order to express this new operation, a new expression SAD_EXPR is
>> introduced in tree.def, and the corresponding entry in optabs is
>> added. The patch also added the "define_expand" for SSE2 and AVX2
>> platforms for i386.
>>
>> The patch is pasted below and also attached as a text file (in which
>> you can see tabs). Bootstrap and make check got passed on x86. Please
>> give me your comments.
>>
>>
>>
>> thanks,
>> Cong
>>
>>
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 8a38316..d528307 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,23 @@
>> +2013-10-29  Cong Hou  <congh@google.com>
>> +
>> + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
>> + pattern recognition.
>> + (type_conversion_p): PROMOTION is true if it's a type promotion
>> + conversion, and false otherwise.  Return true if the given expression
>> + is a type conversion one.
>> + * tree-vectorizer.h: Adjust the number of patterns.
>> + * tree.def: Add SAD_EXPR.
>> + * optabs.def: Add sad_optab.
>> + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
>> + * expr.c (expand_expr_real_2): Likewise.
>> + * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
>> + * gimple.c (get_gimple_rhs_num_ops): Likewise.
>> + * optabs.c (optab_for_tree_code): Likewise.
>> + * tree-cfg.c (estimate_operator_cost): Likewise.
>> + * tree-ssa-operands.c (get_expr_operands): Likewise.
>> + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
>> + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
>> +
>>  2013-10-14  David Malcolm  <dmalcolm@redhat.com>
>>
>>   * dumpfile.h (gcc::dump_manager): New class, to hold state
>> diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
>> index 7ed29f5..9ec761a 100644
>> --- a/gcc/cfgexpand.c
>> +++ b/gcc/cfgexpand.c
>> @@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
>>   {
>>   case COND_EXPR:
>>   case DOT_PROD_EXPR:
>> + case SAD_EXPR:
>>   case WIDEN_MULT_PLUS_EXPR:
>>   case WIDEN_MULT_MINUS_EXPR:
>>   case FMA_EXPR:
>> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
>> index c3f6c94..ca1ab70 100644
>> --- a/gcc/config/i386/sse.md
>> +++ b/gcc/config/i386/sse.md
>> @@ -6052,6 +6052,40 @@
>>    DONE;
>>  })
>>
>> +(define_expand "sadv16qi"
>> +  [(match_operand:V4SI 0 "register_operand")
>> +   (match_operand:V16QI 1 "register_operand")
>> +   (match_operand:V16QI 2 "register_operand")
>> +   (match_operand:V4SI 3 "register_operand")]
>> +  "TARGET_SSE2"
>> +{
>> +  rtx t1 = gen_reg_rtx (V2DImode);
>> +  rtx t2 = gen_reg_rtx (V4SImode);
>> +  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
>> +  convert_move (t2, t1, 0);
>> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
>> +  gen_rtx_PLUS (V4SImode,
>> + operands[3], t2)));
>> +  DONE;
>> +})
>> +
>> +(define_expand "sadv32qi"
>> +  [(match_operand:V8SI 0 "register_operand")
>> +   (match_operand:V32QI 1 "register_operand")
>> +   (match_operand:V32QI 2 "register_operand")
>> +   (match_operand:V8SI 3 "register_operand")]
>> +  "TARGET_AVX2"
>> +{
>> +  rtx t1 = gen_reg_rtx (V4DImode);
>> +  rtx t2 = gen_reg_rtx (V8SImode);
>> +  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
>> +  convert_move (t2, t1, 0);
>> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
>> +  gen_rtx_PLUS (V8SImode,
>> + operands[3], t2)));
>> +  DONE;
>> +})
>> +
>>  (define_insn "ashr<mode>3"
>>    [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
>>   (ashiftrt:VI24_AVX2
>> diff --git a/gcc/expr.c b/gcc/expr.c
>> index 4975a64..1db8a49 100644
>> --- a/gcc/expr.c
>> +++ b/gcc/expr.c
>> @@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target,
>> enum machine_mode tmode,
>>   return target;
>>        }
>>
>> +      case SAD_EXPR:
>> +      {
>> + tree oprnd0 = treeop0;
>> + tree oprnd1 = treeop1;
>> + tree oprnd2 = treeop2;
>> + rtx op2;
>> +
>> + expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
>> + op2 = expand_normal (oprnd2);
>> + target = expand_widen_pattern_expr (ops, op0, op1, op2,
>> +    target, unsignedp);
>> + return target;
>> +      }
>> +
>>      case REALIGN_LOAD_EXPR:
>>        {
>>          tree oprnd0 = treeop0;
>> diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
>> index f0f8166..514ddd1 100644
>> --- a/gcc/gimple-pretty-print.c
>> +++ b/gcc/gimple-pretty-print.c
>> @@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple
>> gs, int spc, int flags)
>>        dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
>>        pp_greater (buffer);
>>        break;
>> +
>> +    case SAD_EXPR:
>> +      pp_string (buffer, "SAD_EXPR <");
>> +      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
>> +      pp_string (buffer, ", ");
>> +      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
>> +      pp_string (buffer, ", ");
>> +      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
>> +      pp_greater (buffer);
>> +      break;
>>
>>      case VEC_PERM_EXPR:
>>        pp_string (buffer, "VEC_PERM_EXPR <");
>> diff --git a/gcc/gimple.c b/gcc/gimple.c
>> index a12dd67..4975959 100644
>> --- a/gcc/gimple.c
>> +++ b/gcc/gimple.c
>> @@ -2562,6 +2562,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
>>        || (SYM) == WIDEN_MULT_PLUS_EXPR    \
>>        || (SYM) == WIDEN_MULT_MINUS_EXPR    \
>>        || (SYM) == DOT_PROD_EXPR    \
>> +      || (SYM) == SAD_EXPR    \
>>        || (SYM) == REALIGN_LOAD_EXPR    \
>>        || (SYM) == VEC_COND_EXPR    \
>>        || (SYM) == VEC_PERM_EXPR                                             \
>> diff --git a/gcc/optabs.c b/gcc/optabs.c
>> index 06a626c..4ddd4d9 100644
>> --- a/gcc/optabs.c
>> +++ b/gcc/optabs.c
>> @@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
>>      case DOT_PROD_EXPR:
>>        return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;
>>
>> +    case SAD_EXPR:
>> +      return sad_optab;
>> +
>>      case WIDEN_MULT_PLUS_EXPR:
>>        return (TYPE_UNSIGNED (type)
>>        ? (TYPE_SATURATING (type)
>> diff --git a/gcc/optabs.def b/gcc/optabs.def
>> index 6b924ac..e35d567 100644
>> --- a/gcc/optabs.def
>> +++ b/gcc/optabs.def
>> @@ -248,6 +248,7 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
>>  OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
>>  OPTAB_D (udot_prod_optab, "udot_prod$I$a")
>>  OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
>> +OPTAB_D (sad_optab, "sad$I$a")
>>  OPTAB_D (vec_extract_optab, "vec_extract$a")
>>  OPTAB_D (vec_init_optab, "vec_init$a")
>>  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
>> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
>> index 075d071..226b8d5 100644
>> --- a/gcc/testsuite/ChangeLog
>> +++ b/gcc/testsuite/ChangeLog
>> @@ -1,3 +1,7 @@
>> +2013-10-29  Cong Hou  <congh@google.com>
>> +
>> + * gcc.dg/vect/vect-reduc-sad.c: New.
>> +
>>  2013-10-14  Tobias Burnus  <burnus@net-b.de>
>>
>>   PR fortran/58658
>> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
>> b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
>> new file mode 100644
>> index 0000000..14ebb3b
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
>> @@ -0,0 +1,54 @@
>> +/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */
>> +
>> +#include <stdarg.h>
>> +#include "tree-vect.h"
>> +
>> +#define N 64
>> +#define SAD N*N/2
>> +
>> +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
>> +unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
>> +
>> +/* Sum of absolute differences between arrays of unsigned char types.
>> +   Detected as a sad pattern.
>> +   Vectorized on targets that support sad for unsigned chars.  */
>> +
>> +__attribute__ ((noinline)) int
>> +foo (int len)
>> +{
>> +  int i;
>> +  int result = 0;
>> +
>> +  for (i = 0; i < len; i++)
>> +    result += abs (X[i] - Y[i]);
>> +
>> +  return result;
>> +}
>> +
>> +
>> +int
>> +main (void)
>> +{
>> +  int i;
>> +  int sad;
>> +
>> +  check_vect ();
>> +
>> +  for (i = 0; i < N; i++)
>> +    {
>> +      X[i] = i;
>> +      Y[i] = N - i;
>> +      __asm__ volatile ("");
>> +    }
>> +
>> +  sad = foo (N);
>> +  if (sad != SAD)
>> +    abort ();
>> +
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern:
>> detected" 1 "vect" } } */
>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>> +/* { dg-final { cleanup-tree-dump "vect" } } */
>> +
>> diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
>> index 8b66791..d689cac 100644
>> --- a/gcc/tree-cfg.c
>> +++ b/gcc/tree-cfg.c
>> @@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt)
>>        return false;
>>
>>      case DOT_PROD_EXPR:
>> +    case SAD_EXPR:
>>      case REALIGN_LOAD_EXPR:
>>        /* FIXME.  */
>>        return false;
>> diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
>> index 2221b9c..44261a3 100644
>> --- a/gcc/tree-inline.c
>> +++ b/gcc/tree-inline.c
>> @@ -3601,6 +3601,7 @@ estimate_operator_cost (enum tree_code code,
>> eni_weights *weights,
>>      case WIDEN_SUM_EXPR:
>>      case WIDEN_MULT_EXPR:
>>      case DOT_PROD_EXPR:
>> +    case SAD_EXPR:
>>      case WIDEN_MULT_PLUS_EXPR:
>>      case WIDEN_MULT_MINUS_EXPR:
>>      case WIDEN_LSHIFT_EXPR:
>> diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
>> index 603f797..393efc3 100644
>> --- a/gcc/tree-ssa-operands.c
>> +++ b/gcc/tree-ssa-operands.c
>> @@ -854,6 +854,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
>>        }
>>
>>      case DOT_PROD_EXPR:
>> +    case SAD_EXPR:
>>      case REALIGN_LOAD_EXPR:
>>      case WIDEN_MULT_PLUS_EXPR:
>>      case WIDEN_MULT_MINUS_EXPR:
>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> index 638b981..89aa8c7 100644
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -3607,6 +3607,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
>>      {
>>        case WIDEN_SUM_EXPR:
>>        case DOT_PROD_EXPR:
>> +      case SAD_EXPR:
>>        case PLUS_EXPR:
>>        case MINUS_EXPR:
>>        case BIT_IOR_EXPR:
>> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
>> index 0a4e812..7919449 100644
>> --- a/gcc/tree-vect-patterns.c
>> +++ b/gcc/tree-vect-patterns.c
>> @@ -45,6 +45,8 @@ static gimple vect_recog_widen_mult_pattern
>> (vec<gimple> *, tree *,
>>       tree *);
>>  static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
>>     tree *);
>> +static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
>> +      tree *);
>>  static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
>>  static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
>>                                                   tree *);
>> @@ -62,6 +64,7 @@ static vect_recog_func_ptr
>> vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
>>   vect_recog_widen_mult_pattern,
>>   vect_recog_widen_sum_pattern,
>>   vect_recog_dot_prod_pattern,
>> +        vect_recog_sad_pattern,
>>   vect_recog_pow_pattern,
>>   vect_recog_widen_shift_pattern,
>>   vect_recog_over_widening_pattern,
>> @@ -140,9 +143,8 @@ vect_single_imm_use (gimple def_stmt)
>>  }
>>
>>  /* Check whether NAME, an ssa-name used in USE_STMT,
>> -   is a result of a type promotion or demotion, such that:
>> +   is a result of a type promotion, such that:
>>       DEF_STMT: NAME = NOP (name0)
>> -   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
>>     If CHECK_SIGN is TRUE, check that either both types are signed or both are
>>     unsigned.  */
>>
>> @@ -189,10 +191,8 @@ type_conversion_p (tree name, gimple use_stmt,
>> bool check_sign,
>>
>>    if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
>>      *promotion = true;
>> -  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
>> -    *promotion = false;
>>    else
>> -    return false;
>> +    *promotion = false;
>>
>>    if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
>>     bb_vinfo, &dummy_gimple, &dummy, &dt))
>> @@ -433,6 +433,242 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts,
>> tree *type_in,
>>  }
>>
>>
>> +/* Function vect_recog_sad_pattern
>> +
>> +   Try to find the following Sum of Absolute Difference (SAD) pattern:
>> +
>> +     unsigned type x_t, y_t;
>> +     signed TYPE1 diff, abs_diff;
>> +     TYPE2 sum = init;
>> +   loop:
>> +     sum_0 = phi <init, sum_1>
>> +     S1  x_t = ...
>> +     S2  y_t = ...
>> +     S3  x_T = (TYPE1) x_t;
>> +     S4  y_T = (TYPE1) y_t;
>> +     S5  diff = x_T - y_T;
>> +     S6  abs_diff = ABS_EXPR <diff>;
>> +     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
>> +     S8  sum_1 = abs_diff + sum_0;
>> +
>> +   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
>> +   same size of 'TYPE1' or bigger. This is a special case of a reduction
>> +   computation.
>> +
>> +   Input:
>> +
>> +   * STMTS: Contains a stmt from which the pattern search begins.  In the
>> +   example, when this function is called with S8, the pattern
>> +   {S3,S4,S5,S6,S7,S8} will be detected.
>> +
>> +   Output:
>> +
>> +   * TYPE_IN: The type of the input arguments to the pattern.
>> +
>> +   * TYPE_OUT: The type of the output of this pattern.
>> +
>> +   * Return value: A new stmt that will be used to replace the sequence of
>> +   stmts that constitute the pattern. In this case it will be:
>> +        SAD_EXPR <x_t, y_t, sum_0>
>> +  */
>> +
>> +static gimple
>> +vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
>> +     tree *type_out)
>> +{
>> +  gimple last_stmt = (*stmts)[0];
>> +  tree sad_oprnd0, sad_oprnd1;
>> +  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
>> +  tree half_type;
>> +  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
>> +  struct loop *loop;
>> +  bool promotion;
>> +
>> +  if (!loop_info)
>> +    return NULL;
>> +
>> +  loop = LOOP_VINFO_LOOP (loop_info);
>> +
>> +  if (!is_gimple_assign (last_stmt))
>> +    return NULL;
>> +
>> +  tree sum_type = gimple_expr_type (last_stmt);
>> +
>> +  /* Look for the following pattern
>> +          DX = (TYPE1) X;
>> +          DY = (TYPE1) Y;
>> +          DDIFF = DX - DY;
>> +          DAD = ABS_EXPR <DDIFF>;
>> +          DDPROD = (TYPE2) DPROD;
>> +          sum_1 = DAD + sum_0;
>> +     In which
>> +     - DX is at least double the size of X
>> +     - DY is at least double the size of Y
>> +     - DX, DY, DDIFF, DAD all have the same type
>> +     - sum is the same size of DAD or bigger
>> +     - sum has been recognized as a reduction variable.
>> +
>> +     This is equivalent to:
>> +       DDIFF = X w- Y;          #widen sub
>> +       DAD = ABS_EXPR <DDIFF>;
>> +       sum_1 = DAD w+ sum_0;    #widen summation
>> +     or
>> +       DDIFF = X w- Y;          #widen sub
>> +       DAD = ABS_EXPR <DDIFF>;
>> +       sum_1 = DAD + sum_0;     #summation
>> +   */
>> +
>> +  /* Starting from LAST_STMT, follow the defs of its uses in search
>> +     of the above pattern.  */
>> +
>> +  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
>> +    return NULL;
>> +
>> +  tree plus_oprnd0, plus_oprnd1;
>> +
>> +  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
>> +    {
>> +      /* Has been detected as widening-summation?  */
>> +
>> +      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
>> +      sum_type = gimple_expr_type (stmt);
>> +      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
>> +        return NULL;
>> +      plus_oprnd0 = gimple_assign_rhs1 (stmt);
>> +      plus_oprnd1 = gimple_assign_rhs2 (stmt);
>> +      half_type = TREE_TYPE (plus_oprnd0);
>> +    }
>> +  else
>> +    {
>> +      gimple def_stmt;
>> +
>> +      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
>> +        return NULL;
>> +      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
>> +      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
>> +      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
>> +  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
>> +        return NULL;
>> +
>> +      /* The type conversion could be promotion, demotion,
>> +         or just signed -> unsigned.  */
>> +      if (type_conversion_p (plus_oprnd0, last_stmt, false,
>> +                             &half_type, &def_stmt, &promotion))
>> +        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
>> +      else
>> +        half_type = sum_type;
>> +    }
>> +
>> +  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
>> +     we know that plus_oprnd1 is the reduction variable (defined by a
>> loop-header
>> +     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
>> +     Then check that plus_oprnd0 is defined by an abs_expr  */
>> +
>> +  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
>> +    return NULL;
>> +
>> +  tree abs_type = half_type;
>> +  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
>> +
>> +  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
>> +  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop,
>> gimple_bb (abs_stmt)))
>> +    return NULL;
>> +
>> +  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
>> +     inside the loop (in case we are analyzing an outer-loop).  */
>> +  if (!is_gimple_assign (abs_stmt))
>> +    return NULL;
>> +
>> +  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
>> +  gcc_assert (abs_stmt_vinfo);
>> +  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
>> +    return NULL;
>> +  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
>> +    return NULL;
>> +
>> +  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
>> +  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
>> +    return NULL;
>> +  if (TYPE_UNSIGNED (abs_type))
>> +    return NULL;
>> +
>> +  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
>> +
>> +  if (TREE_CODE (abs_oprnd) != SSA_NAME)
>> +    return NULL;
>> +
>> +  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
>> +
>> +  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
>> +  if (!gimple_bb (diff_stmt)
>> +      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
>> +    return NULL;
>> +
>> +  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
>> +     inside the loop (in case we are analyzing an outer-loop).  */
>> +  if (!is_gimple_assign (diff_stmt))
>> +    return NULL;
>> +
>> +  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
>> +  gcc_assert (diff_stmt_vinfo);
>> +  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
>> +    return NULL;
>> +  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
>> +    return NULL;
>> +
>> +  tree half_type0, half_type1;
>> +  gimple def_stmt;
>> +
>> +  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
>> +  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
>> +
>> +  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
>> +      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
>> +    return NULL;
>> +  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
>> +                          &half_type0, &def_stmt, &promotion)
>> +      || !promotion)
>> +    return NULL;
>> +  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
>> +
>> +  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
>> +                          &half_type1, &def_stmt, &promotion)
>> +      || !promotion)
>> +    return NULL;
>> +  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
>> +
>> +  if (!types_compatible_p (half_type0, half_type1))
>> +    return NULL;
>> +  if (!TYPE_UNSIGNED (half_type0))
>> +    return NULL;
>> +  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
>> +      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
>> +    return NULL;
>> +
>> +  *type_in = TREE_TYPE (sad_oprnd0);
>> +  *type_out = sum_type;
>> +
>> +  /* Pattern detected. Create a stmt to be used to replace the pattern: */
>> +  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
>> +  gimple pattern_stmt = gimple_build_assign_with_ops
>> +                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
>> +
>> +  if (dump_enabled_p ())
>> +    {
>> +      dump_printf_loc (MSG_NOTE, vect_location,
>> +                       "vect_recog_sad_pattern: detected: ");
>> +      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
>> +      dump_printf (MSG_NOTE, "\n");
>> +    }
>> +
>> +  /* We don't allow changing the order of the computation in the inner-loop
>> +     when doing outer-loop vectorization.  */
>> +  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
>> +
>> +  return pattern_stmt;
>> +}
>> +
>> +
>>  /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
>>     and LSHIFT_EXPR.
>>
>> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>> index 8b7b345..0aac75b 100644
>> --- a/gcc/tree-vectorizer.h
>> +++ b/gcc/tree-vectorizer.h
>> @@ -1044,7 +1044,7 @@ extern void vect_slp_transform_bb (basic_block);
>>     Additional pattern recognition functions can (and will) be added
>>     in the future.  */
>>  typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
>> -#define NUM_PATTERNS 11
>> +#define NUM_PATTERNS 12
>>  void vect_pattern_recog (loop_vec_info, bb_vec_info);
>>
>>  /* In tree-vectorizer.c.  */
>> diff --git a/gcc/tree.def b/gcc/tree.def
>> index 88c850a..31a3b64 100644
>> --- a/gcc/tree.def
>> +++ b/gcc/tree.def
>> @@ -1146,6 +1146,15 @@ DEFTREECODE (REDUC_PLUS_EXPR,
>> "reduc_plus_expr", tcc_unary, 1)
>>          arg3 = WIDEN_SUM_EXPR (tmp, arg3); */
>>  DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr", tcc_expression, 3)
>>
>> +/* Widening sad (sum of absolute differences).
>> +   The first two arguments are of type t1 which should be unsigned integer.
>> +   The third argument and the result are of type t2, such that t2 is at least
>> +   twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to:
>> + tmp1 = WIDEN_MINUS_EXPR (arg1, arg2);
>> + tmp2 = ABS_EXPR (tmp1);
>> + arg3 = PLUS_EXPR (tmp2, arg3); */
>> +DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
>> +
>>  /* Widening summation.
>>     The first argument is of type t1.
>>     The second argument is of type t2, such that t2 is at least twice

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-10-30 12:16 ` Richard Biener
@ 2013-10-31  0:50   ` Cong Hou
  0 siblings, 0 replies; 27+ messages in thread
From: Cong Hou @ 2013-10-31  0:50 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On Wed, Oct 30, 2013 at 4:27 AM, Richard Biener <rguenther@suse.de> wrote:
> On Tue, 29 Oct 2013, Cong Hou wrote:
>
>> Hi
>>
>> SAD (Sum of Absolute Differences) is a common and important algorithm
>> in image processing and other areas. SSE2 even introduced a new
>> instruction PSADBW for it. A SAD loop can be greatly accelerated by
>> this instruction after being vectorized. This patch introduced a new
>> operation SAD_EXPR and a SAD pattern recognizer in vectorizer.
>>
>> The pattern of SAD is shown below:
>>
>>      unsigned type x_t, y_t;
>>      signed TYPE1 diff, abs_diff;
>>      TYPE2 sum = init;
>>    loop:
>>      sum_0 = phi <init, sum_1>
>>      S1  x_t = ...
>>      S2  y_t = ...
>>      S3  x_T = (TYPE1) x_t;
>>      S4  y_T = (TYPE1) y_t;
>>      S5  diff = x_T - y_T;
>>      S6  abs_diff = ABS_EXPR <diff>;
>>      [S7  abs_diff = (TYPE2) abs_diff;  #optional]
>>      S8  sum_1 = abs_diff + sum_0;
>>
>>    where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
>>    same size of 'TYPE1' or bigger. This is a special case of a reduction
>>    computation.
>>
>> For SSE2, type is char, and TYPE1 and TYPE2 are int.
>>
>>
>> In order to express this new operation, a new expression SAD_EXPR is
>> introduced in tree.def, and the corresponding entry in optabs is
>> added. The patch also added the "define_expand" for SSE2 and AVX2
>> platforms for i386.
>>
>> The patch is pasted below and also attached as a text file (in which
>> you can see tabs). Bootstrap and make check got passed on x86. Please
>> give me your comments.
>
> Apart from the testcase comment made earlier
>
> +++ b/gcc/tree-cfg.c
> @@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt)
>        return false;
>
>      case DOT_PROD_EXPR:
> +    case SAD_EXPR:
>      case REALIGN_LOAD_EXPR:
>        /* FIXME.  */
>        return false;
>
> please add proper verification of the operand types.

OK.

>
> +/* Widening sad (sum of absolute differences).
> +   The first two arguments are of type t1 which should be unsigned
> integer.
> +   The third argument and the result are of type t2, such that t2 is at
> least
> +   twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to:
> +       tmp1 = WIDEN_MINUS_EXPR (arg1, arg2);
> +       tmp2 = ABS_EXPR (tmp1);
> +       arg3 = PLUS_EXPR (tmp2, arg3);           */
> +DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
>
> WIDEN_MINUS_EXPR doesn't exist so you have to explain on its
> operation (it returns a signed wide difference?).  Why should
> the first two arguments be unsigned?  I cannot see a good reason
> to require that (other than that maybe the x86 target only has
> support for widened unsigned difference?).  So if you want to
> make that restriction maybe change the name to SADU_EXPR
> (sum of absolute differences of unsigned)?
>
> I suppose you tried introducing WIDEN_MINUS_EXPR instead and
> letting combine do it's work, avoiding the very special optab?

I may use the wrong representation here. I think the behavior of
"WIDEN_MINUS_EXPR" in SAD is different from the general one. SAD
usually works on unsigned integers (see
http://en.wikipedia.org/wiki/Sum_of_absolute_differences), and before
getting the difference between two unsigned integers, they are
promoted to bigger signed integers. And the result of (int)(char)(1) -
(int)(char)(-1) is different from (int)(unsigned char)(1) -
(int)(unsigned char)(-1). So we cannot implement SAD using
WIDEN_MINUS_EXPR.

Also, the SSE2 instruction PSADBW also requires the operands to be
unsigned 8-bit integers.

I will remove the improper description as you pointed out.



thanks,
Cong


>
> Thanks,
> Richard.
>
>>
>>
>> thanks,
>> Cong
>>
>>
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 8a38316..d528307 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,23 @@
>> +2013-10-29  Cong Hou  <congh@google.com>
>> +
>> + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
>> + pattern recognition.
>> + (type_conversion_p): PROMOTION is true if it's a type promotion
>> + conversion, and false otherwise.  Return true if the given expression
>> + is a type conversion one.
>> + * tree-vectorizer.h: Adjust the number of patterns.
>> + * tree.def: Add SAD_EXPR.
>> + * optabs.def: Add sad_optab.
>> + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
>> + * expr.c (expand_expr_real_2): Likewise.
>> + * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
>> + * gimple.c (get_gimple_rhs_num_ops): Likewise.
>> + * optabs.c (optab_for_tree_code): Likewise.
>> + * tree-cfg.c (estimate_operator_cost): Likewise.
>> + * tree-ssa-operands.c (get_expr_operands): Likewise.
>> + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
>> + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
>> +
>>  2013-10-14  David Malcolm  <dmalcolm@redhat.com>
>>
>>   * dumpfile.h (gcc::dump_manager): New class, to hold state
>> diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
>> index 7ed29f5..9ec761a 100644
>> --- a/gcc/cfgexpand.c
>> +++ b/gcc/cfgexpand.c
>> @@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
>>   {
>>   case COND_EXPR:
>>   case DOT_PROD_EXPR:
>> + case SAD_EXPR:
>>   case WIDEN_MULT_PLUS_EXPR:
>>   case WIDEN_MULT_MINUS_EXPR:
>>   case FMA_EXPR:
>> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
>> index c3f6c94..ca1ab70 100644
>> --- a/gcc/config/i386/sse.md
>> +++ b/gcc/config/i386/sse.md
>> @@ -6052,6 +6052,40 @@
>>    DONE;
>>  })
>>
>> +(define_expand "sadv16qi"
>> +  [(match_operand:V4SI 0 "register_operand")
>> +   (match_operand:V16QI 1 "register_operand")
>> +   (match_operand:V16QI 2 "register_operand")
>> +   (match_operand:V4SI 3 "register_operand")]
>> +  "TARGET_SSE2"
>> +{
>> +  rtx t1 = gen_reg_rtx (V2DImode);
>> +  rtx t2 = gen_reg_rtx (V4SImode);
>> +  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
>> +  convert_move (t2, t1, 0);
>> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
>> +  gen_rtx_PLUS (V4SImode,
>> + operands[3], t2)));
>> +  DONE;
>> +})
>> +
>> +(define_expand "sadv32qi"
>> +  [(match_operand:V8SI 0 "register_operand")
>> +   (match_operand:V32QI 1 "register_operand")
>> +   (match_operand:V32QI 2 "register_operand")
>> +   (match_operand:V8SI 3 "register_operand")]
>> +  "TARGET_AVX2"
>> +{
>> +  rtx t1 = gen_reg_rtx (V4DImode);
>> +  rtx t2 = gen_reg_rtx (V8SImode);
>> +  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
>> +  convert_move (t2, t1, 0);
>> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
>> +  gen_rtx_PLUS (V8SImode,
>> + operands[3], t2)));
>> +  DONE;
>> +})
>> +
>>  (define_insn "ashr<mode>3"
>>    [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
>>   (ashiftrt:VI24_AVX2
>> diff --git a/gcc/expr.c b/gcc/expr.c
>> index 4975a64..1db8a49 100644
>> --- a/gcc/expr.c
>> +++ b/gcc/expr.c
>> @@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target,
>> enum machine_mode tmode,
>>   return target;
>>        }
>>
>> +      case SAD_EXPR:
>> +      {
>> + tree oprnd0 = treeop0;
>> + tree oprnd1 = treeop1;
>> + tree oprnd2 = treeop2;
>> + rtx op2;
>> +
>> + expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
>> + op2 = expand_normal (oprnd2);
>> + target = expand_widen_pattern_expr (ops, op0, op1, op2,
>> +    target, unsignedp);
>> + return target;
>> +      }
>> +
>>      case REALIGN_LOAD_EXPR:
>>        {
>>          tree oprnd0 = treeop0;
>> diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
>> index f0f8166..514ddd1 100644
>> --- a/gcc/gimple-pretty-print.c
>> +++ b/gcc/gimple-pretty-print.c
>> @@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple
>> gs, int spc, int flags)
>>        dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
>>        pp_greater (buffer);
>>        break;
>> +
>> +    case SAD_EXPR:
>> +      pp_string (buffer, "SAD_EXPR <");
>> +      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
>> +      pp_string (buffer, ", ");
>> +      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
>> +      pp_string (buffer, ", ");
>> +      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
>> +      pp_greater (buffer);
>> +      break;
>>
>>      case VEC_PERM_EXPR:
>>        pp_string (buffer, "VEC_PERM_EXPR <");
>> diff --git a/gcc/gimple.c b/gcc/gimple.c
>> index a12dd67..4975959 100644
>> --- a/gcc/gimple.c
>> +++ b/gcc/gimple.c
>> @@ -2562,6 +2562,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
>>        || (SYM) == WIDEN_MULT_PLUS_EXPR    \
>>        || (SYM) == WIDEN_MULT_MINUS_EXPR    \
>>        || (SYM) == DOT_PROD_EXPR    \
>> +      || (SYM) == SAD_EXPR    \
>>        || (SYM) == REALIGN_LOAD_EXPR    \
>>        || (SYM) == VEC_COND_EXPR    \
>>        || (SYM) == VEC_PERM_EXPR                                             \
>> diff --git a/gcc/optabs.c b/gcc/optabs.c
>> index 06a626c..4ddd4d9 100644
>> --- a/gcc/optabs.c
>> +++ b/gcc/optabs.c
>> @@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
>>      case DOT_PROD_EXPR:
>>        return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;
>>
>> +    case SAD_EXPR:
>> +      return sad_optab;
>> +
>>      case WIDEN_MULT_PLUS_EXPR:
>>        return (TYPE_UNSIGNED (type)
>>        ? (TYPE_SATURATING (type)
>> diff --git a/gcc/optabs.def b/gcc/optabs.def
>> index 6b924ac..e35d567 100644
>> --- a/gcc/optabs.def
>> +++ b/gcc/optabs.def
>> @@ -248,6 +248,7 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
>>  OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
>>  OPTAB_D (udot_prod_optab, "udot_prod$I$a")
>>  OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
>> +OPTAB_D (sad_optab, "sad$I$a")
>>  OPTAB_D (vec_extract_optab, "vec_extract$a")
>>  OPTAB_D (vec_init_optab, "vec_init$a")
>>  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
>> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
>> index 075d071..226b8d5 100644
>> --- a/gcc/testsuite/ChangeLog
>> +++ b/gcc/testsuite/ChangeLog
>> @@ -1,3 +1,7 @@
>> +2013-10-29  Cong Hou  <congh@google.com>
>> +
>> + * gcc.dg/vect/vect-reduc-sad.c: New.
>> +
>>  2013-10-14  Tobias Burnus  <burnus@net-b.de>
>>
>>   PR fortran/58658
>> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
>> b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
>> new file mode 100644
>> index 0000000..14ebb3b
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
>> @@ -0,0 +1,54 @@
>> +/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */
>> +
>> +#include <stdarg.h>
>> +#include "tree-vect.h"
>> +
>> +#define N 64
>> +#define SAD N*N/2
>> +
>> +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
>> +unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
>> +
>> +/* Sum of absolute differences between arrays of unsigned char types.
>> +   Detected as a sad pattern.
>> +   Vectorized on targets that support sad for unsigned chars.  */
>> +
>> +__attribute__ ((noinline)) int
>> +foo (int len)
>> +{
>> +  int i;
>> +  int result = 0;
>> +
>> +  for (i = 0; i < len; i++)
>> +    result += abs (X[i] - Y[i]);
>> +
>> +  return result;
>> +}
>> +
>> +
>> +int
>> +main (void)
>> +{
>> +  int i;
>> +  int sad;
>> +
>> +  check_vect ();
>> +
>> +  for (i = 0; i < N; i++)
>> +    {
>> +      X[i] = i;
>> +      Y[i] = N - i;
>> +      __asm__ volatile ("");
>> +    }
>> +
>> +  sad = foo (N);
>> +  if (sad != SAD)
>> +    abort ();
>> +
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern:
>> detected" 1 "vect" } } */
>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>> +/* { dg-final { cleanup-tree-dump "vect" } } */
>> +
>> diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
>> index 8b66791..d689cac 100644
>> --- a/gcc/tree-cfg.c
>> +++ b/gcc/tree-cfg.c
>> @@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt)
>>        return false;
>>
>>      case DOT_PROD_EXPR:
>> +    case SAD_EXPR:
>>      case REALIGN_LOAD_EXPR:
>>        /* FIXME.  */
>>        return false;
>> diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
>> index 2221b9c..44261a3 100644
>> --- a/gcc/tree-inline.c
>> +++ b/gcc/tree-inline.c
>> @@ -3601,6 +3601,7 @@ estimate_operator_cost (enum tree_code code,
>> eni_weights *weights,
>>      case WIDEN_SUM_EXPR:
>>      case WIDEN_MULT_EXPR:
>>      case DOT_PROD_EXPR:
>> +    case SAD_EXPR:
>>      case WIDEN_MULT_PLUS_EXPR:
>>      case WIDEN_MULT_MINUS_EXPR:
>>      case WIDEN_LSHIFT_EXPR:
>> diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
>> index 603f797..393efc3 100644
>> --- a/gcc/tree-ssa-operands.c
>> +++ b/gcc/tree-ssa-operands.c
>> @@ -854,6 +854,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
>>        }
>>
>>      case DOT_PROD_EXPR:
>> +    case SAD_EXPR:
>>      case REALIGN_LOAD_EXPR:
>>      case WIDEN_MULT_PLUS_EXPR:
>>      case WIDEN_MULT_MINUS_EXPR:
>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> index 638b981..89aa8c7 100644
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -3607,6 +3607,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
>>      {
>>        case WIDEN_SUM_EXPR:
>>        case DOT_PROD_EXPR:
>> +      case SAD_EXPR:
>>        case PLUS_EXPR:
>>        case MINUS_EXPR:
>>        case BIT_IOR_EXPR:
>> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
>> index 0a4e812..7919449 100644
>> --- a/gcc/tree-vect-patterns.c
>> +++ b/gcc/tree-vect-patterns.c
>> @@ -45,6 +45,8 @@ static gimple vect_recog_widen_mult_pattern
>> (vec<gimple> *, tree *,
>>       tree *);
>>  static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
>>     tree *);
>> +static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
>> +      tree *);
>>  static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
>>  static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
>>                                                   tree *);
>> @@ -62,6 +64,7 @@ static vect_recog_func_ptr
>> vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
>>   vect_recog_widen_mult_pattern,
>>   vect_recog_widen_sum_pattern,
>>   vect_recog_dot_prod_pattern,
>> +        vect_recog_sad_pattern,
>>   vect_recog_pow_pattern,
>>   vect_recog_widen_shift_pattern,
>>   vect_recog_over_widening_pattern,
>> @@ -140,9 +143,8 @@ vect_single_imm_use (gimple def_stmt)
>>  }
>>
>>  /* Check whether NAME, an ssa-name used in USE_STMT,
>> -   is a result of a type promotion or demotion, such that:
>> +   is a result of a type promotion, such that:
>>       DEF_STMT: NAME = NOP (name0)
>> -   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
>>     If CHECK_SIGN is TRUE, check that either both types are signed or both are
>>     unsigned.  */
>>
>> @@ -189,10 +191,8 @@ type_conversion_p (tree name, gimple use_stmt,
>> bool check_sign,
>>
>>    if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
>>      *promotion = true;
>> -  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
>> -    *promotion = false;
>>    else
>> -    return false;
>> +    *promotion = false;
>>
>>    if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
>>     bb_vinfo, &dummy_gimple, &dummy, &dt))
>> @@ -433,6 +433,242 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts,
>> tree *type_in,
>>  }
>>
>>
>> +/* Function vect_recog_sad_pattern
>> +
>> +   Try to find the following Sum of Absolute Difference (SAD) pattern:
>> +
>> +     unsigned type x_t, y_t;
>> +     signed TYPE1 diff, abs_diff;
>> +     TYPE2 sum = init;
>> +   loop:
>> +     sum_0 = phi <init, sum_1>
>> +     S1  x_t = ...
>> +     S2  y_t = ...
>> +     S3  x_T = (TYPE1) x_t;
>> +     S4  y_T = (TYPE1) y_t;
>> +     S5  diff = x_T - y_T;
>> +     S6  abs_diff = ABS_EXPR <diff>;
>> +     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
>> +     S8  sum_1 = abs_diff + sum_0;
>> +
>> +   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
>> +   same size of 'TYPE1' or bigger. This is a special case of a reduction
>> +   computation.
>> +
>> +   Input:
>> +
>> +   * STMTS: Contains a stmt from which the pattern search begins.  In the
>> +   example, when this function is called with S8, the pattern
>> +   {S3,S4,S5,S6,S7,S8} will be detected.
>> +
>> +   Output:
>> +
>> +   * TYPE_IN: The type of the input arguments to the pattern.
>> +
>> +   * TYPE_OUT: The type of the output of this pattern.
>> +
>> +   * Return value: A new stmt that will be used to replace the sequence of
>> +   stmts that constitute the pattern. In this case it will be:
>> +        SAD_EXPR <x_t, y_t, sum_0>
>> +  */
>> +
>> +static gimple
>> +vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
>> +     tree *type_out)
>> +{
>> +  gimple last_stmt = (*stmts)[0];
>> +  tree sad_oprnd0, sad_oprnd1;
>> +  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
>> +  tree half_type;
>> +  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
>> +  struct loop *loop;
>> +  bool promotion;
>> +
>> +  if (!loop_info)
>> +    return NULL;
>> +
>> +  loop = LOOP_VINFO_LOOP (loop_info);
>> +
>> +  if (!is_gimple_assign (last_stmt))
>> +    return NULL;
>> +
>> +  tree sum_type = gimple_expr_type (last_stmt);
>> +
>> +  /* Look for the following pattern
>> +          DX = (TYPE1) X;
>> +          DY = (TYPE1) Y;
>> +          DDIFF = DX - DY;
>> +          DAD = ABS_EXPR <DDIFF>;
>> +          DDPROD = (TYPE2) DPROD;
>> +          sum_1 = DAD + sum_0;
>> +     In which
>> +     - DX is at least double the size of X
>> +     - DY is at least double the size of Y
>> +     - DX, DY, DDIFF, DAD all have the same type
>> +     - sum is the same size of DAD or bigger
>> +     - sum has been recognized as a reduction variable.
>> +
>> +     This is equivalent to:
>> +       DDIFF = X w- Y;          #widen sub
>> +       DAD = ABS_EXPR <DDIFF>;
>> +       sum_1 = DAD w+ sum_0;    #widen summation
>> +     or
>> +       DDIFF = X w- Y;          #widen sub
>> +       DAD = ABS_EXPR <DDIFF>;
>> +       sum_1 = DAD + sum_0;     #summation
>> +   */
>> +
>> +  /* Starting from LAST_STMT, follow the defs of its uses in search
>> +     of the above pattern.  */
>> +
>> +  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
>> +    return NULL;
>> +
>> +  tree plus_oprnd0, plus_oprnd1;
>> +
>> +  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
>> +    {
>> +      /* Has been detected as widening-summation?  */
>> +
>> +      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
>> +      sum_type = gimple_expr_type (stmt);
>> +      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
>> +        return NULL;
>> +      plus_oprnd0 = gimple_assign_rhs1 (stmt);
>> +      plus_oprnd1 = gimple_assign_rhs2 (stmt);
>> +      half_type = TREE_TYPE (plus_oprnd0);
>> +    }
>> +  else
>> +    {
>> +      gimple def_stmt;
>> +
>> +      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
>> +        return NULL;
>> +      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
>> +      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
>> +      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
>> +  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
>> +        return NULL;
>> +
>> +      /* The type conversion could be promotion, demotion,
>> +         or just signed -> unsigned.  */
>> +      if (type_conversion_p (plus_oprnd0, last_stmt, false,
>> +                             &half_type, &def_stmt, &promotion))
>> +        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
>> +      else
>> +        half_type = sum_type;
>> +    }
>> +
>> +  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
>> +     we know that plus_oprnd1 is the reduction variable (defined by a
>> loop-header
>> +     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
>> +     Then check that plus_oprnd0 is defined by an abs_expr  */
>> +
>> +  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
>> +    return NULL;
>> +
>> +  tree abs_type = half_type;
>> +  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
>> +
>> +  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
>> +  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop,
>> gimple_bb (abs_stmt)))
>> +    return NULL;
>> +
>> +  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
>> +     inside the loop (in case we are analyzing an outer-loop).  */
>> +  if (!is_gimple_assign (abs_stmt))
>> +    return NULL;
>> +
>> +  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
>> +  gcc_assert (abs_stmt_vinfo);
>> +  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
>> +    return NULL;
>> +  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
>> +    return NULL;
>> +
>> +  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
>> +  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
>> +    return NULL;
>> +  if (TYPE_UNSIGNED (abs_type))
>> +    return NULL;
>> +
>> +  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
>> +
>> +  if (TREE_CODE (abs_oprnd) != SSA_NAME)
>> +    return NULL;
>> +
>> +  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
>> +
>> +  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
>> +  if (!gimple_bb (diff_stmt)
>> +      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
>> +    return NULL;
>> +
>> +  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
>> +     inside the loop (in case we are analyzing an outer-loop).  */
>> +  if (!is_gimple_assign (diff_stmt))
>> +    return NULL;
>> +
>> +  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
>> +  gcc_assert (diff_stmt_vinfo);
>> +  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
>> +    return NULL;
>> +  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
>> +    return NULL;
>> +
>> +  tree half_type0, half_type1;
>> +  gimple def_stmt;
>> +
>> +  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
>> +  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
>> +
>> +  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
>> +      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
>> +    return NULL;
>> +  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
>> +                          &half_type0, &def_stmt, &promotion)
>> +      || !promotion)
>> +    return NULL;
>> +  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
>> +
>> +  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
>> +                          &half_type1, &def_stmt, &promotion)
>> +      || !promotion)
>> +    return NULL;
>> +  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
>> +
>> +  if (!types_compatible_p (half_type0, half_type1))
>> +    return NULL;
>> +  if (!TYPE_UNSIGNED (half_type0))
>> +    return NULL;
>> +  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
>> +      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
>> +    return NULL;
>> +
>> +  *type_in = TREE_TYPE (sad_oprnd0);
>> +  *type_out = sum_type;
>> +
>> +  /* Pattern detected. Create a stmt to be used to replace the pattern: */
>> +  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
>> +  gimple pattern_stmt = gimple_build_assign_with_ops
>> +                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
>> +
>> +  if (dump_enabled_p ())
>> +    {
>> +      dump_printf_loc (MSG_NOTE, vect_location,
>> +                       "vect_recog_sad_pattern: detected: ");
>> +      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
>> +      dump_printf (MSG_NOTE, "\n");
>> +    }
>> +
>> +  /* We don't allow changing the order of the computation in the inner-loop
>> +     when doing outer-loop vectorization.  */
>> +  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
>> +
>> +  return pattern_stmt;
>> +}
>> +
>> +
>>  /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
>>     and LSHIFT_EXPR.
>>
>> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>> index 8b7b345..0aac75b 100644
>> --- a/gcc/tree-vectorizer.h
>> +++ b/gcc/tree-vectorizer.h
>> @@ -1044,7 +1044,7 @@ extern void vect_slp_transform_bb (basic_block);
>>     Additional pattern recognition functions can (and will) be added
>>     in the future.  */
>>  typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
>> -#define NUM_PATTERNS 11
>> +#define NUM_PATTERNS 12
>>  void vect_pattern_recog (loop_vec_info, bb_vec_info);
>>
>>  /* In tree-vectorizer.c.  */
>> diff --git a/gcc/tree.def b/gcc/tree.def
>> index 88c850a..31a3b64 100644
>> --- a/gcc/tree.def
>> +++ b/gcc/tree.def
>> @@ -1146,6 +1146,15 @@ DEFTREECODE (REDUC_PLUS_EXPR,
>> "reduc_plus_expr", tcc_unary, 1)
>>          arg3 = WIDEN_SUM_EXPR (tmp, arg3); */
>>  DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr", tcc_expression, 3)
>>
>> +/* Widening sad (sum of absolute differences).
>> +   The first two arguments are of type t1 which should be unsigned integer.
>> +   The third argument and the result are of type t2, such that t2 is at least
>> +   twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to:
>> + tmp1 = WIDEN_MINUS_EXPR (arg1, arg2);
>> + tmp2 = ABS_EXPR (tmp1);
>> + arg3 = PLUS_EXPR (tmp2, arg3); */
>> +DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
>> +
>>  /* Widening summation.
>>     The first argument is of type t1.
>>     The second argument is of type t2, such that t2 is at least twice
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE / SUSE Labs
> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> GF: Jeff Hawn, Jennifer Guild, Felix Imend

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-10-29 23:05 Cong Hou
  2013-10-30  0:09 ` Ramana Radhakrishnan
@ 2013-10-30 12:16 ` Richard Biener
  2013-10-31  0:50   ` Cong Hou
  1 sibling, 1 reply; 27+ messages in thread
From: Richard Biener @ 2013-10-30 12:16 UTC (permalink / raw)
  To: Cong Hou; +Cc: GCC Patches

On Tue, 29 Oct 2013, Cong Hou wrote:

> Hi
> 
> SAD (Sum of Absolute Differences) is a common and important algorithm
> in image processing and other areas. SSE2 even introduced a new
> instruction PSADBW for it. A SAD loop can be greatly accelerated by
> this instruction after being vectorized. This patch introduced a new
> operation SAD_EXPR and a SAD pattern recognizer in vectorizer.
> 
> The pattern of SAD is shown below:
> 
>      unsigned type x_t, y_t;
>      signed TYPE1 diff, abs_diff;
>      TYPE2 sum = init;
>    loop:
>      sum_0 = phi <init, sum_1>
>      S1  x_t = ...
>      S2  y_t = ...
>      S3  x_T = (TYPE1) x_t;
>      S4  y_T = (TYPE1) y_t;
>      S5  diff = x_T - y_T;
>      S6  abs_diff = ABS_EXPR <diff>;
>      [S7  abs_diff = (TYPE2) abs_diff;  #optional]
>      S8  sum_1 = abs_diff + sum_0;
> 
>    where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
>    same size of 'TYPE1' or bigger. This is a special case of a reduction
>    computation.
> 
> For SSE2, type is char, and TYPE1 and TYPE2 are int.
> 
> 
> In order to express this new operation, a new expression SAD_EXPR is
> introduced in tree.def, and the corresponding entry in optabs is
> added. The patch also added the "define_expand" for SSE2 and AVX2
> platforms for i386.
> 
> The patch is pasted below and also attached as a text file (in which
> you can see tabs). Bootstrap and make check got passed on x86. Please
> give me your comments.

Apart from the testcase comment made earlier

+++ b/gcc/tree-cfg.c
@@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt)
       return false;

     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case REALIGN_LOAD_EXPR:
       /* FIXME.  */
       return false;

please add proper verification of the operand types.

+/* Widening sad (sum of absolute differences).
+   The first two arguments are of type t1 which should be unsigned 
integer.
+   The third argument and the result are of type t2, such that t2 is at 
least
+   twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to:
+       tmp1 = WIDEN_MINUS_EXPR (arg1, arg2);
+       tmp2 = ABS_EXPR (tmp1);
+       arg3 = PLUS_EXPR (tmp2, arg3);           */
+DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)

WIDEN_MINUS_EXPR doesn't exist so you have to explain on its
operation (it returns a signed wide difference?).  Why should
the first two arguments be unsigned?  I cannot see a good reason
to require that (other than that maybe the x86 target only has
support for widened unsigned difference?).  So if you want to
make that restriction maybe change the name to SADU_EXPR
(sum of absolute differences of unsigned)?

I suppose you tried introducing WIDEN_MINUS_EXPR instead and
letting combine do it's work, avoiding the very special optab?

Thanks,
Richard.

> 
> 
> thanks,
> Cong
> 
> 
> 
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index 8a38316..d528307 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,23 @@
> +2013-10-29  Cong Hou  <congh@google.com>
> +
> + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
> + pattern recognition.
> + (type_conversion_p): PROMOTION is true if it's a type promotion
> + conversion, and false otherwise.  Return true if the given expression
> + is a type conversion one.
> + * tree-vectorizer.h: Adjust the number of patterns.
> + * tree.def: Add SAD_EXPR.
> + * optabs.def: Add sad_optab.
> + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
> + * expr.c (expand_expr_real_2): Likewise.
> + * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
> + * gimple.c (get_gimple_rhs_num_ops): Likewise.
> + * optabs.c (optab_for_tree_code): Likewise.
> + * tree-cfg.c (estimate_operator_cost): Likewise.
> + * tree-ssa-operands.c (get_expr_operands): Likewise.
> + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
> + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
> +
>  2013-10-14  David Malcolm  <dmalcolm@redhat.com>
> 
>   * dumpfile.h (gcc::dump_manager): New class, to hold state
> diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
> index 7ed29f5..9ec761a 100644
> --- a/gcc/cfgexpand.c
> +++ b/gcc/cfgexpand.c
> @@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
>   {
>   case COND_EXPR:
>   case DOT_PROD_EXPR:
> + case SAD_EXPR:
>   case WIDEN_MULT_PLUS_EXPR:
>   case WIDEN_MULT_MINUS_EXPR:
>   case FMA_EXPR:
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index c3f6c94..ca1ab70 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -6052,6 +6052,40 @@
>    DONE;
>  })
> 
> +(define_expand "sadv16qi"
> +  [(match_operand:V4SI 0 "register_operand")
> +   (match_operand:V16QI 1 "register_operand")
> +   (match_operand:V16QI 2 "register_operand")
> +   (match_operand:V4SI 3 "register_operand")]
> +  "TARGET_SSE2"
> +{
> +  rtx t1 = gen_reg_rtx (V2DImode);
> +  rtx t2 = gen_reg_rtx (V4SImode);
> +  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
> +  convert_move (t2, t1, 0);
> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
> +  gen_rtx_PLUS (V4SImode,
> + operands[3], t2)));
> +  DONE;
> +})
> +
> +(define_expand "sadv32qi"
> +  [(match_operand:V8SI 0 "register_operand")
> +   (match_operand:V32QI 1 "register_operand")
> +   (match_operand:V32QI 2 "register_operand")
> +   (match_operand:V8SI 3 "register_operand")]
> +  "TARGET_AVX2"
> +{
> +  rtx t1 = gen_reg_rtx (V4DImode);
> +  rtx t2 = gen_reg_rtx (V8SImode);
> +  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
> +  convert_move (t2, t1, 0);
> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
> +  gen_rtx_PLUS (V8SImode,
> + operands[3], t2)));
> +  DONE;
> +})
> +
>  (define_insn "ashr<mode>3"
>    [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
>   (ashiftrt:VI24_AVX2
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 4975a64..1db8a49 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target,
> enum machine_mode tmode,
>   return target;
>        }
> 
> +      case SAD_EXPR:
> +      {
> + tree oprnd0 = treeop0;
> + tree oprnd1 = treeop1;
> + tree oprnd2 = treeop2;
> + rtx op2;
> +
> + expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
> + op2 = expand_normal (oprnd2);
> + target = expand_widen_pattern_expr (ops, op0, op1, op2,
> +    target, unsignedp);
> + return target;
> +      }
> +
>      case REALIGN_LOAD_EXPR:
>        {
>          tree oprnd0 = treeop0;
> diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
> index f0f8166..514ddd1 100644
> --- a/gcc/gimple-pretty-print.c
> +++ b/gcc/gimple-pretty-print.c
> @@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple
> gs, int spc, int flags)
>        dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
>        pp_greater (buffer);
>        break;
> +
> +    case SAD_EXPR:
> +      pp_string (buffer, "SAD_EXPR <");
> +      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
> +      pp_string (buffer, ", ");
> +      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
> +      pp_string (buffer, ", ");
> +      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
> +      pp_greater (buffer);
> +      break;
> 
>      case VEC_PERM_EXPR:
>        pp_string (buffer, "VEC_PERM_EXPR <");
> diff --git a/gcc/gimple.c b/gcc/gimple.c
> index a12dd67..4975959 100644
> --- a/gcc/gimple.c
> +++ b/gcc/gimple.c
> @@ -2562,6 +2562,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
>        || (SYM) == WIDEN_MULT_PLUS_EXPR    \
>        || (SYM) == WIDEN_MULT_MINUS_EXPR    \
>        || (SYM) == DOT_PROD_EXPR    \
> +      || (SYM) == SAD_EXPR    \
>        || (SYM) == REALIGN_LOAD_EXPR    \
>        || (SYM) == VEC_COND_EXPR    \
>        || (SYM) == VEC_PERM_EXPR                                             \
> diff --git a/gcc/optabs.c b/gcc/optabs.c
> index 06a626c..4ddd4d9 100644
> --- a/gcc/optabs.c
> +++ b/gcc/optabs.c
> @@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
>      case DOT_PROD_EXPR:
>        return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;
> 
> +    case SAD_EXPR:
> +      return sad_optab;
> +
>      case WIDEN_MULT_PLUS_EXPR:
>        return (TYPE_UNSIGNED (type)
>        ? (TYPE_SATURATING (type)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 6b924ac..e35d567 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -248,6 +248,7 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
>  OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
>  OPTAB_D (udot_prod_optab, "udot_prod$I$a")
>  OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
> +OPTAB_D (sad_optab, "sad$I$a")
>  OPTAB_D (vec_extract_optab, "vec_extract$a")
>  OPTAB_D (vec_init_optab, "vec_init$a")
>  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
> index 075d071..226b8d5 100644
> --- a/gcc/testsuite/ChangeLog
> +++ b/gcc/testsuite/ChangeLog
> @@ -1,3 +1,7 @@
> +2013-10-29  Cong Hou  <congh@google.com>
> +
> + * gcc.dg/vect/vect-reduc-sad.c: New.
> +
>  2013-10-14  Tobias Burnus  <burnus@net-b.de>
> 
>   PR fortran/58658
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
> b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
> new file mode 100644
> index 0000000..14ebb3b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
> @@ -0,0 +1,54 @@
> +/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */
> +
> +#include <stdarg.h>
> +#include "tree-vect.h"
> +
> +#define N 64
> +#define SAD N*N/2
> +
> +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
> +unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
> +
> +/* Sum of absolute differences between arrays of unsigned char types.
> +   Detected as a sad pattern.
> +   Vectorized on targets that support sad for unsigned chars.  */
> +
> +__attribute__ ((noinline)) int
> +foo (int len)
> +{
> +  int i;
> +  int result = 0;
> +
> +  for (i = 0; i < len; i++)
> +    result += abs (X[i] - Y[i]);
> +
> +  return result;
> +}
> +
> +
> +int
> +main (void)
> +{
> +  int i;
> +  int sad;
> +
> +  check_vect ();
> +
> +  for (i = 0; i < N; i++)
> +    {
> +      X[i] = i;
> +      Y[i] = N - i;
> +      __asm__ volatile ("");
> +    }
> +
> +  sad = foo (N);
> +  if (sad != SAD)
> +    abort ();
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern:
> detected" 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> +/* { dg-final { cleanup-tree-dump "vect" } } */
> +
> diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
> index 8b66791..d689cac 100644
> --- a/gcc/tree-cfg.c
> +++ b/gcc/tree-cfg.c
> @@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt)
>        return false;
> 
>      case DOT_PROD_EXPR:
> +    case SAD_EXPR:
>      case REALIGN_LOAD_EXPR:
>        /* FIXME.  */
>        return false;
> diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
> index 2221b9c..44261a3 100644
> --- a/gcc/tree-inline.c
> +++ b/gcc/tree-inline.c
> @@ -3601,6 +3601,7 @@ estimate_operator_cost (enum tree_code code,
> eni_weights *weights,
>      case WIDEN_SUM_EXPR:
>      case WIDEN_MULT_EXPR:
>      case DOT_PROD_EXPR:
> +    case SAD_EXPR:
>      case WIDEN_MULT_PLUS_EXPR:
>      case WIDEN_MULT_MINUS_EXPR:
>      case WIDEN_LSHIFT_EXPR:
> diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
> index 603f797..393efc3 100644
> --- a/gcc/tree-ssa-operands.c
> +++ b/gcc/tree-ssa-operands.c
> @@ -854,6 +854,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
>        }
> 
>      case DOT_PROD_EXPR:
> +    case SAD_EXPR:
>      case REALIGN_LOAD_EXPR:
>      case WIDEN_MULT_PLUS_EXPR:
>      case WIDEN_MULT_MINUS_EXPR:
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 638b981..89aa8c7 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -3607,6 +3607,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
>      {
>        case WIDEN_SUM_EXPR:
>        case DOT_PROD_EXPR:
> +      case SAD_EXPR:
>        case PLUS_EXPR:
>        case MINUS_EXPR:
>        case BIT_IOR_EXPR:
> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
> index 0a4e812..7919449 100644
> --- a/gcc/tree-vect-patterns.c
> +++ b/gcc/tree-vect-patterns.c
> @@ -45,6 +45,8 @@ static gimple vect_recog_widen_mult_pattern
> (vec<gimple> *, tree *,
>       tree *);
>  static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
>     tree *);
> +static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
> +      tree *);
>  static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
>  static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
>                                                   tree *);
> @@ -62,6 +64,7 @@ static vect_recog_func_ptr
> vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
>   vect_recog_widen_mult_pattern,
>   vect_recog_widen_sum_pattern,
>   vect_recog_dot_prod_pattern,
> +        vect_recog_sad_pattern,
>   vect_recog_pow_pattern,
>   vect_recog_widen_shift_pattern,
>   vect_recog_over_widening_pattern,
> @@ -140,9 +143,8 @@ vect_single_imm_use (gimple def_stmt)
>  }
> 
>  /* Check whether NAME, an ssa-name used in USE_STMT,
> -   is a result of a type promotion or demotion, such that:
> +   is a result of a type promotion, such that:
>       DEF_STMT: NAME = NOP (name0)
> -   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
>     If CHECK_SIGN is TRUE, check that either both types are signed or both are
>     unsigned.  */
> 
> @@ -189,10 +191,8 @@ type_conversion_p (tree name, gimple use_stmt,
> bool check_sign,
> 
>    if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
>      *promotion = true;
> -  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
> -    *promotion = false;
>    else
> -    return false;
> +    *promotion = false;
> 
>    if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
>     bb_vinfo, &dummy_gimple, &dummy, &dt))
> @@ -433,6 +433,242 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts,
> tree *type_in,
>  }
> 
> 
> +/* Function vect_recog_sad_pattern
> +
> +   Try to find the following Sum of Absolute Difference (SAD) pattern:
> +
> +     unsigned type x_t, y_t;
> +     signed TYPE1 diff, abs_diff;
> +     TYPE2 sum = init;
> +   loop:
> +     sum_0 = phi <init, sum_1>
> +     S1  x_t = ...
> +     S2  y_t = ...
> +     S3  x_T = (TYPE1) x_t;
> +     S4  y_T = (TYPE1) y_t;
> +     S5  diff = x_T - y_T;
> +     S6  abs_diff = ABS_EXPR <diff>;
> +     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
> +     S8  sum_1 = abs_diff + sum_0;
> +
> +   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
> +   same size of 'TYPE1' or bigger. This is a special case of a reduction
> +   computation.
> +
> +   Input:
> +
> +   * STMTS: Contains a stmt from which the pattern search begins.  In the
> +   example, when this function is called with S8, the pattern
> +   {S3,S4,S5,S6,S7,S8} will be detected.
> +
> +   Output:
> +
> +   * TYPE_IN: The type of the input arguments to the pattern.
> +
> +   * TYPE_OUT: The type of the output of this pattern.
> +
> +   * Return value: A new stmt that will be used to replace the sequence of
> +   stmts that constitute the pattern. In this case it will be:
> +        SAD_EXPR <x_t, y_t, sum_0>
> +  */
> +
> +static gimple
> +vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
> +     tree *type_out)
> +{
> +  gimple last_stmt = (*stmts)[0];
> +  tree sad_oprnd0, sad_oprnd1;
> +  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
> +  tree half_type;
> +  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
> +  struct loop *loop;
> +  bool promotion;
> +
> +  if (!loop_info)
> +    return NULL;
> +
> +  loop = LOOP_VINFO_LOOP (loop_info);
> +
> +  if (!is_gimple_assign (last_stmt))
> +    return NULL;
> +
> +  tree sum_type = gimple_expr_type (last_stmt);
> +
> +  /* Look for the following pattern
> +          DX = (TYPE1) X;
> +          DY = (TYPE1) Y;
> +          DDIFF = DX - DY;
> +          DAD = ABS_EXPR <DDIFF>;
> +          DDPROD = (TYPE2) DPROD;
> +          sum_1 = DAD + sum_0;
> +     In which
> +     - DX is at least double the size of X
> +     - DY is at least double the size of Y
> +     - DX, DY, DDIFF, DAD all have the same type
> +     - sum is the same size of DAD or bigger
> +     - sum has been recognized as a reduction variable.
> +
> +     This is equivalent to:
> +       DDIFF = X w- Y;          #widen sub
> +       DAD = ABS_EXPR <DDIFF>;
> +       sum_1 = DAD w+ sum_0;    #widen summation
> +     or
> +       DDIFF = X w- Y;          #widen sub
> +       DAD = ABS_EXPR <DDIFF>;
> +       sum_1 = DAD + sum_0;     #summation
> +   */
> +
> +  /* Starting from LAST_STMT, follow the defs of its uses in search
> +     of the above pattern.  */
> +
> +  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
> +    return NULL;
> +
> +  tree plus_oprnd0, plus_oprnd1;
> +
> +  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
> +    {
> +      /* Has been detected as widening-summation?  */
> +
> +      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
> +      sum_type = gimple_expr_type (stmt);
> +      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
> +        return NULL;
> +      plus_oprnd0 = gimple_assign_rhs1 (stmt);
> +      plus_oprnd1 = gimple_assign_rhs2 (stmt);
> +      half_type = TREE_TYPE (plus_oprnd0);
> +    }
> +  else
> +    {
> +      gimple def_stmt;
> +
> +      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
> +        return NULL;
> +      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
> +      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
> +      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
> +  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
> +        return NULL;
> +
> +      /* The type conversion could be promotion, demotion,
> +         or just signed -> unsigned.  */
> +      if (type_conversion_p (plus_oprnd0, last_stmt, false,
> +                             &half_type, &def_stmt, &promotion))
> +        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
> +      else
> +        half_type = sum_type;
> +    }
> +
> +  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
> +     we know that plus_oprnd1 is the reduction variable (defined by a
> loop-header
> +     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
> +     Then check that plus_oprnd0 is defined by an abs_expr  */
> +
> +  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
> +    return NULL;
> +
> +  tree abs_type = half_type;
> +  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
> +
> +  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
> +  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop,
> gimple_bb (abs_stmt)))
> +    return NULL;
> +
> +  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
> +     inside the loop (in case we are analyzing an outer-loop).  */
> +  if (!is_gimple_assign (abs_stmt))
> +    return NULL;
> +
> +  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
> +  gcc_assert (abs_stmt_vinfo);
> +  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
> +    return NULL;
> +  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
> +    return NULL;
> +
> +  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
> +  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
> +    return NULL;
> +  if (TYPE_UNSIGNED (abs_type))
> +    return NULL;
> +
> +  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
> +
> +  if (TREE_CODE (abs_oprnd) != SSA_NAME)
> +    return NULL;
> +
> +  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
> +
> +  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
> +  if (!gimple_bb (diff_stmt)
> +      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
> +    return NULL;
> +
> +  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
> +     inside the loop (in case we are analyzing an outer-loop).  */
> +  if (!is_gimple_assign (diff_stmt))
> +    return NULL;
> +
> +  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
> +  gcc_assert (diff_stmt_vinfo);
> +  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
> +    return NULL;
> +  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
> +    return NULL;
> +
> +  tree half_type0, half_type1;
> +  gimple def_stmt;
> +
> +  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
> +  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
> +
> +  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
> +      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
> +    return NULL;
> +  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
> +                          &half_type0, &def_stmt, &promotion)
> +      || !promotion)
> +    return NULL;
> +  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
> +
> +  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
> +                          &half_type1, &def_stmt, &promotion)
> +      || !promotion)
> +    return NULL;
> +  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
> +
> +  if (!types_compatible_p (half_type0, half_type1))
> +    return NULL;
> +  if (!TYPE_UNSIGNED (half_type0))
> +    return NULL;
> +  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
> +      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
> +    return NULL;
> +
> +  *type_in = TREE_TYPE (sad_oprnd0);
> +  *type_out = sum_type;
> +
> +  /* Pattern detected. Create a stmt to be used to replace the pattern: */
> +  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
> +  gimple pattern_stmt = gimple_build_assign_with_ops
> +                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
> +
> +  if (dump_enabled_p ())
> +    {
> +      dump_printf_loc (MSG_NOTE, vect_location,
> +                       "vect_recog_sad_pattern: detected: ");
> +      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
> +      dump_printf (MSG_NOTE, "\n");
> +    }
> +
> +  /* We don't allow changing the order of the computation in the inner-loop
> +     when doing outer-loop vectorization.  */
> +  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
> +
> +  return pattern_stmt;
> +}
> +
> +
>  /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
>     and LSHIFT_EXPR.
> 
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 8b7b345..0aac75b 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -1044,7 +1044,7 @@ extern void vect_slp_transform_bb (basic_block);
>     Additional pattern recognition functions can (and will) be added
>     in the future.  */
>  typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
> -#define NUM_PATTERNS 11
> +#define NUM_PATTERNS 12
>  void vect_pattern_recog (loop_vec_info, bb_vec_info);
> 
>  /* In tree-vectorizer.c.  */
> diff --git a/gcc/tree.def b/gcc/tree.def
> index 88c850a..31a3b64 100644
> --- a/gcc/tree.def
> +++ b/gcc/tree.def
> @@ -1146,6 +1146,15 @@ DEFTREECODE (REDUC_PLUS_EXPR,
> "reduc_plus_expr", tcc_unary, 1)
>          arg3 = WIDEN_SUM_EXPR (tmp, arg3); */
>  DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr", tcc_expression, 3)
> 
> +/* Widening sad (sum of absolute differences).
> +   The first two arguments are of type t1 which should be unsigned integer.
> +   The third argument and the result are of type t2, such that t2 is at least
> +   twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to:
> + tmp1 = WIDEN_MINUS_EXPR (arg1, arg2);
> + tmp2 = ABS_EXPR (tmp1);
> + arg3 = PLUS_EXPR (tmp2, arg3); */
> +DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
> +
>  /* Widening summation.
>     The first argument is of type t1.
>     The second argument is of type t2, such that t2 is at least twice
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
  2013-10-29 23:05 Cong Hou
@ 2013-10-30  0:09 ` Ramana Radhakrishnan
  2013-10-31  1:10   ` Cong Hou
  2013-10-30 12:16 ` Richard Biener
  1 sibling, 1 reply; 27+ messages in thread
From: Ramana Radhakrishnan @ 2013-10-30  0:09 UTC (permalink / raw)
  To: Cong Hou; +Cc: GCC Patches, Richard Biener

Cong,

Please don't do the following.

>+++ b/gcc/testsuite/gcc.dg/vect/
vect-reduc-sad.c
@@ -0,0 +1,54 @@
+/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */

you are adding a test to gcc.dg/vect - It's a common directory
containing tests that need to run on multiple architectures and such
tests should be keyed by the feature they enable which can be turned
on for ports that have such an instruction.

The correct way of doing this is to key this on the feature something
like dg-require-effective-target vect_sad_char . And define the
equivalent routine in testsuite/lib/target-supports.exp and enable it
for sse2 for the x86 port. If in doubt look at
check_effective_target_vect_int and a whole family of such functions
in testsuite/lib/target-supports.exp

This makes life easy for other port maintainers who want to turn on
this support. And for bonus points please update the testcase writing
wiki page with this information if it isn't already there.

You are also missing documentation updates for SAD_EXPR, md.texi for
the new standard pattern name. Shouldn't it be called sad<mode>4
really ?


regards
Ramana





On Tue, Oct 29, 2013 at 10:23 PM, Cong Hou <congh@google.com> wrote:
> Hi
>
> SAD (Sum of Absolute Differences) is a common and important algorithm
> in image processing and other areas. SSE2 even introduced a new
> instruction PSADBW for it. A SAD loop can be greatly accelerated by
> this instruction after being vectorized. This patch introduced a new
> operation SAD_EXPR and a SAD pattern recognizer in vectorizer.
>
> The pattern of SAD is shown below:
>
>      unsigned type x_t, y_t;
>      signed TYPE1 diff, abs_diff;
>      TYPE2 sum = init;
>    loop:
>      sum_0 = phi <init, sum_1>
>      S1  x_t = ...
>      S2  y_t = ...
>      S3  x_T = (TYPE1) x_t;
>      S4  y_T = (TYPE1) y_t;
>      S5  diff = x_T - y_T;
>      S6  abs_diff = ABS_EXPR <diff>;
>      [S7  abs_diff = (TYPE2) abs_diff;  #optional]
>      S8  sum_1 = abs_diff + sum_0;
>
>    where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
>    same size of 'TYPE1' or bigger. This is a special case of a reduction
>    computation.
>
> For SSE2, type is char, and TYPE1 and TYPE2 are int.
>
>
> In order to express this new operation, a new expression SAD_EXPR is
> introduced in tree.def, and the corresponding entry in optabs is
> added. The patch also added the "define_expand" for SSE2 and AVX2
> platforms for i386.
>
> The patch is pasted below and also attached as a text file (in which
> you can see tabs). Bootstrap and make check got passed on x86. Please
> give me your comments.
>
>
>
> thanks,
> Cong
>
>
>
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index 8a38316..d528307 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,23 @@
> +2013-10-29  Cong Hou  <congh@google.com>
> +
> + * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
> + pattern recognition.
> + (type_conversion_p): PROMOTION is true if it's a type promotion
> + conversion, and false otherwise.  Return true if the given expression
> + is a type conversion one.
> + * tree-vectorizer.h: Adjust the number of patterns.
> + * tree.def: Add SAD_EXPR.
> + * optabs.def: Add sad_optab.
> + * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
> + * expr.c (expand_expr_real_2): Likewise.
> + * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
> + * gimple.c (get_gimple_rhs_num_ops): Likewise.
> + * optabs.c (optab_for_tree_code): Likewise.
> + * tree-cfg.c (estimate_operator_cost): Likewise.
> + * tree-ssa-operands.c (get_expr_operands): Likewise.
> + * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
> + * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
> +
>  2013-10-14  David Malcolm  <dmalcolm@redhat.com>
>
>   * dumpfile.h (gcc::dump_manager): New class, to hold state
> diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
> index 7ed29f5..9ec761a 100644
> --- a/gcc/cfgexpand.c
> +++ b/gcc/cfgexpand.c
> @@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
>   {
>   case COND_EXPR:
>   case DOT_PROD_EXPR:
> + case SAD_EXPR:
>   case WIDEN_MULT_PLUS_EXPR:
>   case WIDEN_MULT_MINUS_EXPR:
>   case FMA_EXPR:
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index c3f6c94..ca1ab70 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -6052,6 +6052,40 @@
>    DONE;
>  })
>
> +(define_expand "sadv16qi"
> +  [(match_operand:V4SI 0 "register_operand")
> +   (match_operand:V16QI 1 "register_operand")
> +   (match_operand:V16QI 2 "register_operand")
> +   (match_operand:V4SI 3 "register_operand")]
> +  "TARGET_SSE2"
> +{
> +  rtx t1 = gen_reg_rtx (V2DImode);
> +  rtx t2 = gen_reg_rtx (V4SImode);
> +  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
> +  convert_move (t2, t1, 0);
> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
> +  gen_rtx_PLUS (V4SImode,
> + operands[3], t2)));
> +  DONE;
> +})
> +
> +(define_expand "sadv32qi"
> +  [(match_operand:V8SI 0 "register_operand")
> +   (match_operand:V32QI 1 "register_operand")
> +   (match_operand:V32QI 2 "register_operand")
> +   (match_operand:V8SI 3 "register_operand")]
> +  "TARGET_AVX2"
> +{
> +  rtx t1 = gen_reg_rtx (V4DImode);
> +  rtx t2 = gen_reg_rtx (V8SImode);
> +  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
> +  convert_move (t2, t1, 0);
> +  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
> +  gen_rtx_PLUS (V8SImode,
> + operands[3], t2)));
> +  DONE;
> +})
> +
>  (define_insn "ashr<mode>3"
>    [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
>   (ashiftrt:VI24_AVX2
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 4975a64..1db8a49 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target,
> enum machine_mode tmode,
>   return target;
>        }
>
> +      case SAD_EXPR:
> +      {
> + tree oprnd0 = treeop0;
> + tree oprnd1 = treeop1;
> + tree oprnd2 = treeop2;
> + rtx op2;
> +
> + expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
> + op2 = expand_normal (oprnd2);
> + target = expand_widen_pattern_expr (ops, op0, op1, op2,
> +    target, unsignedp);
> + return target;
> +      }
> +
>      case REALIGN_LOAD_EXPR:
>        {
>          tree oprnd0 = treeop0;
> diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
> index f0f8166..514ddd1 100644
> --- a/gcc/gimple-pretty-print.c
> +++ b/gcc/gimple-pretty-print.c
> @@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple
> gs, int spc, int flags)
>        dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
>        pp_greater (buffer);
>        break;
> +
> +    case SAD_EXPR:
> +      pp_string (buffer, "SAD_EXPR <");
> +      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
> +      pp_string (buffer, ", ");
> +      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
> +      pp_string (buffer, ", ");
> +      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
> +      pp_greater (buffer);
> +      break;
>
>      case VEC_PERM_EXPR:
>        pp_string (buffer, "VEC_PERM_EXPR <");
> diff --git a/gcc/gimple.c b/gcc/gimple.c
> index a12dd67..4975959 100644
> --- a/gcc/gimple.c
> +++ b/gcc/gimple.c
> @@ -2562,6 +2562,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
>        || (SYM) == WIDEN_MULT_PLUS_EXPR    \
>        || (SYM) == WIDEN_MULT_MINUS_EXPR    \
>        || (SYM) == DOT_PROD_EXPR    \
> +      || (SYM) == SAD_EXPR    \
>        || (SYM) == REALIGN_LOAD_EXPR    \
>        || (SYM) == VEC_COND_EXPR    \
>        || (SYM) == VEC_PERM_EXPR                                             \
> diff --git a/gcc/optabs.c b/gcc/optabs.c
> index 06a626c..4ddd4d9 100644
> --- a/gcc/optabs.c
> +++ b/gcc/optabs.c
> @@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
>      case DOT_PROD_EXPR:
>        return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;
>
> +    case SAD_EXPR:
> +      return sad_optab;
> +
>      case WIDEN_MULT_PLUS_EXPR:
>        return (TYPE_UNSIGNED (type)
>        ? (TYPE_SATURATING (type)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 6b924ac..e35d567 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -248,6 +248,7 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
>  OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
>  OPTAB_D (udot_prod_optab, "udot_prod$I$a")
>  OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
> +OPTAB_D (sad_optab, "sad$I$a")
>  OPTAB_D (vec_extract_optab, "vec_extract$a")
>  OPTAB_D (vec_init_optab, "vec_init$a")
>  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
> index 075d071..226b8d5 100644
> --- a/gcc/testsuite/ChangeLog
> +++ b/gcc/testsuite/ChangeLog
> @@ -1,3 +1,7 @@
> +2013-10-29  Cong Hou  <congh@google.com>
> +
> + * gcc.dg/vect/vect-reduc-sad.c: New.
> +
>  2013-10-14  Tobias Burnus  <burnus@net-b.de>
>
>   PR fortran/58658
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
> b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
> new file mode 100644
> index 0000000..14ebb3b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
> @@ -0,0 +1,54 @@
> +/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */
> +
> +#include <stdarg.h>
> +#include "tree-vect.h"
> +
> +#define N 64
> +#define SAD N*N/2
> +
> +unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
> +unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
> +
> +/* Sum of absolute differences between arrays of unsigned char types.
> +   Detected as a sad pattern.
> +   Vectorized on targets that support sad for unsigned chars.  */
> +
> +__attribute__ ((noinline)) int
> +foo (int len)
> +{
> +  int i;
> +  int result = 0;
> +
> +  for (i = 0; i < len; i++)
> +    result += abs (X[i] - Y[i]);
> +
> +  return result;
> +}
> +
> +
> +int
> +main (void)
> +{
> +  int i;
> +  int sad;
> +
> +  check_vect ();
> +
> +  for (i = 0; i < N; i++)
> +    {
> +      X[i] = i;
> +      Y[i] = N - i;
> +      __asm__ volatile ("");
> +    }
> +
> +  sad = foo (N);
> +  if (sad != SAD)
> +    abort ();
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern:
> detected" 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> +/* { dg-final { cleanup-tree-dump "vect" } } */
> +
> diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
> index 8b66791..d689cac 100644
> --- a/gcc/tree-cfg.c
> +++ b/gcc/tree-cfg.c
> @@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt)
>        return false;
>
>      case DOT_PROD_EXPR:
> +    case SAD_EXPR:
>      case REALIGN_LOAD_EXPR:
>        /* FIXME.  */
>        return false;
> diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
> index 2221b9c..44261a3 100644
> --- a/gcc/tree-inline.c
> +++ b/gcc/tree-inline.c
> @@ -3601,6 +3601,7 @@ estimate_operator_cost (enum tree_code code,
> eni_weights *weights,
>      case WIDEN_SUM_EXPR:
>      case WIDEN_MULT_EXPR:
>      case DOT_PROD_EXPR:
> +    case SAD_EXPR:
>      case WIDEN_MULT_PLUS_EXPR:
>      case WIDEN_MULT_MINUS_EXPR:
>      case WIDEN_LSHIFT_EXPR:
> diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
> index 603f797..393efc3 100644
> --- a/gcc/tree-ssa-operands.c
> +++ b/gcc/tree-ssa-operands.c
> @@ -854,6 +854,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
>        }
>
>      case DOT_PROD_EXPR:
> +    case SAD_EXPR:
>      case REALIGN_LOAD_EXPR:
>      case WIDEN_MULT_PLUS_EXPR:
>      case WIDEN_MULT_MINUS_EXPR:
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 638b981..89aa8c7 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -3607,6 +3607,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
>      {
>        case WIDEN_SUM_EXPR:
>        case DOT_PROD_EXPR:
> +      case SAD_EXPR:
>        case PLUS_EXPR:
>        case MINUS_EXPR:
>        case BIT_IOR_EXPR:
> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
> index 0a4e812..7919449 100644
> --- a/gcc/tree-vect-patterns.c
> +++ b/gcc/tree-vect-patterns.c
> @@ -45,6 +45,8 @@ static gimple vect_recog_widen_mult_pattern
> (vec<gimple> *, tree *,
>       tree *);
>  static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
>     tree *);
> +static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
> +      tree *);
>  static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
>  static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
>                                                   tree *);
> @@ -62,6 +64,7 @@ static vect_recog_func_ptr
> vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
>   vect_recog_widen_mult_pattern,
>   vect_recog_widen_sum_pattern,
>   vect_recog_dot_prod_pattern,
> +        vect_recog_sad_pattern,
>   vect_recog_pow_pattern,
>   vect_recog_widen_shift_pattern,
>   vect_recog_over_widening_pattern,
> @@ -140,9 +143,8 @@ vect_single_imm_use (gimple def_stmt)
>  }
>
>  /* Check whether NAME, an ssa-name used in USE_STMT,
> -   is a result of a type promotion or demotion, such that:
> +   is a result of a type promotion, such that:
>       DEF_STMT: NAME = NOP (name0)
> -   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
>     If CHECK_SIGN is TRUE, check that either both types are signed or both are
>     unsigned.  */
>
> @@ -189,10 +191,8 @@ type_conversion_p (tree name, gimple use_stmt,
> bool check_sign,
>
>    if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
>      *promotion = true;
> -  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
> -    *promotion = false;
>    else
> -    return false;
> +    *promotion = false;
>
>    if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
>     bb_vinfo, &dummy_gimple, &dummy, &dt))
> @@ -433,6 +433,242 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts,
> tree *type_in,
>  }
>
>
> +/* Function vect_recog_sad_pattern
> +
> +   Try to find the following Sum of Absolute Difference (SAD) pattern:
> +
> +     unsigned type x_t, y_t;
> +     signed TYPE1 diff, abs_diff;
> +     TYPE2 sum = init;
> +   loop:
> +     sum_0 = phi <init, sum_1>
> +     S1  x_t = ...
> +     S2  y_t = ...
> +     S3  x_T = (TYPE1) x_t;
> +     S4  y_T = (TYPE1) y_t;
> +     S5  diff = x_T - y_T;
> +     S6  abs_diff = ABS_EXPR <diff>;
> +     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
> +     S8  sum_1 = abs_diff + sum_0;
> +
> +   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
> +   same size of 'TYPE1' or bigger. This is a special case of a reduction
> +   computation.
> +
> +   Input:
> +
> +   * STMTS: Contains a stmt from which the pattern search begins.  In the
> +   example, when this function is called with S8, the pattern
> +   {S3,S4,S5,S6,S7,S8} will be detected.
> +
> +   Output:
> +
> +   * TYPE_IN: The type of the input arguments to the pattern.
> +
> +   * TYPE_OUT: The type of the output of this pattern.
> +
> +   * Return value: A new stmt that will be used to replace the sequence of
> +   stmts that constitute the pattern. In this case it will be:
> +        SAD_EXPR <x_t, y_t, sum_0>
> +  */
> +
> +static gimple
> +vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
> +     tree *type_out)
> +{
> +  gimple last_stmt = (*stmts)[0];
> +  tree sad_oprnd0, sad_oprnd1;
> +  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
> +  tree half_type;
> +  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
> +  struct loop *loop;
> +  bool promotion;
> +
> +  if (!loop_info)
> +    return NULL;
> +
> +  loop = LOOP_VINFO_LOOP (loop_info);
> +
> +  if (!is_gimple_assign (last_stmt))
> +    return NULL;
> +
> +  tree sum_type = gimple_expr_type (last_stmt);
> +
> +  /* Look for the following pattern
> +          DX = (TYPE1) X;
> +          DY = (TYPE1) Y;
> +          DDIFF = DX - DY;
> +          DAD = ABS_EXPR <DDIFF>;
> +          DDPROD = (TYPE2) DPROD;
> +          sum_1 = DAD + sum_0;
> +     In which
> +     - DX is at least double the size of X
> +     - DY is at least double the size of Y
> +     - DX, DY, DDIFF, DAD all have the same type
> +     - sum is the same size of DAD or bigger
> +     - sum has been recognized as a reduction variable.
> +
> +     This is equivalent to:
> +       DDIFF = X w- Y;          #widen sub
> +       DAD = ABS_EXPR <DDIFF>;
> +       sum_1 = DAD w+ sum_0;    #widen summation
> +     or
> +       DDIFF = X w- Y;          #widen sub
> +       DAD = ABS_EXPR <DDIFF>;
> +       sum_1 = DAD + sum_0;     #summation
> +   */
> +
> +  /* Starting from LAST_STMT, follow the defs of its uses in search
> +     of the above pattern.  */
> +
> +  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
> +    return NULL;
> +
> +  tree plus_oprnd0, plus_oprnd1;
> +
> +  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
> +    {
> +      /* Has been detected as widening-summation?  */
> +
> +      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
> +      sum_type = gimple_expr_type (stmt);
> +      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
> +        return NULL;
> +      plus_oprnd0 = gimple_assign_rhs1 (stmt);
> +      plus_oprnd1 = gimple_assign_rhs2 (stmt);
> +      half_type = TREE_TYPE (plus_oprnd0);
> +    }
> +  else
> +    {
> +      gimple def_stmt;
> +
> +      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
> +        return NULL;
> +      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
> +      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
> +      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
> +  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
> +        return NULL;
> +
> +      /* The type conversion could be promotion, demotion,
> +         or just signed -> unsigned.  */
> +      if (type_conversion_p (plus_oprnd0, last_stmt, false,
> +                             &half_type, &def_stmt, &promotion))
> +        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
> +      else
> +        half_type = sum_type;
> +    }
> +
> +  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
> +     we know that plus_oprnd1 is the reduction variable (defined by a
> loop-header
> +     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
> +     Then check that plus_oprnd0 is defined by an abs_expr  */
> +
> +  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
> +    return NULL;
> +
> +  tree abs_type = half_type;
> +  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
> +
> +  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
> +  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop,
> gimple_bb (abs_stmt)))
> +    return NULL;
> +
> +  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
> +     inside the loop (in case we are analyzing an outer-loop).  */
> +  if (!is_gimple_assign (abs_stmt))
> +    return NULL;
> +
> +  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
> +  gcc_assert (abs_stmt_vinfo);
> +  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
> +    return NULL;
> +  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
> +    return NULL;
> +
> +  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
> +  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
> +    return NULL;
> +  if (TYPE_UNSIGNED (abs_type))
> +    return NULL;
> +
> +  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
> +
> +  if (TREE_CODE (abs_oprnd) != SSA_NAME)
> +    return NULL;
> +
> +  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
> +
> +  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
> +  if (!gimple_bb (diff_stmt)
> +      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
> +    return NULL;
> +
> +  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
> +     inside the loop (in case we are analyzing an outer-loop).  */
> +  if (!is_gimple_assign (diff_stmt))
> +    return NULL;
> +
> +  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
> +  gcc_assert (diff_stmt_vinfo);
> +  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
> +    return NULL;
> +  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
> +    return NULL;
> +
> +  tree half_type0, half_type1;
> +  gimple def_stmt;
> +
> +  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
> +  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
> +
> +  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
> +      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
> +    return NULL;
> +  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
> +                          &half_type0, &def_stmt, &promotion)
> +      || !promotion)
> +    return NULL;
> +  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
> +
> +  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
> +                          &half_type1, &def_stmt, &promotion)
> +      || !promotion)
> +    return NULL;
> +  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
> +
> +  if (!types_compatible_p (half_type0, half_type1))
> +    return NULL;
> +  if (!TYPE_UNSIGNED (half_type0))
> +    return NULL;
> +  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
> +      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
> +    return NULL;
> +
> +  *type_in = TREE_TYPE (sad_oprnd0);
> +  *type_out = sum_type;
> +
> +  /* Pattern detected. Create a stmt to be used to replace the pattern: */
> +  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
> +  gimple pattern_stmt = gimple_build_assign_with_ops
> +                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
> +
> +  if (dump_enabled_p ())
> +    {
> +      dump_printf_loc (MSG_NOTE, vect_location,
> +                       "vect_recog_sad_pattern: detected: ");
> +      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
> +      dump_printf (MSG_NOTE, "\n");
> +    }
> +
> +  /* We don't allow changing the order of the computation in the inner-loop
> +     when doing outer-loop vectorization.  */
> +  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
> +
> +  return pattern_stmt;
> +}
> +
> +
>  /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
>     and LSHIFT_EXPR.
>
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 8b7b345..0aac75b 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -1044,7 +1044,7 @@ extern void vect_slp_transform_bb (basic_block);
>     Additional pattern recognition functions can (and will) be added
>     in the future.  */
>  typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
> -#define NUM_PATTERNS 11
> +#define NUM_PATTERNS 12
>  void vect_pattern_recog (loop_vec_info, bb_vec_info);
>
>  /* In tree-vectorizer.c.  */
> diff --git a/gcc/tree.def b/gcc/tree.def
> index 88c850a..31a3b64 100644
> --- a/gcc/tree.def
> +++ b/gcc/tree.def
> @@ -1146,6 +1146,15 @@ DEFTREECODE (REDUC_PLUS_EXPR,
> "reduc_plus_expr", tcc_unary, 1)
>          arg3 = WIDEN_SUM_EXPR (tmp, arg3); */
>  DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr", tcc_expression, 3)
>
> +/* Widening sad (sum of absolute differences).
> +   The first two arguments are of type t1 which should be unsigned integer.
> +   The third argument and the result are of type t2, such that t2 is at least
> +   twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to:
> + tmp1 = WIDEN_MINUS_EXPR (arg1, arg2);
> + tmp2 = ABS_EXPR (tmp1);
> + arg3 = PLUS_EXPR (tmp2, arg3); */
> +DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
> +
>  /* Widening summation.
>     The first argument is of type t1.
>     The second argument is of type t2, such that t2 is at least twice

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.
@ 2013-10-29 23:05 Cong Hou
  2013-10-30  0:09 ` Ramana Radhakrishnan
  2013-10-30 12:16 ` Richard Biener
  0 siblings, 2 replies; 27+ messages in thread
From: Cong Hou @ 2013-10-29 23:05 UTC (permalink / raw)
  To: GCC Patches; +Cc: Richard Biener

[-- Attachment #1: Type: text/plain, Size: 21432 bytes --]

Hi

SAD (Sum of Absolute Differences) is a common and important algorithm
in image processing and other areas. SSE2 even introduced a new
instruction PSADBW for it. A SAD loop can be greatly accelerated by
this instruction after being vectorized. This patch introduced a new
operation SAD_EXPR and a SAD pattern recognizer in vectorizer.

The pattern of SAD is shown below:

     unsigned type x_t, y_t;
     signed TYPE1 diff, abs_diff;
     TYPE2 sum = init;
   loop:
     sum_0 = phi <init, sum_1>
     S1  x_t = ...
     S2  y_t = ...
     S3  x_T = (TYPE1) x_t;
     S4  y_T = (TYPE1) y_t;
     S5  diff = x_T - y_T;
     S6  abs_diff = ABS_EXPR <diff>;
     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
     S8  sum_1 = abs_diff + sum_0;

   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
   same size of 'TYPE1' or bigger. This is a special case of a reduction
   computation.

For SSE2, type is char, and TYPE1 and TYPE2 are int.


In order to express this new operation, a new expression SAD_EXPR is
introduced in tree.def, and the corresponding entry in optabs is
added. The patch also added the "define_expand" for SSE2 and AVX2
platforms for i386.

The patch is pasted below and also attached as a text file (in which
you can see tabs). Bootstrap and make check got passed on x86. Please
give me your comments.



thanks,
Cong



diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..d528307 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,23 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+ * tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+ pattern recognition.
+ (type_conversion_p): PROMOTION is true if it's a type promotion
+ conversion, and false otherwise.  Return true if the given expression
+ is a type conversion one.
+ * tree-vectorizer.h: Adjust the number of patterns.
+ * tree.def: Add SAD_EXPR.
+ * optabs.def: Add sad_optab.
+ * cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+ * expr.c (expand_expr_real_2): Likewise.
+ * gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+ * gimple.c (get_gimple_rhs_num_ops): Likewise.
+ * optabs.c (optab_for_tree_code): Likewise.
+ * tree-cfg.c (estimate_operator_cost): Likewise.
+ * tree-ssa-operands.c (get_expr_operands): Likewise.
+ * tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+ * config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+
 2013-10-14  David Malcolm  <dmalcolm@redhat.com>

  * dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 7ed29f5..9ec761a 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
  {
  case COND_EXPR:
  case DOT_PROD_EXPR:
+ case SAD_EXPR:
  case WIDEN_MULT_PLUS_EXPR:
  case WIDEN_MULT_MINUS_EXPR:
  case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..ca1ab70 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6052,6 +6052,40 @@
   DONE;
 })

+(define_expand "sadv16qi"
+  [(match_operand:V4SI 0 "register_operand")
+   (match_operand:V16QI 1 "register_operand")
+   (match_operand:V16QI 2 "register_operand")
+   (match_operand:V4SI 3 "register_operand")]
+  "TARGET_SSE2"
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+  gen_rtx_PLUS (V4SImode,
+ operands[3], t2)));
+  DONE;
+})
+
+(define_expand "sadv32qi"
+  [(match_operand:V8SI 0 "register_operand")
+   (match_operand:V32QI 1 "register_operand")
+   (match_operand:V32QI 2 "register_operand")
+   (match_operand:V8SI 3 "register_operand")]
+  "TARGET_AVX2"
+{
+  rtx t1 = gen_reg_rtx (V4DImode);
+  rtx t2 = gen_reg_rtx (V8SImode);
+  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+  gen_rtx_PLUS (V8SImode,
+ operands[3], t2)));
+  DONE;
+})
+
 (define_insn "ashr<mode>3"
   [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
  (ashiftrt:VI24_AVX2
diff --git a/gcc/expr.c b/gcc/expr.c
index 4975a64..1db8a49 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target,
enum machine_mode tmode,
  return target;
       }

+      case SAD_EXPR:
+      {
+ tree oprnd0 = treeop0;
+ tree oprnd1 = treeop1;
+ tree oprnd2 = treeop2;
+ rtx op2;
+
+ expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
+ op2 = expand_normal (oprnd2);
+ target = expand_widen_pattern_expr (ops, op0, op1, op2,
+    target, unsignedp);
+ return target;
+      }
+
     case REALIGN_LOAD_EXPR:
       {
         tree oprnd0 = treeop0;
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index f0f8166..514ddd1 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple
gs, int spc, int flags)
       dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
       pp_greater (buffer);
       break;
+
+    case SAD_EXPR:
+      pp_string (buffer, "SAD_EXPR <");
+      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
+      pp_greater (buffer);
+      break;

     case VEC_PERM_EXPR:
       pp_string (buffer, "VEC_PERM_EXPR <");
diff --git a/gcc/gimple.c b/gcc/gimple.c
index a12dd67..4975959 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -2562,6 +2562,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
       || (SYM) == WIDEN_MULT_PLUS_EXPR    \
       || (SYM) == WIDEN_MULT_MINUS_EXPR    \
       || (SYM) == DOT_PROD_EXPR    \
+      || (SYM) == SAD_EXPR    \
       || (SYM) == REALIGN_LOAD_EXPR    \
       || (SYM) == VEC_COND_EXPR    \
       || (SYM) == VEC_PERM_EXPR                                             \
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 06a626c..4ddd4d9 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
     case DOT_PROD_EXPR:
       return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;

+    case SAD_EXPR:
+      return sad_optab;
+
     case WIDEN_MULT_PLUS_EXPR:
       return (TYPE_UNSIGNED (type)
       ? (TYPE_SATURATING (type)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 6b924ac..e35d567 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -248,6 +248,7 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
 OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
+OPTAB_D (sad_optab, "sad$I$a")
 OPTAB_D (vec_extract_optab, "vec_extract$a")
 OPTAB_D (vec_init_optab, "vec_init$a")
 OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..226b8d5 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+ * gcc.dg/vect/vect-reduc-sad.c: New.
+
 2013-10-14  Tobias Burnus  <burnus@net-b.de>

  PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
new file mode 100644
index 0000000..14ebb3b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
@@ -0,0 +1,54 @@
+/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+
+#define N 64
+#define SAD N*N/2
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+
+/* Sum of absolute differences between arrays of unsigned char types.
+   Detected as a sad pattern.
+   Vectorized on targets that support sad for unsigned chars.  */
+
+__attribute__ ((noinline)) int
+foo (int len)
+{
+  int i;
+  int result = 0;
+
+  for (i = 0; i < len; i++)
+    result += abs (X[i] - Y[i]);
+
+  return result;
+}
+
+
+int
+main (void)
+{
+  int i;
+  int sad;
+
+  check_vect ();
+
+  for (i = 0; i < N; i++)
+    {
+      X[i] = i;
+      Y[i] = N - i;
+      __asm__ volatile ("");
+    }
+
+  sad = foo (N);
+  if (sad != SAD)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern:
detected" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 8b66791..d689cac 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt)
       return false;

     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case REALIGN_LOAD_EXPR:
       /* FIXME.  */
       return false;
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 2221b9c..44261a3 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3601,6 +3601,7 @@ estimate_operator_cost (enum tree_code code,
eni_weights *weights,
     case WIDEN_SUM_EXPR:
     case WIDEN_MULT_EXPR:
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
     case WIDEN_LSHIFT_EXPR:
diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
index 603f797..393efc3 100644
--- a/gcc/tree-ssa-operands.c
+++ b/gcc/tree-ssa-operands.c
@@ -854,6 +854,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
       }

     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case REALIGN_LOAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 638b981..89aa8c7 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3607,6 +3607,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
     {
       case WIDEN_SUM_EXPR:
       case DOT_PROD_EXPR:
+      case SAD_EXPR:
       case PLUS_EXPR:
       case MINUS_EXPR:
       case BIT_IOR_EXPR:
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 0a4e812..7919449 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -45,6 +45,8 @@ static gimple vect_recog_widen_mult_pattern
(vec<gimple> *, tree *,
      tree *);
 static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
    tree *);
+static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
+      tree *);
 static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
 static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
                                                  tree *);
@@ -62,6 +64,7 @@ static vect_recog_func_ptr
vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
  vect_recog_widen_mult_pattern,
  vect_recog_widen_sum_pattern,
  vect_recog_dot_prod_pattern,
+        vect_recog_sad_pattern,
  vect_recog_pow_pattern,
  vect_recog_widen_shift_pattern,
  vect_recog_over_widening_pattern,
@@ -140,9 +143,8 @@ vect_single_imm_use (gimple def_stmt)
 }

 /* Check whether NAME, an ssa-name used in USE_STMT,
-   is a result of a type promotion or demotion, such that:
+   is a result of a type promotion, such that:
      DEF_STMT: NAME = NOP (name0)
-   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
    If CHECK_SIGN is TRUE, check that either both types are signed or both are
    unsigned.  */

@@ -189,10 +191,8 @@ type_conversion_p (tree name, gimple use_stmt,
bool check_sign,

   if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
     *promotion = true;
-  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
-    *promotion = false;
   else
-    return false;
+    *promotion = false;

   if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
    bb_vinfo, &dummy_gimple, &dummy, &dt))
@@ -433,6 +433,242 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts,
tree *type_in,
 }


+/* Function vect_recog_sad_pattern
+
+   Try to find the following Sum of Absolute Difference (SAD) pattern:
+
+     unsigned type x_t, y_t;
+     signed TYPE1 diff, abs_diff;
+     TYPE2 sum = init;
+   loop:
+     sum_0 = phi <init, sum_1>
+     S1  x_t = ...
+     S2  y_t = ...
+     S3  x_T = (TYPE1) x_t;
+     S4  y_T = (TYPE1) y_t;
+     S5  diff = x_T - y_T;
+     S6  abs_diff = ABS_EXPR <diff>;
+     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
+     S8  sum_1 = abs_diff + sum_0;
+
+   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
+   same size of 'TYPE1' or bigger. This is a special case of a reduction
+   computation.
+
+   Input:
+
+   * STMTS: Contains a stmt from which the pattern search begins.  In the
+   example, when this function is called with S8, the pattern
+   {S3,S4,S5,S6,S7,S8} will be detected.
+
+   Output:
+
+   * TYPE_IN: The type of the input arguments to the pattern.
+
+   * TYPE_OUT: The type of the output of this pattern.
+
+   * Return value: A new stmt that will be used to replace the sequence of
+   stmts that constitute the pattern. In this case it will be:
+        SAD_EXPR <x_t, y_t, sum_0>
+  */
+
+static gimple
+vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
+     tree *type_out)
+{
+  gimple last_stmt = (*stmts)[0];
+  tree sad_oprnd0, sad_oprnd1;
+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
+  tree half_type;
+  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
+  struct loop *loop;
+  bool promotion;
+
+  if (!loop_info)
+    return NULL;
+
+  loop = LOOP_VINFO_LOOP (loop_info);
+
+  if (!is_gimple_assign (last_stmt))
+    return NULL;
+
+  tree sum_type = gimple_expr_type (last_stmt);
+
+  /* Look for the following pattern
+          DX = (TYPE1) X;
+          DY = (TYPE1) Y;
+          DDIFF = DX - DY;
+          DAD = ABS_EXPR <DDIFF>;
+          DDPROD = (TYPE2) DPROD;
+          sum_1 = DAD + sum_0;
+     In which
+     - DX is at least double the size of X
+     - DY is at least double the size of Y
+     - DX, DY, DDIFF, DAD all have the same type
+     - sum is the same size of DAD or bigger
+     - sum has been recognized as a reduction variable.
+
+     This is equivalent to:
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD w+ sum_0;    #widen summation
+     or
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD + sum_0;     #summation
+   */
+
+  /* Starting from LAST_STMT, follow the defs of its uses in search
+     of the above pattern.  */
+
+  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
+    return NULL;
+
+  tree plus_oprnd0, plus_oprnd1;
+
+  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
+    {
+      /* Has been detected as widening-summation?  */
+
+      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+      sum_type = gimple_expr_type (stmt);
+      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (stmt);
+      half_type = TREE_TYPE (plus_oprnd0);
+    }
+  else
+    {
+      gimple def_stmt;
+
+      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
+      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
+  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
+        return NULL;
+
+      /* The type conversion could be promotion, demotion,
+         or just signed -> unsigned.  */
+      if (type_conversion_p (plus_oprnd0, last_stmt, false,
+                             &half_type, &def_stmt, &promotion))
+        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
+      else
+        half_type = sum_type;
+    }
+
+  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
+     we know that plus_oprnd1 is the reduction variable (defined by a
loop-header
+     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
+     Then check that plus_oprnd0 is defined by an abs_expr  */
+
+  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
+    return NULL;
+
+  tree abs_type = half_type;
+  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
+
+  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
+  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop,
gimple_bb (abs_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (abs_stmt))
+    return NULL;
+
+  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
+  gcc_assert (abs_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
+    return NULL;
+
+  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
+  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
+    return NULL;
+  if (TYPE_UNSIGNED (abs_type))
+    return NULL;
+
+  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
+
+  if (TREE_CODE (abs_oprnd) != SSA_NAME)
+    return NULL;
+
+  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
+
+  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
+  if (!gimple_bb (diff_stmt)
+      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (diff_stmt))
+    return NULL;
+
+  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
+  gcc_assert (diff_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
+    return NULL;
+
+  tree half_type0, half_type1;
+  gimple def_stmt;
+
+  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
+  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
+
+  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
+      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
+    return NULL;
+  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
+                          &half_type0, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
+
+  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
+                          &half_type1, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
+
+  if (!types_compatible_p (half_type0, half_type1))
+    return NULL;
+  if (!TYPE_UNSIGNED (half_type0))
+    return NULL;
+  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
+      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
+    return NULL;
+
+  *type_in = TREE_TYPE (sad_oprnd0);
+  *type_out = sum_type;
+
+  /* Pattern detected. Create a stmt to be used to replace the pattern: */
+  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
+  gimple pattern_stmt = gimple_build_assign_with_ops
+                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location,
+                       "vect_recog_sad_pattern: detected: ");
+      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
+      dump_printf (MSG_NOTE, "\n");
+    }
+
+  /* We don't allow changing the order of the computation in the inner-loop
+     when doing outer-loop vectorization.  */
+  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
+
+  return pattern_stmt;
+}
+
+
 /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
    and LSHIFT_EXPR.

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 8b7b345..0aac75b 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1044,7 +1044,7 @@ extern void vect_slp_transform_bb (basic_block);
    Additional pattern recognition functions can (and will) be added
    in the future.  */
 typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
-#define NUM_PATTERNS 11
+#define NUM_PATTERNS 12
 void vect_pattern_recog (loop_vec_info, bb_vec_info);

 /* In tree-vectorizer.c.  */
diff --git a/gcc/tree.def b/gcc/tree.def
index 88c850a..31a3b64 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1146,6 +1146,15 @@ DEFTREECODE (REDUC_PLUS_EXPR,
"reduc_plus_expr", tcc_unary, 1)
         arg3 = WIDEN_SUM_EXPR (tmp, arg3); */
 DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr", tcc_expression, 3)

+/* Widening sad (sum of absolute differences).
+   The first two arguments are of type t1 which should be unsigned integer.
+   The third argument and the result are of type t2, such that t2 is at least
+   twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to:
+ tmp1 = WIDEN_MINUS_EXPR (arg1, arg2);
+ tmp2 = ABS_EXPR (tmp1);
+ arg3 = PLUS_EXPR (tmp2, arg3); */
+DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
+
 /* Widening summation.
    The first argument is of type t1.
    The second argument is of type t2, such that t2 is at least twice

[-- Attachment #2: patch-sad.txt --]
[-- Type: text/plain, Size: 20158 bytes --]

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 8a38316..d528307 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,23 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+	* tree-vect-patterns.c (vect_recog_sad_pattern): New function for SAD
+	pattern recognition.
+	(type_conversion_p): PROMOTION is true if it's a type promotion
+	conversion, and false otherwise.  Return true if the given expression
+	is a type conversion one.
+	* tree-vectorizer.h: Adjust the number of patterns.
+	* tree.def: Add SAD_EXPR.
+	* optabs.def: Add sad_optab.
+	* cfgexpand.c (expand_debug_expr): Add SAD_EXPR case.
+	* expr.c (expand_expr_real_2): Likewise.
+	* gimple-pretty-print.c (dump_ternary_rhs): Likewise.
+	* gimple.c (get_gimple_rhs_num_ops): Likewise.
+	* optabs.c (optab_for_tree_code): Likewise.
+	* tree-cfg.c (estimate_operator_cost): Likewise.
+	* tree-ssa-operands.c (get_expr_operands): Likewise.
+	* tree-vect-loop.c (get_initial_def_for_reduction): Likewise.
+	* config/i386/sse.md: Add SSE2 and AVX2 expand for SAD.
+
 2013-10-14  David Malcolm  <dmalcolm@redhat.com>
 
 	* dumpfile.h (gcc::dump_manager): New class, to hold state
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 7ed29f5..9ec761a 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2730,6 +2730,7 @@ expand_debug_expr (tree exp)
 	{
 	case COND_EXPR:
 	case DOT_PROD_EXPR:
+	case SAD_EXPR:
 	case WIDEN_MULT_PLUS_EXPR:
 	case WIDEN_MULT_MINUS_EXPR:
 	case FMA_EXPR:
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c3f6c94..ca1ab70 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6052,6 +6052,40 @@
   DONE;
 })
 
+(define_expand "sadv16qi"
+  [(match_operand:V4SI 0 "register_operand")
+   (match_operand:V16QI 1 "register_operand")
+   (match_operand:V16QI 2 "register_operand")
+   (match_operand:V4SI 3 "register_operand")]
+  "TARGET_SSE2"
+{
+  rtx t1 = gen_reg_rtx (V2DImode);
+  rtx t2 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_sse2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+			  gen_rtx_PLUS (V4SImode,
+					operands[3], t2)));
+  DONE;
+})
+
+(define_expand "sadv32qi"
+  [(match_operand:V8SI 0 "register_operand")
+   (match_operand:V32QI 1 "register_operand")
+   (match_operand:V32QI 2 "register_operand")
+   (match_operand:V8SI 3 "register_operand")]
+  "TARGET_AVX2"
+{
+  rtx t1 = gen_reg_rtx (V4DImode);
+  rtx t2 = gen_reg_rtx (V8SImode);
+  emit_insn (gen_avx2_psadbw (t1, operands[1], operands[2]));
+  convert_move (t2, t1, 0);
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0],
+			  gen_rtx_PLUS (V8SImode,
+					operands[3], t2)));
+  DONE;
+})
+
 (define_insn "ashr<mode>3"
   [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
 	(ashiftrt:VI24_AVX2
diff --git a/gcc/expr.c b/gcc/expr.c
index 4975a64..1db8a49 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9026,6 +9026,20 @@ expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
 	return target;
       }
 
+      case SAD_EXPR:
+      {
+	tree oprnd0 = treeop0;
+	tree oprnd1 = treeop1;
+	tree oprnd2 = treeop2;
+	rtx op2;
+
+	expand_operands (oprnd0, oprnd1, NULL_RTX, &op0, &op1, EXPAND_NORMAL);
+	op2 = expand_normal (oprnd2);
+	target = expand_widen_pattern_expr (ops, op0, op1, op2,
+					    target, unsignedp);
+	return target;
+      }
+
     case REALIGN_LOAD_EXPR:
       {
         tree oprnd0 = treeop0;
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index f0f8166..514ddd1 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -425,6 +425,16 @@ dump_ternary_rhs (pretty_printer *buffer, gimple gs, int spc, int flags)
       dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
       pp_greater (buffer);
       break;
+
+    case SAD_EXPR:
+      pp_string (buffer, "SAD_EXPR <");
+      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
+      pp_greater (buffer);
+      break;
     
     case VEC_PERM_EXPR:
       pp_string (buffer, "VEC_PERM_EXPR <");
diff --git a/gcc/gimple.c b/gcc/gimple.c
index a12dd67..4975959 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -2562,6 +2562,7 @@ get_gimple_rhs_num_ops (enum tree_code code)
       || (SYM) == WIDEN_MULT_PLUS_EXPR					    \
       || (SYM) == WIDEN_MULT_MINUS_EXPR					    \
       || (SYM) == DOT_PROD_EXPR						    \
+      || (SYM) == SAD_EXPR						    \
       || (SYM) == REALIGN_LOAD_EXPR					    \
       || (SYM) == VEC_COND_EXPR						    \
       || (SYM) == VEC_PERM_EXPR                                             \
diff --git a/gcc/optabs.c b/gcc/optabs.c
index 06a626c..4ddd4d9 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -462,6 +462,9 @@ optab_for_tree_code (enum tree_code code, const_tree type,
     case DOT_PROD_EXPR:
       return TYPE_UNSIGNED (type) ? udot_prod_optab : sdot_prod_optab;
 
+    case SAD_EXPR:
+      return sad_optab;
+
     case WIDEN_MULT_PLUS_EXPR:
       return (TYPE_UNSIGNED (type)
 	      ? (TYPE_SATURATING (type)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 6b924ac..e35d567 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -248,6 +248,7 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a")
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
 OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
+OPTAB_D (sad_optab, "sad$I$a")
 OPTAB_D (vec_extract_optab, "vec_extract$a")
 OPTAB_D (vec_init_optab, "vec_init$a")
 OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 075d071..226b8d5 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2013-10-29  Cong Hou  <congh@google.com>
+
+	* gcc.dg/vect/vect-reduc-sad.c: New.
+
 2013-10-14  Tobias Burnus  <burnus@net-b.de>
 
 	PR fortran/58658
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
new file mode 100644
index 0000000..14ebb3b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-sad.c
@@ -0,0 +1,54 @@
+/* { dg-require-effective-target sse2 { target { i?86-*-* x86_64-*-* } } } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+
+#define N 64
+#define SAD N*N/2
+
+unsigned char X[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+unsigned char Y[N] __attribute__ ((__aligned__(__BIGGEST_ALIGNMENT__)));
+
+/* Sum of absolute differences between arrays of unsigned char types.
+   Detected as a sad pattern.
+   Vectorized on targets that support sad for unsigned chars.  */
+
+__attribute__ ((noinline)) int
+foo (int len)
+{
+  int i;
+  int result = 0;
+
+  for (i = 0; i < len; i++)
+    result += abs (X[i] - Y[i]);
+
+  return result;
+}
+
+
+int
+main (void)
+{
+  int i;
+  int sad;
+
+  check_vect ();
+
+  for (i = 0; i < N; i++)
+    {
+      X[i] = i;
+      Y[i] = N - i;
+      __asm__ volatile ("");
+    }
+
+  sad = foo (N);
+  if (sad != SAD)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vect_recog_sad_pattern: detected" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 8b66791..d689cac 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3797,6 +3797,7 @@ verify_gimple_assign_ternary (gimple stmt)
       return false;
 
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case REALIGN_LOAD_EXPR:
       /* FIXME.  */
       return false;
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 2221b9c..44261a3 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -3601,6 +3601,7 @@ estimate_operator_cost (enum tree_code code, eni_weights *weights,
     case WIDEN_SUM_EXPR:
     case WIDEN_MULT_EXPR:
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
     case WIDEN_LSHIFT_EXPR:
diff --git a/gcc/tree-ssa-operands.c b/gcc/tree-ssa-operands.c
index 603f797..393efc3 100644
--- a/gcc/tree-ssa-operands.c
+++ b/gcc/tree-ssa-operands.c
@@ -854,6 +854,7 @@ get_expr_operands (gimple stmt, tree *expr_p, int flags)
       }
 
     case DOT_PROD_EXPR:
+    case SAD_EXPR:
     case REALIGN_LOAD_EXPR:
     case WIDEN_MULT_PLUS_EXPR:
     case WIDEN_MULT_MINUS_EXPR:
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 638b981..89aa8c7 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3607,6 +3607,7 @@ get_initial_def_for_reduction (gimple stmt, tree init_val,
     {
       case WIDEN_SUM_EXPR:
       case DOT_PROD_EXPR:
+      case SAD_EXPR:
       case PLUS_EXPR:
       case MINUS_EXPR:
       case BIT_IOR_EXPR:
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 0a4e812..7919449 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -45,6 +45,8 @@ static gimple vect_recog_widen_mult_pattern (vec<gimple> *, tree *,
 					     tree *);
 static gimple vect_recog_dot_prod_pattern (vec<gimple> *, tree *,
 					   tree *);
+static gimple vect_recog_sad_pattern (vec<gimple> *, tree *,
+				      tree *);
 static gimple vect_recog_pow_pattern (vec<gimple> *, tree *, tree *);
 static gimple vect_recog_over_widening_pattern (vec<gimple> *, tree *,
                                                  tree *);
@@ -62,6 +64,7 @@ static vect_recog_func_ptr vect_vect_recog_func_ptrs[NUM_PATTERNS] = {
 	vect_recog_widen_mult_pattern,
 	vect_recog_widen_sum_pattern,
 	vect_recog_dot_prod_pattern,
+        vect_recog_sad_pattern,
 	vect_recog_pow_pattern,
 	vect_recog_widen_shift_pattern,
 	vect_recog_over_widening_pattern,
@@ -140,9 +143,8 @@ vect_single_imm_use (gimple def_stmt)
 }
 
 /* Check whether NAME, an ssa-name used in USE_STMT,
-   is a result of a type promotion or demotion, such that:
+   is a result of a type promotion, such that:
      DEF_STMT: NAME = NOP (name0)
-   where the type of name0 (ORIG_TYPE) is smaller/bigger than the type of NAME.
    If CHECK_SIGN is TRUE, check that either both types are signed or both are
    unsigned.  */
 
@@ -189,10 +191,8 @@ type_conversion_p (tree name, gimple use_stmt, bool check_sign,
 
   if (TYPE_PRECISION (type) >= (TYPE_PRECISION (*orig_type) * 2))
     *promotion = true;
-  else if (TYPE_PRECISION (*orig_type) >= (TYPE_PRECISION (type) * 2))
-    *promotion = false;
   else
-    return false;
+    *promotion = false;
 
   if (!vect_is_simple_use (oprnd0, *def_stmt, loop_vinfo,
 			   bb_vinfo, &dummy_gimple, &dummy, &dt))
@@ -433,6 +433,242 @@ vect_recog_dot_prod_pattern (vec<gimple> *stmts, tree *type_in,
 }
 
 
+/* Function vect_recog_sad_pattern
+
+   Try to find the following Sum of Absolute Difference (SAD) pattern:
+
+     unsigned type x_t, y_t;
+     signed TYPE1 diff, abs_diff;
+     TYPE2 sum = init;
+   loop:
+     sum_0 = phi <init, sum_1>
+     S1  x_t = ...
+     S2  y_t = ...
+     S3  x_T = (TYPE1) x_t;
+     S4  y_T = (TYPE1) y_t;
+     S5  diff = x_T - y_T;
+     S6  abs_diff = ABS_EXPR <diff>;
+     [S7  abs_diff = (TYPE2) abs_diff;  #optional]
+     S8  sum_1 = abs_diff + sum_0;
+
+   where 'TYPE1' is at least double the size of type 'type', and 'TYPE2' is the
+   same size of 'TYPE1' or bigger. This is a special case of a reduction
+   computation.
+
+   Input:
+
+   * STMTS: Contains a stmt from which the pattern search begins.  In the
+   example, when this function is called with S8, the pattern
+   {S3,S4,S5,S6,S7,S8} will be detected.
+
+   Output:
+
+   * TYPE_IN: The type of the input arguments to the pattern.
+
+   * TYPE_OUT: The type of the output of this pattern.
+
+   * Return value: A new stmt that will be used to replace the sequence of
+   stmts that constitute the pattern. In this case it will be:
+        SAD_EXPR <x_t, y_t, sum_0>
+  */
+
+static gimple
+vect_recog_sad_pattern (vec<gimple> *stmts, tree *type_in,
+			     tree *type_out)
+{
+  gimple last_stmt = (*stmts)[0];
+  tree sad_oprnd0, sad_oprnd1;
+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (last_stmt);
+  tree half_type;
+  loop_vec_info loop_info = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
+  struct loop *loop;
+  bool promotion;
+
+  if (!loop_info)
+    return NULL;
+
+  loop = LOOP_VINFO_LOOP (loop_info);
+
+  if (!is_gimple_assign (last_stmt))
+    return NULL;
+
+  tree sum_type = gimple_expr_type (last_stmt);
+
+  /* Look for the following pattern
+          DX = (TYPE1) X;
+          DY = (TYPE1) Y;
+          DDIFF = DX - DY;
+          DAD = ABS_EXPR <DDIFF>;
+          DDPROD = (TYPE2) DPROD;
+          sum_1 = DAD + sum_0;
+     In which
+     - DX is at least double the size of X
+     - DY is at least double the size of Y
+     - DX, DY, DDIFF, DAD all have the same type
+     - sum is the same size of DAD or bigger
+     - sum has been recognized as a reduction variable.
+
+     This is equivalent to:
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD w+ sum_0;    #widen summation
+     or
+       DDIFF = X w- Y;          #widen sub
+       DAD = ABS_EXPR <DDIFF>;
+       sum_1 = DAD + sum_0;     #summation
+   */
+
+  /* Starting from LAST_STMT, follow the defs of its uses in search
+     of the above pattern.  */
+
+  if (gimple_assign_rhs_code (last_stmt) != PLUS_EXPR)
+    return NULL;
+
+  tree plus_oprnd0, plus_oprnd1;
+
+  if (STMT_VINFO_IN_PATTERN_P (stmt_vinfo))
+    {
+      /* Has been detected as widening-summation?  */
+
+      gimple stmt = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+      sum_type = gimple_expr_type (stmt);
+      if (gimple_assign_rhs_code (stmt) != WIDEN_SUM_EXPR)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (stmt);
+      half_type = TREE_TYPE (plus_oprnd0);
+    }
+  else
+    {
+      gimple def_stmt;
+
+      if (STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_reduction_def)
+        return NULL;
+      plus_oprnd0 = gimple_assign_rhs1 (last_stmt);
+      plus_oprnd1 = gimple_assign_rhs2 (last_stmt);
+      if (!types_compatible_p (TREE_TYPE (plus_oprnd0), sum_type)
+	  || !types_compatible_p (TREE_TYPE (plus_oprnd1), sum_type))
+        return NULL;
+
+      /* The type conversion could be promotion, demotion,
+         or just signed -> unsigned.  */
+      if (type_conversion_p (plus_oprnd0, last_stmt, false,
+                             &half_type, &def_stmt, &promotion))
+        plus_oprnd0 = gimple_assign_rhs1 (def_stmt);
+      else
+        half_type = sum_type;
+    }
+
+  /* So far so good.  Since last_stmt was detected as a (summation) reduction,
+     we know that plus_oprnd1 is the reduction variable (defined by a loop-header
+     phi), and plus_oprnd0 is an ssa-name defined by a stmt in the loop body.
+     Then check that plus_oprnd0 is defined by an abs_expr  */
+
+  if (TREE_CODE (plus_oprnd0) != SSA_NAME)
+    return NULL;
+
+  tree abs_type = half_type;
+  gimple abs_stmt = SSA_NAME_DEF_STMT (plus_oprnd0);
+
+  /* It could not be the sad pattern if the abs_stmt is outside the loop.  */
+  if (!gimple_bb (abs_stmt) || !flow_bb_inside_loop_p (loop, gimple_bb (abs_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (abs_stmt))
+    return NULL;
+
+  stmt_vec_info abs_stmt_vinfo = vinfo_for_stmt (abs_stmt);
+  gcc_assert (abs_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (abs_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (abs_stmt) != ABS_EXPR)
+    return NULL;
+
+  tree abs_oprnd = gimple_assign_rhs1 (abs_stmt);
+  if (!types_compatible_p (TREE_TYPE (abs_oprnd), abs_type))
+    return NULL;
+  if (TYPE_UNSIGNED (abs_type))
+    return NULL;
+
+  /* We then detect if the operand of abs_expr is defined by a minus_expr.  */
+
+  if (TREE_CODE (abs_oprnd) != SSA_NAME)
+    return NULL;
+
+  gimple diff_stmt = SSA_NAME_DEF_STMT (abs_oprnd);
+
+  /* It could not be the sad pattern if the diff_stmt is outside the loop.  */
+  if (!gimple_bb (diff_stmt)
+      || !flow_bb_inside_loop_p (loop, gimple_bb (diff_stmt)))
+    return NULL;
+
+  /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
+     inside the loop (in case we are analyzing an outer-loop).  */
+  if (!is_gimple_assign (diff_stmt))
+    return NULL;
+
+  stmt_vec_info diff_stmt_vinfo = vinfo_for_stmt (diff_stmt);
+  gcc_assert (diff_stmt_vinfo);
+  if (STMT_VINFO_DEF_TYPE (diff_stmt_vinfo) != vect_internal_def)
+    return NULL;
+  if (gimple_assign_rhs_code (diff_stmt) != MINUS_EXPR)
+    return NULL;
+
+  tree half_type0, half_type1;
+  gimple def_stmt;
+
+  tree minus_oprnd0 = gimple_assign_rhs1 (diff_stmt);
+  tree minus_oprnd1 = gimple_assign_rhs2 (diff_stmt);
+
+  if (!types_compatible_p (TREE_TYPE (minus_oprnd0), abs_type)
+      || !types_compatible_p (TREE_TYPE (minus_oprnd1), abs_type))
+    return NULL;
+  if (!type_conversion_p (minus_oprnd0, diff_stmt, false,
+                          &half_type0, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd0 = gimple_assign_rhs1 (def_stmt);
+
+  if (!type_conversion_p (minus_oprnd1, diff_stmt, false,
+                          &half_type1, &def_stmt, &promotion)
+      || !promotion)
+    return NULL;
+  sad_oprnd1 = gimple_assign_rhs1 (def_stmt);
+
+  if (!types_compatible_p (half_type0, half_type1))
+    return NULL;
+  if (!TYPE_UNSIGNED (half_type0))
+    return NULL;
+  if (TYPE_PRECISION (abs_type) < TYPE_PRECISION (half_type0) * 2
+      || TYPE_PRECISION (sum_type) < TYPE_PRECISION (half_type0) * 2)
+    return NULL;
+
+  *type_in = TREE_TYPE (sad_oprnd0);
+  *type_out = sum_type;
+
+  /* Pattern detected. Create a stmt to be used to replace the pattern: */
+  tree var = vect_recog_temp_ssa_var (sum_type, NULL);
+  gimple pattern_stmt = gimple_build_assign_with_ops
+                          (SAD_EXPR, var, sad_oprnd0, sad_oprnd1, plus_oprnd1);
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location,
+                       "vect_recog_sad_pattern: detected: ");
+      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, pattern_stmt, 0);
+      dump_printf (MSG_NOTE, "\n");
+    }
+
+  /* We don't allow changing the order of the computation in the inner-loop
+     when doing outer-loop vectorization.  */
+  gcc_assert (!nested_in_vect_loop_p (loop, last_stmt));
+
+  return pattern_stmt;
+}
+
+
 /* Handle widening operation by a constant.  At the moment we support MULT_EXPR
    and LSHIFT_EXPR.
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 8b7b345..0aac75b 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1044,7 +1044,7 @@ extern void vect_slp_transform_bb (basic_block);
    Additional pattern recognition functions can (and will) be added
    in the future.  */
 typedef gimple (* vect_recog_func_ptr) (vec<gimple> *, tree *, tree *);
-#define NUM_PATTERNS 11
+#define NUM_PATTERNS 12
 void vect_pattern_recog (loop_vec_info, bb_vec_info);
 
 /* In tree-vectorizer.c.  */
diff --git a/gcc/tree.def b/gcc/tree.def
index 88c850a..31a3b64 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1146,6 +1146,15 @@ DEFTREECODE (REDUC_PLUS_EXPR, "reduc_plus_expr", tcc_unary, 1)
         arg3 = WIDEN_SUM_EXPR (tmp, arg3);		 */
 DEFTREECODE (DOT_PROD_EXPR, "dot_prod_expr", tcc_expression, 3)
 
+/* Widening sad (sum of absolute differences).
+   The first two arguments are of type t1 which should be unsigned integer.
+   The third argument and the result are of type t2, such that t2 is at least
+   twice the size of t1. SAD_EXPR(arg1,arg2,arg3) is equivalent to:
+	tmp1 = WIDEN_MINUS_EXPR (arg1, arg2);
+	tmp2 = ABS_EXPR (tmp1);
+	arg3 = PLUS_EXPR (tmp2, arg3);		 */
+DEFTREECODE (SAD_EXPR, "sad_expr", tcc_expression, 3)
+
 /* Widening summation.
    The first argument is of type t1.
    The second argument is of type t2, such that t2 is at least twice

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-06-25  2:04 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-31 11:26 [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer Uros Bizjak
2013-11-01  2:04 ` Cong Hou
2013-11-01  7:43   ` Uros Bizjak
2013-11-01 10:17   ` James Greenhalgh
2013-11-01 16:49     ` Cong Hou
2013-11-04 10:06       ` James Greenhalgh
2013-11-04 18:34         ` Cong Hou
2013-11-05 10:03           ` James Greenhalgh
2013-11-05 18:14             ` Cong Hou
2013-11-08  6:42               ` Cong Hou
2013-11-08 11:30                 ` James Greenhalgh
2013-11-11 21:22                   ` Cong Hou
2013-11-14  7:50                     ` Cong Hou
2013-11-15 18:47                       ` Cong Hou
2013-11-20 18:59                         ` Cong Hou
2013-12-03  1:07                     ` Cong Hou
2013-12-17 18:04                       ` Cong Hou
2014-06-23 23:44                         ` Cong Hou
2014-06-24  7:36                           ` Richard Biener
2014-06-24 11:19                       ` Richard Biener
2014-06-25  2:04                         ` Cong Hou
  -- strict thread matches above, loose matches on Subject: below --
2013-10-29 23:05 Cong Hou
2013-10-30  0:09 ` Ramana Radhakrishnan
2013-10-31  1:10   ` Cong Hou
2013-10-31  3:18     ` Ramana Radhakrishnan
2013-10-30 12:16 ` Richard Biener
2013-10-31  0:50   ` Cong Hou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).