[PATCH GCC]Improve alias check code generation in vectorizer

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH GCC]Improve alias check code generation in vectorizer
@ 2016-06-13 10:06 Bin Cheng
  2016-06-14 12:42 ` Richard Biener
  0 siblings, 1 reply; 2+ messages in thread
From: Bin Cheng @ 2016-06-13 10:06 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd

[-- Attachment #1: Type: text/plain, Size: 3317 bytes --]

Hi,
Take subroutine "DACOP" from spec2k/200.sixtrack as an example, the loop needs to be versioned for vectorization because of possibly alias.  The alias check for data-references are like:

//pair 1
dr_a:
(Data Ref: 
  bb: 8 
  stmt: _92 = da.cc[_27];
  ref: da.cc[_27];
)
dr_b:
(Data Ref: 
  bb: 8 
  stmt: da.cc[_93] = _92;
  ref: da.cc[_93];
)
//pair 2
dr_a:
(Data Ref: 
  bb: 8 
  stmt: pretmp_29 = da.i2[_27];
  ref: da.i2[_27];
)
dr_b:
(Data Ref: 
  bb: 8 
  stmt: da.i2[_93] = pretmp_29;
  ref: da.i2[_93];
)
//pair 3
dr_a:
(Data Ref: 
  bb: 8 
  stmt: pretmp_28 = da.i1[_27];
  ref: da.i1[_27];
)
dr_b:
(Data Ref: 
  bb: 8 
  stmt: da.i1[_93] = pretmp_28;
  ref: da.i1[_93];
)

The code generated for alias checks are as below:

  <bb 23>:
  # iy_186 = PHI <_413(22), 2(2)>
  # ivtmp_1050 = PHI <ivtmp_1049(22), 512(2)>
  _155 = iy_186 + -2;
  _156 = _155 * 516;
  _241 = iy_186 + -1;
  _242 = _241 * 516;
  _328 = iy_186 * 516;
  _413 = iy_186 + 1;
  _414 = _413 * 516;
  _499 = iy_186 + 2;
  _500 = _499 * 516;
  _998 = iy_186 * 516;
  _997 = (sizetype) _998;
  _996 = _997 + 6;
  _995 = _996 * 4;
  _994 = global_Output.2_16 + _995;
  _993 = iy_186 * 516;
  _992 = (long unsigned int) _993;
  _991 = _992 * 4;
  _990 = _991 + 18446744073709547488;
  _989 = global_Input.0_153 + _990;
  _886 = _989 >= _994;
  _885 = iy_186 * 516;
  _884 = (sizetype) _885;
  _883 = _884 + 1040;
  _882 = _883 * 4;
  _881 = global_Input.0_153 + _882;
  _880 = (sizetype) _998;
  _879 = _880 + 2;
  _878 = _879 * 4;
  _877 = global_Output.2_16 + _878;
  _876 = _877 >= _881;
  _875 = _876 | _886;
  _874 = iy_186 * 516;
  _873 = (sizetype) _874;
  _872 = _873 + 514;
  _871 = _872 * 4;
  _870 = global_Output.2_16 + _871;
  _869 = local_Filter_33 >= _870;
  _868 = local_Filter_33 + 100;
  _867 = (sizetype) _874;
  _866 = _867 + 2;
  _865 = _866 * 4;
  _864 = global_Output.2_16 + _865;
  _863 = _864 >= _868;
  _862 = _863 | _869;
  _861 = _862 & _875;
  if (_861 != 0)
    goto <bb 7>;
  else
    goto <bb 4>;

It contains quite a lot redundant computations.  Root cause is vectorizer simply translates alias checks into full address expressions comparison, and CSE opportunities are covered by foler.  This patch improves function vect_create_cond_for_alias_checks by simplifying the comparison by comparing DR_BASE_ADDRESS/DR_INIT of both data-reference at compilation time.  It also simplifies conditions:
  (addr_a_min + addr_a_length) <= addr_b_min || (addr_b_min + addr_b_length) <= addr_a_min
into below form:
  cond_expr = addr_b_min - addr_a_min
  cond_expr >= addr_a_length || cond_expr <= -addr_b_length
if the comparison is done in signed type.  And this can be further simplified by folder if addr_a_length and addr_b_lengnth are equal/const, which is quite common.
I looked into generated assembly, this patch does introduces small regression in some cases, but overall I think it's good.  Bootstrap and test on x86_64 and AArch64.  Is it OK?
Thanks,
bin

2016-06-08  Bin Cheng  <bin.cheng@arm.com>

	* tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): New
	Parameter.  Simplify alias check conditions at compilation time.
	(vect_loop_versioning): Pass new argument to above function.

[-- Attachment #2: alias-check-condition-20160611.txt --]
[-- Type: text/plain, Size: 8218 bytes --]

diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 438458e..b38a6e4 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2211,17 +2211,22 @@ vect_create_cond_for_align_checks (loop_vec_info loop_vinfo,
 
    Output:
    COND_EXPR - conditional expression.
+   COND_EXPR_STMT_LIST - statements needed to construct the conditional
+                         expression.
 
    The returned COND_EXPR is the conditional expression to be used in the if
    statement that controls which version of the loop gets executed at runtime.
 */
 
 void
-vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
+vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo,
+				   tree * cond_expr,
+				   gimple_seq *cond_expr_stmt_list)
 {
   vec<dr_with_seg_len_pair_t> comp_alias_ddrs =
     LOOP_VINFO_COMP_ALIAS_DDRS (loop_vinfo);
   tree part_cond_expr;
+  gimple_seq seq;
 
   /* Create expression
      ((store_ptr_0 + store_segment_length_0) <= load_ptr_0)
@@ -2237,21 +2242,17 @@ vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
 
   for (size_t i = 0, s = comp_alias_ddrs.length (); i < s; ++i)
     {
+      enum tree_code code;
+      tree type_a, type_b, type_offset = ssizetype;
       const dr_with_seg_len& dr_a = comp_alias_ddrs[i].first;
       const dr_with_seg_len& dr_b = comp_alias_ddrs[i].second;
       tree segment_length_a = dr_a.seg_len;
       tree segment_length_b = dr_b.seg_len;
       tree addr_base_a = DR_BASE_ADDRESS (dr_a.dr);
       tree addr_base_b = DR_BASE_ADDRESS (dr_b.dr);
+      tree init_a = DR_INIT (dr_a.dr), init_b = DR_INIT (dr_b.dr);
       tree offset_a = DR_OFFSET (dr_a.dr), offset_b = DR_OFFSET (dr_b.dr);
 
-      offset_a = fold_build2 (PLUS_EXPR, TREE_TYPE (offset_a),
-			      offset_a, DR_INIT (dr_a.dr));
-      offset_b = fold_build2 (PLUS_EXPR, TREE_TYPE (offset_b),
-			      offset_b, DR_INIT (dr_b.dr));
-      addr_base_a = fold_build_pointer_plus (addr_base_a, offset_a);
-      addr_base_b = fold_build_pointer_plus (addr_base_b, offset_b);
-
       if (dump_enabled_p ())
 	{
 	  dump_printf_loc (MSG_NOTE, vect_location,
@@ -2262,31 +2263,140 @@ vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
 	  dump_printf (MSG_NOTE, "\n");
 	}
 
-      tree seg_a_min = addr_base_a;
-      tree seg_a_max = fold_build_pointer_plus (addr_base_a, segment_length_a);
       /* For negative step, we need to adjust address range by TYPE_SIZE_UNIT
 	 bytes, e.g., int a[3] -> a[1] range is [a+4, a+16) instead of
-	 [a, a+12) */
+	 [a, a+12).  */
       if (tree_int_cst_compare (DR_STEP (dr_a.dr), size_zero_node) < 0)
 	{
 	  tree unit_size = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr_a.dr)));
-	  seg_a_min = fold_build_pointer_plus (seg_a_max, unit_size);
-	  seg_a_max = fold_build_pointer_plus (addr_base_a, unit_size);
+	  init_a = fold_build2 (PLUS_EXPR, type_offset, init_a,
+				fold_convert (type_offset, unit_size));
 	}
-
-      tree seg_b_min = addr_base_b;
-      tree seg_b_max = fold_build_pointer_plus (addr_base_b, segment_length_b);
       if (tree_int_cst_compare (DR_STEP (dr_b.dr), size_zero_node) < 0)
 	{
 	  tree unit_size = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr_b.dr)));
-	  seg_b_min = fold_build_pointer_plus (seg_b_max, unit_size);
-	  seg_b_max = fold_build_pointer_plus (addr_base_b, unit_size);
+	  init_b = fold_build2 (PLUS_EXPR, type_offset, init_b,
+				fold_convert (type_offset, unit_size));
 	}
 
-      part_cond_expr =
-      	fold_build2 (TRUTH_OR_EXPR, boolean_type_node,
-	  fold_build2 (LE_EXPR, boolean_type_node, seg_a_max, seg_b_min),
-	  fold_build2 (LE_EXPR, boolean_type_node, seg_b_max, seg_a_min));
+      /* No need to consider init parts if they are equal in both DRs.  */
+      if (wi::to_widest (init_a) != wi::to_widest (init_b))
+	{
+	  if (wi::to_widest (init_a) < wi::to_widest (init_b))
+	    {
+	      init_b = fold_build2 (MINUS_EXPR, type_offset, init_b, init_a);
+	      init_a = fold_convert (type_offset, integer_zero_node);
+	    }
+	  else
+	    {
+	      init_a = fold_build2 (MINUS_EXPR, type_offset, init_a, init_b);
+	      init_b = fold_convert (type_offset, integer_zero_node);
+	    }
+
+	  offset_a = fold_build2 (PLUS_EXPR, type_offset, offset_a, init_a);
+	  offset_b = fold_build2 (PLUS_EXPR, type_offset, offset_b, init_b);
+	}
+
+      /* No need to consider base parts if they are equal in both DRs.  */
+      if (operand_equal_p (addr_base_a, addr_base_b, 0))
+	{
+	  code = PLUS_EXPR;
+	  type_a = type_offset;
+	  type_b = type_offset;
+	  addr_base_a = offset_a;
+	  addr_base_b = offset_b;
+	}
+      else
+	{
+	  type_offset = sizetype;
+	  code = POINTER_PLUS_EXPR;
+	  type_a = TREE_TYPE (addr_base_a);
+	  type_b = TREE_TYPE (addr_base_b);
+	  addr_base_a = fold_build_pointer_plus (addr_base_a, offset_a);
+	  addr_base_b = fold_build_pointer_plus (addr_base_b, offset_b);
+	}
+
+      segment_length_a = fold_convert (type_offset, segment_length_a);
+      segment_length_b = fold_convert (type_offset, segment_length_b);
+
+      addr_base_a = force_gimple_operand (addr_base_a, &seq, true, NULL);
+      gimple_seq_add_seq (cond_expr_stmt_list, seq);
+      addr_base_b = force_gimple_operand (addr_base_b, &seq, true, NULL);
+      gimple_seq_add_seq (cond_expr_stmt_list, seq);
+
+      tree seg_a_min;
+      tree seg_a_max;
+      if (tree_int_cst_compare (DR_STEP (dr_a.dr), size_zero_node) < 0)
+	{
+	  seg_a_min = fold_build2 (code, type_a,
+				   addr_base_a, segment_length_a);
+	  seg_a_max = addr_base_a;
+	  /* We need absolute value of segnment length later.  */
+	  segment_length_a = fold_build2 (MINUS_EXPR, type_offset,
+					  integer_zero_node, segment_length_a);
+	}
+      else
+	{
+	  seg_a_min = addr_base_a;
+	  seg_a_max = fold_build2 (code, type_a,
+				    addr_base_a, segment_length_a);
+	}
+
+      tree seg_b_min;
+      tree seg_b_max;
+      if (tree_int_cst_compare (DR_STEP (dr_b.dr), size_zero_node) < 0)
+	{
+	  seg_b_min = fold_build2 (code, type_b,
+				   addr_base_b, segment_length_b);
+	  seg_b_max = addr_base_b;
+	  /* We need absolute value of segnment length later.  */
+	  segment_length_b = fold_build2 (MINUS_EXPR, type_offset,
+					  integer_zero_node, segment_length_b);
+	}
+      else
+	{
+	  seg_b_min = addr_base_b;
+	  seg_b_max = fold_build2 (code, type_b,
+				   addr_base_b, segment_length_b);
+	}
+
+      if (!TYPE_UNSIGNED (type_offset))
+	{
+	  /* If comparison is done in signed type, below condition checks:
+
+	       (seg_a_max <= seg_b_min || seg_b_max <= seg_a_min)
+
+	     can be simplified into:
+
+	       cond_expr = seg_b_min - seg_a_min;
+	       (cond_expr >= segment_length_a
+		|| cond_expr <= -segment_length_b)
+
+	     It can be further simplified by folder if
+
+	       (segment_length_a == segment_length_b)
+
+	     holds, which is common.  */
+	  tree diff = fold_build2 (MINUS_EXPR, type_offset,
+				   seg_b_min, seg_a_min);
+	  diff = force_gimple_operand (diff, &seq, true, NULL);
+	  gimple_seq_add_seq (cond_expr_stmt_list, seq);
+	  segment_length_b = fold_build2 (MINUS_EXPR, type_offset,
+					  integer_zero_node, segment_length_b);
+	  tree expr1 = fold_build2 (GE_EXPR, boolean_type_node,
+				    diff, segment_length_a);
+	  tree expr2 = fold_build2 (LE_EXPR, boolean_type_node,
+				    diff, segment_length_b);
+	  part_cond_expr = fold_build2 (TRUTH_OR_EXPR,
+					boolean_type_node, expr1, expr2);
+	}
+      else
+	{
+	  part_cond_expr =
+	    fold_build2 (TRUTH_OR_EXPR, boolean_type_node,
+	      fold_build2 (LE_EXPR, boolean_type_node, seg_a_max, seg_b_min),
+	      fold_build2 (LE_EXPR, boolean_type_node, seg_b_max, seg_a_min));
+	}
 
       if (*cond_expr)
 	*cond_expr = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
@@ -2356,7 +2466,8 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
 				       &cond_expr_stmt_list);
 
   if (version_alias)
-    vect_create_cond_for_alias_checks (loop_vinfo, &cond_expr);
+    vect_create_cond_for_alias_checks (loop_vinfo, &cond_expr,
+				       &cond_expr_stmt_list);
 
   cond_expr = force_gimple_operand_1 (cond_expr, &gimplify_stmt_list,
 				      is_gimple_condexpr, NULL_TREE);

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH GCC]Improve alias check code generation in vectorizer
  2016-06-13 10:06 [PATCH GCC]Improve alias check code generation in vectorizer Bin Cheng
@ 2016-06-14 12:42 ` Richard Biener
  0 siblings, 0 replies; 2+ messages in thread
From: Richard Biener @ 2016-06-14 12:42 UTC (permalink / raw)
  To: Bin Cheng; +Cc: gcc-patches, nd

On Mon, Jun 13, 2016 at 12:06 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> Take subroutine "DACOP" from spec2k/200.sixtrack as an example, the loop needs to be versioned for vectorization because of possibly alias.  The alias check for data-references are like:
>
> //pair 1
> dr_a:
> (Data Ref:
>   bb: 8
>   stmt: _92 = da.cc[_27];
>   ref: da.cc[_27];
> )
> dr_b:
> (Data Ref:
>   bb: 8
>   stmt: da.cc[_93] = _92;
>   ref: da.cc[_93];
> )
> //pair 2
> dr_a:
> (Data Ref:
>   bb: 8
>   stmt: pretmp_29 = da.i2[_27];
>   ref: da.i2[_27];
> )
> dr_b:
> (Data Ref:
>   bb: 8
>   stmt: da.i2[_93] = pretmp_29;
>   ref: da.i2[_93];
> )
> //pair 3
> dr_a:
> (Data Ref:
>   bb: 8
>   stmt: pretmp_28 = da.i1[_27];
>   ref: da.i1[_27];
> )
> dr_b:
> (Data Ref:
>   bb: 8
>   stmt: da.i1[_93] = pretmp_28;
>   ref: da.i1[_93];
> )
>
> The code generated for alias checks are as below:
>
>   <bb 23>:
>   # iy_186 = PHI <_413(22), 2(2)>
>   # ivtmp_1050 = PHI <ivtmp_1049(22), 512(2)>
>   _155 = iy_186 + -2;
>   _156 = _155 * 516;
>   _241 = iy_186 + -1;
>   _242 = _241 * 516;
>   _328 = iy_186 * 516;
>   _413 = iy_186 + 1;
>   _414 = _413 * 516;
>   _499 = iy_186 + 2;
>   _500 = _499 * 516;
>   _998 = iy_186 * 516;
>   _997 = (sizetype) _998;
>   _996 = _997 + 6;
>   _995 = _996 * 4;
>   _994 = global_Output.2_16 + _995;
>   _993 = iy_186 * 516;
>   _992 = (long unsigned int) _993;
>   _991 = _992 * 4;
>   _990 = _991 + 18446744073709547488;
>   _989 = global_Input.0_153 + _990;
>   _886 = _989 >= _994;
>   _885 = iy_186 * 516;
>   _884 = (sizetype) _885;
>   _883 = _884 + 1040;
>   _882 = _883 * 4;
>   _881 = global_Input.0_153 + _882;
>   _880 = (sizetype) _998;
>   _879 = _880 + 2;
>   _878 = _879 * 4;
>   _877 = global_Output.2_16 + _878;
>   _876 = _877 >= _881;
>   _875 = _876 | _886;
>   _874 = iy_186 * 516;
>   _873 = (sizetype) _874;
>   _872 = _873 + 514;
>   _871 = _872 * 4;
>   _870 = global_Output.2_16 + _871;
>   _869 = local_Filter_33 >= _870;
>   _868 = local_Filter_33 + 100;
>   _867 = (sizetype) _874;
>   _866 = _867 + 2;
>   _865 = _866 * 4;
>   _864 = global_Output.2_16 + _865;
>   _863 = _864 >= _868;
>   _862 = _863 | _869;
>   _861 = _862 & _875;
>   if (_861 != 0)
>     goto <bb 7>;
>   else
>     goto <bb 4>;
>
> It contains quite a lot redundant computations.  Root cause is vectorizer simply translates alias checks into full address expressions comparison, and CSE opportunities are covered by foler.  This patch improves function vect_create_cond_for_alias_checks by simplifying the comparison by comparing DR_BASE_ADDRESS/DR_INIT of both data-reference at compilation time.  It also simplifies conditions:
>   (addr_a_min + addr_a_length) <= addr_b_min || (addr_b_min + addr_b_length) <= addr_a_min
> into below form:
>   cond_expr = addr_b_min - addr_a_min
>   cond_expr >= addr_a_length || cond_expr <= -addr_b_length
> if the comparison is done in signed type.  And this can be further simplified by folder if addr_a_length and addr_b_lengnth are equal/const, which is quite common.
> I looked into generated assembly, this patch does introduces small regression in some cases, but overall I think it's good.  Bootstrap and test on x86_64 and AArch64.  Is it OK?

Hmm.  I think it's obviously good to make the generated expressions
smaller.  Ideally we'd know the dependence distance
here (but have added them to the runtime checks because the distance
was symbolic - dependence analysis doesn't give us this information).
This means we often can't optimize the check to i0 - i1 >= 4 but
retain useless widening/multiplications.  Does your patch help with
those?

int cmp = tree_int_cst_compare (init_a, init_b);

+      /* No need to consider init parts if they are equal in both DRs.  */
+      if (wi::to_widest (init_a) != wi::to_widest (init_b))

cmp != 0

+       {
+         if (wi::to_widest (init_a) < wi::to_widest (init_b))

cmp < 0

+           {
....

this code seems to mimic some comparison folding we have?  Does that
not trigger for some reason?

That said, not adding the address-base will make that not trigger if
the remainder of ops are unsigned.

Doing intermittent force_gimple_operand will make fold not see the
whole trees and thus simplifications get lost.
In fact you're mixing gimple stmt building and generic expr building -
ideally you'd use gimple_build (...) for
all of it (but that doesn't work as all the DR_ components are GENERIC).

So I think we have to be careful with this kind of micro-optimization
patch.  If you'd extracted the "indices" to
compare somehow that would be great.

Another option would be to use tree-affine to simplify the compares.

Richard.



> Thanks,
> bin
>
> 2016-06-08  Bin Cheng  <bin.cheng@arm.com>
>
>         * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): New
>         Parameter.  Simplify alias check conditions at compilation time.
>         (vect_loop_versioning): Pass new argument to above function.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-06-14 12:42 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-13 10:06 [PATCH GCC]Improve alias check code generation in vectorizer Bin Cheng
2016-06-14 12:42 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).