public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer
@ 2016-05-25 13:03 Bin Cheng
  2016-05-27 12:07 ` Richard Biener
  2016-07-13 20:59 ` Jeff Law
  0 siblings, 2 replies; 11+ messages in thread
From: Bin Cheng @ 2016-05-25 13:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd

[-- Attachment #1: Type: text/plain, Size: 3433 bytes --]

Hi,
As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object.  Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure.  After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary.  Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR.  This is necessary because GCC only handles re-association for commutative operators in CSE.

I checked its impact on various test cases.
With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
1) It doesn't fix all the problem on x86_64.  Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case.  Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
2) It causes regression for PR68030 on AArch64.  I think the regression is caused by IVOPT issues which are exposed by this patch.  Checks on offset validity in get_address_cost is wrong/inaccurate now.  It considers an offset as valid if it's within the maximum offset range that backend supports.  This is not true, for example, AArch64 requires aligned offset additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range.  Another issue is also in get_address_cost.  Inaccurate cost is computed for "base + offset + INDEX" address expression.  When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.

Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.

Thanks,
bin

2016-05-17 Bin Cheng  <bin.cheng@arm.com>

        PR tree-optimization/68030
        PR tree-optimization/69710
        * tree-ssa-dom.c (cse_bbs): New function.
        * tree-ssa-dom.h (cse_bbs): New declaration.
        * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
        Re-associate address by splitting constant offset.
        (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
        basic block.
        * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
        changed basic block.
        * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
        (changed_bbs): New variable.
        (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
        * tree-vectorizer.h (changed_bbs): New declaration.

[-- Attachment #2: pr69710-20160523.txt --]
[-- Type: text/plain, Size: 7145 bytes --]

diff --git a/gcc/tree-ssa-dom.c b/gcc/tree-ssa-dom.c
index 8bf5b3c..b2a0b0c 100644
--- a/gcc/tree-ssa-dom.c
+++ b/gcc/tree-ssa-dom.c
@@ -2090,3 +2090,50 @@ lookup_avail_expr (gimple *stmt, bool insert,
 
   return lhs;
 }
+
+/* A local CSE interface which runs CSE for basic blocks recorded in
+   CHANGED_BBS.  */
+
+void
+cse_bbs (bitmap changed_bbs)
+{
+  unsigned index;
+  bitmap_iterator bi;
+  gimple_stmt_iterator gsi;
+
+  hash_table<expr_elt_hasher> *avail_exprs;
+  class avail_exprs_stack *avail_exprs_stack;
+  class const_and_copies *const_and_copies;
+
+  avail_exprs = new hash_table<expr_elt_hasher> (1024);
+  avail_exprs_stack = new class avail_exprs_stack (avail_exprs);
+  const_and_copies = new class const_and_copies ();
+
+  threadedge_initialize_values ();
+  /* Push a marker on the stacks of local information so that we know how
+     far to unwind when we finalize this block.  */
+  avail_exprs_stack->push_marker ();
+  const_and_copies->push_marker ();
+
+  EXECUTE_IF_SET_IN_BITMAP (changed_bbs, 0, index, bi)
+    {
+      basic_block bb = BASIC_BLOCK_FOR_FN (cfun, index);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "\n\nRun local cse on block #%d\n\n", bb->index);
+
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	optimize_stmt (bb, gsi, const_and_copies, avail_exprs_stack);
+
+      /* Pop stacks to keep it small.  */
+      avail_exprs_stack->pop_to_marker ();
+      const_and_copies->pop_to_marker ();
+    }
+
+  delete avail_exprs;
+  avail_exprs = NULL;
+
+  delete avail_exprs_stack;
+  delete const_and_copies;
+  threadedge_finalize_values ();
+}
diff --git a/gcc/tree-ssa-dom.h b/gcc/tree-ssa-dom.h
index e6abe65..0ab1ec9 100644
--- a/gcc/tree-ssa-dom.h
+++ b/gcc/tree-ssa-dom.h
@@ -24,5 +24,6 @@ extern bool simple_iv_increment_p (gimple *);
 extern void record_temporary_equivalences (edge,
 					   class const_and_copies *,
 					   class avail_exprs_stack *);
+extern void cse_bbs (bitmap);
 
 #endif /* GCC_TREE_SSA_DOM_H */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 7652e21..ed5ef90 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4108,23 +4108,27 @@ vect_create_addr_base_for_vector_ref (gimple *stmt,
       base_name = get_name (DR_REF (dr));
     }
 
-  /* Create base_offset */
-  base_offset = size_binop (PLUS_EXPR,
-			    fold_convert (sizetype, base_offset),
-			    fold_convert (sizetype, init));
+  base_offset = fold_convert (sizetype, base_offset);
+  init = fold_convert (sizetype, init);
 
   if (offset)
     {
       offset = fold_build2 (MULT_EXPR, sizetype,
 			    fold_convert (sizetype, offset), step);
-      base_offset = fold_build2 (PLUS_EXPR, sizetype,
-				 base_offset, offset);
+      if (TREE_CODE (offset) == INTEGER_CST)
+	init = fold_build2 (PLUS_EXPR, sizetype, init, offset);
+      else
+	base_offset = fold_build2 (PLUS_EXPR, sizetype,
+				   base_offset, offset);
     }
   if (byte_offset)
     {
       byte_offset = fold_convert (sizetype, byte_offset);
-      base_offset = fold_build2 (PLUS_EXPR, sizetype,
-				 base_offset, byte_offset);
+      if (TREE_CODE (byte_offset) == INTEGER_CST)
+	init = fold_build2 (PLUS_EXPR, sizetype, init, byte_offset);
+      else
+	base_offset = fold_build2 (PLUS_EXPR, sizetype,
+				   base_offset, byte_offset);
     }
 
   /* base + base_offset */
@@ -4138,6 +4142,10 @@ vect_create_addr_base_for_vector_ref (gimple *stmt,
     }
 
   vect_ptr_type = build_pointer_type (STMT_VINFO_VECTYPE (stmt_info));
+  addr_base = force_gimple_operand (addr_base, &seq, true, NULL);
+  gimple_seq_add_seq (new_stmt_list, seq);
+  /* Add constant offset at last.  */
+  addr_base = fold_build_pointer_plus (addr_base, init);
   dest = vect_get_new_vect_var (vect_ptr_type, vect_pointer_var, base_name);
   addr_base = force_gimple_operand (addr_base, &seq, true, dest);
   gimple_seq_add_seq (new_stmt_list, seq);
@@ -4371,6 +4379,7 @@ vect_create_data_ref_ptr (gimple *stmt, tree aggr_type, struct loop *at_loop,
         {
           new_bb = gsi_insert_seq_on_edge_immediate (pe, new_stmt_list);
           gcc_assert (!new_bb);
+          bitmap_set_bit (changed_bbs, pe->src->index);
         }
       else
         gsi_insert_seq_before (gsi, new_stmt_list, GSI_SAME_STMT);
@@ -5081,9 +5090,10 @@ vect_setup_realignment (gimple *stmt, gimple_stmt_iterator *gsi,
 							NULL_TREE, loop);
           if (loop)
             {
-   	      pe = loop_preheader_edge (loop);
+	      pe = loop_preheader_edge (loop);
 	      new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
 	      gcc_assert (!new_bb);
+	      bitmap_set_bit (changed_bbs, pe->src->index);
             }
           else
              gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT);
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 7ec6dae..2b93048 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1895,10 +1895,11 @@ vect_gen_niters_for_prolog_loop (loop_vec_info loop_vinfo, tree loop_niters, int
 
       new_bb = gsi_insert_seq_on_edge_immediate (pe, new_stmts);
       gcc_assert (!new_bb);
+      bitmap_set_bit (changed_bbs, pe->src->index);
 
       /* Create:  byte_misalign = addr & (vectype_align - 1)  */
       byte_misalign =
-        fold_build2 (BIT_AND_EXPR, type, fold_convert (type, start_addr), 
+        fold_build2 (BIT_AND_EXPR, type, fold_convert (type, start_addr),
                      vectype_align_minus_1);
 
       /* Create:  elem_misalign = byte_misalign / element_size  */
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 2b25b45..9441121 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-vectorizer.h"
 #include "tree-ssa-propagate.h"
+#include "tree-ssa-dom.h"
 #include "dbgcnt.h"
 #include "tree-scalar-evolution.h"
 
@@ -82,6 +83,9 @@ source_location vect_location;
 
 /* Vector mapping GIMPLE stmt to stmt_vec_info. */
 vec<stmt_vec_info> stmt_vec_info_vec;
+
+/* Basic blocks should be cleaned up by CSE after vectorization.  */
+bitmap changed_bbs;
 \f
 /* For mapping simduid to vectorization factor.  */
 
@@ -508,6 +512,7 @@ vectorize_loops (void)
     note_simd_array_uses (&simd_array_to_simduid_htab);
 
   init_stmt_vec_info_vec ();
+  changed_bbs = BITMAP_ALLOC (NULL);
 
   /*  ----------- Analyze loops. -----------  */
 
@@ -619,6 +624,9 @@ vectorize_loops (void)
       loop->aux = NULL;
     }
 
+  if (!bitmap_empty_p (changed_bbs))
+    cse_bbs (changed_bbs);
+  BITMAP_FREE (changed_bbs);
   free_stmt_vec_info_vec ();
 
   /* Fold IFN_GOMP_SIMD_{VF,LANE,LAST_LANE,ORDERED_{START,END}} builtins.  */
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index bd1d55a..3cd2d96 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -695,6 +695,7 @@ struct dataref_aux {
 #define MAX_VECTORIZATION_FACTOR 64
 
 extern vec<stmt_vec_info> stmt_vec_info_vec;
+extern bitmap changed_bbs;
 
 void init_stmt_vec_info_vec (void);
 void free_stmt_vec_info_vec (void);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer
  2016-05-25 13:03 [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer Bin Cheng
@ 2016-05-27 12:07 ` Richard Biener
  2016-05-27 12:15   ` Bin.Cheng
  2016-07-13 20:59 ` Jeff Law
  1 sibling, 1 reply; 11+ messages in thread
From: Richard Biener @ 2016-05-27 12:07 UTC (permalink / raw)
  To: Bin Cheng; +Cc: gcc-patches, nd

On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object.  Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
> It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure.  After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary.  Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
> Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR.  This is necessary because GCC only handles re-association for commutative operators in CSE.
>
> I checked its impact on various test cases.
> With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case.  Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
> 2) It causes regression for PR68030 on AArch64.  I think the regression is caused by IVOPT issues which are exposed by this patch.  Checks on offset validity in get_address_cost is wrong/inaccurate now.  It considers an offset as valid if it's within the maximum offset range that backend supports.  This is not true, for example, AArch64 requires aligned offset additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range.  Another issue is also in get_address_cost.  Inaccurate cost is computed for "base + offset + INDEX" address expression.  When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.
>
> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.

It looks quite straight-forward with the caveat that it has one
obvious piece that is not in the order
of the complexity of a basic-block.  threadedge_initialize_values
creates the SSA value array
which is zero-initialized (upon use).  That's probably a non-issue for
the use you propose for
the vectorizer (call cse_bbs once per function).  As Ideally I would
like this facility to replace
the loop unrollers own propagate_constants_for_unrolling it might
become an issue though.
In that regard the unroller facility is also more powerful because it
handles regions (including
PHIs).

Note that in particular for SLP vectorization the vectorizer itself
may end up creating quite
some redundancies so I wonder if it's worth to CSE the vectorized loop
body as well
(and given we have PHIs eventually CSE the whole vectorized loop with
pre-header as a region...)

Thanks,
Richard.

> Thanks,
> bin
>
> 2016-05-17 Bin Cheng  <bin.cheng@arm.com>
>
>         PR tree-optimization/68030
>         PR tree-optimization/69710
>         * tree-ssa-dom.c (cse_bbs): New function.
>         * tree-ssa-dom.h (cse_bbs): New declaration.
>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>         Re-associate address by splitting constant offset.
>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>         basic block.
>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>         changed basic block.
>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>         (changed_bbs): New variable.
>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>         * tree-vectorizer.h (changed_bbs): New declaration.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer
  2016-05-27 12:07 ` Richard Biener
@ 2016-05-27 12:15   ` Bin.Cheng
  2016-05-27 13:01     ` Richard Biener
  0 siblings, 1 reply; 11+ messages in thread
From: Bin.Cheng @ 2016-05-27 12:15 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bin Cheng, gcc-patches

On Fri, May 27, 2016 at 11:45 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> Hi,
>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object.  Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
>> It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure.  After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary.  Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
>> Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR.  This is necessary because GCC only handles re-association for commutative operators in CSE.
>>
>> I checked its impact on various test cases.
>> With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
>> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case.  Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
>> 2) It causes regression for PR68030 on AArch64.  I think the regression is caused by IVOPT issues which are exposed by this patch.  Checks on offset validity in get_address_cost is wrong/inaccurate now.  It considers an offset as valid if it's within the maximum offset range that backend supports.  This is not true, for example, AArch64 requires aligned offset additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range.  Another issue is also in get_address_cost.  Inaccurate cost is computed for "base + offset + INDEX" address expression.  When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.
>>
>> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>
> It looks quite straight-forward with the caveat that it has one
> obvious piece that is not in the order
> of the complexity of a basic-block.  threadedge_initialize_values
> creates the SSA value array
I noticed this too, and think it's better to get rid of this init/fini
functions by some kind re-design.  I found it's quite weird to call
threadege_X in tree-vrp.c.  I will keep investigating this.
> which is zero-initialized (upon use).  That's probably a non-issue for
> the use you propose for
> the vectorizer (call cse_bbs once per function).  As Ideally I would
> like this facility to replace
> the loop unrollers own propagate_constants_for_unrolling it might
> become an issue though.
> In that regard the unroller facility is also more powerful because it
> handles regions (including
> PHIs).
With the current data-structure, I think it's not very hard to extend
the interface to regions.  I will keep investigating this too.  BTW,
if it's okay, I tend to do that in following patches.
>
> Note that in particular for SLP vectorization the vectorizer itself
> may end up creating quite
> some redundancies so I wonder if it's worth to CSE the vectorized loop
> body as well
Maybe.  The next step is condition block created by vectorizer.  Both
prune_runtime_alias_test_list and generated alias checks are
sub-optimal now, even worse, somehow computations in alias checks can
be propagated to loop pre-header.  With help of this interface, alias
checks (and later code) can be simplified.

> (and given we have PHIs eventually CSE the whole vectorized loop with
> pre-header as a region...)
>
> Thanks,
> Richard.
>
>> Thanks,
>> bin
>>
>> 2016-05-17 Bin Cheng  <bin.cheng@arm.com>
>>
>>         PR tree-optimization/68030
>>         PR tree-optimization/69710
>>         * tree-ssa-dom.c (cse_bbs): New function.
>>         * tree-ssa-dom.h (cse_bbs): New declaration.
>>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>>         Re-associate address by splitting constant offset.
>>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>>         basic block.
>>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>>         changed basic block.
>>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>>         (changed_bbs): New variable.
>>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>>         * tree-vectorizer.h (changed_bbs): New declaration.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer
  2016-05-27 12:15   ` Bin.Cheng
@ 2016-05-27 13:01     ` Richard Biener
  2016-06-06 11:23       ` Bin.Cheng
  2016-06-21 21:54       ` Jeff Law
  0 siblings, 2 replies; 11+ messages in thread
From: Richard Biener @ 2016-05-27 13:01 UTC (permalink / raw)
  To: Bin.Cheng, Jeff Law; +Cc: Bin Cheng, gcc-patches

On Fri, May 27, 2016 at 1:11 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, May 27, 2016 at 11:45 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object.  Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
>>> It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure.  After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary.  Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
>>> Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR.  This is necessary because GCC only handles re-association for commutative operators in CSE.
>>>
>>> I checked its impact on various test cases.
>>> With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
>>> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case.  Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
>>> 2) It causes regression for PR68030 on AArch64.  I think the regression is caused by IVOPT issues which are exposed by this patch.  Checks on offset validity in get_address_cost is wrong/inaccurate now.  It considers an offset as valid if it's within the maximum offset range that backend supports.  This is not true, for example, AArch64 requires aligned offset additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range.  Another issue is also in get_address_cost.  Inaccurate cost is computed for "base + offset + INDEX" address expression.  When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.
>>>
>>> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>>
>> It looks quite straight-forward with the caveat that it has one
>> obvious piece that is not in the order
>> of the complexity of a basic-block.  threadedge_initialize_values
>> creates the SSA value array
> I noticed this too, and think it's better to get rid of this init/fini
> functions by some kind re-design.  I found it's quite weird to call
> threadege_X in tree-vrp.c.  I will keep investigating this.
>> which is zero-initialized (upon use).  That's probably a non-issue for
>> the use you propose for
>> the vectorizer (call cse_bbs once per function).  As Ideally I would
>> like this facility to replace
>> the loop unrollers own propagate_constants_for_unrolling it might
>> become an issue though.
>> In that regard the unroller facility is also more powerful because it
>> handles regions (including
>> PHIs).
> With the current data-structure, I think it's not very hard to extend
> the interface to regions.  I will keep investigating this too.  BTW,
> if it's okay, I tend to do that in following patches.

I'm fine with doing enhancements to this in followup patches (adding Jeff
to CC for his opinions).

>>
>> Note that in particular for SLP vectorization the vectorizer itself
>> may end up creating quite
>> some redundancies so I wonder if it's worth to CSE the vectorized loop
>> body as well
> Maybe.  The next step is condition block created by vectorizer.  Both
> prune_runtime_alias_test_list and generated alias checks are
> sub-optimal now, even worse, somehow computations in alias checks can
> be propagated to loop pre-header.  With help of this interface, alias
> checks (and later code) can be simplified.

Yeah.

Thanks,
Richard.

>> (and given we have PHIs eventually CSE the whole vectorized loop with
>> pre-header as a region...)
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> bin
>>>
>>> 2016-05-17 Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         PR tree-optimization/68030
>>>         PR tree-optimization/69710
>>>         * tree-ssa-dom.c (cse_bbs): New function.
>>>         * tree-ssa-dom.h (cse_bbs): New declaration.
>>>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>>>         Re-associate address by splitting constant offset.
>>>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>>>         basic block.
>>>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>>>         changed basic block.
>>>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>>>         (changed_bbs): New variable.
>>>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>>>         * tree-vectorizer.h (changed_bbs): New declaration.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer
  2016-05-27 13:01     ` Richard Biener
@ 2016-06-06 11:23       ` Bin.Cheng
  2016-06-21 21:58         ` Jeff Law
  2016-06-21 21:54       ` Jeff Law
  1 sibling, 1 reply; 11+ messages in thread
From: Bin.Cheng @ 2016-06-06 11:23 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, Bin Cheng, gcc-patches

On Fri, May 27, 2016 at 12:56 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, May 27, 2016 at 1:11 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Fri, May 27, 2016 at 11:45 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> Hi,
>>>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object.  Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
>>>> It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure.  After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary.  Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
>>>> Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR.  This is necessary because GCC only handles re-association for commutative operators in CSE.
>>>>
>>>> I checked its impact on various test cases.
>>>> With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
>>>> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case.  Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
>>>> 2) It causes regression for PR68030 on AArch64.  I think the regression is caused by IVOPT issues which are exposed by this patch.  Checks on offset validity in get_address_cost is wrong/inaccurate now.  It considers an offset as valid if it's within the maximum offset range that backend supports.  This is not true, for example, AArch64 requires aligned offset additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range.  Another issue is also in get_address_cost.  Inaccurate cost is computed for "base + offset + INDEX" address expression.  When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.
>>>>
>>>> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>>>
>>> It looks quite straight-forward with the caveat that it has one
>>> obvious piece that is not in the order
>>> of the complexity of a basic-block.  threadedge_initialize_values
>>> creates the SSA value array
>> I noticed this too, and think it's better to get rid of this init/fini
>> functions by some kind re-design.  I found it's quite weird to call
>> threadege_X in tree-vrp.c.  I will keep investigating this.
>>> which is zero-initialized (upon use).  That's probably a non-issue for
>>> the use you propose for
>>> the vectorizer (call cse_bbs once per function).  As Ideally I would
>>> like this facility to replace
>>> the loop unrollers own propagate_constants_for_unrolling it might
>>> become an issue though.
>>> In that regard the unroller facility is also more powerful because it
>>> handles regions (including
>>> PHIs).
>> With the current data-structure, I think it's not very hard to extend
>> the interface to regions.  I will keep investigating this too.  BTW,
>> if it's okay, I tend to do that in following patches.
>
> I'm fine with doing enhancements to this in followup patches (adding Jeff
> to CC for his opinions).
Hi,
I further investigated the impact of this patch, and believe my
previous conclusion still holds.  On targets that don't support [base
+ index + offset] addressing mode, it may causes IVO generating worse
code than now.  I found and investigated a new case showing exactly
the same problem.  Take AArch64 as an example, before this patch, IVO
computes cost and chooses candidate as below:

Loop preheader:
    vect_p_1 = ...
    vect_p_2 = ...
Loop header:
    MEM[vect_p_1 + VAR]
    MEM[vect_p_2 + VAR]

After this patch, cse opportunities are realized between vect_p_1 and
vect_p_2, and IVO computes cost for the second MEM_REF differently as
below:
Loop preheader:
    vect_p_1 = ....
    vect_p_2 = vect_1_p + offset
Loop header:
    MEM[vect_p_1 + VAR]
    MEM[vect_p_1 + VAR + offset]

On x86, it's optimal because the addressing mode is supported.  On
AArch64, cost computed for [vect_p_1 + VAR + offset] is too large.  As
a result, VAR is not chosen, and more candidates are chosen.  The
inaccurate cost is computed because IVO relies on LEGITIMIZE_ADDRESS
hook, as well as use buffering when constructing rtx address
expression.  IVO should be aware of that "vect_p_1 + offset" can be
computed as an invariant just like before this patch.
I am doing experiments to rewrite address computation in IVO, in the
meantime, I collected various benchmarks data on AArch64 and there is
no big regression caused by the mentioned issue.

Hi Jeff,
What's your opinion on this (and how to extend it to a region based
interface)?  I will update this patch for further review if you are
okay.

Thanks,
bin
>
>>>
>>> Note that in particular for SLP vectorization the vectorizer itself
>>> may end up creating quite
>>> some redundancies so I wonder if it's worth to CSE the vectorized loop
>>> body as well
>> Maybe.  The next step is condition block created by vectorizer.  Both
>> prune_runtime_alias_test_list and generated alias checks are
>> sub-optimal now, even worse, somehow computations in alias checks can
>> be propagated to loop pre-header.  With help of this interface, alias
>> checks (and later code) can be simplified.
>
> Yeah.
>
> Thanks,
> Richard.
>
>>> (and given we have PHIs eventually CSE the whole vectorized loop with
>>> pre-header as a region...)
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> bin
>>>>
>>>> 2016-05-17 Bin Cheng  <bin.cheng@arm.com>
>>>>
>>>>         PR tree-optimization/68030
>>>>         PR tree-optimization/69710
>>>>         * tree-ssa-dom.c (cse_bbs): New function.
>>>>         * tree-ssa-dom.h (cse_bbs): New declaration.
>>>>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>>>>         Re-associate address by splitting constant offset.
>>>>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>>>>         basic block.
>>>>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>>>>         changed basic block.
>>>>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>>>>         (changed_bbs): New variable.
>>>>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>>>>         * tree-vectorizer.h (changed_bbs): New declaration.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer
  2016-05-27 13:01     ` Richard Biener
  2016-06-06 11:23       ` Bin.Cheng
@ 2016-06-21 21:54       ` Jeff Law
  1 sibling, 0 replies; 11+ messages in thread
From: Jeff Law @ 2016-06-21 21:54 UTC (permalink / raw)
  To: Richard Biener, Bin.Cheng; +Cc: Bin Cheng, gcc-patches

On 05/27/2016 05:56 AM, Richard Biener wrote:
> On Fri, May 27, 2016 at 1:11 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Fri, May 27, 2016 at 11:45 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> Hi,
>>>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated computations in loop's pre-header basic block when creating base address for vector reference to the same memory object.  Because the duplicated code is out of loop, IVOPT fails to track base object for these vector references, resulting in missed strength reduction.
>>>> It's agreed that vectorizer should be improved to generate optimal (IVOPT-friendly) code, the difficult part is we want a generic infrastructure.  After investigation, I tried to introduce a generic/simple local CSE interface by reusing existing algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each basic block in a bitmap, customers of this interface only need to record basic blocks in the bitmap when necessary.  Note we don't need scopedtables' unwinding facility since the interface runs only for single basic block, this should be good in terms of compilation time.
>>>> Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR.  This is necessary because GCC only handles re-association for commutative operators in CSE.
>>>>
>>>> I checked its impact on various test cases.
>>>> With this patch, PR68030's generated assembly is reduced from ~750 lines to ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
>>>> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for base address of the first reference is somehow moved outside of loop's pre-header, local CSE can't help in this case.  Though split_constant_offset can back track ssa def chain, it causes possible redundant when there is no CSE opportunities in pre-header.
>>>> 2) It causes regression for PR68030 on AArch64.  I think the regression is caused by IVOPT issues which are exposed by this patch.  Checks on offset validity in get_address_cost is wrong/inaccurate now.  It considers an offset as valid if it's within the maximum offset range that backend supports.  This is not true, for example, AArch64 requires aligned offset additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within the maximum offset range.  Another issue is also in get_address_cost.  Inaccurate cost is computed for "base + offset + INDEX" address expression.  When register pressure is low, "base+offset" can be hoisted out and we can use [base + INDEX] addressing mode, whichhis is current behavior.
>>>>
>>>> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>>>
>>> It looks quite straight-forward with the caveat that it has one
>>> obvious piece that is not in the order
>>> of the complexity of a basic-block.  threadedge_initialize_values
>>> creates the SSA value array
>> I noticed this too, and think it's better to get rid of this init/fini
>> functions by some kind re-design.  I found it's quite weird to call
>> threadege_X in tree-vrp.c.  I will keep investigating this.
>>> which is zero-initialized (upon use).  That's probably a non-issue for
>>> the use you propose for
>>> the vectorizer (call cse_bbs once per function).  As Ideally I would
>>> like this facility to replace
>>> the loop unrollers own propagate_constants_for_unrolling it might
>>> become an issue though.
>>> In that regard the unroller facility is also more powerful because it
>>> handles regions (including
>>> PHIs).
>> With the current data-structure, I think it's not very hard to extend
>> the interface to regions.  I will keep investigating this too.  BTW,
>> if it's okay, I tend to do that in following patches.
>
> I'm fine with doing enhancements to this in followup patches (adding Jeff
> to CC for his opinions).
Likewise.  Ideally we'll get all the threading stuff out of DOM/VRP. 
VRP is probably the easiest as I think it'll fall-out of some work 
Andrew is doing.  Getting rid of the threading from DOM is more work, 
but nothing that (IMHO) is terribly complex, its just time consuming.

Jeff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer
  2016-06-06 11:23       ` Bin.Cheng
@ 2016-06-21 21:58         ` Jeff Law
  0 siblings, 0 replies; 11+ messages in thread
From: Jeff Law @ 2016-06-21 21:58 UTC (permalink / raw)
  To: Bin.Cheng, Richard Biener; +Cc: Bin Cheng, gcc-patches

On 06/06/2016 05:23 AM, Bin.Cheng wrote:

> Hi Jeff,
> What's your opinion on this (and how to extend it to a region based
> interface)?  I will update this patch for further review if you are
> okay.
I've never pondered how to revamp DOM into a regional interface (though 
I have pondered how to revamp the threading bits into a regional 
interface -- without any real success).

Conceptually I think it's just a matter of setting an entry block/edge, 
computing all blocks dominated by that entry block/edge and limiting the 
walker to only visiting blocks in that set.  So the regional aspect 
would live in domwalk.[ch].

We'd probably want to avoid all the threading bits in a regional walk -- 
not because threading isn't useful, but sequencing the updates is a PITA 
and trying to do it in the middle of some other pass would just be a mess.

jeff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer
  2016-05-25 13:03 [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer Bin Cheng
  2016-05-27 12:07 ` Richard Biener
@ 2016-07-13 20:59 ` Jeff Law
  2016-11-21 21:34   ` Doug Gilmore
  1 sibling, 1 reply; 11+ messages in thread
From: Jeff Law @ 2016-07-13 20:59 UTC (permalink / raw)
  To: Bin Cheng, gcc-patches; +Cc: nd

On 05/25/2016 05:22 AM, Bin Cheng wrote:
> Hi, As analyzed in PR68303 and PR69710, vectorizer generates
> duplicated computations in loop's pre-header basic block when
> creating base address for vector reference to the same memory object.
Not a huge surprise.  Loop optimizations generally have a tendency to 
create and/or expose CSE opportunities.   Unrolling is a common culprit, 
there's certainly the possibility for header duplication, code motions 
and IV rewriting to also expose/create redundant code.

We haven't really had great success squashing these redundancies within 
the loop optimizer itself.  This has lead to running CSE like passes 
after the loop optimizers or mini-CSE passes like we see here.


> Because the duplicated code is out of loop, IVOPT fails to track base
> object for these vector references, resulting in missed strength
> reduction. It's agreed that vectorizer should be improved to generate
> optimal (IVOPT-friendly) code, the difficult part is we want a
> generic infrastructure.  After investigation, I tried to introduce a
> generic/simple local CSE interface by reusing existing
> algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).
I believe you're the only person that's ever tried to re-use that 
infrastructure -- were there any API/usage patterns that would have 
worked better for your needs?

ISTM that conceptually if we broke out the basic hashing bits that's 
what you're most interested in re-using since you're really just 
building a local CSE.



> The interface runs local CSE for each basic block in a bitmap,
> customers of this interface only need to record basic blocks in the
> bitmap when necessary.  Note we don't need scopedtables' unwinding
> facility since the interface runs only for single basic block, this
> should be good in terms of compilation time.
Right.


>
> I checked its impact on various test cases. With this patch,
> PR68030's generated assembly is reduced from ~750 lines to ~580 lines
> on x86_64, with both pre-header and loop body simplified.
That's a good result :-)



  But, 1) It
> doesn't fix all the problem on x86_64.  Root cause is computation for
> base address of the first reference is somehow moved outside of
> loop's pre-header, local CSE can't help in this case.
That's a bid odd -- have you investigated why this is outside the loop 
header?

  Though
> split_constant_offset can back track ssa def chain, it causes
> possible redundant when there is no CSE opportunities in pre-header.
Is the problem that when you split you don't know if doing so will be 
profitable until you you're actually CSE-ing the pre-header?

> 2) It causes regression for PR68030 on AArch64.  I think the
> regression is caused by IVOPT issues which are exposed by this patch.
> Checks on offset validity in get_address_cost is wrong/inaccurate
> now.  It considers an offset as valid if it's within the maximum
> offset range that backend supports.  This is not true, for example,
> AArch64 requires aligned offset additionally.  For example, LDR [base
> + 2060] is invalid for V4SFmode, although "2060" is within the
> maximum offset range.  Another issue is also in get_address_cost.
> Inaccurate cost is computed for "base + offset + INDEX" address
> expression.  When register pressure is low, "base+offset" can be
> hoisted out and we can use [base + INDEX] addressing mode, whichhis
> is current behavior.
>

 > Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
 >
 > Thanks,
 > bin
 >
 > 2016-05-17 Bin Cheng  <bin.cheng@arm.com>
 >
 >         PR tree-optimization/68030
 >         PR tree-optimization/69710
 >         * tree-ssa-dom.c (cse_bbs): New function.
 >         * tree-ssa-dom.h (cse_bbs): New declaration.
 >         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
 >         Re-associate address by splitting constant offset.
 >         (vect_create_data_ref_ptr, vect_setup_realignment): Record 
changed
 >         basic block.
 >         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): 
Record
 >         changed basic block.
 >         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
 >         (changed_bbs): New variable.
 >         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
 >         * tree-vectorizer.h (changed_bbs): New declaration.
 >
 >
 > pr69710-20160523.txt
 >
 >
 > diff --git a/gcc/tree-ssa-dom.c b/gcc/tree-ssa-dom.c
 > index 8bf5b3c..b2a0b0c 100644
 > --- a/gcc/tree-ssa-dom.c
 > +++ b/gcc/tree-ssa-dom.c
 > @@ -2090,3 +2090,50 @@ lookup_avail_expr (gimple *stmt, bool insert,
 >
 >    return lhs;
 >  }
 > +
 > +/* A local CSE interface which runs CSE for basic blocks recorded in
 > +   CHANGED_BBS.  */
 > +
 > +void
 > +cse_bbs (bitmap changed_bbs)
 > +{
 > +  unsigned index;
 > +  bitmap_iterator bi;
 > +  gimple_stmt_iterator gsi;
 > +
 > +  hash_table<expr_elt_hasher> *avail_exprs;
 > +  class avail_exprs_stack *avail_exprs_stack;
 > +  class const_and_copies *const_and_copies;
 > +
 > +  avail_exprs = new hash_table<expr_elt_hasher> (1024);
 > +  avail_exprs_stack = new class avail_exprs_stack (avail_exprs);
 > +  const_and_copies = new class const_and_copies ();
 > +
 > +  threadedge_initialize_values ();
 > +  /* Push a marker on the stacks of local information so that we 
know how
 > +     far to unwind when we finalize this block.  */
 > +  avail_exprs_stack->push_marker ();
 > +  const_and_copies->push_marker ();
 > +
 > +  EXECUTE_IF_SET_IN_BITMAP (changed_bbs, 0, index, bi)
 > +    {
 > +      basic_block bb = BASIC_BLOCK_FOR_FN (cfun, index);
 > +
 > +      if (dump_file && (dump_flags & TDF_DETAILS))
 > +	fprintf (dump_file, "\n\nRun local cse on block #%d\n\n", bb->index);
 > +
 > +      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
 > +	optimize_stmt (bb, gsi, const_and_copies, avail_exprs_stack);
 > +
 > +      /* Pop stacks to keep it small.  */
 > +      avail_exprs_stack->pop_to_marker ();
 > +      const_and_copies->pop_to_marker ();
 > +    }
 > +
 > +  delete avail_exprs;
 > +  avail_exprs = NULL;
 > +
 > +  delete avail_exprs_stack;
 > +  delete const_and_copies;
 > +  threadedge_finalize_values ();
 > +}
No real objections here or how it's used in tree-vectorizer.c.  I'd like 
to get to a place where you didn't have to muck with the stacks or 
threadedge_* at all, but that to me is a cleanup item and not a 
requirement to move forward.

I'll assume the reassociations you do in tree-vect-data-refs are reasonable.

So how do you want to move forward on this?  Are the issues you noted 
above serious enough to warrant holding this up?

jeff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer
  2016-07-13 20:59 ` Jeff Law
@ 2016-11-21 21:34   ` Doug Gilmore
  2016-11-22 16:07     ` Bin.Cheng
  0 siblings, 1 reply; 11+ messages in thread
From: Doug Gilmore @ 2016-11-21 21:34 UTC (permalink / raw)
  To: Jeff Law, Bin Cheng, gcc-patches; +Cc: nd

[-- Attachment #1: Type: text/plain, Size: 4512 bytes --]

I haven't seen any followups to this discussion of Bin's patch to
PR68303 and PR69710, the patch submission:
http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02000.html

Discussion:
http://gcc.gnu.org/ml/gcc-patches/2016-07/msg00761.html
http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01551.html
http://gcc.gnu.org/ml/gcc-patches/2016-06/msg00372.html
http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01550.html
http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02162.html
http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02155.html
http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02154.html


so I did some investigation to get a better understanding of the
issues involved.

On 07/13/2016 01:59 PM, Jeff Law wrote:
> On 05/25/2016 05:22 AM, Bin Cheng wrote:
>> Hi, As analyzed in PR68303 and PR69710, vectorizer generates
>> duplicated computations in loop's pre-header basic block when
>> creating base address for vector reference to the same memory object.
> Not a huge surprise.  Loop optimizations generally have a tendency
> to create and/or expose CSE opportunities.  Unrolling is a common
> culprit, there's certainly the possibility for header duplication,
> code motions and IV rewriting to also expose/create redundant code.
>
> ...
> 
>  But, 1) It
>> doesn't fix all the problem on x86_64.  Root cause is computation for
>> base address of the first reference is somehow moved outside of
>> loop's pre-header, local CSE can't help in this case.
> That's a bid odd -- have you investigated why this is outside the loop header?
> ...
I didn't look at this issue per se, but I did try running DOM between
autovectorization and IVS.  Just running DOM had little effect, what
was crucial was adding the change Bin mentioned in his original
message:

    Besides CSE issue, this patch also re-associates address
    expressions in vect_create_addr_base_for_vector_ref, specifically,
    it splits constant offset and adds it back near the expression
    root in IR.  This is necessary because GCC only handles
    re-association for commutative operators in CSE.

I attached a patch for these changes only.  These are the important
modifications that address the some of the IVS related issues exposed
by PR68303. I found that adding the CSE change (or calling DOM between
autovectorization and IVOPTS) is not needed, and from what I have
seen, actually makes the code worse.

Applying only the modifications to
vect_create_addr_base_for_vector_ref, additional simplifications will
be done when induction variables are found (function
find_induction_variables).  These simplications are indicated by the
appearance of lines:

Applying pattern match.pd:1056, generic-match.c:11865

in the IVOPS dump file.  Now IVOPTs transforms the code so that
constants now appear in the computation of the effective addresses for
the memory OPs.  However the code generated by IVOPTS still uses a
separate base register for each memory reference.  Later DOM3
transforms the code to use just one base register, which is the form
the code needs to be in for the preliminary phase of IVOPTs where
"IV uses" associated with memory OPs are placed into groups.  At the
time of this grouping, checks are done to ensure that for each member
of a group the constant offsets don't overflow the immediate fields in
actual machine instructions (more in this see * below).

Currently it appears that an IV is generated for each memory
reference.  Instead of generating a new IV for each memory reference,
we could try to detect that value of the new IV is just a constant
offset of an existing IV and just generate a new temp reflecting that.
I haven't worked through what needs to be done to implement that, but
for the issue in PR69710 (saxpy example where the same IV should be
used for a load and store) is straightforward to implement so since
work has already been done in during data dependence analysis to
detect this situation.  I attached a patch for PR69710 that was
bootstrapped and tested on X86_64 without errors.  It does appear that
it needs more testing, since I did notice SPEC 2006 h264ref produces
different results with the patch applied, which I still need to
investigate.

Doug

* Note that when IV uses are grouped, only positive constant offsets
constraints are considered.  That negative offsets can be used are
reflected in the costs of using a different IV than the IV associated
with a particular group.  Thus once the optimal IV set is found, a
different IV may chosen, which causes negative constant offsets to be
used.


[-- Attachment #2: 0001-PR68030-PR69710-vect_create_addr_base_for_vector_ref.patch --]
[-- Type: text/x-patch, Size: 3492 bytes --]

From 3ea70edb3bf68057c955d2b22204f17bb670f65a Mon Sep 17 00:00:00 2001
From: Doug Gilmore <doug.gilmore@imgtec.com>
Date: Fri, 4 Nov 2016 18:49:58 -0700
Subject: [PATCH] [PR68030/PR69710] vect_create_addr_base_for_vector_ref
 changes only fix.

This patch include changes noted in Bin's first patch message:
https://gcc.gnu.org/ml/gcc-patches/2016-05/msg02000.html

    Besides CSE issue, this patch also re-associates address expressions
    in vect_create_addr_base_for_vector_ref, specifically, it splits
    constant offset and adds it back near the expression root in IR.  This
    is necessary because GCC only handles re-association for commutative
    operators in CSE.

With these changes only one IV and one base register is used for
vectorized memory references in the example for PR68030.
Unfortunately memory OP IV uses cannot be grouped per the comment in
tree-ssa-loop-ivopts.c:

   1) The interesting uses of induction variables are found.  This includes

      -- uses of induction variables in non-linear expressions
      -- addresses of arrays
      -- comparisons of induction variables

      Note the interesting uses are categorized and handled in group.
      Generally, address type uses are grouped together if their iv bases
      are different in constant offset.

AFAICS checks to select IVs so that offsets do not overflow the fields
in the machine instructions will not be done unless IV uses are
grouped together.
---
 gcc/tree-vect-data-refs.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 5a30314..4cbe7ae 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4239,23 +4239,27 @@ vect_create_addr_base_for_vector_ref (gimple *stmt,
       base_name = get_name (DR_REF (dr));
     }
 
-  /* Create base_offset */
-  base_offset = size_binop (PLUS_EXPR,
-			    fold_convert (sizetype, base_offset),
-			    fold_convert (sizetype, init));
+  base_offset = fold_convert (sizetype, base_offset);
+  init = fold_convert (sizetype, init);
 
   if (offset)
     {
       offset = fold_build2 (MULT_EXPR, sizetype,
 			    fold_convert (sizetype, offset), step);
-      base_offset = fold_build2 (PLUS_EXPR, sizetype,
-				 base_offset, offset);
+      if (TREE_CODE (offset) == INTEGER_CST)
+	init = fold_build2 (PLUS_EXPR, sizetype, init, offset);
+      else
+	base_offset = fold_build2 (PLUS_EXPR, sizetype,
+				   base_offset, offset);
     }
   if (byte_offset)
     {
       byte_offset = fold_convert (sizetype, byte_offset);
-      base_offset = fold_build2 (PLUS_EXPR, sizetype,
-				 base_offset, byte_offset);
+      if (TREE_CODE (byte_offset) == INTEGER_CST)
+	init = fold_build2 (PLUS_EXPR, sizetype, init, byte_offset);
+      else
+	base_offset = fold_build2 (PLUS_EXPR, sizetype,
+				   base_offset, byte_offset);
     }
 
   /* base + base_offset */
@@ -4269,6 +4273,10 @@ vect_create_addr_base_for_vector_ref (gimple *stmt,
     }
 
   vect_ptr_type = build_pointer_type (STMT_VINFO_VECTYPE (stmt_info));
+  addr_base = force_gimple_operand (addr_base, &seq, true, NULL);
+  gimple_seq_add_seq (new_stmt_list, seq);
+  /* Add constant offset at last.  */
+  addr_base = fold_build_pointer_plus (addr_base, init);
   dest = vect_get_new_vect_var (vect_ptr_type, vect_pointer_var, base_name);
   addr_base = force_gimple_operand (addr_base, &seq, true, dest);
   gimple_seq_add_seq (new_stmt_list, seq);
-- 
1.9.1


[-- Attachment #3: 0001-Another-prototype-fix-to-PR69710.patch --]
[-- Type: text/x-patch, Size: 5673 bytes --]

From 68d783d5b8e3a6238a7e7b25d76fc6e457bfc484 Mon Sep 17 00:00:00 2001
From: Doug Gilmore <doug.gilmore@imgtec.com>
Date: Fri, 18 Nov 2016 13:29:47 -0800
Subject: [PATCH] Another prototype fix to PR69710.

Would need to be extended to handle PR68030.

Tested on r240609 several weeks ago: bootstrapped and no make check
regressions on x86_64-unknown-linux-gnu, though a run-time error on
SPEC 2006 h264ref at -O3 with --size=test was observed.
---
 gcc/tree-data-ref.c       |  1 +
 gcc/tree-data-ref.h       |  4 ++++
 gcc/tree-vect-data-refs.c | 50 ++++++++++++++++++++++++++++++++++++++++++++---
 gcc/tree-vect-loop.c      |  2 ++
 gcc/tree-vectorizer.h     |  5 +++++
 5 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index 8152da3..7a4850c 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -1492,6 +1492,7 @@ initialize_data_dependence_relation (struct data_reference *a,
   DDR_SUBSCRIPTS (res).create (0);
   DDR_DIR_VECTS (res).create (0);
   DDR_DIST_VECTS (res).create (0);
+  DDR_PREHEADER_IV (res) = NULL_TREE;
 
   if (a == NULL || b == NULL)
     {
diff --git a/gcc/tree-data-ref.h b/gcc/tree-data-ref.h
index 856dd58..ca67d93 100644
--- a/gcc/tree-data-ref.h
+++ b/gcc/tree-data-ref.h
@@ -264,6 +264,9 @@ struct data_dependence_relation
   /* Set to true when the dependence relation is on the same data
      access.  */
   bool self_reference_p;
+
+  /* For zero distance dependencies, use the same IV in the preheader.  */
+  tree preheader_iv;
 };
 
 typedef struct data_dependence_relation *ddr_p;
@@ -294,6 +297,7 @@ typedef struct data_dependence_relation *ddr_p;
 #define DDR_DIST_VECT(DDR, I) \
   DDR_DIST_VECTS (DDR)[I]
 #define DDR_REVERSED_P(DDR) DDR->reversed_p
+#define DDR_PREHEADER_IV(DDR) DDR->preheader_iv
 
 \f
 bool dr_analyze_innermost (struct data_reference *, struct loop *);
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 5a30314..12cfbec 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -394,6 +394,7 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
 		}
 	    }
 
+	  LOOP_VINFO_ZERO_DIST_DEPS (loop_vinfo).safe_push (ddr);
 	  continue;
 	}
 
@@ -4492,10 +4493,53 @@ vect_create_data_ref_ptr (gimple *stmt, tree aggr_type, struct loop *at_loop,
   /* (2) Calculate the initial address of the aggregate-pointer, and set
      the aggregate-pointer to point to it before the loop.  */
 
-  /* Create: (&(base[init_val+offset]+byte_offset) in the loop preheader.  */
+  /* Create: (&(base[init_val+offset]+byte_offset) in the loop preheader,
+     taking care to use reuse IVs.  */
 
-  new_temp = vect_create_addr_base_for_vector_ref (stmt, &new_stmt_list,
-						   offset, loop, byte_offset);
+  bool need_new_temp = true;
+  ddr_p zd_ddr = (ddr_p) NULL;
+  if (loop_vinfo)
+    {
+      int i;
+      ddr_p ddr;
+      
+
+      FOR_EACH_VEC_ELT (LOOP_VINFO_ZERO_DIST_DEPS (loop_vinfo), i, ddr)
+	{
+	  if (DDR_A(ddr) == dr || DDR_B(ddr) == dr)
+	    {
+	      zd_ddr = ddr;
+	      break;
+	    }
+	}
+
+      if (zd_ddr && DDR_PREHEADER_IV (zd_ddr))
+	{
+	  new_temp = DDR_PREHEADER_IV (zd_ddr);
+	  need_new_temp = false;
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location, "reuse ");
+	      dump_generic_expr (MSG_NOTE, TDF_SLIM, new_temp);
+	      dump_printf (MSG_NOTE, "\n");
+	    }
+	}
+    }
+
+  if (need_new_temp)
+      new_temp = vect_create_addr_base_for_vector_ref (stmt, &new_stmt_list,
+						       offset, loop, byte_offset);
+  if (zd_ddr)
+    {
+      DDR_PREHEADER_IV (zd_ddr) = new_temp;
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location, "save ");
+	  dump_generic_expr (MSG_NOTE, TDF_SLIM, new_temp);
+	  dump_printf (MSG_NOTE, "\n");
+	}
+    }
+  
   if (new_stmt_list)
     {
       if (pe)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 4150b0d..0bf67e9 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1173,6 +1173,7 @@ new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_PEELING_FOR_NITER (res) = false;
   LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
   LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
+  LOOP_VINFO_ZERO_DIST_DEPS (res) = vNULL;
 
   return res;
 }
@@ -1271,6 +1272,7 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, bool clean_stmts)
   LOOP_VINFO_GROUPED_STORES (loop_vinfo).release ();
   LOOP_VINFO_REDUCTIONS (loop_vinfo).release ();
   LOOP_VINFO_REDUCTION_CHAINS (loop_vinfo).release ();
+  LOOP_VINFO_ZERO_DIST_DEPS (loop_vinfo).release ();
 
   destroy_cost_data (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo));
   loop_vinfo->scalar_cost_vec.release ();
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 2a7fa0a..7fcc824 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -339,6 +339,10 @@ typedef struct _loop_vec_info : public vec_info {
      this points to the original vectorized loop.  Otherwise NULL.  */
   _loop_vec_info *orig_loop_info;
 
+  /* Dependency relations that have zero distance, thus the associated
+     memory references can share the same IV.  */
+  vec<ddr_p> zero_dist_deps;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -379,6 +383,7 @@ typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_SCALAR_ITERATION_COST(L) (L)->scalar_cost_vec
 #define LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST(L) (L)->single_scalar_iteration_cost
 #define LOOP_VINFO_ORIG_LOOP_INFO(L)       (L)->orig_loop_info
+#define LOOP_VINFO_ZERO_DIST_DEPS(L)       (L)->zero_dist_deps
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
   ((L)->may_misalign_stmts.length () > 0)
-- 
1.9.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer
  2016-11-21 21:34   ` Doug Gilmore
@ 2016-11-22 16:07     ` Bin.Cheng
  2016-11-23  0:35       ` Doug Gilmore
  0 siblings, 1 reply; 11+ messages in thread
From: Bin.Cheng @ 2016-11-22 16:07 UTC (permalink / raw)
  To: Doug Gilmore; +Cc: Jeff Law, gcc-patches

On Mon, Nov 21, 2016 at 9:34 PM, Doug Gilmore <Doug.Gilmore@imgtec.com> wrote:
> I haven't seen any followups to this discussion of Bin's patch to
> PR68303 and PR69710, the patch submission:
> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02000.html
>
> Discussion:
> http://gcc.gnu.org/ml/gcc-patches/2016-07/msg00761.html
> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01551.html
> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg00372.html
> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01550.html
> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02162.html
> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02155.html
> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02154.html
>
>
> so I did some investigation to get a better understanding of the
> issues involved.
Hi Doug,
Thanks for looking into this problem.
>
> On 07/13/2016 01:59 PM, Jeff Law wrote:
>> On 05/25/2016 05:22 AM, Bin Cheng wrote:
>>> Hi, As analyzed in PR68303 and PR69710, vectorizer generates
>>> duplicated computations in loop's pre-header basic block when
>>> creating base address for vector reference to the same memory object.
>> Not a huge surprise.  Loop optimizations generally have a tendency
>> to create and/or expose CSE opportunities.  Unrolling is a common
>> culprit, there's certainly the possibility for header duplication,
>> code motions and IV rewriting to also expose/create redundant code.
>>
>> ...
>>
>>  But, 1) It
>>> doesn't fix all the problem on x86_64.  Root cause is computation for
>>> base address of the first reference is somehow moved outside of
>>> loop's pre-header, local CSE can't help in this case.
>> That's a bid odd -- have you investigated why this is outside the loop header?
>> ...
> I didn't look at this issue per se, but I did try running DOM between
> autovectorization and IVS.  Just running DOM had little effect, what
> was crucial was adding the change Bin mentioned in his original
> message:
>
>     Besides CSE issue, this patch also re-associates address
>     expressions in vect_create_addr_base_for_vector_ref, specifically,
>     it splits constant offset and adds it back near the expression
>     root in IR.  This is necessary because GCC only handles
>     re-association for commutative operators in CSE.
>
> I attached a patch for these changes only.  These are the important
> modifications that address the some of the IVS related issues exposed
> by PR68303. I found that adding the CSE change (or calling DOM between
> autovectorization and IVOPTS) is not needed, and from what I have
I checked the code again.  As you said, re-association part is important
to enable CSE opportunities, no matter when and which pass handles it.
After re-association, the computation of base addresses are like:

    //preheader
    b_1 = g_Input + var_offset_1;
    vectp_1 = b_1 + cst_offset_1;
    b_2 = g_Input + var_offset_2;
    vectp_2 = b_2 + cst_offset_2;
    ...
    b_n = g_input + var_offset_n;
    vectp_n = b_n + cst_offset_n;

    //loop
    MEM[vectp_1];
    MEM[vectp_2];
    ...
    MEM[vectp_n];

In fact, var_offset_1, var_offset_2, ..., var_offset_n are equal to others.  So
the addresses are in the form of "g_Input + var_offset + cst_offset_x" differing
to each other wrto constant offset.  The purpose of CSE is to propagate all
parts of this address to IVOPTs, otherwise IVOPTS only knows IVs as below:

    iv_use_1: {b_1 + cst_offset_1, step}_loop
    iv_use_1: {b_2 + cst_offset_2, step}_loop
    ...
    iv_use_n: {b_n + cst_offset_n, step}_loop

> seen, actually makes the code worse.
>
> Applying only the modifications to
> vect_create_addr_base_for_vector_ref, additional simplifications will
> be done when induction variables are found (function
> find_induction_variables).  These simplications are indicated by the
> appearance of lines:
>
> Applying pattern match.pd:1056, generic-match.c:11865
This doesn't look related to this problem to me.  The simplification of this
problem is CSE, it's not what match.pd does.

>
> in the IVOPS dump file.  Now IVOPTs transforms the code so that
> constants now appear in the computation of the effective addresses for
> the memory OPs.  However the code generated by IVOPTS still uses a
> separate base register for each memory reference.  Later DOM3
> transforms the code to use just one base register, which is the form

Indeed CSE now looks like unnecessary fixing the problem, we can relying on
DOM pass to explore the equality among new bases (b_1, b_2, ..., b_n).  This
actually echoes my humble opinion: we shouldn't rely on IVOPTs to fix all bad
code issues.  On the other handle, for cases in which these bases
(b_1, b_2, ..., b_n)
are not equal to each other, there is not much to lose in this way either.

> the code needs to be in for the preliminary phase of IVOPTs where
> "IV uses" associated with memory OPs are placed into groups.  At the
> time of this grouping, checks are done to ensure that for each member
> of a group the constant offsets don't overflow the immediate fields in
> actual machine instructions (more in this see * below).
>
> Currently it appears that an IV is generated for each memory
> reference.  Instead of generating a new IV for each memory reference,
> we could try to detect that value of the new IV is just a constant
> offset of an existing IV and just generate a new temp reflecting that.
> I haven't worked through what needs to be done to implement that, but
> for the issue in PR69710 (saxpy example where the same IV should be
Basic idea is to re-associate generated base addresses for vectors in order
to enable more CSE opportunities.  The original patch re-associates all
vector base addresses, no matter if they share common-sub-expression
with others or not.  This is not good, and could result in sub-optimal code.
That's one reason why the patch is not updated.  The old idea here is to
introduce a vectorizer local fix, for example using hash-table storing existing
base address, and generating new ones (which differ in const offset) based
on it.

> used for a load and store) is straightforward to implement so since
> work has already been done in during data dependence analysis to
> detect this situation.  I attached a patch for PR69710 that was
> bootstrapped and tested on X86_64 without errors.  It does appear that
> it needs more testing, since I did notice SPEC 2006 h264ref produces
> different results with the patch applied, which I still need to
> investigate.
Your patch reminds me of another possible method.  Like tree-predcom.c
does, a light-weight merge&find structure, it records groups of data-refs,
and data-refs in a group having the same base address (except constant
offset).  We can build such group information during analyzing each ddr.
This is lighter than has-table with optimal time complexity.

Above comments are made based on x86, there is one more problem on
AArch64.  Given address iv_use with the form MEM[base + IV + cst_offset],
when cst_offset is out the range of [base + offset] addressing mode, IVOPTs
rewrites it into below form:

    temp_1 = base + IV;
    temp_2 = temp_1 + cst_offset;
    MEM[temp_2];

What we want is:

    temp_1 = base + cst_offset;
    MEM[temp-1 + IV];

Thus temp_1 can be hoisted out of loop when register pressure allows (as
in this problem).  In cases register pressure is high, very likely the addition
is kept in loop, which is still better than now in which register pressure is
high because different IVs are chosen.  Without this change, the vectorization
change causes obvious regression in loop body on AArch64.

But both changes look like stage 1 work.

Thanks,
bin

> Doug
>
> * Note that when IV uses are grouped, only positive constant offsets
> constraints are considered.  That negative offsets can be used are
> reflected in the costs of using a different IV than the IV associated
> with a particular group.  Thus once the optimal IV set is found, a
> different IV may chosen, which causes negative constant offsets to be
> used.
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer
  2016-11-22 16:07     ` Bin.Cheng
@ 2016-11-23  0:35       ` Doug Gilmore
  0 siblings, 0 replies; 11+ messages in thread
From: Doug Gilmore @ 2016-11-23  0:35 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Jeff Law, gcc-patches

On 11/22/2016 08:07 AM, Bin.Cheng wrote:
> On Mon, Nov 21, 2016 at 9:34 PM, Doug Gilmore <Doug.Gilmore@imgtec.com> wrote:
>> I haven't seen any followups to this discussion of Bin's patch to
>> PR68303 and PR69710, the patch submission:
>> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02000.html
>>
>> Discussion:
>> http://gcc.gnu.org/ml/gcc-patches/2016-07/msg00761.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01551.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg00372.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01550.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02162.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02155.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02154.html
>>
>>
>> so I did some investigation to get a better understanding of the
>> issues involved.
> Hi Doug,
> Thanks for looking into this problem.
>>
>> On 07/13/2016 01:59 PM, Jeff Law wrote:
>>> On 05/25/2016 05:22 AM, Bin Cheng wrote:
>>>> Hi, As analyzed in PR68303 and PR69710, vectorizer generates
>>>> duplicated computations in loop's pre-header basic block when
>>>> creating base address for vector reference to the same memory object.
>>> Not a huge surprise.  Loop optimizations generally have a tendency
>>> to create and/or expose CSE opportunities.  Unrolling is a common
>>> culprit, there's certainly the possibility for header duplication,
>>> code motions and IV rewriting to also expose/create redundant code.
>>>
>>> ...
>>>
>>>  But, 1) It
>>>> doesn't fix all the problem on x86_64.  Root cause is computation for
>>>> base address of the first reference is somehow moved outside of
>>>> loop's pre-header, local CSE can't help in this case.
>>> That's a bid odd -- have you investigated why this is outside the loop header?
>>> ...
>> I didn't look at this issue per se, but I did try running DOM between
>> autovectorization and IVS.  Just running DOM had little effect, what
>> was crucial was adding the change Bin mentioned in his original
>> message:
>>
>>     Besides CSE issue, this patch also re-associates address
>>     expressions in vect_create_addr_base_for_vector_ref, specifically,
>>     it splits constant offset and adds it back near the expression
>>     root in IR.  This is necessary because GCC only handles
>>     re-association for commutative operators in CSE.
>>
>> I attached a patch for these changes only.  These are the important
>> modifications that address the some of the IVS related issues exposed
>> by PR68303. I found that adding the CSE change (or calling DOM between
>> autovectorization and IVOPTS) is not needed, and from what I have
> I checked the code again.  As you said, re-association part is important
> to enable CSE opportunities, no matter when and which pass handles it.
> After re-association, the computation of base addresses are like:
> 
>     //preheader
>     b_1 = g_Input + var_offset_1;
>     vectp_1 = b_1 + cst_offset_1;
>     b_2 = g_Input + var_offset_2;
>     vectp_2 = b_2 + cst_offset_2;
>     ...
>     b_n = g_input + var_offset_n;
>     vectp_n = b_n + cst_offset_n;
> 
>     //loop
>     MEM[vectp_1];
>     MEM[vectp_2];
>     ...
>     MEM[vectp_n];
> 
> In fact, var_offset_1, var_offset_2, ..., var_offset_n are equal to others.  So
> the addresses are in the form of "g_Input + var_offset + cst_offset_x" differing
> to each other wrto constant offset.  The purpose of CSE is to propagate all
> parts of this address to IVOPTs, otherwise IVOPTS only knows IVs as below:
> 
>     iv_use_1: {b_1 + cst_offset_1, step}_loop
>     iv_use_1: {b_2 + cst_offset_2, step}_loop
>     ...
>     iv_use_n: {b_n + cst_offset_n, step}_loop
> 
>> seen, actually makes the code worse.
>>
>> Applying only the modifications to
>> vect_create_addr_base_for_vector_ref, additional simplifications will
>> be done when induction variables are found (function
>> find_induction_variables).  These simplications are indicated by the
>> appearance of lines:
>>
>> Applying pattern match.pd:1056, generic-match.c:11865
> This doesn't look related to this problem to me.  The simplification of this
> problem is CSE, it's not what match.pd does.
> 
>>
>> in the IVOPS dump file.  Now IVOPTs transforms the code so that
>> constants now appear in the computation of the effective addresses for
>> the memory OPs.  However the code generated by IVOPTS still uses a
>> separate base register for each memory reference.  Later DOM3
>> transforms the code to use just one base register, which is the form
> 
> Indeed CSE now looks like unnecessary fixing the problem, we can relying on
> DOM pass to explore the equality among new bases (b_1, b_2, ..., b_n).  This
> actually echoes my humble opinion: we shouldn't rely on IVOPTs to fix all bad
> code issues.  On the other handle, for cases in which these bases
> (b_1, b_2, ..., b_n)
> are not equal to each other, there is not much to lose in this way either.
> 
>> the code needs to be in for the preliminary phase of IVOPTs where
>> "IV uses" associated with memory OPs are placed into groups.  At the
>> time of this grouping, checks are done to ensure that for each member
>> of a group the constant offsets don't overflow the immediate fields in
>> actual machine instructions (more in this see * below).
>>
>> Currently it appears that an IV is generated for each memory
>> reference.  Instead of generating a new IV for each memory reference,
>> we could try to detect that value of the new IV is just a constant
>> offset of an existing IV and just generate a new temp reflecting that.
>> I haven't worked through what needs to be done to implement that, but
>> for the issue in PR69710 (saxpy example where the same IV should be
> Basic idea is to re-associate generated base addresses for vectors in order
> to enable more CSE opportunities.  The original patch re-associates all
> vector base addresses, no matter if they share common-sub-expression
> with others or not.  This is not good, and could result in sub-optimal code.
> That's one reason why the patch is not updated.  The old idea here is to
> introduce a vectorizer local fix, for example using hash-table storing existing
> base address, and generating new ones (which differ in const offset) based
> on it.
> 
>> used for a load and store) is straightforward to implement so since
>> work has already been done in during data dependence analysis to
>> detect this situation.  I attached a patch for PR69710 that was
>> bootstrapped and tested on X86_64 without errors.  It does appear that
>> it needs more testing, since I did notice SPEC 2006 h264ref produces
>> different results with the patch applied, which I still need to
>> investigate.
> Your patch reminds me of another possible method.  Like tree-predcom.c
> does, a light-weight merge&find structure, it records groups of data-refs,
> and data-refs in a group having the same base address (except constant
> offset).  We can build such group information during analyzing each ddr.
> This is lighter than has-table with optimal time complexity.
Hi Bin,

Thank you for the quick response.

Isn't this work ready done for us?

See trace output in the "vect" details dump for DR structures:

...
	base_object: *global_Input.0_145 + (sizetype) ((unsigned int) iy_179 * 2064)
	Access function 0: {4144B, +, 4}_2
...
> 
> Above comments are made based on x86, there is one more problem on
> AArch64.  Given address iv_use with the form MEM[base + IV + cst_offset],
> when cst_offset is out the range of [base + offset] addressing mode, IVOPTs
> rewrites it into below form:
> 
>     temp_1 = base + IV;
>     temp_2 = temp_1 + cst_offset;
>     MEM[temp_2];
> 
> What we want is:
> 
>     temp_1 = base + cst_offset;
>     MEM[temp-1 + IV];
> 
> Thus temp_1 can be hoisted out of loop when register pressure allows (as
> in this problem).  In cases register pressure is high, very likely the addition
> is kept in loop, which is still better than now in which register pressure is
> high because different IVs are chosen.  Without this change, the vectorization
> change causes obvious regression in loop body on AArch64.
> 
> But both changes look like stage 1 work.
Right -- Thanks!

Doug
> 
> Thanks,
> bin
> 
>> Doug
>>
>> * Note that when IV uses are grouped, only positive constant offsets
>> constraints are considered.  That negative offsets can be used are
>> reflected in the costs of using a different IV than the IV associated
>> with a particular group.  Thus once the optimal IV set is found, a
>> different IV may chosen, which causes negative constant offsets to be
>> used.
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-11-23  0:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-25 13:03 [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer Bin Cheng
2016-05-27 12:07 ` Richard Biener
2016-05-27 12:15   ` Bin.Cheng
2016-05-27 13:01     ` Richard Biener
2016-06-06 11:23       ` Bin.Cheng
2016-06-21 21:58         ` Jeff Law
2016-06-21 21:54       ` Jeff Law
2016-07-13 20:59 ` Jeff Law
2016-11-21 21:34   ` Doug Gilmore
2016-11-22 16:07     ` Bin.Cheng
2016-11-23  0:35       ` Doug Gilmore

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).