[PATCH] Optional alternative base_expr in finding basis for CAND

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
@ 2013-11-04 18:46 Yufeng Zhang
  2013-11-11 18:10 ` Bill Schmidt
  0 siblings, 1 reply; 34+ messages in thread
From: Yufeng Zhang @ 2013-11-04 18:46 UTC (permalink / raw)
  To: gcc-patches; +Cc: Bill Schmidt, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 4970 bytes --]

Hi,

This patch extends the slsr pass to optionally use an alternative base 
expression in finding basis for CAND_REFs.  Currently the pass uses 
hash-based algorithm to match the base_expr in a candidate.  Given a 
test case like the following, slsr will not be able to recognize the two 
CAND_REFs have the same basis, as their base_expr are of different 
SSA_NAMEs:

typedef int arr_2[20][20];

void foo (arr_2 a2, int i, int j)
{
   a2[i][j] = 1;
   a2[i + 10][j] = 2;
}

The gimple dump before slsr is like the following (using an 
arm-none-eabi gcc):

   i.0_2 = (unsigned int) i_1(D);
   _3 = i.0_2 * 80;
   _5 = a2_4(D) + _3;
   *_5[j_7(D)] = 1;      <----
   _9 = _3 + 800;
   _10 = a2_4(D) + _9;
   *_10[j_7(D)] = 2;     <----

Here are the dumps for the two CAND_REFs generated for the two 
statements pointed by the arrows:


   4  [2] _5 = a2_4(D) + _3;
      ADD  : a2_4(D) + (80 * i_1(D)) : int[20] *
      basis: 0  dependent: 0  sibling: 0
      next-interp: 0  dead-savings: 0

   8  [2] *_10[j_7(D)] = 2;
      REF  : _10 + ((sizetype) j_7(D) * 4) + 0 : int[20] *
      basis: 5  dependent: 0  sibling: 0
      next-interp: 0  dead-savings: 0

As mentioned previously, slsr cannot establish that candidate 4 is the 
basis for the candidate 8, as they have different base_exprs: a2_4(D) 
and _10, respectively.  However, the two references actually only differ 
by an immediate offset (800).

This patch uses the tree affine combination facilities to create an 
optional alternative base expression to be used in finding (as well as 
recording) the basis.  It calls tree_to_aff_combination_expand on 
base_expr, reset the offset field of the generated aff_tree to 0 and 
generate a tree from it by calling aff_combination_to_tree.

The new tree is recorded as a potential basis, and when 
find_basis_for_candidate fails to find a basis for a CAND_REF in its 
normal approach, it searches again using a tree expanded in such way. 
Such an expanded tree usually discloses the expression behind an 
SSA_NAME.  In the example above, instead of seeing the strength 
reduction candidate chains like this:

   _5 -> 5
   _10 -> 8

we are now having:

   _5 -> 5
   _10 -> 8
   a2_4(D) + (sizetype) i_1(D) * 80 -> 5 -> 8

With the candidates 5 and 8 linked to the same tree expression (a2_4(D) 
+ (sizetype) i_1(D) * 80), slsr is now able to establish that 5 is the 
basis of 8.

The patch doesn't attempt to change the content of any CAND_REF though. 
  It only enables CAND_REFs which (1) have the same stride and (2) have 
the underlying expressions of their base_expr only differ in immediate 
offsets,  to be recognized to have the same basis.  The statements with 
such CAND_REFs will be lowered to MEM_REFs, and later on the RTL 
expander shall be able to fold and re-associate the immediate offsets to 
the rightmost side of the addressing expression, and therefore exposes 
the common sub-expression successfully.

The code-gen difference of the example code on arm with -O2 
-mcpu=cortex-15 is:

         mov     r3, r1, asl #6
-       add     ip, r0, r2, asl #2
         str     lr, [sp, #-4]!
+       mov     ip, #1
+       mov     lr, #2
         add     r1, r3, r1, asl #4
-       mov     lr, #1
-       mov     r3, #2
         add     r0, r0, r1
-       add     r0, r0, #800
-       str     lr, [ip, r1]
-       str     r3, [r0, r2, asl #2]
+       add     r3, r0, r2, asl #2
+       str     ip, [r0, r2, asl #2]
+       str     lr, [r3, #800]
         ldr     pc, [sp], #4

One fewer instruction in this simple case.

The example used in illustration is too simple to show code-gen 
difference on x86_64, but the included test case will show the benefit 
of the patch quite obviously.

The patch has passed

* bootstrapping on arm and x86_64
* regtest on arm-none-eabi,  aarch64-none-elf and x86_64

There is no regression in SPEC2K on arm or x86_64.

OK to commit to the trunk?

Any comment is welcomed!

Thanks,
Yufeng


gcc/

         * gimple-ssa-strength-reduction.c: Include tree-affine.h.
         (find_basis_for_base_expr): Update comment.
         (find_basis_for_candidate): Add new parameter 'alt_base_expr' of
         type 'tree'.  Optionally call find_basis_for_base_expr with
         'alt_base_expr'.
         (record_potential_basis): Add new parameter 'alt_base_expr' of
         type 'tree'; set node->base_expr with 'alt_base_expr' if it is
         not NULL.
         (name_expansions): New static variable.
         (get_alternative_base): New function.
         (alloc_cand_and_find_basis): Call get_alternative_base for 
CAND_REF.
         Update calls to find_basis_for_candidate and 
record_potential_basis.
         (execute_strength_reduction): Call free_affine_expand_cache with
         &name_expansions.

gcc/testsuite/

         * gcc.dg/tree-ssa/slsr-41.c: New test.

[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 5908 bytes --]

diff --git a/gcc/gimple-ssa-strength-reduction.c b/gcc/gimple-ssa-strength-reduction.c
index 9a5072c..3150046 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "expmed.h"
 #include "params.h"
 #include "hash-table.h"
+#include "tree-affine.h"
 \f
 /* Information about a strength reduction candidate.  Each statement
    in the candidate table represents an expression of one of the
@@ -434,9 +435,10 @@ find_phi_def (tree base)
   return c->cand_num;
 }
 
-/* Helper routine for find_basis_for_candidate.  May be called twice:
+/* Helper routine for find_basis_for_candidate.  May be called three times:
    once for the candidate's base expr, and optionally again for the
-   candidate's phi definition.  */
+   candidate's phi definition, as well as for an alternative base expr
+   passed as the 2nd argument to find_basis_for_candidate.  */
 
 static slsr_cand_t
 find_basis_for_base_expr (slsr_cand_t c, tree base_expr)
@@ -477,10 +479,13 @@ find_basis_for_base_expr (slsr_cand_t c, tree base_expr)
    appear in a block that dominates the candidate statement and have
    the same stride and type.  If more than one possible basis exists,
    the one with highest index in the vector is chosen; this will be
-   the most immediately dominating basis.  */
+   the most immediately dominating basis.
+
+   When ALT_BASE_EXPR is not NULL, it will also be used to look for
+   possible candidates if all previous attempts have failed.  */
 
 static int
-find_basis_for_candidate (slsr_cand_t c)
+find_basis_for_candidate (slsr_cand_t c, tree alt_base_expr)
 {
   slsr_cand_t basis = find_basis_for_base_expr (c, c->base_expr);
 
@@ -513,6 +518,9 @@ find_basis_for_candidate (slsr_cand_t c)
 	}
     }
 
+  if (!basis && alt_base_expr)
+    basis = find_basis_for_base_expr (c, alt_base_expr);
+
   if (basis)
     {
       c->sibling = basis->dependent;
@@ -524,16 +532,17 @@ find_basis_for_candidate (slsr_cand_t c)
 }
 
 /* Record a mapping from the base expression of C to C itself, indicating that
-   C may potentially serve as a basis using that base expression.  */
+   C may potentially serve as a basis using that base expression.  Use
+   ALT_BASE_EXPR as the base expression instead, if it is not NULL.  */
 
 static void
-record_potential_basis (slsr_cand_t c)
+record_potential_basis (slsr_cand_t c, tree alt_base_expr)
 {
   cand_chain_t node;
   cand_chain **slot;
 
   node = (cand_chain_t) obstack_alloc (&chain_obstack, sizeof (cand_chain));
-  node->base_expr = c->base_expr;
+  node->base_expr = alt_base_expr ? alt_base_expr : c->base_expr;
   node->cand = c;
   node->next = NULL;
   slot = base_cand_map.find_slot (node, INSERT);
@@ -548,14 +557,46 @@ record_potential_basis (slsr_cand_t c)
     *slot = node;
 }
 
+static struct pointer_map_t *name_expansions;
+
+/* Given BASE, use the tree affine combiniation facilities to
+   find the underlying tree expression for BASE, with any
+   immediate offset excluded.  */
+
+static tree
+get_alternative_base (tree base)
+{
+  tree expr;
+  aff_tree aff;
+
+  tree_to_aff_combination_expand (base, TREE_TYPE (base),
+				  &aff, &name_expansions);
+  aff.offset = tree_to_double_int (integer_zero_node);
+  expr = aff_combination_to_tree (&aff);
+
+  if (expr == base)
+    expr = NULL;
+
+  return expr;
+}
+
 /* Allocate storage for a new candidate and initialize its fields.
-   Attempt to find a basis for the candidate.  */
+   Attempt to find a basis for the candidate.
+
+   For CAND_REF, an alternative base may also be recorded and used
+   to find a basis.  This helps cases where the expression hidden
+   behind BASE (which is usually an SSA_NAME) has immediate offset,
+   e.g.
+
+     a2[i][j] = 1;
+     a2[i + 20][j] = 2;  */
 
 static slsr_cand_t
-alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base, 
+alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
 			   double_int index, tree stride, tree ctype,
 			   unsigned savings)
 {
+  tree alt_base_expr = NULL;
   slsr_cand_t c = (slsr_cand_t) obstack_alloc (&cand_obstack,
 					       sizeof (slsr_cand));
   c->cand_stmt = gs;
@@ -573,12 +614,17 @@ alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
 
   cand_vec.safe_push (c);
 
+  if (kind == CAND_REF)
+    alt_base_expr = get_alternative_base (base);
+
   if (kind == CAND_PHI)
     c->basis = 0;
   else
-    c->basis = find_basis_for_candidate (c);
+    c->basis = find_basis_for_candidate (c, alt_base_expr);
 
-  record_potential_basis (c);
+  record_potential_basis (c, NULL);
+  if (alt_base_expr)
+    record_potential_basis (c, alt_base_expr);
 
   return c;
 }
@@ -3534,6 +3580,8 @@ execute_strength_reduction (void)
       dump_cand_chains ();
     }
 
+  free_affine_expand_cache (&name_expansions);
+
   /* Analyze costs and make appropriate replacements.  */
   analyze_candidates_and_replace ();
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
new file mode 100644
index 0000000..870d714
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
@@ -0,0 +1,24 @@
+/* Verify straight-line strength reduction in using
+   alternative base expr to record and look for the
+   potential candidate.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i-10] [j] = 2;
+  a2 [i] [j++] = i;
+  a2 [i+20] [j++] = i;
+  a2 [i-3] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-04 18:46 [PATCH] Optional alternative base_expr in finding basis for CAND_REFs Yufeng Zhang
@ 2013-11-11 18:10 ` Bill Schmidt
  2013-11-12 23:44   ` Yufeng Zhang
  0 siblings, 1 reply; 34+ messages in thread
From: Bill Schmidt @ 2013-11-11 18:10 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: gcc-patches, Richard Biener

Hi Yufeng,

The idea is a good one but I don't like your implementation of adding an
extra expression parameter to look at on the find_basis_for_candidate
lookup.  This goes against the design of the pass and may not be
sufficiently general (might there be situations where a third possible
basis could exist?).

The overall design is set up to have alternate interpretations of
candidates in the candidate table to handle this sort of ambiguity.  The
goal for your example is create a second candidate (chained to the first
one by way of the next_interp field) so that the candidate table looks
like this:

   8  [2] *_10[j_7(D)] = 2;
      REF  : _10 + ((sizetype) j_7(D) * 4) + 0 : int[20] *
      basis: 0  dependent: 0  sibling: 0
      next-interp: 9  dead-savings: 0

   9  [2] *_10[j_7(D)] = 2;
      REF  : _5 + ((sizetype) j_7(D) * 4) + 800 : int[20] *
      basis: 5  dependent: 0  sibling: 0
      next-interp: 0  dead-savings: 0

This will in turn allow subsequent candidates to be seen in terms of
either _5 or _10, which may be necessary to avoid missed opportunities.
There may be a subsequent REF _15 +... that can be an affine expression
of either of these, for example.

If you fail to find a basis for a candidate with its first
interpretation, you can then follow the next-interp chain to look for a
basis for the next one, without the messy passing of extra possibilities
to the find-basis routine.

I haven't read the patch in detail, but I think this should give you
enough to work with to re-design the idea to fit better with the
existing framework.  Please let me know if you need more information, or
if you feel I've misunderstood something.

Thanks,
Bill

On Mon, 2013-11-04 at 18:41 +0000, Yufeng Zhang wrote:
> Hi,
> 
> This patch extends the slsr pass to optionally use an alternative base 
> expression in finding basis for CAND_REFs.  Currently the pass uses 
> hash-based algorithm to match the base_expr in a candidate.  Given a 
> test case like the following, slsr will not be able to recognize the two 
> CAND_REFs have the same basis, as their base_expr are of different 
> SSA_NAMEs:
> 
> typedef int arr_2[20][20];
> 
> void foo (arr_2 a2, int i, int j)
> {
>    a2[i][j] = 1;
>    a2[i + 10][j] = 2;
> }
> 
> The gimple dump before slsr is like the following (using an 
> arm-none-eabi gcc):
> 
>    i.0_2 = (unsigned int) i_1(D);
>    _3 = i.0_2 * 80;
>    _5 = a2_4(D) + _3;
>    *_5[j_7(D)] = 1;      <----
>    _9 = _3 + 800;
>    _10 = a2_4(D) + _9;
>    *_10[j_7(D)] = 2;     <----
> 
> Here are the dumps for the two CAND_REFs generated for the two 
> statements pointed by the arrows:
> 
> 
>    4  [2] _5 = a2_4(D) + _3;
>       ADD  : a2_4(D) + (80 * i_1(D)) : int[20] *
>       basis: 0  dependent: 0  sibling: 0
>       next-interp: 0  dead-savings: 0
> 
>    8  [2] *_10[j_7(D)] = 2;
>       REF  : _10 + ((sizetype) j_7(D) * 4) + 0 : int[20] *
>       basis: 5  dependent: 0  sibling: 0
>       next-interp: 0  dead-savings: 0
> 
> As mentioned previously, slsr cannot establish that candidate 4 is the 
> basis for the candidate 8, as they have different base_exprs: a2_4(D) 
> and _10, respectively.  However, the two references actually only differ 
> by an immediate offset (800).
> 
> This patch uses the tree affine combination facilities to create an 
> optional alternative base expression to be used in finding (as well as 
> recording) the basis.  It calls tree_to_aff_combination_expand on 
> base_expr, reset the offset field of the generated aff_tree to 0 and 
> generate a tree from it by calling aff_combination_to_tree.
> 
> The new tree is recorded as a potential basis, and when 
> find_basis_for_candidate fails to find a basis for a CAND_REF in its 
> normal approach, it searches again using a tree expanded in such way. 
> Such an expanded tree usually discloses the expression behind an 
> SSA_NAME.  In the example above, instead of seeing the strength 
> reduction candidate chains like this:
> 
>    _5 -> 5
>    _10 -> 8
> 
> we are now having:
> 
>    _5 -> 5
>    _10 -> 8
>    a2_4(D) + (sizetype) i_1(D) * 80 -> 5 -> 8
> 
> With the candidates 5 and 8 linked to the same tree expression (a2_4(D) 
> + (sizetype) i_1(D) * 80), slsr is now able to establish that 5 is the 
> basis of 8.
> 
> The patch doesn't attempt to change the content of any CAND_REF though. 
>   It only enables CAND_REFs which (1) have the same stride and (2) have 
> the underlying expressions of their base_expr only differ in immediate 
> offsets,  to be recognized to have the same basis.  The statements with 
> such CAND_REFs will be lowered to MEM_REFs, and later on the RTL 
> expander shall be able to fold and re-associate the immediate offsets to 
> the rightmost side of the addressing expression, and therefore exposes 
> the common sub-expression successfully.
> 
> The code-gen difference of the example code on arm with -O2 
> -mcpu=cortex-15 is:
> 
>          mov     r3, r1, asl #6
> -       add     ip, r0, r2, asl #2
>          str     lr, [sp, #-4]!
> +       mov     ip, #1
> +       mov     lr, #2
>          add     r1, r3, r1, asl #4
> -       mov     lr, #1
> -       mov     r3, #2
>          add     r0, r0, r1
> -       add     r0, r0, #800
> -       str     lr, [ip, r1]
> -       str     r3, [r0, r2, asl #2]
> +       add     r3, r0, r2, asl #2
> +       str     ip, [r0, r2, asl #2]
> +       str     lr, [r3, #800]
>          ldr     pc, [sp], #4
> 
> One fewer instruction in this simple case.
> 
> The example used in illustration is too simple to show code-gen 
> difference on x86_64, but the included test case will show the benefit 
> of the patch quite obviously.
> 
> The patch has passed
> 
> * bootstrapping on arm and x86_64
> * regtest on arm-none-eabi,  aarch64-none-elf and x86_64
> 
> There is no regression in SPEC2K on arm or x86_64.
> 
> OK to commit to the trunk?
> 
> Any comment is welcomed!
> 
> Thanks,
> Yufeng
> 
> 
> gcc/
> 
>          * gimple-ssa-strength-reduction.c: Include tree-affine.h.
>          (find_basis_for_base_expr): Update comment.
>          (find_basis_for_candidate): Add new parameter 'alt_base_expr' of
>          type 'tree'.  Optionally call find_basis_for_base_expr with
>          'alt_base_expr'.
>          (record_potential_basis): Add new parameter 'alt_base_expr' of
>          type 'tree'; set node->base_expr with 'alt_base_expr' if it is
>          not NULL.
>          (name_expansions): New static variable.
>          (get_alternative_base): New function.
>          (alloc_cand_and_find_basis): Call get_alternative_base for 
> CAND_REF.
>          Update calls to find_basis_for_candidate and 
> record_potential_basis.
>          (execute_strength_reduction): Call free_affine_expand_cache with
>          &name_expansions.
> 
> gcc/testsuite/
> 
>          * gcc.dg/tree-ssa/slsr-41.c: New test.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-11 18:10 ` Bill Schmidt
@ 2013-11-12 23:44   ` Yufeng Zhang
  2013-11-13 21:12     ` Bill Schmidt
  0 siblings, 1 reply; 34+ messages in thread
From: Yufeng Zhang @ 2013-11-12 23:44 UTC (permalink / raw)
  To: Bill Schmidt; +Cc: gcc-patches, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 14703 bytes --]

Hi Bill,

Many thanks for the review.

I find your suggestion on using the next_interp field quite 
enlightening.  I prepared a patch which adds changes without modifying 
the framework.  With the patch, the slsr pass now tries to create a 
second candidate for each memory accessing gimple statement, and chain 
it to the first one via the next_interp field.

There are two implications in this approach though:

1) For each memory accessing gimple statement, there can be two 
candidates, and these two candidates can be part of different dependency 
graphs respectively (based on different base expr).  Only one of the 
dependency graph should be traversed to do replace_refs.  Most of the 
changes in the patch is to handle this implication.

I am aware that you suggest to follow the next-interp chain only when 
the searching fails for the first interpretation.  However, that doesn't 
work very well, as it can result in worse code-gen.  Taking a varied 
form of the added test slsr-41.c for example:

i1:  a2 [i] [j] = 1;
i2:  a2 [i] [j+1] = 2;
i3:  a2 [i+20] [j] = i;

With the 2nd interpretation created conditionally, the following two 
dependency chains will be established:

   i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
   i1 --> i3  (base expr is a tree expression of (a2 + i * 200))

the result is that three gimples will be lowered to MEM_REFs differently 
(as the candidates have different base_exprs); the later passes can get 
confused, generating worse code.

What this patch does is to create two interpretations where possible (if 
different base exprs exist); the following dependency chains will be 
produced:

   i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
   i1 --> i2 --> i3  (base expr is a tree expression of (a2 + i * 200))

In analyze_candidates_and_replace, a new function preferred_ref_cand is 
called to analyze a root CAND_REF and replace_refs is only called if 
this root CAND_REF is found to be part of a larger dependency graph (or 
longer dependency chain in simple cases).  In the example above, the 2nd 
dependency chain will be picked up to do replace_refs.

2) The 2nd implication is that the alternative candidate may expose the 
underlying tree expression of a base expr, which can cause more 
aggressive extraction and folding of immediate offsets.  Taking the new 
test slsr-41 for example, the code-gen difference on x86_64 with the 
original patch and this patch is (-O2):

-       leal    5(%rsi), %edx
+       leal    5(%rsi), %eax
         movslq  %esi, %rsi
-       salq    $2, %rsi
-       movslq  %edx, %rax
-       leaq    (%rax,%rax,4), %rax
-       leaq    (%rax,%rax,4), %rcx
-       salq    $3, %rcx
-       leaq    (%rdi,%rcx), %rax
-       addq    %rsi, %rax
-       movl    $2, -1980(%rax)
-       movl    %edx, 20(%rax)
-       movl    %edx, 4024(%rax)
-       leaq    -600(%rdi,%rcx), %rax
-       addl    $1, 16(%rsi,%rax)
+       imulq   $204, %rsi, %rsi
+       addq    %rsi, %rdi
+       movl    $2, -980(%rdi)
+       movl    %eax, 1020(%rdi)
+       movl    %eax, 5024(%rdi)
+       addl    $1, 416(%rdi)
         ret

As you can see, the larger offsets are produced as the affine expander 
is able to look deep into the tree expression.  This raises concern that 
larger immediates can cause worse code-gen when the immediates are out 
of the supported range on a target.  On x86_64 it is not obvious (as it 
allows larger ranges), on arm cortex-a15 the load with the immediate 
5024 will be done by

         movw    r2, #5024
         str     r3, [r0, r2]

which is not optimal.  Things can get worse when there are multiple 
loads/stores with large immediates as each one may require an extra mov 
immediate instruction.  One thing can potentially be done is to reduce 
the strength of multiple large immediates later on in some RTL pass by 
doing an initial offsetting first?  What do you think?  Are you 
particularly concerned about this issue?

The patch passes the bootstrapping on arm and x86_64; the regtest is 
still running.

Here is the changelog:

gcc/

         * gimple-ssa-strength-reduction.c: Include tree-affine.h.
         (name_expansions): New static variable.
         (get_alternative_base): New function.
         (restructure_reference): Add new local variables 'alt_base' and
         'delta'; call get_alternative_base and alloc_cand_and_find_basis
         to create an alternative interpretation.
         (num_of_dependents): New function.
         (preferred_ref_cand): Ditto.
         (analyze_candidates_and_replace): Call preferred_ref_cand for
         CAND_REF and skip replace_refs if the returned value is 
differerent.
         (execute_strength_reduction): call free_affine_expand_cache with
         &name_expansions.

gcc/testsuite/

         * gcc.dg/tree-ssa/slsr-41.c: New test.


For your consideration, I've also attached another patch which is an 
improvement to the original patch.  This patch improves the original one 
by reducing the number of changes to the existing framework, e.g. 
leaving find_basis_for_base_expr unchanged.  While it still slightly 
modifies the interfaces (find_basis_for_candidate and 
record_potential_basis), it has advantage over the 1st patch attached 
here: its impact on the code-gen is much smaller, as it enables more 
ARRAY_REFs to be lowered without handing over the underlying tree 
expression to replace_ref.  It creates the following dependency chains 
for the aforementioned example:

   i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
   i1 --> i2 --> i3  (base expr is a tree expression of (a2 + i * 200))

While they look the same as what the 1st patch does, only one candidate 
is generated for each memory accessing gimple statement; some candiates 
are chained twice, once to a cand_chain with a base_expr of an SSA_NAME 
and the other to a cand_chain with the underlying tree expression as its 
base_expr.  In other words, it produces two different dependency graphs 
without creating different interpretations, by utilizing the existing 
framework of cand_chain and find_basis_for_base_expr.

The patch passes the bootstrapping on arm and x86_64, as well as regtest 
on x86_64.  The following is the changelog entry:

gcc/

         * gimple-ssa-strength-reduction.c: Include tree-affine.h.
         (name_expansions): New static variable.
         (alt_base_map): Ditto.
         (get_alternative_base): New function.
         (find_basis_for_candidate): For CAND_REF, optionally call
         find_basis_for_base_expr with the returned value from
         get_alternative_base.
         (record_potential_basis): Add new parameter 'base' of type 'tree';
         return if base == NULL; use base to set node->base_expr.
         (alloc_cand_and_find_basis): Update; call 
record_potential_basis for
         CAND_REF with the returned value from get_alternative_base.
         (execute_strength_reduction): Call pointer_map_create for 
alt_base_map;
         call free_affine_expand_cache with &name_expansions.

gcc/testsuite/

         * gcc.dg/tree-ssa/slsr-41.c: New test.


Which patch do you like more?

If you have any question on either of the patch, please let me know.

Regards,
Yufeng


On 11/11/13 17:09, Bill Schmidt wrote:
> Hi Yufeng,
>
> The idea is a good one but I don't like your implementation of adding an
> extra expression parameter to look at on the find_basis_for_candidate
> lookup.  This goes against the design of the pass and may not be
> sufficiently general (might there be situations where a third possible
> basis could exist?).
>
> The overall design is set up to have alternate interpretations of
> candidates in the candidate table to handle this sort of ambiguity.  The
> goal for your example is create a second candidate (chained to the first
> one by way of the next_interp field) so that the candidate table looks
> like this:
>
>     8  [2] *_10[j_7(D)] = 2;
>        REF  : _10 + ((sizetype) j_7(D) * 4) + 0 : int[20] *
>        basis: 0  dependent: 0  sibling: 0
>        next-interp: 9  dead-savings: 0
>
>     9  [2] *_10[j_7(D)] = 2;
>        REF  : _5 + ((sizetype) j_7(D) * 4) + 800 : int[20] *
>        basis: 5  dependent: 0  sibling: 0
>        next-interp: 0  dead-savings: 0
>
> This will in turn allow subsequent candidates to be seen in terms of
> either _5 or _10, which may be necessary to avoid missed opportunities.
> There may be a subsequent REF _15 +... that can be an affine expression
> of either of these, for example.
>
> If you fail to find a basis for a candidate with its first
> interpretation, you can then follow the next-interp chain to look for a
> basis for the next one, without the messy passing of extra possibilities
> to the find-basis routine.
>
> I haven't read the patch in detail, but I think this should give you
> enough to work with to re-design the idea to fit better with the
> existing framework.  Please let me know if you need more information, or
> if you feel I've misunderstood something.
>
> Thanks,
> Bill
>
> On Mon, 2013-11-04 at 18:41 +0000, Yufeng Zhang wrote:
>> Hi,
>>
>> This patch extends the slsr pass to optionally use an alternative base
>> expression in finding basis for CAND_REFs.  Currently the pass uses
>> hash-based algorithm to match the base_expr in a candidate.  Given a
>> test case like the following, slsr will not be able to recognize the two
>> CAND_REFs have the same basis, as their base_expr are of different
>> SSA_NAMEs:
>>
>> typedef int arr_2[20][20];
>>
>> void foo (arr_2 a2, int i, int j)
>> {
>>     a2[i][j] = 1;
>>     a2[i + 10][j] = 2;
>> }
>>
>> The gimple dump before slsr is like the following (using an
>> arm-none-eabi gcc):
>>
>>     i.0_2 = (unsigned int) i_1(D);
>>     _3 = i.0_2 * 80;
>>     _5 = a2_4(D) + _3;
>>     *_5[j_7(D)] = 1;<----
>>     _9 = _3 + 800;
>>     _10 = a2_4(D) + _9;
>>     *_10[j_7(D)] = 2;<----
>>
>> Here are the dumps for the two CAND_REFs generated for the two
>> statements pointed by the arrows:
>>
>>
>>     4  [2] _5 = a2_4(D) + _3;
>>        ADD  : a2_4(D) + (80 * i_1(D)) : int[20] *
>>        basis: 0  dependent: 0  sibling: 0
>>        next-interp: 0  dead-savings: 0
>>
>>     8  [2] *_10[j_7(D)] = 2;
>>        REF  : _10 + ((sizetype) j_7(D) * 4) + 0 : int[20] *
>>        basis: 5  dependent: 0  sibling: 0
>>        next-interp: 0  dead-savings: 0
>>
>> As mentioned previously, slsr cannot establish that candidate 4 is the
>> basis for the candidate 8, as they have different base_exprs: a2_4(D)
>> and _10, respectively.  However, the two references actually only differ
>> by an immediate offset (800).
>>
>> This patch uses the tree affine combination facilities to create an
>> optional alternative base expression to be used in finding (as well as
>> recording) the basis.  It calls tree_to_aff_combination_expand on
>> base_expr, reset the offset field of the generated aff_tree to 0 and
>> generate a tree from it by calling aff_combination_to_tree.
>>
>> The new tree is recorded as a potential basis, and when
>> find_basis_for_candidate fails to find a basis for a CAND_REF in its
>> normal approach, it searches again using a tree expanded in such way.
>> Such an expanded tree usually discloses the expression behind an
>> SSA_NAME.  In the example above, instead of seeing the strength
>> reduction candidate chains like this:
>>
>>     _5 ->  5
>>     _10 ->  8
>>
>> we are now having:
>>
>>     _5 ->  5
>>     _10 ->  8
>>     a2_4(D) + (sizetype) i_1(D) * 80 ->  5 ->  8
>>
>> With the candidates 5 and 8 linked to the same tree expression (a2_4(D)
>> + (sizetype) i_1(D) * 80), slsr is now able to establish that 5 is the
>> basis of 8.
>>
>> The patch doesn't attempt to change the content of any CAND_REF though.
>>    It only enables CAND_REFs which (1) have the same stride and (2) have
>> the underlying expressions of their base_expr only differ in immediate
>> offsets,  to be recognized to have the same basis.  The statements with
>> such CAND_REFs will be lowered to MEM_REFs, and later on the RTL
>> expander shall be able to fold and re-associate the immediate offsets to
>> the rightmost side of the addressing expression, and therefore exposes
>> the common sub-expression successfully.
>>
>> The code-gen difference of the example code on arm with -O2
>> -mcpu=cortex-15 is:
>>
>>           mov     r3, r1, asl #6
>> -       add     ip, r0, r2, asl #2
>>           str     lr, [sp, #-4]!
>> +       mov     ip, #1
>> +       mov     lr, #2
>>           add     r1, r3, r1, asl #4
>> -       mov     lr, #1
>> -       mov     r3, #2
>>           add     r0, r0, r1
>> -       add     r0, r0, #800
>> -       str     lr, [ip, r1]
>> -       str     r3, [r0, r2, asl #2]
>> +       add     r3, r0, r2, asl #2
>> +       str     ip, [r0, r2, asl #2]
>> +       str     lr, [r3, #800]
>>           ldr     pc, [sp], #4
>>
>> One fewer instruction in this simple case.
>>
>> The example used in illustration is too simple to show code-gen
>> difference on x86_64, but the included test case will show the benefit
>> of the patch quite obviously.
>>
>> The patch has passed
>>
>> * bootstrapping on arm and x86_64
>> * regtest on arm-none-eabi,  aarch64-none-elf and x86_64
>>
>> There is no regression in SPEC2K on arm or x86_64.
>>
>> OK to commit to the trunk?
>>
>> Any comment is welcomed!
>>
>> Thanks,
>> Yufeng
>>
>>
>> gcc/
>>
>>           * gimple-ssa-strength-reduction.c: Include tree-affine.h.
>>           (find_basis_for_base_expr): Update comment.
>>           (find_basis_for_candidate): Add new parameter 'alt_base_expr' of
>>           type 'tree'.  Optionally call find_basis_for_base_expr with
>>           'alt_base_expr'.
>>           (record_potential_basis): Add new parameter 'alt_base_expr' of
>>           type 'tree'; set node->base_expr with 'alt_base_expr' if it is
>>           not NULL.
>>           (name_expansions): New static variable.
>>           (get_alternative_base): New function.
>>           (alloc_cand_and_find_basis): Call get_alternative_base for
>> CAND_REF.
>>           Update calls to find_basis_for_candidate and
>> record_potential_basis.
>>           (execute_strength_reduction): Call free_affine_expand_cache with
>>           &name_expansions.
>>
>> gcc/testsuite/
>>
>>           * gcc.dg/tree-ssa/slsr-41.c: New test.
>
>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: use-next-interp.patch --]
[-- Type: text/x-patch; name=use-next-interp.patch, Size: 6542 bytes --]

diff --git a/gcc/gimple-ssa-strength-reduction.c b/gcc/gimple-ssa-strength-reduction.c
index 88afc91..30e3763 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -53,6 +53,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "params.h"
 #include "hash-table.h"
 #include "tree-ssa-address.h"
+#include "tree-affine.h"
 \f
 /* Information about a strength reduction candidate.  Each statement
    in the candidate table represents an expression of one of the
@@ -420,6 +421,31 @@ cand_chain_hasher::equal (const value_type *chain1, const compare_type *chain2)
 /* Hash table embodying a mapping from base exprs to chains of candidates.  */
 static hash_table <cand_chain_hasher> base_cand_map;
 \f
+/* Pointer map used by tree_to_aff_combination_expand.  */
+static struct pointer_map_t *name_expansions;
+
+/* Given BASE, use the tree affine combiniation facilities to
+   find the underlying tree expression for BASE, with any
+   immediate offset excluded.  */
+
+static tree
+get_alternative_base (tree base, double_int *offset)
+{
+  tree expr;
+  aff_tree aff;
+
+  tree_to_aff_combination_expand (base, TREE_TYPE (base),
+				  &aff, &name_expansions);
+  *offset = aff.offset;
+  aff.offset = tree_to_double_int (integer_zero_node);
+  expr = aff_combination_to_tree (&aff);
+
+  if (expr == base)
+    return NULL;
+  else
+    return expr;
+}
+
 /* Look in the candidate table for a CAND_PHI that defines BASE and
    return it if found; otherwise return NULL.  */
 
@@ -912,11 +938,11 @@ restructure_reference (tree *pbase, tree *poffset, double_int *pindex,
 static void
 slsr_process_ref (gimple gs)
 {
-  tree ref_expr, base, offset, type;
+  tree ref_expr, base, offset, type, alt_base;
   HOST_WIDE_INT bitsize, bitpos;
   enum machine_mode mode;
   int unsignedp, volatilep;
-  double_int index;
+  double_int index, delta;
   slsr_cand_t c;
 
   if (gimple_vdef (gs))
@@ -942,6 +968,16 @@ slsr_process_ref (gimple gs)
 
   /* Add the candidate to the statement-candidate mapping.  */
   add_cand_for_stmt (gs, c);
+
+  /* Add alternate interpretation.  */
+  if ((alt_base = get_alternative_base (base, &delta)))
+    {
+      slsr_cand_t c2 =
+	alloc_cand_and_find_basis (CAND_REF, gs, alt_base, index + delta,
+				   offset, type, 0);
+
+      c->next_interp = c2->cand_num;
+    }
 }
 
 /* Create a candidate entry for a statement GS, where GS multiplies
@@ -1802,6 +1838,80 @@ dump_incr_vec (void)
     }
 }
 \f
+/* Helper routine for preferred_ref_cand.  Given C which is a CAND_REF,
+   recursively count and return the number of dependents, including
+   itself.  */
+
+static int
+num_of_dependents (slsr_cand_t c)
+{
+  int n = 1;
+
+  if (c->sibling)
+    n += num_of_dependents (lookup_cand (c->sibling));
+
+  if (c->dependent)
+    n += num_of_dependents (lookup_cand (c->dependent));
+
+  return n;
+}
+
+/* Some of the memory accessing gimple statements have two CAND_REF
+   candidates as a result of an optional backtracing into the base
+   expr.  The routine checks and compares the two candidates, if both
+   exist; the candidate with a more dominating basis or the one
+   whose dependency graph has more nodes is returned.  In the case of
+   a draw, the candidate with the original base expr (primary) is
+   preferred to the backtraced one (secondary).  C is the CAND_REF
+   to be checked.
+
+   The whole idea is to avoid these gimple statements to be
+   replace_ref 'ed twice, and in a random order.  */
+
+static slsr_cand_t
+preferred_ref_cand (slsr_cand_t c)
+{
+  slsr_cand_t primary, secondary, theother;
+  slsr_cand_t *result
+    = (slsr_cand_t *) pointer_map_contains (stmt_cand_map,
+					    c->cand_stmt);
+  gcc_assert (result);
+
+  primary = *result;
+  if (primary->next_interp)
+    secondary = lookup_cand (primary->next_interp);
+  else
+    secondary = NULL;
+
+  gcc_assert (c == primary || c == secondary);
+  theother = c == primary ? secondary : primary;
+
+  if (theother)
+    {
+      /* An earlier basis exists!  The replacement may have
+	 already happened.  */
+      if (theother->basis != 0)
+	return theother;
+
+      if (theother->dependent != 0)
+	{
+	  int num_c = num_of_dependents (c);
+	  int num_t = num_of_dependents (theother);
+
+	  /* Fewer dependents, lower priority.  */
+	  if (num_c < num_t)
+	    return theother;
+
+	  /* When the numbers are the same, the primary candiate
+	     is preferred.  */
+	  if (num_c == num_t && theother == primary)
+	    return theother;
+	}
+    }
+
+  return c;
+}
+
 /* Replace *EXPR in candidate C with an equivalent strength-reduced
    data reference.  */
 
@@ -3453,7 +3563,20 @@ analyze_candidates_and_replace (void)
       /* If this is a chain of CAND_REFs, unconditionally replace
 	 each of them with a strength-reduced data reference.  */
       if (c->kind == CAND_REF)
-	replace_refs (c);
+	{
+	  slsr_cand_t t = preferred_ref_cand (c);
+
+	  if (t != c)
+	    {
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		fprintf (dump_file, "\tProcessing skipped: "
+			 "higher-priority dependency tree is detected, "
+			 "where %d is chained.\n", t->cand_num);
+	      continue;
+	    }
+
+	  replace_refs (c);
+	}
 
       /* If the common stride of all related candidates is a known
 	 constant, each candidate without a phi-dependence can be
@@ -3539,6 +3662,8 @@ execute_strength_reduction (void)
       dump_cand_chains ();
     }
 
+  free_affine_expand_cache (&name_expansions);
+
   /* Analyze costs and make appropriate replacements.  */
   analyze_candidates_and_replace ();
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
new file mode 100644
index 0000000..870d714
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
@@ -0,0 +1,24 @@
+/* Verify straight-line strength reduction in using
+   alternative base expr to record and look for the
+   potential candidate.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i-10] [j] = 2;
+  a2 [i] [j++] = i;
+  a2 [i+20] [j++] = i;
+  a2 [i-3] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: use-aff-tree-v2.patch --]
[-- Type: text/x-patch; name=use-aff-tree-v2.patch, Size: 6362 bytes --]

diff --git a/gcc/gimple-ssa-strength-reduction.c b/gcc/gimple-ssa-strength-reduction.c
index 88afc91..d069246 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -53,6 +53,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "params.h"
 #include "hash-table.h"
 #include "tree-ssa-address.h"
+#include "tree-affine.h"
 \f
 /* Information about a strength reduction candidate.  Each statement
    in the candidate table represents an expression of one of the
@@ -420,6 +421,42 @@ cand_chain_hasher::equal (const value_type *chain1, const compare_type *chain2)
 /* Hash table embodying a mapping from base exprs to chains of candidates.  */
 static hash_table <cand_chain_hasher> base_cand_map;
 \f
+/* Pointer map used by tree_to_aff_combination_expand.  */
+static struct pointer_map_t *name_expansions;
+/* Pointer map embodying a mapping from bases to alternative bases.  */
+static struct pointer_map_t *alt_base_map;
+
+/* Given BASE, use the tree affine combiniation facilities to
+   find the underlying tree expression for BASE, with any
+   immediate offset excluded.  */
+
+static tree
+get_alternative_base (tree base)
+{
+  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
+
+  if (result == NULL)
+    {
+      tree expr;
+      aff_tree aff;
+
+      tree_to_aff_combination_expand (base, TREE_TYPE (base),
+				      &aff, &name_expansions);
+      aff.offset = tree_to_double_int (integer_zero_node);
+      expr = aff_combination_to_tree (&aff);
+
+      result = (tree *) pointer_map_insert (alt_base_map, base);
+      gcc_assert (!*result);
+
+      if (expr == base)
+	*result = NULL;
+      else
+	*result = expr;
+    }
+
+  return *result;
+}
+
 /* Look in the candidate table for a CAND_PHI that defines BASE and
    return it if found; otherwise return NULL.  */
 
@@ -439,9 +476,10 @@ find_phi_def (tree base)
   return c->cand_num;
 }
 
-/* Helper routine for find_basis_for_candidate.  May be called twice:
+/* Helper routine for find_basis_for_candidate.  May be called three times:
    once for the candidate's base expr, and optionally again for the
-   candidate's phi definition.  */
+   candidate's phi definition, as well as for an alternative base expr
+   in the case of CAND_REF.  */
 
 static slsr_cand_t
 find_basis_for_base_expr (slsr_cand_t c, tree base_expr)
@@ -518,6 +556,13 @@ find_basis_for_candidate (slsr_cand_t c)
 	}
     }
 
+  if (!basis && c->kind == CAND_REF)
+    {
+      tree alt_base_expr = get_alternative_base (c->base_expr);
+      if (alt_base_expr)
+	basis = find_basis_for_base_expr (c, alt_base_expr);
+    }
+
   if (basis)
     {
       c->sibling = basis->dependent;
@@ -528,17 +573,22 @@ find_basis_for_candidate (slsr_cand_t c)
   return 0;
 }
 
-/* Record a mapping from the base expression of C to C itself, indicating that
-   C may potentially serve as a basis using that base expression.  */
+/* Record a mapping from BASE to C, indicating that C may potentially serve
+   as a basis using that base expression.  BASE may be the same as
+   C->BASE_EXPR; alternatively BASE can be a different tree that share the
+   underlining expression of C->BASE_EXPR.  */
 
 static void
-record_potential_basis (slsr_cand_t c)
+record_potential_basis (slsr_cand_t c, tree base)
 {
   cand_chain_t node;
   cand_chain **slot;
 
+  if (base == NULL)
+    return;
+
   node = (cand_chain_t) obstack_alloc (&chain_obstack, sizeof (cand_chain));
-  node->base_expr = c->base_expr;
+  node->base_expr = base;
   node->cand = c;
   node->next = NULL;
   slot = base_cand_map.find_slot (node, INSERT);
@@ -554,10 +604,18 @@ record_potential_basis (slsr_cand_t c)
 }
 
 /* Allocate storage for a new candidate and initialize its fields.
-   Attempt to find a basis for the candidate.  */
+   Attempt to find a basis for the candidate.
+
+   For CAND_REF, an alternative base may also be recorded and used
+   to find a basis.  This helps cases where the expression hidden
+   behind BASE (which is usually an SSA_NAME) has immediate offset,
+   e.g.
+
+     a2[i][j] = 1;
+     a2[i + 20][j] = 2;  */
 
 static slsr_cand_t
-alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base, 
+alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
 			   double_int index, tree stride, tree ctype,
 			   unsigned savings)
 {
@@ -583,7 +641,9 @@ alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
   else
     c->basis = find_basis_for_candidate (c);
 
-  record_potential_basis (c);
+  record_potential_basis (c, base);
+  if (kind == CAND_REF)
+    record_potential_basis (c, get_alternative_base (base));
 
   return c;
 }
@@ -3524,6 +3584,9 @@ execute_strength_reduction (void)
   /* Allocate the mapping from base expressions to candidate chains.  */
   base_cand_map.create (500);
 
+  /* Allocate the mapping from bases to alternative bases.  */
+  alt_base_map = pointer_map_create ();
+
   /* Initialize the loop optimizer.  We need to detect flow across
      back edges, and this gives us dominator information as well.  */
   loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
@@ -3539,6 +3602,9 @@ execute_strength_reduction (void)
       dump_cand_chains ();
     }
 
+  pointer_map_destroy (alt_base_map);
+  free_affine_expand_cache (&name_expansions);
+
   /* Analyze costs and make appropriate replacements.  */
   analyze_candidates_and_replace ();
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
new file mode 100644
index 0000000..870d714
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
@@ -0,0 +1,24 @@
+/* Verify straight-line strength reduction in using
+   alternative base expr to record and look for the
+   potential candidate.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i-10] [j] = 2;
+  a2 [i] [j++] = i;
+  a2 [i+20] [j++] = i;
+  a2 [i-3] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-12 23:44   ` Yufeng Zhang
@ 2013-11-13 21:12     ` Bill Schmidt
  2013-11-13 22:29       ` Yufeng Zhang
  0 siblings, 1 reply; 34+ messages in thread
From: Bill Schmidt @ 2013-11-13 21:12 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: gcc-patches, Richard Biener

Hi Yufeng,

On Tue, 2013-11-12 at 22:34 +0000, Yufeng Zhang wrote:
> Hi Bill,
> 
> Many thanks for the review.
> 
> I find your suggestion on using the next_interp field quite 
> enlightening.  I prepared a patch which adds changes without modifying 
> the framework.  With the patch, the slsr pass now tries to create a 
> second candidate for each memory accessing gimple statement, and chain 
> it to the first one via the next_interp field.
> 
> There are two implications in this approach though:
> 
> 1) For each memory accessing gimple statement, there can be two 
> candidates, and these two candidates can be part of different dependency 
> graphs respectively (based on different base expr).  Only one of the 
> dependency graph should be traversed to do replace_refs.  Most of the 
> changes in the patch is to handle this implication.
> 
> I am aware that you suggest to follow the next-interp chain only when 
> the searching fails for the first interpretation.  However, that doesn't 
> work very well, as it can result in worse code-gen.  Taking a varied 
> form of the added test slsr-41.c for example:
> 
> i1:  a2 [i] [j] = 1;
> i2:  a2 [i] [j+1] = 2;
> i3:  a2 [i+20] [j] = i;
> 
> With the 2nd interpretation created conditionally, the following two 
> dependency chains will be established:
> 
>    i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
>    i1 --> i3  (base expr is a tree expression of (a2 + i * 200))

So it seems to me that really what needs to happen is to unify those two
base_exprs.  We don't currently have logic in this pass to look up an
SSA name based on {base, index, stride, cand_type}, but that could be
done with a hash table.  For now to save processing time it would make
sense to only do that for MEM candidates, though the cand_type should be
included in the hash to allow this to be used for other candidate types
if necessary.  Of course, the SSA name definition must dominate the
candidate to be eligible as a basis, and that should be checked, but
this should generally be the case.

The goal should be for all of these references to have the same base
expr so that i3 can choose either i1 or i2 as a basis.  (For now the
logic in the pass chooses the most dominating basis, but eventually I
would like to add heuristics to make better choices.)

If all three of these use the same base expr, that should eliminate your
concerns, right?

> 
> the result is that three gimples will be lowered to MEM_REFs differently 
> (as the candidates have different base_exprs); the later passes can get 
> confused, generating worse code.
> 
> What this patch does is to create two interpretations where possible (if 
> different base exprs exist); the following dependency chains will be 
> produced:
> 
>    i1 --> i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
>    i1 --> i2 --> i3  (base expr is a tree expression of (a2 + i * 200))
> 
> In analyze_candidates_and_replace, a new function preferred_ref_cand is 
> called to analyze a root CAND_REF and replace_refs is only called if 
> this root CAND_REF is found to be part of a larger dependency graph (or 
> longer dependency chain in simple cases).  In the example above, the 2nd 
> dependency chain will be picked up to do replace_refs.
> 
> 2) The 2nd implication is that the alternative candidate may expose the 
> underlying tree expression of a base expr, which can cause more 
> aggressive extraction and folding of immediate offsets.  Taking the new 
> test slsr-41 for example, the code-gen difference on x86_64 with the 
> original patch and this patch is (-O2):
> 
> -       leal    5(%rsi), %edx
> +       leal    5(%rsi), %eax
>          movslq  %esi, %rsi
> -       salq    $2, %rsi
> -       movslq  %edx, %rax
> -       leaq    (%rax,%rax,4), %rax
> -       leaq    (%rax,%rax,4), %rcx
> -       salq    $3, %rcx
> -       leaq    (%rdi,%rcx), %rax
> -       addq    %rsi, %rax
> -       movl    $2, -1980(%rax)
> -       movl    %edx, 20(%rax)
> -       movl    %edx, 4024(%rax)
> -       leaq    -600(%rdi,%rcx), %rax
> -       addl    $1, 16(%rsi,%rax)
> +       imulq   $204, %rsi, %rsi
> +       addq    %rsi, %rdi
> +       movl    $2, -980(%rdi)
> +       movl    %eax, 1020(%rdi)
> +       movl    %eax, 5024(%rdi)
> +       addl    $1, 416(%rdi)
>          ret
> 
> As you can see, the larger offsets are produced as the affine expander 
> is able to look deep into the tree expression.  This raises concern that 
> larger immediates can cause worse code-gen when the immediates are out 
> of the supported range on a target.  On x86_64 it is not obvious (as it 
> allows larger ranges), on arm cortex-a15 the load with the immediate 
> 5024 will be done by
> 
>          movw    r2, #5024
>          str     r3, [r0, r2]
> 
> which is not optimal.  Things can get worse when there are multiple 
> loads/stores with large immediates as each one may require an extra mov 
> immediate instruction.  One thing can potentially be done is to reduce 
> the strength of multiple large immediates later on in some RTL pass by 
> doing an initial offsetting first?  What do you think?  Are you 
> particularly concerned about this issue?

To me, this seems like something that the middle end should not concern
itself about, but that may be oversimplifying.  I would think this is
not the only pass that can create such issues, and the overall code
generation should usually be improved anyway.  Richard, would you care
to weigh in here?

A couple of quick comments on the next_interp patch:

 * You don't need num_of_dependents ().  You should be able to add a
forward declaration for count_candidates () and use it.
 * Your new test case is missing a final newline, so your patch doesn't
apply cleanly.

Please look into unifying the base expressions, as I believe you should
not need the preferred_ref_cand logic if you do that.

I still prefer the approach of using next_interp for its generality and
expandibility.

Thanks,
Bill



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-13 21:12     ` Bill Schmidt
@ 2013-11-13 22:29       ` Yufeng Zhang
  2013-11-13 22:30         ` Bill Schmidt
  0 siblings, 1 reply; 34+ messages in thread
From: Yufeng Zhang @ 2013-11-13 22:29 UTC (permalink / raw)
  To: Bill Schmidt; +Cc: gcc-patches, Richard Biener

Hi Bill,

On 11/13/13 18:04, Bill Schmidt wrote:
> Hi Yufeng,
>
> On Tue, 2013-11-12 at 22:34 +0000, Yufeng Zhang wrote:
>> Hi Bill,
>>
>> Many thanks for the review.
>>
>> I find your suggestion on using the next_interp field quite
>> enlightening.  I prepared a patch which adds changes without modifying
>> the framework.  With the patch, the slsr pass now tries to create a
>> second candidate for each memory accessing gimple statement, and chain
>> it to the first one via the next_interp field.
>>
>> There are two implications in this approach though:
>>
>> 1) For each memory accessing gimple statement, there can be two
>> candidates, and these two candidates can be part of different dependency
>> graphs respectively (based on different base expr).  Only one of the
>> dependency graph should be traversed to do replace_refs.  Most of the
>> changes in the patch is to handle this implication.
>>
>> I am aware that you suggest to follow the next-interp chain only when
>> the searching fails for the first interpretation.  However, that doesn't
>> work very well, as it can result in worse code-gen.  Taking a varied
>> form of the added test slsr-41.c for example:
>>
>> i1:  a2 [i] [j] = 1;
>> i2:  a2 [i] [j+1] = 2;
>> i3:  a2 [i+20] [j] = i;
>>
>> With the 2nd interpretation created conditionally, the following two
>> dependency chains will be established:
>>
>>     i1 -->  i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
>>     i1 -->  i3  (base expr is a tree expression of (a2 + i * 200))
>
> So it seems to me that really what needs to happen is to unify those two
> base_exprs.  We don't currently have logic in this pass to look up an
> SSA name based on {base, index, stride, cand_type}, but that could be
> done with a hash table.  For now to save processing time it would make
> sense to only do that for MEM candidates, though the cand_type should be
> included in the hash to allow this to be used for other candidate types
> if necessary.  Of course, the SSA name definition must dominate the
> candidate to be eligible as a basis, and that should be checked, but
> this should generally be the case.

I'm not quite sure if the SSA_NAME look-up works; maybe I haven't fully 
understood what you suggest.

For i1 --> i3, the base_expr is the tree expression (a2 + i * 200), 
which is the result of a sequence of operations (conversion to affine, 
immediate offset removal and conversion to tree), with another SSA_NAME 
as the input.  In other words, there are two SSA_NAMEs involved in the 
example:

   _s1: (a2 + i * 200).
   _s2: (a2 + (i * 200 + 4000))

their strides and indexes are different.

I guess what you suggest is that given the tree expression (a2 + i * 
200), look up an SSA_NAME and return _s1.  If that is the case, the 
challenge will be how to analyze the tree expression and get the 
information on its {base, index, stride, cand_type}.  While it would be 
too specific and narrative to check for a POINTER_PLUS_EXPR expression, 
the existing framework (e.g. create_add_ssa_cand) seems to assume that 
the analyzed tree represent a genuine gimple statement.

Moreover, there may not be an SSA_NAME exists, for example in the 
following case:

   i1:  a2 [i+1] [j] = 1;
   i2:  a2 [i+1] [j+1] = 2;
   i3:  a2 [i+20] [j] = i;

you wouldn't be able to find an SSA_NAME for (a2 + i * 200).

[snip]
> A couple of quick comments on the next_interp patch:
>
>   * You don't need num_of_dependents ().  You should be able to add a
> forward declaration for count_candidates () and use it.

Missed count_candidates (); thanks!

>   * Your new test case is missing a final newline, so your patch doesn't
> apply cleanly.

I'll fix it.

> Please look into unifying the base expressions, as I believe you should
> not need the preferred_ref_cand logic if you do that.

I would also like to live without preferred_ref_cand if feasible . :)

> I still prefer the approach of using next_interp for its generality and
> expandibility.

Sure; this approach indeed fit the framework better.


Regards,
Yufeng

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-13 22:29       ` Yufeng Zhang
@ 2013-11-13 22:30         ` Bill Schmidt
  2013-11-13 23:14           ` Bill Schmidt
  0 siblings, 1 reply; 34+ messages in thread
From: Bill Schmidt @ 2013-11-13 22:30 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: gcc-patches, Richard Biener

Hi Yufeng,

On Wed, 2013-11-13 at 19:32 +0000, Yufeng Zhang wrote:
> Hi Bill,
> 
> On 11/13/13 18:04, Bill Schmidt wrote:
> > Hi Yufeng,
> >
> > On Tue, 2013-11-12 at 22:34 +0000, Yufeng Zhang wrote:
> >> Hi Bill,
> >>
> >> Many thanks for the review.
> >>
> >> I find your suggestion on using the next_interp field quite
> >> enlightening.  I prepared a patch which adds changes without modifying
> >> the framework.  With the patch, the slsr pass now tries to create a
> >> second candidate for each memory accessing gimple statement, and chain
> >> it to the first one via the next_interp field.
> >>
> >> There are two implications in this approach though:
> >>
> >> 1) For each memory accessing gimple statement, there can be two
> >> candidates, and these two candidates can be part of different dependency
> >> graphs respectively (based on different base expr).  Only one of the
> >> dependency graph should be traversed to do replace_refs.  Most of the
> >> changes in the patch is to handle this implication.
> >>
> >> I am aware that you suggest to follow the next-interp chain only when
> >> the searching fails for the first interpretation.  However, that doesn't
> >> work very well, as it can result in worse code-gen.  Taking a varied
> >> form of the added test slsr-41.c for example:
> >>
> >> i1:  a2 [i] [j] = 1;
> >> i2:  a2 [i] [j+1] = 2;
> >> i3:  a2 [i+20] [j] = i;
> >>
> >> With the 2nd interpretation created conditionally, the following two
> >> dependency chains will be established:
> >>
> >>     i1 -->  i2  (base expr is an SSA_NAME defined as (a2 + i * 200))
> >>     i1 -->  i3  (base expr is a tree expression of (a2 + i * 200))
> >
> > So it seems to me that really what needs to happen is to unify those two
> > base_exprs.  We don't currently have logic in this pass to look up an
> > SSA name based on {base, index, stride, cand_type}, but that could be
> > done with a hash table.  For now to save processing time it would make
> > sense to only do that for MEM candidates, though the cand_type should be
> > included in the hash to allow this to be used for other candidate types
> > if necessary.  Of course, the SSA name definition must dominate the
> > candidate to be eligible as a basis, and that should be checked, but
> > this should generally be the case.
> 
> I'm not quite sure if the SSA_NAME look-up works; maybe I haven't fully 
> understood what you suggest.
> 
> For i1 --> i3, the base_expr is the tree expression (a2 + i * 200), 
> which is the result of a sequence of operations (conversion to affine, 
> immediate offset removal and conversion to tree), with another SSA_NAME 
> as the input.  In other words, there are two SSA_NAMEs involved in the 
> example:
> 
>    _s1: (a2 + i * 200).
>    _s2: (a2 + (i * 200 + 4000))
> 
> their strides and indexes are different.
> 
> I guess what you suggest is that given the tree expression (a2 + i * 
> 200), look up an SSA_NAME and return _s1.  If that is the case, the 
> challenge will be how to analyze the tree expression and get the 
> information on its {base, index, stride, cand_type}.  While it would be 
> too specific and narrative to check for a POINTER_PLUS_EXPR expression, 
> the existing framework (e.g. create_add_ssa_cand) seems to assume that 
> the analyzed tree represent a genuine gimple statement.
> 
> Moreover, there may not be an SSA_NAME exists, for example in the 
> following case:
> 
>    i1:  a2 [i+1] [j] = 1;
>    i2:  a2 [i+1] [j+1] = 2;
>    i3:  a2 [i+20] [j] = i;
> 
> you wouldn't be able to find an SSA_NAME for (a2 + i * 200).

Ok.  It is probably too much to hope for to get a sufficiently general
approach to handle all of these cases cleanly.

Bleah.  The whole preferred_ref_cand business seems very ad hoc to me,
and to some extent is closing the barn door after the cows have escaped.
Perhaps we can't use the next-interpretation infrastructure to solve
this problem ideally, in which case I apologize for leading you down
this path.  The alternate patch at least keeps the candidate tree in a
straightforward state, and the new version is less intrusive than the
original.

Let me look that version over more carefully and I'll get back to you.
Thanks for your patience.

Bill

> 
> [snip]
> > A couple of quick comments on the next_interp patch:
> >
> >   * You don't need num_of_dependents ().  You should be able to add a
> > forward declaration for count_candidates () and use it.
> 
> Missed count_candidates (); thanks!
> 
> >   * Your new test case is missing a final newline, so your patch doesn't
> > apply cleanly.
> 
> I'll fix it.
> 
> > Please look into unifying the base expressions, as I believe you should
> > not need the preferred_ref_cand logic if you do that.
> 
> I would also like to live without preferred_ref_cand if feasible . :)
> 
> > I still prefer the approach of using next_interp for its generality and
> > expandibility.
> 
> Sure; this approach indeed fit the framework better.
> 
> 
> Regards,
> Yufeng
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-13 22:30         ` Bill Schmidt
@ 2013-11-13 23:14           ` Bill Schmidt
  2013-11-13 23:25             ` Bill Schmidt
  2013-11-14  4:07             ` Yufeng Zhang
  0 siblings, 2 replies; 34+ messages in thread
From: Bill Schmidt @ 2013-11-13 23:14 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: gcc-patches, Richard Biener

Hi Yufeng,

The second version of your original patch is ok with me with the
following changes.  Sorry for the little side adventure into the
next-interp logic; in the end that's going to hurt more than it helps in
this case.  Thanks for having a look at it, anyway.  Thanks also for
cleaning up this version to be less intrusive to common interfaces; I
appreciate it.


>diff --git a/gcc/gimple-ssa-strength-reduction.c b/gcc/gimple-ssa-strength-reduction.c
>index 88afc91..d069246 100644
>--- a/gcc/gimple-ssa-strength-reduction.c
>+++ b/gcc/gimple-ssa-strength-reduction.c
>@@ -53,6 +53,7 @@ along with GCC; see the file COPYING3.  If not see
> #include "params.h"
> #include "hash-table.h"
> #include "tree-ssa-address.h"
>+#include "tree-affine.h"
> \f
> /* Information about a strength reduction candidate.  Each statement
>    in the candidate table represents an expression of one of the
>@@ -420,6 +421,42 @@ cand_chain_hasher::equal (const value_type *chain1, const compare_type *chain2)
> /* Hash table embodying a mapping from base exprs to chains of candidates.  */
> static hash_table <cand_chain_hasher> base_cand_map;
> \f
>+/* Pointer map used by tree_to_aff_combination_expand.  */
>+static struct pointer_map_t *name_expansions;
>+/* Pointer map embodying a mapping from bases to alternative bases.  */
>+static struct pointer_map_t *alt_base_map;
>+
>+/* Given BASE, use the tree affine combiniation facilities to
>+   find the underlying tree expression for BASE, with any
>+   immediate offset excluded.  */
>+
>+static tree
>+get_alternative_base (tree base)
>+{
>+  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
>+
>+  if (result == NULL)
>+    {
>+      tree expr;
>+      aff_tree aff;
>+
>+      tree_to_aff_combination_expand (base, TREE_TYPE (base),
>+				      &aff, &name_expansions);
>+      aff.offset = tree_to_double_int (integer_zero_node);
>+      expr = aff_combination_to_tree (&aff);
>+
>+      result = (tree *) pointer_map_insert (alt_base_map, base);
>+      gcc_assert (!*result);
>+
>+      if (expr == base)
>+	*result = NULL;
>+      else
>+	*result = expr;
>+    }
>+
>+  return *result;
>+}
>+
> /* Look in the candidate table for a CAND_PHI that defines BASE and
>    return it if found; otherwise return NULL.  */
> 
>@@ -439,9 +476,10 @@ find_phi_def (tree base)
>   return c->cand_num;
> }
> 
>-/* Helper routine for find_basis_for_candidate.  May be called twice:
>+/* Helper routine for find_basis_for_candidate.  May be called three times:
>    once for the candidate's base expr, and optionally again for the
>-   candidate's phi definition.  */
>+   candidate's phi definition, as well as for an alternative base expr
>+   in the case of CAND_REF.  */

Technically this will never be called three times.  It will be called
once for the candidate's base expression, and optionally either for the
candidate's phi definition or for a CAND_REF's alternative base
expression.  (There is no phi processing for a CAND_REF.)

> 
> static slsr_cand_t
> find_basis_for_base_expr (slsr_cand_t c, tree base_expr)
>@@ -518,6 +556,13 @@ find_basis_for_candidate (slsr_cand_t c)
> 	}
>     }
> 
>+  if (!basis && c->kind == CAND_REF)
>+    {
>+      tree alt_base_expr = get_alternative_base (c->base_expr);
>+      if (alt_base_expr)
>+	basis = find_basis_for_base_expr (c, alt_base_expr);
>+    }
>+
>   if (basis)
>     {
>       c->sibling = basis->dependent;
>@@ -528,17 +573,22 @@ find_basis_for_candidate (slsr_cand_t c)
>   return 0;
> }
> 
>-/* Record a mapping from the base expression of C to C itself, indicating that
>-   C may potentially serve as a basis using that base expression.  */
>+/* Record a mapping from BASE to C, indicating that C may potentially serve
>+   as a basis using that base expression.  BASE may be the same as
>+   C->BASE_EXPR; alternatively BASE can be a different tree that share the
>+   underlining expression of C->BASE_EXPR.  */
> 
> static void
>-record_potential_basis (slsr_cand_t c)
>+record_potential_basis (slsr_cand_t c, tree base)
> {
>   cand_chain_t node;
>   cand_chain **slot;
> 
>+  if (base == NULL)
>+    return;

Please do this check outside the common code; it's not necessary except
for CAND_REFs.  Replace with:

  gcc_assert (base);

>+
>   node = (cand_chain_t) obstack_alloc (&chain_obstack, sizeof (cand_chain));
>-  node->base_expr = c->base_expr;
>+  node->base_expr = base;
>   node->cand = c;
>   node->next = NULL;
>   slot = base_cand_map.find_slot (node, INSERT);
>@@ -554,10 +604,18 @@ record_potential_basis (slsr_cand_t c)
> }
> 
> /* Allocate storage for a new candidate and initialize its fields.
>-   Attempt to find a basis for the candidate.  */
>+   Attempt to find a basis for the candidate.
>+
>+   For CAND_REF, an alternative base may also be recorded and used
>+   to find a basis.  This helps cases where the expression hidden
>+   behind BASE (which is usually an SSA_NAME) has immediate offset,
>+   e.g.
>+
>+     a2[i][j] = 1;
>+     a2[i + 20][j] = 2;  */
> 
> static slsr_cand_t
>-alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base, 
>+alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
> 			   double_int index, tree stride, tree ctype,
> 			   unsigned savings)
> {
>@@ -583,7 +641,9 @@ alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
>   else
>     c->basis = find_basis_for_candidate (c);
> 
>-  record_potential_basis (c);
>+  record_potential_basis (c, base);
>+  if (kind == CAND_REF)
>+    record_potential_basis (c, get_alternative_base (base));

Tied to the above change:

if (kind == CAND_REF)
  {
    tree alt_base = get_alternative_base (base);
    if (alt_base)
      record_potential_basis (c, alt_base);
  }

> 
>   return c;
> }
>@@ -3524,6 +3584,9 @@ execute_strength_reduction (void)
>   /* Allocate the mapping from base expressions to candidate chains.  */
>   base_cand_map.create (500);
> 
>+  /* Allocate the mapping from bases to alternative bases.  */
>+  alt_base_map = pointer_map_create ();
>+
>   /* Initialize the loop optimizer.  We need to detect flow across
>      back edges, and this gives us dominator information as well.  */
>   loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
>@@ -3539,6 +3602,9 @@ execute_strength_reduction (void)
>       dump_cand_chains ();
>     }
> 
>+  pointer_map_destroy (alt_base_map);
>+  free_affine_expand_cache (&name_expansions);
>+
>   /* Analyze costs and make appropriate replacements.  */
>   analyze_candidates_and_replace ();
> 
>diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
>new file mode 100644
>index 0000000..870d714
>--- /dev/null
>+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
>@@ -0,0 +1,24 @@
>+/* Verify straight-line strength reduction in using
>+   alternative base expr to record and look for the
>+   potential candidate.  */
>+
>+/* { dg-do compile } */
>+/* { dg-options "-O2 -fdump-tree-slsr" } */
>+
>+typedef int arr_2[50][50];
>+
>+void foo (arr_2 a2, int v1)
>+{
>+  int i, j;
>+
>+  i = v1 + 5;
>+  j = i;
>+  a2 [i-10] [j] = 2;
>+  a2 [i] [j++] = i;
>+  a2 [i+20] [j++] = i;
>+  a2 [i-3] [i-1] += 1;
>+  return;
>+}
>+
>+/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>+/* { dg-final { cleanup-tree-dump "slsr" } } */

As mentioned previously, please add the missing newline at EOF.

Everything else looks OK to me.  Please ask Richard for final approval,
as I'm not a maintainer.

Thanks,
Bill

(P.S. I prefer inline patches rather than attachments; it makes it
easier to reply with markup.)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-13 23:14           ` Bill Schmidt
@ 2013-11-13 23:25             ` Bill Schmidt
  2013-11-14  4:07             ` Yufeng Zhang
  1 sibling, 0 replies; 34+ messages in thread
From: Bill Schmidt @ 2013-11-13 23:25 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: gcc-patches, Richard Biener

Hi Yufeng,

The second version of your original patch is ok with me with the
following changes.  Sorry for the little side adventure into the
next-interp logic; in the end that's going to hurt more than it helps in
this case.  Thanks for having a look at it, anyway.  Thanks also for
cleaning up this version to be less intrusive to common interfaces; I
appreciate it.


>diff --git a/gcc/gimple-ssa-strength-reduction.c b/gcc/gimple-ssa-strength-reduction.c
>index 88afc91..d069246 100644
>--- a/gcc/gimple-ssa-strength-reduction.c
>+++ b/gcc/gimple-ssa-strength-reduction.c
>@@ -53,6 +53,7 @@ along with GCC; see the file COPYING3.  If not see
> #include "params.h"
> #include "hash-table.h"
> #include "tree-ssa-address.h"
>+#include "tree-affine.h"
> \f
> /* Information about a strength reduction candidate.  Each statement
>    in the candidate table represents an expression of one of the
>@@ -420,6 +421,42 @@ cand_chain_hasher::equal (const value_type *chain1, const compare_type *chain2)
> /* Hash table embodying a mapping from base exprs to chains of candidates.  */
> static hash_table <cand_chain_hasher> base_cand_map;
> \f
>+/* Pointer map used by tree_to_aff_combination_expand.  */
>+static struct pointer_map_t *name_expansions;
>+/* Pointer map embodying a mapping from bases to alternative bases.  */
>+static struct pointer_map_t *alt_base_map;
>+
>+/* Given BASE, use the tree affine combiniation facilities to
>+   find the underlying tree expression for BASE, with any
>+   immediate offset excluded.  */
>+
>+static tree
>+get_alternative_base (tree base)
>+{
>+  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
>+
>+  if (result == NULL)
>+    {
>+      tree expr;
>+      aff_tree aff;
>+
>+      tree_to_aff_combination_expand (base, TREE_TYPE (base),
>+				      &aff, &name_expansions);
>+      aff.offset = tree_to_double_int (integer_zero_node);
>+      expr = aff_combination_to_tree (&aff);
>+
>+      result = (tree *) pointer_map_insert (alt_base_map, base);
>+      gcc_assert (!*result);
>+
>+      if (expr == base)
>+	*result = NULL;
>+      else
>+	*result = expr;
>+    }
>+
>+  return *result;
>+}
>+
> /* Look in the candidate table for a CAND_PHI that defines BASE and
>    return it if found; otherwise return NULL.  */
> 
>@@ -439,9 +476,10 @@ find_phi_def (tree base)
>   return c->cand_num;
> }
> 
>-/* Helper routine for find_basis_for_candidate.  May be called twice:
>+/* Helper routine for find_basis_for_candidate.  May be called three times:
>    once for the candidate's base expr, and optionally again for the
>-   candidate's phi definition.  */
>+   candidate's phi definition, as well as for an alternative base expr
>+   in the case of CAND_REF.  */

Technically this will never be called three times.  It will be called
once for the candidate's base expression, and optionally either for the
candidate's phi definition or for a CAND_REF's alternative base
expression.  (There is no phi processing for a CAND_REF.)

> 
> static slsr_cand_t
> find_basis_for_base_expr (slsr_cand_t c, tree base_expr)
>@@ -518,6 +556,13 @@ find_basis_for_candidate (slsr_cand_t c)
> 	}
>     }
> 
>+  if (!basis && c->kind == CAND_REF)
>+    {
>+      tree alt_base_expr = get_alternative_base (c->base_expr);
>+      if (alt_base_expr)
>+	basis = find_basis_for_base_expr (c, alt_base_expr);
>+    }
>+
>   if (basis)
>     {
>       c->sibling = basis->dependent;
>@@ -528,17 +573,22 @@ find_basis_for_candidate (slsr_cand_t c)
>   return 0;
> }
> 
>-/* Record a mapping from the base expression of C to C itself, indicating that
>-   C may potentially serve as a basis using that base expression.  */
>+/* Record a mapping from BASE to C, indicating that C may potentially serve
>+   as a basis using that base expression.  BASE may be the same as
>+   C->BASE_EXPR; alternatively BASE can be a different tree that share the
>+   underlining expression of C->BASE_EXPR.  */
> 
> static void
>-record_potential_basis (slsr_cand_t c)
>+record_potential_basis (slsr_cand_t c, tree base)
> {
>   cand_chain_t node;
>   cand_chain **slot;
> 
>+  if (base == NULL)
>+    return;

Please do this check outside the common code; it's not necessary except
for CAND_REFs.  Replace with:

  gcc_assert (base);

>+
>   node = (cand_chain_t) obstack_alloc (&chain_obstack, sizeof (cand_chain));
>-  node->base_expr = c->base_expr;
>+  node->base_expr = base;
>   node->cand = c;
>   node->next = NULL;
>   slot = base_cand_map.find_slot (node, INSERT);
>@@ -554,10 +604,18 @@ record_potential_basis (slsr_cand_t c)
> }
> 
> /* Allocate storage for a new candidate and initialize its fields.
>-   Attempt to find a basis for the candidate.  */
>+   Attempt to find a basis for the candidate.
>+
>+   For CAND_REF, an alternative base may also be recorded and used
>+   to find a basis.  This helps cases where the expression hidden
>+   behind BASE (which is usually an SSA_NAME) has immediate offset,
>+   e.g.
>+
>+     a2[i][j] = 1;
>+     a2[i + 20][j] = 2;  */
> 
> static slsr_cand_t
>-alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base, 
>+alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
> 			   double_int index, tree stride, tree ctype,
> 			   unsigned savings)
> {
>@@ -583,7 +641,9 @@ alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
>   else
>     c->basis = find_basis_for_candidate (c);
> 
>-  record_potential_basis (c);
>+  record_potential_basis (c, base);
>+  if (kind == CAND_REF)
>+    record_potential_basis (c, get_alternative_base (base));

Tied to the above change:

if (kind == CAND_REF)
  {
    tree alt_base = get_alternative_base (base);
    if (alt_base)
      record_potential_basis (c, alt_base);
  }

> 
>   return c;
> }
>@@ -3524,6 +3584,9 @@ execute_strength_reduction (void)
>   /* Allocate the mapping from base expressions to candidate chains.  */
>   base_cand_map.create (500);
> 
>+  /* Allocate the mapping from bases to alternative bases.  */
>+  alt_base_map = pointer_map_create ();
>+
>   /* Initialize the loop optimizer.  We need to detect flow across
>      back edges, and this gives us dominator information as well.  */
>   loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
>@@ -3539,6 +3602,9 @@ execute_strength_reduction (void)
>       dump_cand_chains ();
>     }
> 
>+  pointer_map_destroy (alt_base_map);
>+  free_affine_expand_cache (&name_expansions);
>+
>   /* Analyze costs and make appropriate replacements.  */
>   analyze_candidates_and_replace ();
> 
>diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
>new file mode 100644
>index 0000000..870d714
>--- /dev/null
>+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
>@@ -0,0 +1,24 @@
>+/* Verify straight-line strength reduction in using
>+   alternative base expr to record and look for the
>+   potential candidate.  */
>+
>+/* { dg-do compile } */
>+/* { dg-options "-O2 -fdump-tree-slsr" } */
>+
>+typedef int arr_2[50][50];
>+
>+void foo (arr_2 a2, int v1)
>+{
>+  int i, j;
>+
>+  i = v1 + 5;
>+  j = i;
>+  a2 [i-10] [j] = 2;
>+  a2 [i] [j++] = i;
>+  a2 [i+20] [j++] = i;
>+  a2 [i-3] [i-1] += 1;
>+  return;
>+}
>+
>+/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>+/* { dg-final { cleanup-tree-dump "slsr" } } */

As mentioned previously, please add the missing newline at EOF.

Everything else looks OK to me.  Please ask Richard for final approval,
as I'm not a maintainer.

Thanks,
Bill

(P.S. I prefer inline patches rather than attachments; it makes it
easier to reply with markup.)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-13 23:14           ` Bill Schmidt
  2013-11-13 23:25             ` Bill Schmidt
@ 2013-11-14  4:07             ` Yufeng Zhang
  2013-11-19 12:32               ` [PING] " Yufeng Zhang
  2013-11-26 15:22               ` Richard Biener
  1 sibling, 2 replies; 34+ messages in thread
From: Yufeng Zhang @ 2013-11-14  4:07 UTC (permalink / raw)
  To: Bill Schmidt, Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1697 bytes --]

Hi Bill,

On 11/13/13 20:54, Bill Schmidt wrote:
> Hi Yufeng,
>
> The second version of your original patch is ok with me with the
> following changes.  Sorry for the little side adventure into the
> next-interp logic; in the end that's going to hurt more than it helps in
> this case.  Thanks for having a look at it, anyway.  Thanks also for
> cleaning up this version to be less intrusive to common interfaces; I
> appreciate it.

Thanks a lot for the review.  I've attached an updated patch with the 
suggested changes incorporated.

For the next-interp adventure, I was quite happy to do the experiment; 
it's a good chance of gaining insight into the pass.  Many thanks for 
your prompt replies and patience in guiding!

> Everything else looks OK to me.  Please ask Richard for final approval,
> as I'm not a maintainer.

Hi Richard, would you be happy to OK the patch?

Regards,
Yufeng

gcc/

	* gimple-ssa-strength-reduction.c: Include tree-affine.h.
	(name_expansions): New static variable.
	(alt_base_map): Ditto.
	(get_alternative_base): New function.
	(find_basis_for_candidate): For CAND_REF, optionally call
	find_basis_for_base_expr with the returned value from
	get_alternative_base.
	(record_potential_basis): Add new parameter 'base' of type 'tree';
	add an assertion of non-NULL base; use base to set node->base_expr.
	(alloc_cand_and_find_basis): Update; call record_potential_basis
	for CAND_REF with the returned value from get_alternative_base.
	(execute_strength_reduction): Call pointer_map_create for
	alt_base_map; call free_affine_expand_cache with &name_expansions.

gcc/testsuite/

	* gcc.dg/tree-ssa/slsr-41.c: New test.

[-- Attachment #2: patch-use-affine-v3.txt --]
[-- Type: text/plain, Size: 6185 bytes --]

diff --git a/gcc/gimple-ssa-strength-reduction.c b/gcc/gimple-ssa-strength-reduction.c
index 88afc91..26502c3 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -53,6 +53,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "params.h"
 #include "hash-table.h"
 #include "tree-ssa-address.h"
+#include "tree-affine.h"
 \f
 /* Information about a strength reduction candidate.  Each statement
    in the candidate table represents an expression of one of the
@@ -420,6 +421,42 @@ cand_chain_hasher::equal (const value_type *chain1, const compare_type *chain2)
 /* Hash table embodying a mapping from base exprs to chains of candidates.  */
 static hash_table <cand_chain_hasher> base_cand_map;
 \f
+/* Pointer map used by tree_to_aff_combination_expand.  */
+static struct pointer_map_t *name_expansions;
+/* Pointer map embodying a mapping from bases to alternative bases.  */
+static struct pointer_map_t *alt_base_map;
+
+/* Given BASE, use the tree affine combiniation facilities to
+   find the underlying tree expression for BASE, with any
+   immediate offset excluded.  */
+
+static tree
+get_alternative_base (tree base)
+{
+  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
+
+  if (result == NULL)
+    {
+      tree expr;
+      aff_tree aff;
+
+      tree_to_aff_combination_expand (base, TREE_TYPE (base),
+				      &aff, &name_expansions);
+      aff.offset = tree_to_double_int (integer_zero_node);
+      expr = aff_combination_to_tree (&aff);
+
+      result = (tree *) pointer_map_insert (alt_base_map, base);
+      gcc_assert (!*result);
+
+      if (expr == base)
+	*result = NULL;
+      else
+	*result = expr;
+    }
+
+  return *result;
+}
+
 /* Look in the candidate table for a CAND_PHI that defines BASE and
    return it if found; otherwise return NULL.  */
 
@@ -440,8 +477,9 @@ find_phi_def (tree base)
 }
 
 /* Helper routine for find_basis_for_candidate.  May be called twice:
-   once for the candidate's base expr, and optionally again for the
-   candidate's phi definition.  */
+   once for the candidate's base expr, and optionally again either for
+   the candidate's phi definition or for a CAND_REF's alternative base
+   expression.  */
 
 static slsr_cand_t
 find_basis_for_base_expr (slsr_cand_t c, tree base_expr)
@@ -518,6 +556,13 @@ find_basis_for_candidate (slsr_cand_t c)
 	}
     }
 
+  if (!basis && c->kind == CAND_REF)
+    {
+      tree alt_base_expr = get_alternative_base (c->base_expr);
+      if (alt_base_expr)
+	basis = find_basis_for_base_expr (c, alt_base_expr);
+    }
+
   if (basis)
     {
       c->sibling = basis->dependent;
@@ -528,17 +573,21 @@ find_basis_for_candidate (slsr_cand_t c)
   return 0;
 }
 
-/* Record a mapping from the base expression of C to C itself, indicating that
-   C may potentially serve as a basis using that base expression.  */
+/* Record a mapping from BASE to C, indicating that C may potentially serve
+   as a basis using that base expression.  BASE may be the same as
+   C->BASE_EXPR; alternatively BASE can be a different tree that share the
+   underlining expression of C->BASE_EXPR.  */
 
 static void
-record_potential_basis (slsr_cand_t c)
+record_potential_basis (slsr_cand_t c, tree base)
 {
   cand_chain_t node;
   cand_chain **slot;
 
+  gcc_assert (base);
+
   node = (cand_chain_t) obstack_alloc (&chain_obstack, sizeof (cand_chain));
-  node->base_expr = c->base_expr;
+  node->base_expr = base;
   node->cand = c;
   node->next = NULL;
   slot = base_cand_map.find_slot (node, INSERT);
@@ -554,10 +603,18 @@ record_potential_basis (slsr_cand_t c)
 }
 
 /* Allocate storage for a new candidate and initialize its fields.
-   Attempt to find a basis for the candidate.  */
+   Attempt to find a basis for the candidate.
+
+   For CAND_REF, an alternative base may also be recorded and used
+   to find a basis.  This helps cases where the expression hidden
+   behind BASE (which is usually an SSA_NAME) has immediate offset,
+   e.g.
+
+     a2[i][j] = 1;
+     a2[i + 20][j] = 2;  */
 
 static slsr_cand_t
-alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base, 
+alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
 			   double_int index, tree stride, tree ctype,
 			   unsigned savings)
 {
@@ -583,7 +640,13 @@ alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
   else
     c->basis = find_basis_for_candidate (c);
 
-  record_potential_basis (c);
+  record_potential_basis (c, base);
+  if (kind == CAND_REF)
+    {
+      tree alt_base = get_alternative_base (base);
+      if (alt_base)
+	record_potential_basis (c, alt_base);
+    }
 
   return c;
 }
@@ -3524,6 +3587,9 @@ execute_strength_reduction (void)
   /* Allocate the mapping from base expressions to candidate chains.  */
   base_cand_map.create (500);
 
+  /* Allocate the mapping from bases to alternative bases.  */
+  alt_base_map = pointer_map_create ();
+
   /* Initialize the loop optimizer.  We need to detect flow across
      back edges, and this gives us dominator information as well.  */
   loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
@@ -3539,6 +3605,9 @@ execute_strength_reduction (void)
       dump_cand_chains ();
     }
 
+  pointer_map_destroy (alt_base_map);
+  free_affine_expand_cache (&name_expansions);
+
   /* Analyze costs and make appropriate replacements.  */
   analyze_candidates_and_replace ();
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
new file mode 100644
index 0000000..870d714
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
@@ -0,0 +1,24 @@
+/* Verify straight-line strength reduction in using
+   alternative base expr to record and look for the
+   potential candidate.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i-10] [j] = 2;
+  a2 [i] [j++] = i;
+  a2 [i+20] [j++] = i;
+  a2 [i-3] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-14  4:07             ` Yufeng Zhang
@ 2013-11-19 12:32               ` Yufeng Zhang
  2013-11-26 14:53                 ` [PING^2] " Yufeng Zhang
  2013-11-26 15:22               ` Richard Biener
  1 sibling, 1 reply; 34+ messages in thread
From: Yufeng Zhang @ 2013-11-19 12:32 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bill Schmidt, gcc-patches

Hi Richard,

Can I get an approval or some feedback from you about the patch?

Regards,
Yufeng

On 11/13/13 23:25, Yufeng Zhang wrote:
> On 11/13/13 20:54, Bill Schmidt wrote:
>> Hi Yufeng,
>>
>> The second version of your original patch is ok with me with the
>> following changes.
>
> Thanks a lot for the review.  I've attached an updated patch with the
> suggested changes incorporated.
>
>> Everything else looks OK to me.  Please ask Richard for final approval,
>> as I'm not a maintainer.
>
> Hi Richard, would you be happy to OK the patch?
>
> Regards,
> Yufeng
>
> gcc/
>
> 	* gimple-ssa-strength-reduction.c: Include tree-affine.h.
> 	(name_expansions): New static variable.
> 	(alt_base_map): Ditto.
> 	(get_alternative_base): New function.
> 	(find_basis_for_candidate): For CAND_REF, optionally call
> 	find_basis_for_base_expr with the returned value from
> 	get_alternative_base.
> 	(record_potential_basis): Add new parameter 'base' of type 'tree';
> 	add an assertion of non-NULL base; use base to set node->base_expr.
> 	(alloc_cand_and_find_basis): Update; call record_potential_basis
> 	for CAND_REF with the returned value from get_alternative_base.
> 	(execute_strength_reduction): Call pointer_map_create for
> 	alt_base_map; call free_affine_expand_cache with&name_expansions.
>
> gcc/testsuite/
>
> 	* gcc.dg/tree-ssa/slsr-41.c: New test.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING^2] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-19 12:32               ` [PING] " Yufeng Zhang
@ 2013-11-26 14:53                 ` Yufeng Zhang
  0 siblings, 0 replies; 34+ messages in thread
From: Yufeng Zhang @ 2013-11-26 14:53 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bill Schmidt, gcc-patches

Ping^2

The patch was posted here:

http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01523.html

Thanks,
Yufeng

On 11/19/13 11:45, Yufeng Zhang wrote:
> Hi Richard,
>
> Can I get an approval or some feedback from you about the patch?
>
> Regards,
> Yufeng
>
> On 11/13/13 23:25, Yufeng Zhang wrote:
>> On 11/13/13 20:54, Bill Schmidt wrote:
>>> Hi Yufeng,
>>>
>>> The second version of your original patch is ok with me with the
>>> following changes.
>>
>> Thanks a lot for the review.  I've attached an updated patch with the
>> suggested changes incorporated.
>>
>>> Everything else looks OK to me.  Please ask Richard for final approval,
>>> as I'm not a maintainer.
>>
>> Hi Richard, would you be happy to OK the patch?
>>
>> Regards,
>> Yufeng
>>
>> gcc/
>>
>> 	* gimple-ssa-strength-reduction.c: Include tree-affine.h.
>> 	(name_expansions): New static variable.
>> 	(alt_base_map): Ditto.
>> 	(get_alternative_base): New function.
>> 	(find_basis_for_candidate): For CAND_REF, optionally call
>> 	find_basis_for_base_expr with the returned value from
>> 	get_alternative_base.
>> 	(record_potential_basis): Add new parameter 'base' of type 'tree';
>> 	add an assertion of non-NULL base; use base to set node->base_expr.
>> 	(alloc_cand_and_find_basis): Update; call record_potential_basis
>> 	for CAND_REF with the returned value from get_alternative_base.
>> 	(execute_strength_reduction): Call pointer_map_create for
>> 	alt_base_map; call free_affine_expand_cache with&name_expansions.
>>
>> gcc/testsuite/
>>
>> 	* gcc.dg/tree-ssa/slsr-41.c: New test.
>
>
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-14  4:07             ` Yufeng Zhang
  2013-11-19 12:32               ` [PING] " Yufeng Zhang
@ 2013-11-26 15:22               ` Richard Biener
  2013-11-26 18:06                 ` Yufeng Zhang
  1 sibling, 1 reply; 34+ messages in thread
From: Richard Biener @ 2013-11-26 15:22 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: Bill Schmidt, gcc-patches

On Thu, Nov 14, 2013 at 12:25 AM, Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> Hi Bill,
>
>
> On 11/13/13 20:54, Bill Schmidt wrote:
>>
>> Hi Yufeng,
>>
>> The second version of your original patch is ok with me with the
>> following changes.  Sorry for the little side adventure into the
>> next-interp logic; in the end that's going to hurt more than it helps in
>> this case.  Thanks for having a look at it, anyway.  Thanks also for
>> cleaning up this version to be less intrusive to common interfaces; I
>> appreciate it.
>
>
> Thanks a lot for the review.  I've attached an updated patch with the
> suggested changes incorporated.
>
> For the next-interp adventure, I was quite happy to do the experiment; it's
> a good chance of gaining insight into the pass.  Many thanks for your prompt
> replies and patience in guiding!
>
>
>> Everything else looks OK to me.  Please ask Richard for final approval,
>> as I'm not a maintainer.
>
>
> Hi Richard, would you be happy to OK the patch?

Hmm,

+static tree
+get_alternative_base (tree base)
+{
+  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
+
+  if (result == NULL)
+    {
+      tree expr;
+      aff_tree aff;
+
+      tree_to_aff_combination_expand (base, TREE_TYPE (base),
+                                     &aff, &name_expansions);
+      aff.offset = tree_to_double_int (integer_zero_node);
+      expr = aff_combination_to_tree (&aff);
+
+      result = (tree *) pointer_map_insert (alt_base_map, base);
+      gcc_assert (!*result);

I believe this cache will never hit (unless you repeatedly ask for
the exact same statement?) - any non-trivial 'base' trees are
not shared and thus not pointer equivalent.

Also using tree_to_aff_combination_expand to get at - what
exactly? The address with any constant offset stripped?
Where do you re-construct that offset?  That is, aff.offset,
which you definitely need to get at a candidate?

+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i-10] [j] = 2;
+  a2 [i] [j++] = i;
+  a2 [i+20] [j++] = i;
+  a2 [i-3] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */

scanning for 5 MEMs looks non-sensical.  What transform do
you expect?  I see other slsr testcases do similar non-sensical
checking which is bad, too.

Richard.

> Regards,
>
> Yufeng
>
> gcc/
>
>         * gimple-ssa-strength-reduction.c: Include tree-affine.h.
>         (name_expansions): New static variable.
>         (alt_base_map): Ditto.
>         (get_alternative_base): New function.
>         (find_basis_for_candidate): For CAND_REF, optionally call
>         find_basis_for_base_expr with the returned value from
>         get_alternative_base.
>         (record_potential_basis): Add new parameter 'base' of type 'tree';
>         add an assertion of non-NULL base; use base to set node->base_expr.
>
>         (alloc_cand_and_find_basis): Update; call record_potential_basis
>         for CAND_REF with the returned value from get_alternative_base.
>         (execute_strength_reduction): Call pointer_map_create for
>         alt_base_map; call free_affine_expand_cache with &name_expansions.
>
> gcc/testsuite/
>
>         * gcc.dg/tree-ssa/slsr-41.c: New test.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-26 15:22               ` Richard Biener
@ 2013-11-26 18:06                 ` Yufeng Zhang
  2013-12-02 15:48                   ` [PING] " Yufeng Zhang
  0 siblings, 1 reply; 34+ messages in thread
From: Yufeng Zhang @ 2013-11-26 18:06 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bill Schmidt, gcc-patches

On 11/26/13 12:45, Richard Biener wrote:
> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
>> Hi Bill,
>>
>>
>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>
>>> Hi Yufeng,
>>>
>>> The second version of your original patch is ok with me with the
>>> following changes.  Sorry for the little side adventure into the
>>> next-interp logic; in the end that's going to hurt more than it helps in
>>> this case.  Thanks for having a look at it, anyway.  Thanks also for
>>> cleaning up this version to be less intrusive to common interfaces; I
>>> appreciate it.
>>
>>
>> Thanks a lot for the review.  I've attached an updated patch with the
>> suggested changes incorporated.
>>
>> For the next-interp adventure, I was quite happy to do the experiment; it's
>> a good chance of gaining insight into the pass.  Many thanks for your prompt
>> replies and patience in guiding!
>>
>>
>>> Everything else looks OK to me.  Please ask Richard for final approval,
>>> as I'm not a maintainer.
>>
>>
>> Hi Richard, would you be happy to OK the patch?
>
> Hmm,
>
> +static tree
> +get_alternative_base (tree base)
> +{
> +  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
> +
> +  if (result == NULL)
> +    {
> +      tree expr;
> +      aff_tree aff;
> +
> +      tree_to_aff_combination_expand (base, TREE_TYPE (base),
> +&aff,&name_expansions);
> +      aff.offset = tree_to_double_int (integer_zero_node);
> +      expr = aff_combination_to_tree (&aff);
> +
> +      result = (tree *) pointer_map_insert (alt_base_map, base);
> +      gcc_assert (!*result);
>
> I believe this cache will never hit (unless you repeatedly ask for
> the exact same statement?) - any non-trivial 'base' trees are
> not shared and thus not pointer equivalent.

Yes, you are right about the non-trivial 'base' tree are rarely shared. 
  The cached is introduced mainly because get_alternative_base () may be 
called twice on the same 'base' tree, once in the 
find_basis_for_candidate () for look-up and the other time in 
alloc_cand_and_find_basis () for record_potential_basis ().  I'm happy 
to leave out the cache if you think the benefit is trivial.

> Also using tree_to_aff_combination_expand to get at - what
> exactly? The address with any constant offset stripped?
> Where do you re-construct that offset?  That is, aff.offset,
> which you definitely need to get at a candidate?

As explained in the previous RFC emails, the expanded and 
constant-offset-stripped base expr is only used for the purpose of basis 
look-up.  The corresponding candidate still has the unexpanded base expr 
as its 'base_expr', therefore the info on the constant offset is not 
lost and doesn't need to be re-constructed.

> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> +
> +typedef int arr_2[50][50];
> +
> +void foo (arr_2 a2, int v1)
> +{
> +  int i, j;
> +
> +  i = v1 + 5;
> +  j = i;
> +  a2 [i-10] [j] = 2;
> +  a2 [i] [j++] = i;
> +  a2 [i+20] [j++] = i;
> +  a2 [i-3] [i-1] += 1;
> +  return;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>
> scanning for 5 MEMs looks non-sensical.  What transform do
> you expect?  I see other slsr testcases do similar non-sensical
> checking which is bad, too.

As the slsr optimizes CAND_REF candidates by simply lowering them to 
MEM_REF from e.g. ARRAY_REF, I think scanning for the number of MEM_REFs 
is an effective check.  Alternatively, I can add a follow-up patch to 
add some dumping facility in replace_ref () to print out the replacing 
actions when -fdump-tree-slsr-details is on.

I hope these can address your concerns.


Regards,
Yufeng



>
> Richard.
>
>> Regards,
>>
>> Yufeng
>>
>> gcc/
>>
>>          * gimple-ssa-strength-reduction.c: Include tree-affine.h.
>>          (name_expansions): New static variable.
>>          (alt_base_map): Ditto.
>>          (get_alternative_base): New function.
>>          (find_basis_for_candidate): For CAND_REF, optionally call
>>          find_basis_for_base_expr with the returned value from
>>          get_alternative_base.
>>          (record_potential_basis): Add new parameter 'base' of type 'tree';
>>          add an assertion of non-NULL base; use base to set node->base_expr.
>>
>>          (alloc_cand_and_find_basis): Update; call record_potential_basis
>>          for CAND_REF with the returned value from get_alternative_base.
>>          (execute_strength_reduction): Call pointer_map_create for
>>          alt_base_map; call free_affine_expand_cache with&name_expansions.
>>
>> gcc/testsuite/
>>
>>          * gcc.dg/tree-ssa/slsr-41.c: New test.
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-11-26 18:06                 ` Yufeng Zhang
@ 2013-12-02 15:48                   ` Yufeng Zhang
  2013-12-03  6:50                     ` Jeff Law
  0 siblings, 1 reply; 34+ messages in thread
From: Yufeng Zhang @ 2013-12-02 15:48 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bill Schmidt, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 5067 bytes --]

Ping~

http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html

Thanks,
Yufeng

On 11/26/13 15:02, Yufeng Zhang wrote:
> On 11/26/13 12:45, Richard Biener wrote:
>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng Zhang<Yufeng.Zhang@arm.com>   wrote:
>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>> The second version of your original patch is ok with me with the
>>>> following changes.  Sorry for the little side adventure into the
>>>> next-interp logic; in the end that's going to hurt more than it helps in
>>>> this case.  Thanks for having a look at it, anyway.  Thanks also for
>>>> cleaning up this version to be less intrusive to common interfaces; I
>>>> appreciate it.
>>>
>>>
>>> Thanks a lot for the review.  I've attached an updated patch with the
>>> suggested changes incorporated.
>>>
>>> For the next-interp adventure, I was quite happy to do the experiment; it's
>>> a good chance of gaining insight into the pass.  Many thanks for your prompt
>>> replies and patience in guiding!
>>>
>>>
>>>> Everything else looks OK to me.  Please ask Richard for final approval,
>>>> as I'm not a maintainer.
>>>
>>>
>>> Hi Richard, would you be happy to OK the patch?
>>
>> Hmm,
>>
>> +static tree
>> +get_alternative_base (tree base)
>> +{
>> +  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
>> +
>> +  if (result == NULL)
>> +    {
>> +      tree expr;
>> +      aff_tree aff;
>> +
>> +      tree_to_aff_combination_expand (base, TREE_TYPE (base),
>> +&aff,&name_expansions);
>> +      aff.offset = tree_to_double_int (integer_zero_node);
>> +      expr = aff_combination_to_tree (&aff);
>> +
>> +      result = (tree *) pointer_map_insert (alt_base_map, base);
>> +      gcc_assert (!*result);
>>
>> I believe this cache will never hit (unless you repeatedly ask for
>> the exact same statement?) - any non-trivial 'base' trees are
>> not shared and thus not pointer equivalent.
>
> Yes, you are right about the non-trivial 'base' tree are rarely shared.
>    The cached is introduced mainly because get_alternative_base () may be
> called twice on the same 'base' tree, once in the
> find_basis_for_candidate () for look-up and the other time in
> alloc_cand_and_find_basis () for record_potential_basis ().  I'm happy
> to leave out the cache if you think the benefit is trivial.
>
>> Also using tree_to_aff_combination_expand to get at - what
>> exactly? The address with any constant offset stripped?
>> Where do you re-construct that offset?  That is, aff.offset,
>> which you definitely need to get at a candidate?
>
> As explained in the previous RFC emails, the expanded and
> constant-offset-stripped base expr is only used for the purpose of basis
> look-up.  The corresponding candidate still has the unexpanded base expr
> as its 'base_expr', therefore the info on the constant offset is not
> lost and doesn't need to be re-constructed.
>
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>> +
>> +typedef int arr_2[50][50];
>> +
>> +void foo (arr_2 a2, int v1)
>> +{
>> +  int i, j;
>> +
>> +  i = v1 + 5;
>> +  j = i;
>> +  a2 [i-10] [j] = 2;
>> +  a2 [i] [j++] = i;
>> +  a2 [i+20] [j++] = i;
>> +  a2 [i-3] [i-1] += 1;
>> +  return;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>
>> scanning for 5 MEMs looks non-sensical.  What transform do
>> you expect?  I see other slsr testcases do similar non-sensical
>> checking which is bad, too.
>
> As the slsr optimizes CAND_REF candidates by simply lowering them to
> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of MEM_REFs
> is an effective check.  Alternatively, I can add a follow-up patch to
> add some dumping facility in replace_ref () to print out the replacing
> actions when -fdump-tree-slsr-details is on.
>
> I hope these can address your concerns.
>
>
> Regards,
> Yufeng
>
>
>
>>
>> Richard.
>>
>>> Regards,
>>>
>>> Yufeng
>>>
>>> gcc/
>>>
>>>           * gimple-ssa-strength-reduction.c: Include tree-affine.h.
>>>           (name_expansions): New static variable.
>>>           (alt_base_map): Ditto.
>>>           (get_alternative_base): New function.
>>>           (find_basis_for_candidate): For CAND_REF, optionally call
>>>           find_basis_for_base_expr with the returned value from
>>>           get_alternative_base.
>>>           (record_potential_basis): Add new parameter 'base' of type 'tree';
>>>           add an assertion of non-NULL base; use base to set node->base_expr.
>>>
>>>           (alloc_cand_and_find_basis): Update; call record_potential_basis
>>>           for CAND_REF with the returned value from get_alternative_base.
>>>           (execute_strength_reduction): Call pointer_map_create for
>>>           alt_base_map; call free_affine_expand_cache with&name_expansions.
>>>
>>> gcc/testsuite/
>>>
>>>           * gcc.dg/tree-ssa/slsr-41.c: New test.
>>
>
>
>

[-- Attachment #2: patch-use-affine-v3.txt --]
[-- Type: text/plain, Size: 6376 bytes --]

diff --git a/gcc/gimple-ssa-strength-reduction.c b/gcc/gimple-ssa-strength-reduction.c
index 88afc91..26502c3 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -53,6 +53,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "params.h"
 #include "hash-table.h"
 #include "tree-ssa-address.h"
+#include "tree-affine.h"
 \f
 /* Information about a strength reduction candidate.  Each statement
    in the candidate table represents an expression of one of the
@@ -420,6 +421,42 @@ cand_chain_hasher::equal (const value_type *chain1, const compare_type *chain2)
 /* Hash table embodying a mapping from base exprs to chains of candidates.  */
 static hash_table <cand_chain_hasher> base_cand_map;
 \f
+/* Pointer map used by tree_to_aff_combination_expand.  */
+static struct pointer_map_t *name_expansions;
+/* Pointer map embodying a mapping from bases to alternative bases.  */
+static struct pointer_map_t *alt_base_map;
+
+/* Given BASE, use the tree affine combiniation facilities to
+   find the underlying tree expression for BASE, with any
+   immediate offset excluded.  */
+
+static tree
+get_alternative_base (tree base)
+{
+  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
+
+  if (result == NULL)
+    {
+      tree expr;
+      aff_tree aff;
+
+      tree_to_aff_combination_expand (base, TREE_TYPE (base),
+				      &aff, &name_expansions);
+      aff.offset = tree_to_double_int (integer_zero_node);
+      expr = aff_combination_to_tree (&aff);
+
+      result = (tree *) pointer_map_insert (alt_base_map, base);
+      gcc_assert (!*result);
+
+      if (expr == base)
+	*result = NULL;
+      else
+	*result = expr;
+    }
+
+  return *result;
+}
+
 /* Look in the candidate table for a CAND_PHI that defines BASE and
    return it if found; otherwise return NULL.  */
 
@@ -440,8 +477,9 @@ find_phi_def (tree base)
 }
 
 /* Helper routine for find_basis_for_candidate.  May be called twice:
-   once for the candidate's base expr, and optionally again for the
-   candidate's phi definition.  */
+   once for the candidate's base expr, and optionally again either for
+   the candidate's phi definition or for a CAND_REF's alternative base
+   expression.  */
 
 static slsr_cand_t
 find_basis_for_base_expr (slsr_cand_t c, tree base_expr)
@@ -518,6 +556,13 @@ find_basis_for_candidate (slsr_cand_t c)
 	}
     }
 
+  if (!basis && c->kind == CAND_REF)
+    {
+      tree alt_base_expr = get_alternative_base (c->base_expr);
+      if (alt_base_expr)
+	basis = find_basis_for_base_expr (c, alt_base_expr);
+    }
+
   if (basis)
     {
       c->sibling = basis->dependent;
@@ -528,17 +573,21 @@ find_basis_for_candidate (slsr_cand_t c)
   return 0;
 }
 
-/* Record a mapping from the base expression of C to C itself, indicating that
-   C may potentially serve as a basis using that base expression.  */
+/* Record a mapping from BASE to C, indicating that C may potentially serve
+   as a basis using that base expression.  BASE may be the same as
+   C->BASE_EXPR; alternatively BASE can be a different tree that share the
+   underlining expression of C->BASE_EXPR.  */
 
 static void
-record_potential_basis (slsr_cand_t c)
+record_potential_basis (slsr_cand_t c, tree base)
 {
   cand_chain_t node;
   cand_chain **slot;
 
+  gcc_assert (base);
+
   node = (cand_chain_t) obstack_alloc (&chain_obstack, sizeof (cand_chain));
-  node->base_expr = c->base_expr;
+  node->base_expr = base;
   node->cand = c;
   node->next = NULL;
   slot = base_cand_map.find_slot (node, INSERT);
@@ -554,10 +603,18 @@ record_potential_basis (slsr_cand_t c)
 }
 
 /* Allocate storage for a new candidate and initialize its fields.
-   Attempt to find a basis for the candidate.  */
+   Attempt to find a basis for the candidate.
+
+   For CAND_REF, an alternative base may also be recorded and used
+   to find a basis.  This helps cases where the expression hidden
+   behind BASE (which is usually an SSA_NAME) has immediate offset,
+   e.g.
+
+     a2[i][j] = 1;
+     a2[i + 20][j] = 2;  */
 
 static slsr_cand_t
-alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base, 
+alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
 			   double_int index, tree stride, tree ctype,
 			   unsigned savings)
 {
@@ -583,7 +640,13 @@ alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
   else
     c->basis = find_basis_for_candidate (c);
 
-  record_potential_basis (c);
+  record_potential_basis (c, base);
+  if (kind == CAND_REF)
+    {
+      tree alt_base = get_alternative_base (base);
+      if (alt_base)
+	record_potential_basis (c, alt_base);
+    }
 
   return c;
 }
@@ -3524,6 +3587,9 @@ execute_strength_reduction (void)
   /* Allocate the mapping from base expressions to candidate chains.  */
   base_cand_map.create (500);
 
+  /* Allocate the mapping from bases to alternative bases.  */
+  alt_base_map = pointer_map_create ();
+
   /* Initialize the loop optimizer.  We need to detect flow across
      back edges, and this gives us dominator information as well.  */
   loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
@@ -3539,6 +3605,9 @@ execute_strength_reduction (void)
       dump_cand_chains ();
     }
 
+  pointer_map_destroy (alt_base_map);
+  free_affine_expand_cache (&name_expansions);
+
   /* Analyze costs and make appropriate replacements.  */
   analyze_candidates_and_replace ();
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
new file mode 100644
index 0000000..870d714
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
@@ -0,0 +1,24 @@
+/* Verify straight-line strength reduction in using
+   alternative base expr to record and look for the
+   potential candidate.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i-10] [j] = 2;
+  a2 [i] [j++] = i;
+  a2 [i+20] [j++] = i;
+  a2 [i-3] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-02 15:48                   ` [PING] " Yufeng Zhang
@ 2013-12-03  6:50                     ` Jeff Law
  2013-12-03 12:51                       ` Yufeng Zhang
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Law @ 2013-12-03  6:50 UTC (permalink / raw)
  To: Yufeng Zhang, Richard Biener; +Cc: Bill Schmidt, gcc-patches

On 12/02/13 08:47, Yufeng Zhang wrote:
> Ping~
>
> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html

>
> Thanks,
> Yufeng
>
> On 11/26/13 15:02, Yufeng Zhang wrote:
>> On 11/26/13 12:45, Richard Biener wrote:
>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>> Zhang<Yufeng.Zhang@arm.com>   wrote:
>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>> The second version of your original patch is ok with me with the
>>>>> following changes.  Sorry for the little side adventure into the
>>>>> next-interp logic; in the end that's going to hurt more than it
>>>>> helps in
>>>>> this case.  Thanks for having a look at it, anyway.  Thanks also for
>>>>> cleaning up this version to be less intrusive to common interfaces; I
>>>>> appreciate it.
>>>>
>>>>
>>>> Thanks a lot for the review.  I've attached an updated patch with the
>>>> suggested changes incorporated.
>>>>
>>>> For the next-interp adventure, I was quite happy to do the
>>>> experiment; it's
>>>> a good chance of gaining insight into the pass.  Many thanks for
>>>> your prompt
>>>> replies and patience in guiding!
>>>>
>>>>
>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>> approval,
>>>>> as I'm not a maintainer.
First a note, I need to check on voting for Bill as the slsr maintainer 
from the steering committee.   Voting was in progress just before the 
close of stage1 development so I haven't tallied the results :-)

>>
>> Yes, you are right about the non-trivial 'base' tree are rarely shared.
>>    The cached is introduced mainly because get_alternative_base () may be
>> called twice on the same 'base' tree, once in the
>> find_basis_for_candidate () for look-up and the other time in
>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm happy
>> to leave out the cache if you think the benefit is trivial.
Without some sense of how expensive the lookups are vs how often the 
cache hits it's awful hard to know if the cache is worth it.

I'd say take it out unless you have some sense it's really saving time. 
  It's a pretty minor implementation detail either way.


>>
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>> +
>>> +typedef int arr_2[50][50];
>>> +
>>> +void foo (arr_2 a2, int v1)
>>> +{
>>> +  int i, j;
>>> +
>>> +  i = v1 + 5;
>>> +  j = i;
>>> +  a2 [i-10] [j] = 2;
>>> +  a2 [i] [j++] = i;
>>> +  a2 [i+20] [j++] = i;
>>> +  a2 [i-3] [i-1] += 1;
>>> +  return;
>>> +}
>>> +
>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>
>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>> you expect?  I see other slsr testcases do similar non-sensical
>>> checking which is bad, too.
>>
>> As the slsr optimizes CAND_REF candidates by simply lowering them to
>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of MEM_REFs
>> is an effective check.  Alternatively, I can add a follow-up patch to
>> add some dumping facility in replace_ref () to print out the replacing
>> actions when -fdump-tree-slsr-details is on.
I think adding some details to the dump and scanning for them would be 
better.  That's the only change that is required for this to move forward.

I suggest doing it quickly.  We're well past stage1 close at this point.

jeff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-03  6:50                     ` Jeff Law
@ 2013-12-03 12:51                       ` Yufeng Zhang
  2013-12-03 14:21                         ` Richard Biener
  0 siblings, 1 reply; 34+ messages in thread
From: Yufeng Zhang @ 2013-12-03 12:51 UTC (permalink / raw)
  To: Jeff Law; +Cc: Richard Biener, Bill Schmidt, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 5151 bytes --]

On 12/03/13 06:48, Jeff Law wrote:
> On 12/02/13 08:47, Yufeng Zhang wrote:
>> Ping~
>>
>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>
>>
>> Thanks,
>> Yufeng
>>
>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>> On 11/26/13 12:45, Richard Biener wrote:
>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>> Zhang<Yufeng.Zhang@arm.com>    wrote:
>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>> The second version of your original patch is ok with me with the
>>>>>> following changes.  Sorry for the little side adventure into the
>>>>>> next-interp logic; in the end that's going to hurt more than it
>>>>>> helps in
>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks also for
>>>>>> cleaning up this version to be less intrusive to common interfaces; I
>>>>>> appreciate it.
>>>>>
>>>>>
>>>>> Thanks a lot for the review.  I've attached an updated patch with the
>>>>> suggested changes incorporated.
>>>>>
>>>>> For the next-interp adventure, I was quite happy to do the
>>>>> experiment; it's
>>>>> a good chance of gaining insight into the pass.  Many thanks for
>>>>> your prompt
>>>>> replies and patience in guiding!
>>>>>
>>>>>
>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>> approval,
>>>>>> as I'm not a maintainer.
> First a note, I need to check on voting for Bill as the slsr maintainer
> from the steering committee.   Voting was in progress just before the
> close of stage1 development so I haven't tallied the results :-)

Looking forward to some good news! :)

>>>
>>> Yes, you are right about the non-trivial 'base' tree are rarely shared.
>>>     The cached is introduced mainly because get_alternative_base () may be
>>> called twice on the same 'base' tree, once in the
>>> find_basis_for_candidate () for look-up and the other time in
>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm happy
>>> to leave out the cache if you think the benefit is trivial.
> Without some sense of how expensive the lookups are vs how often the
> cache hits it's awful hard to know if the cache is worth it.
>
> I'd say take it out unless you have some sense it's really saving time.
>    It's a pretty minor implementation detail either way.

I think the affine tree routines are generally expensive; it is worth 
having a cache to avoid calling them too many times.  I run the slsr-*.c 
tests under gcc.dg/tree-ssa/ and find out that the cache hit rates range 
from 55.6% to 90%, with 73.5% as the average.  The samples may not well 
represent the real world scenario, but they do show the fact that the 
'base' tree can be shared to some extent.  So I'd like to have the cache 
in the patch.

>
>>>
>>>> +/* { dg-do compile } */
>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>> +
>>>> +typedef int arr_2[50][50];
>>>> +
>>>> +void foo (arr_2 a2, int v1)
>>>> +{
>>>> +  int i, j;
>>>> +
>>>> +  i = v1 + 5;
>>>> +  j = i;
>>>> +  a2 [i-10] [j] = 2;
>>>> +  a2 [i] [j++] = i;
>>>> +  a2 [i+20] [j++] = i;
>>>> +  a2 [i-3] [i-1] += 1;
>>>> +  return;
>>>> +}
>>>> +
>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>
>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>> checking which is bad, too.
>>>
>>> As the slsr optimizes CAND_REF candidates by simply lowering them to
>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of MEM_REFs
>>> is an effective check.  Alternatively, I can add a follow-up patch to
>>> add some dumping facility in replace_ref () to print out the replacing
>>> actions when -fdump-tree-slsr-details is on.
> I think adding some details to the dump and scanning for them would be
> better.  That's the only change that is required for this to move forward.

I've updated to patch to dump more details when -fdump-tree-slsr-details 
is on.  The tests have also been updated to scan for these new dumps 
instead of MEMs.

>
> I suggest doing it quickly.  We're well past stage1 close at this point.

The bootstrapping on x86_64 is still running.  OK to commit if it succeeds?

Thanks,
Yufeng

gcc/

	* gimple-ssa-strength-reduction.c: Include tree-affine.h.
	(name_expansions): New static variable.
	(alt_base_map): Ditto.
	(get_alternative_base): New function.
	(find_basis_for_candidate): For CAND_REF, optionally call
	find_basis_for_base_expr with the returned value from
	get_alternative_base.
	(record_potential_basis): Add new parameter 'base' of type 'tree';
	add an assertion of non-NULL base; use base to set node->base_expr.
	(alloc_cand_and_find_basis): Update; call record_potential_basis
	for CAND_REF with the returned value from get_alternative_base.
	(replace_refs): Dump details on the replacing.
	(execute_strength_reduction): Call pointer_map_create for
	alt_base_map; call free_affine_expand_cache with &name_expansions.

gcc/testsuite/

	* gcc.dg/tree-ssa/slsr-39.c: Update.
	* gcc.dg/tree-ssa/slsr-41.c: New test.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: use-affine-v4.patch --]
[-- Type: text/x-patch; name=use-affine-v4.patch, Size: 7856 bytes --]

diff --git a/gcc/gimple-ssa-strength-reduction.c b/gcc/gimple-ssa-strength-reduction.c
index 88afc91..bf3362f 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -53,6 +53,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "params.h"
 #include "hash-table.h"
 #include "tree-ssa-address.h"
+#include "tree-affine.h"
 \f
 /* Information about a strength reduction candidate.  Each statement
    in the candidate table represents an expression of one of the
@@ -420,6 +421,42 @@ cand_chain_hasher::equal (const value_type *chain1, const compare_type *chain2)
 /* Hash table embodying a mapping from base exprs to chains of candidates.  */
 static hash_table <cand_chain_hasher> base_cand_map;
 \f
+/* Pointer map used by tree_to_aff_combination_expand.  */
+static struct pointer_map_t *name_expansions;
+/* Pointer map embodying a mapping from bases to alternative bases.  */
+static struct pointer_map_t *alt_base_map;
+
+/* Given BASE, use the tree affine combiniation facilities to
+   find the underlying tree expression for BASE, with any
+   immediate offset excluded.  */
+
+static tree
+get_alternative_base (tree base)
+{
+  tree *result = (tree *) pointer_map_contains (alt_base_map, base);
+
+  if (result == NULL)
+    {
+      tree expr;
+      aff_tree aff;
+
+      tree_to_aff_combination_expand (base, TREE_TYPE (base),
+				      &aff, &name_expansions);
+      aff.offset = tree_to_double_int (integer_zero_node);
+      expr = aff_combination_to_tree (&aff);
+
+      result = (tree *) pointer_map_insert (alt_base_map, base);
+      gcc_assert (!*result);
+
+      if (expr == base)
+	*result = NULL;
+      else
+	*result = expr;
+    }
+
+  return *result;
+}
+
 /* Look in the candidate table for a CAND_PHI that defines BASE and
    return it if found; otherwise return NULL.  */
 
@@ -440,8 +477,9 @@ find_phi_def (tree base)
 }
 
 /* Helper routine for find_basis_for_candidate.  May be called twice:
-   once for the candidate's base expr, and optionally again for the
-   candidate's phi definition.  */
+   once for the candidate's base expr, and optionally again either for
+   the candidate's phi definition or for a CAND_REF's alternative base
+   expression.  */
 
 static slsr_cand_t
 find_basis_for_base_expr (slsr_cand_t c, tree base_expr)
@@ -518,6 +556,13 @@ find_basis_for_candidate (slsr_cand_t c)
 	}
     }
 
+  if (!basis && c->kind == CAND_REF)
+    {
+      tree alt_base_expr = get_alternative_base (c->base_expr);
+      if (alt_base_expr)
+	basis = find_basis_for_base_expr (c, alt_base_expr);
+    }
+
   if (basis)
     {
       c->sibling = basis->dependent;
@@ -528,17 +573,21 @@ find_basis_for_candidate (slsr_cand_t c)
   return 0;
 }
 
-/* Record a mapping from the base expression of C to C itself, indicating that
-   C may potentially serve as a basis using that base expression.  */
+/* Record a mapping from BASE to C, indicating that C may potentially serve
+   as a basis using that base expression.  BASE may be the same as
+   C->BASE_EXPR; alternatively BASE can be a different tree that share the
+   underlining expression of C->BASE_EXPR.  */
 
 static void
-record_potential_basis (slsr_cand_t c)
+record_potential_basis (slsr_cand_t c, tree base)
 {
   cand_chain_t node;
   cand_chain **slot;
 
+  gcc_assert (base);
+
   node = (cand_chain_t) obstack_alloc (&chain_obstack, sizeof (cand_chain));
-  node->base_expr = c->base_expr;
+  node->base_expr = base;
   node->cand = c;
   node->next = NULL;
   slot = base_cand_map.find_slot (node, INSERT);
@@ -554,10 +603,18 @@ record_potential_basis (slsr_cand_t c)
 }
 
 /* Allocate storage for a new candidate and initialize its fields.
-   Attempt to find a basis for the candidate.  */
+   Attempt to find a basis for the candidate.
+
+   For CAND_REF, an alternative base may also be recorded and used
+   to find a basis.  This helps cases where the expression hidden
+   behind BASE (which is usually an SSA_NAME) has immediate offset,
+   e.g.
+
+     a2[i][j] = 1;
+     a2[i + 20][j] = 2;  */
 
 static slsr_cand_t
-alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base, 
+alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
 			   double_int index, tree stride, tree ctype,
 			   unsigned savings)
 {
@@ -583,7 +640,13 @@ alloc_cand_and_find_basis (enum cand_kind kind, gimple gs, tree base,
   else
     c->basis = find_basis_for_candidate (c);
 
-  record_potential_basis (c);
+  record_potential_basis (c, base);
+  if (kind == CAND_REF)
+    {
+      tree alt_base = get_alternative_base (base);
+      if (alt_base)
+	record_potential_basis (c, alt_base);
+    }
 
   return c;
 }
@@ -1843,6 +1906,12 @@ replace_ref (tree *expr, slsr_cand_t c)
 static void
 replace_refs (slsr_cand_t c)
 {
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fputs ("Replacing reference: ", dump_file);
+      print_gimple_stmt (dump_file, c->cand_stmt, 0, 0);
+    }
+
   if (gimple_vdef (c->cand_stmt))
     {
       tree *lhs = gimple_assign_lhs_ptr (c->cand_stmt);
@@ -1854,6 +1923,13 @@ replace_refs (slsr_cand_t c)
       replace_ref (rhs, c);
     }
 
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fputs ("With: ", dump_file);
+      print_gimple_stmt (dump_file, c->cand_stmt, 0, 0);
+      fputs ("\n", dump_file);
+    }
+
   if (c->sibling)
     replace_refs (lookup_cand (c->sibling));
 
@@ -3524,6 +3600,9 @@ execute_strength_reduction (void)
   /* Allocate the mapping from base expressions to candidate chains.  */
   base_cand_map.create (500);
 
+  /* Allocate the mapping from bases to alternative bases.  */
+  alt_base_map = pointer_map_create ();
+
   /* Initialize the loop optimizer.  We need to detect flow across
      back edges, and this gives us dominator information as well.  */
   loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
@@ -3539,6 +3618,9 @@ execute_strength_reduction (void)
       dump_cand_chains ();
     }
 
+  pointer_map_destroy (alt_base_map);
+  free_affine_expand_cache (&name_expansions);
+
   /* Analyze costs and make appropriate replacements.  */
   analyze_candidates_and_replace ();
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c
index 8cc2798..c146219 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c
@@ -6,7 +6,7 @@
     *PINDEX:   C1 + (C2 * C3) + C4  */
 
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-slsr" } */
+/* { dg-options "-O2 -fdump-tree-slsr-details" } */
 
 typedef int arr_2[50][50];
 
@@ -22,5 +22,5 @@ void foo (arr_2 a2, int v1)
   return;
 }
 
-/* { dg-final { scan-tree-dump-times "MEM" 4 "slsr" } } */
+/* { dg-final { scan-tree-dump-times "Replacing reference: " 4 "slsr" } } */
 /* { dg-final { cleanup-tree-dump "slsr" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
new file mode 100644
index 0000000..2c9d908
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-41.c
@@ -0,0 +1,24 @@
+/* Verify straight-line strength reduction in using
+   alternative base expr to record and look for the
+   potential candidate.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr-details" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i-10] [j] = 2;
+  a2 [i] [j++] = i;
+  a2 [i+20] [j++] = i;
+  a2 [i-3] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "Replacing reference: " 5 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-03 12:51                       ` Yufeng Zhang
@ 2013-12-03 14:21                         ` Richard Biener
  2013-12-03 15:52                           ` Yufeng Zhang
  0 siblings, 1 reply; 34+ messages in thread
From: Richard Biener @ 2013-12-03 14:21 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: Jeff Law, Bill Schmidt, gcc-patches

On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> On 12/03/13 06:48, Jeff Law wrote:
>>
>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>
>>> Ping~
>>>
>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>
>>
>>>
>>> Thanks,
>>> Yufeng
>>>
>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>
>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>
>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>> Zhang<Yufeng.Zhang@arm.com>    wrote:
>>>>>>
>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>
>>>>>>> The second version of your original patch is ok with me with the
>>>>>>> following changes.  Sorry for the little side adventure into the
>>>>>>> next-interp logic; in the end that's going to hurt more than it
>>>>>>> helps in
>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks also for
>>>>>>> cleaning up this version to be less intrusive to common interfaces; I
>>>>>>> appreciate it.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks a lot for the review.  I've attached an updated patch with the
>>>>>> suggested changes incorporated.
>>>>>>
>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>> experiment; it's
>>>>>> a good chance of gaining insight into the pass.  Many thanks for
>>>>>> your prompt
>>>>>> replies and patience in guiding!
>>>>>>
>>>>>>
>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>> approval,
>>>>>>> as I'm not a maintainer.
>>
>> First a note, I need to check on voting for Bill as the slsr maintainer
>> from the steering committee.   Voting was in progress just before the
>> close of stage1 development so I haven't tallied the results :-)
>
>
> Looking forward to some good news! :)
>
>
>>>>
>>>> Yes, you are right about the non-trivial 'base' tree are rarely shared.
>>>>     The cached is introduced mainly because get_alternative_base () may
>>>> be
>>>> called twice on the same 'base' tree, once in the
>>>> find_basis_for_candidate () for look-up and the other time in
>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm happy
>>>> to leave out the cache if you think the benefit is trivial.
>>
>> Without some sense of how expensive the lookups are vs how often the
>> cache hits it's awful hard to know if the cache is worth it.
>>
>> I'd say take it out unless you have some sense it's really saving time.
>>    It's a pretty minor implementation detail either way.
>
>
> I think the affine tree routines are generally expensive; it is worth having
> a cache to avoid calling them too many times.  I run the slsr-*.c tests
> under gcc.dg/tree-ssa/ and find out that the cache hit rates range from
> 55.6% to 90%, with 73.5% as the average.  The samples may not well represent
> the real world scenario, but they do show the fact that the 'base' tree can
> be shared to some extent.  So I'd like to have the cache in the patch.
>
>
>>
>>>>
>>>>> +/* { dg-do compile } */
>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>> +
>>>>> +typedef int arr_2[50][50];
>>>>> +
>>>>> +void foo (arr_2 a2, int v1)
>>>>> +{
>>>>> +  int i, j;
>>>>> +
>>>>> +  i = v1 + 5;
>>>>> +  j = i;
>>>>> +  a2 [i-10] [j] = 2;
>>>>> +  a2 [i] [j++] = i;
>>>>> +  a2 [i+20] [j++] = i;
>>>>> +  a2 [i-3] [i-1] += 1;
>>>>> +  return;
>>>>> +}
>>>>> +
>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>
>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>> checking which is bad, too.
>>>>
>>>>
>>>> As the slsr optimizes CAND_REF candidates by simply lowering them to
>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of MEM_REFs
>>>> is an effective check.  Alternatively, I can add a follow-up patch to
>>>> add some dumping facility in replace_ref () to print out the replacing
>>>> actions when -fdump-tree-slsr-details is on.
>>
>> I think adding some details to the dump and scanning for them would be
>> better.  That's the only change that is required for this to move forward.
>
>
> I've updated to patch to dump more details when -fdump-tree-slsr-details is
> on.  The tests have also been updated to scan for these new dumps instead of
> MEMs.
>
>
>>
>> I suggest doing it quickly.  We're well past stage1 close at this point.
>
>
> The bootstrapping on x86_64 is still running.  OK to commit if it succeeds?

I still don't like it.  It's using the wrong and too expensive tools to do
stuff.  What kind of bases are we ultimately interested in?  Browsing
the code it looks like we're having

  /* Base expression for the chain of candidates:  often, but not
     always, an SSA name.  */
  tree base_expr;

which isn't really too informative but I suppose they are all
kind-of-gimple_val()s?  That said, I wonder if you can simply
use get_addr_base_and_unit_offset in place of get_alternative_base (),
ignoring the returned offset.

Richard.

> Thanks,
>
> Yufeng
>
> gcc/
>
>         * gimple-ssa-strength-reduction.c: Include tree-affine.h.
>         (name_expansions): New static variable.
>         (alt_base_map): Ditto.
>         (get_alternative_base): New function.
>         (find_basis_for_candidate): For CAND_REF, optionally call
>         find_basis_for_base_expr with the returned value from
>         get_alternative_base.
>         (record_potential_basis): Add new parameter 'base' of type 'tree';
>         add an assertion of non-NULL base; use base to set node->base_expr.
>         (alloc_cand_and_find_basis): Update; call record_potential_basis
>         for CAND_REF with the returned value from get_alternative_base.
>         (replace_refs): Dump details on the replacing.
>
>         (execute_strength_reduction): Call pointer_map_create for
>         alt_base_map; call free_affine_expand_cache with &name_expansions.
>
> gcc/testsuite/
>
>         * gcc.dg/tree-ssa/slsr-39.c: Update.
>         * gcc.dg/tree-ssa/slsr-41.c: New test.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-03 14:21                         ` Richard Biener
@ 2013-12-03 15:52                           ` Yufeng Zhang
  2013-12-03 19:21                             ` Jeff Law
  2013-12-03 20:32                             ` Richard Biener
  0 siblings, 2 replies; 34+ messages in thread
From: Yufeng Zhang @ 2013-12-03 15:52 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, Bill Schmidt, gcc-patches

On 12/03/13 14:20, Richard Biener wrote:
> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
>> On 12/03/13 06:48, Jeff Law wrote:
>>>
>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>>
>>>> Ping~
>>>>
>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>>
>>>
>>>>
>>>> Thanks,
>>>> Yufeng
>>>>
>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>>
>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>>
>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
>>>>>>>
>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>>
>>>>>>>> The second version of your original patch is ok with me with the
>>>>>>>> following changes.  Sorry for the little side adventure into the
>>>>>>>> next-interp logic; in the end that's going to hurt more than it
>>>>>>>> helps in
>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks also for
>>>>>>>> cleaning up this version to be less intrusive to common interfaces; I
>>>>>>>> appreciate it.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks a lot for the review.  I've attached an updated patch with the
>>>>>>> suggested changes incorporated.
>>>>>>>
>>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>>> experiment; it's
>>>>>>> a good chance of gaining insight into the pass.  Many thanks for
>>>>>>> your prompt
>>>>>>> replies and patience in guiding!
>>>>>>>
>>>>>>>
>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>>> approval,
>>>>>>>> as I'm not a maintainer.
>>>
>>> First a note, I need to check on voting for Bill as the slsr maintainer
>>> from the steering committee.   Voting was in progress just before the
>>> close of stage1 development so I haven't tallied the results :-)
>>
>>
>> Looking forward to some good news! :)
>>
>>
>>>>>
>>>>> Yes, you are right about the non-trivial 'base' tree are rarely shared.
>>>>>      The cached is introduced mainly because get_alternative_base () may
>>>>> be
>>>>> called twice on the same 'base' tree, once in the
>>>>> find_basis_for_candidate () for look-up and the other time in
>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm happy
>>>>> to leave out the cache if you think the benefit is trivial.
>>>
>>> Without some sense of how expensive the lookups are vs how often the
>>> cache hits it's awful hard to know if the cache is worth it.
>>>
>>> I'd say take it out unless you have some sense it's really saving time.
>>>     It's a pretty minor implementation detail either way.
>>
>>
>> I think the affine tree routines are generally expensive; it is worth having
>> a cache to avoid calling them too many times.  I run the slsr-*.c tests
>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range from
>> 55.6% to 90%, with 73.5% as the average.  The samples may not well represent
>> the real world scenario, but they do show the fact that the 'base' tree can
>> be shared to some extent.  So I'd like to have the cache in the patch.
>>
>>
>>>
>>>>>
>>>>>> +/* { dg-do compile } */
>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>>> +
>>>>>> +typedef int arr_2[50][50];
>>>>>> +
>>>>>> +void foo (arr_2 a2, int v1)
>>>>>> +{
>>>>>> +  int i, j;
>>>>>> +
>>>>>> +  i = v1 + 5;
>>>>>> +  j = i;
>>>>>> +  a2 [i-10] [j] = 2;
>>>>>> +  a2 [i] [j++] = i;
>>>>>> +  a2 [i+20] [j++] = i;
>>>>>> +  a2 [i-3] [i-1] += 1;
>>>>>> +  return;
>>>>>> +}
>>>>>> +
>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>>
>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>>> checking which is bad, too.
>>>>>
>>>>>
>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them to
>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of MEM_REFs
>>>>> is an effective check.  Alternatively, I can add a follow-up patch to
>>>>> add some dumping facility in replace_ref () to print out the replacing
>>>>> actions when -fdump-tree-slsr-details is on.
>>>
>>> I think adding some details to the dump and scanning for them would be
>>> better.  That's the only change that is required for this to move forward.
>>
>>
>> I've updated to patch to dump more details when -fdump-tree-slsr-details is
>> on.  The tests have also been updated to scan for these new dumps instead of
>> MEMs.
>>
>>
>>>
>>> I suggest doing it quickly.  We're well past stage1 close at this point.
>>
>>
>> The bootstrapping on x86_64 is still running.  OK to commit if it succeeds?
>
> I still don't like it.  It's using the wrong and too expensive tools to do
> stuff.  What kind of bases are we ultimately interested in?  Browsing
> the code it looks like we're having
>
>    /* Base expression for the chain of candidates:  often, but not
>       always, an SSA name.  */
>    tree base_expr;
>
> which isn't really too informative but I suppose they are all
> kind-of-gimple_val()s?  That said, I wonder if you can simply
> use get_addr_base_and_unit_offset in place of get_alternative_base (),
> ignoring the returned offset.

'base_expr' is essentially the base address of a handled_component_p, 
e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of 
the object returned by get_inner_reference ().

Given a test case like the following:

typedef int arr_2[20][20];

void foo (arr_2 a2, int i, int j)
{
   a2[i+10][j] = 1;
   a2[i+10][j+1] = 1;
   a2[i+20][j] = 1;
}

The IR before SLSR is (on x86_64):

   _2 = (long unsigned int) i_1(D);
   _3 = _2 * 80;
   _4 = _3 + 800;
   _6 = a2_5(D) + _4;
   *_6[j_8(D)] = 1;
   _10 = j_8(D) + 1;
   *_6[_10] = 1;
   _12 = _3 + 1600;
   _13 = a2_5(D) + _12;
   *_13[j_8(D)] = 1;

The base_expr for the 1st and 2nd memory reference are the same, i.e. 
_6, while the base_expr for a2[i+20][j] is _13.

_13 is essentially (_6 + 800), so all of the three memory references 
essentially share the same base address.  As their strides are also the 
same (MULT_EXPR (j, 4)), the three references can all be lowered to 
MEM_REFs.  What this patch does is to use the tree affine tools to help 
recognize the underlying base address expression; as it requires looking 
into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset () 
won't help here.

Bill has helped me exploit other ways of achieving this in SLSR, but so 
far we think this is the best way to proceed.  The use of tree affine 
routines has been restricted to CAND_REFs only and there is the 
aforementioned cache facility to help reduce the overhead.

Thanks,
Yufeng

P.S. some more details what the patch does:

The CAND_REF for the three memory references are:

  6  [2] *_6[j_8(D)] = 1;
      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
      basis: 0  dependent: 8  sibling: 0
      next-interp: 0  dead-savings: 0

   8  [2] *_6[_10] = 1;
      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
      basis: 6  dependent: 11  sibling: 0
      next-interp: 0  dead-savings: 0

  11  [2] *_13[j_8(D)] = 1;
      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
      basis: 8  dependent: 0  sibling: 0
      next-interp: 0  dead-savings: 0

Before the patch, the strength reduction candidate chains for the three 
CAND_REFs are:

   _6 -> 6 -> 8
   _13 -> 11

i.e. SLSR recognizes the first two references share the same basis, 
while the last one is on it own.

With the patch, an extra candidate chain can be recognized:

   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8

i.e. all of the three references are found to have the same basis 
(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded _6 
or _13, with the immediate offset removed.  The pass is now able to 
lower all of the three references, instead of the first two only, to 
MEM_REFs.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-03 15:52                           ` Yufeng Zhang
@ 2013-12-03 19:21                             ` Jeff Law
  2013-12-03 20:32                             ` Richard Biener
  1 sibling, 0 replies; 34+ messages in thread
From: Jeff Law @ 2013-12-03 19:21 UTC (permalink / raw)
  To: Yufeng Zhang, Richard Biener; +Cc: Bill Schmidt, gcc-patches

On 12/03/13 08:52, Yufeng Zhang wrote:
>>
>> I still don't like it.  It's using the wrong and too expensive tools
>> to do
>> stuff.  What kind of bases are we ultimately interested in?  Browsing
>> the code it looks like we're having
>>
>>    /* Base expression for the chain of candidates:  often, but not
>>       always, an SSA name.  */
>>    tree base_expr;
>>
>> which isn't really too informative but I suppose they are all
>> kind-of-gimple_val()s?  That said, I wonder if you can simply
>> use get_addr_base_and_unit_offset in place of get_alternative_base (),
>> ignoring the returned offset.
>
> 'base_expr' is essentially the base address of a handled_component_p,
> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> the object returned by get_inner_reference ().
>
> Given a test case like the following:
>
> typedef int arr_2[20][20];
>
> void foo (arr_2 a2, int i, int j)
> {
>    a2[i+10][j] = 1;
>    a2[i+10][j+1] = 1;
>    a2[i+20][j] = 1;
> }
>
> The IR before SLSR is (on x86_64):
>
>    _2 = (long unsigned int) i_1(D);
>    _3 = _2 * 80;
>    _4 = _3 + 800;
>    _6 = a2_5(D) + _4;
>    *_6[j_8(D)] = 1;
>    _10 = j_8(D) + 1;
>    *_6[_10] = 1;
>    _12 = _3 + 1600;
>    _13 = a2_5(D) + _12;
>    *_13[j_8(D)] = 1;
>
> The base_expr for the 1st and 2nd memory reference are the same, i.e.
> _6, while the base_expr for a2[i+20][j] is _13.
>
> _13 is essentially (_6 + 800), so all of the three memory references
> essentially share the same base address.  As their strides are also the
> same (MULT_EXPR (j, 4)), the three references can all be lowered to
> MEM_REFs.  What this patch does is to use the tree affine tools to help
> recognize the underlying base address expression; as it requires looking
> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> won't help here.
>
> Bill has helped me exploit other ways of achieving this in SLSR, but so
> far we think this is the best way to proceed.  The use of tree affine
> routines has been restricted to CAND_REFs only and there is the
> aforementioned cache facility to help reduce the overhead.
Right and I think Bill's opinions should carry the weight here since he 
wrote the SLSR code and will likely be its maintainer.

So let's keep the cache per your data.  OK for the trunk.  Thanks for 
your patience and understanding.

jeff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-03 15:52                           ` Yufeng Zhang
  2013-12-03 19:21                             ` Jeff Law
@ 2013-12-03 20:32                             ` Richard Biener
  2013-12-03 21:57                               ` Yufeng Zhang
  2013-12-03 22:04                               ` Bill Schmidt
  1 sibling, 2 replies; 34+ messages in thread
From: Richard Biener @ 2013-12-03 20:32 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: Jeff Law, Bill Schmidt, gcc-patches

Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
>On 12/03/13 14:20, Richard Biener wrote:
>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com> 
>wrote:
>>> On 12/03/13 06:48, Jeff Law wrote:
>>>>
>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>>>
>>>>> Ping~
>>>>>
>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Yufeng
>>>>>
>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>>>
>>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>>>
>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
>>>>>>>>
>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>>>
>>>>>>>>> The second version of your original patch is ok with me with
>the
>>>>>>>>> following changes.  Sorry for the little side adventure into
>the
>>>>>>>>> next-interp logic; in the end that's going to hurt more than
>it
>>>>>>>>> helps in
>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>also for
>>>>>>>>> cleaning up this version to be less intrusive to common
>interfaces; I
>>>>>>>>> appreciate it.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>with the
>>>>>>>> suggested changes incorporated.
>>>>>>>>
>>>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>>>> experiment; it's
>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>for
>>>>>>>> your prompt
>>>>>>>> replies and patience in guiding!
>>>>>>>>
>>>>>>>>
>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>>>> approval,
>>>>>>>>> as I'm not a maintainer.
>>>>
>>>> First a note, I need to check on voting for Bill as the slsr
>maintainer
>>>> from the steering committee.   Voting was in progress just before
>the
>>>> close of stage1 development so I haven't tallied the results :-)
>>>
>>>
>>> Looking forward to some good news! :)
>>>
>>>
>>>>>>
>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>shared.
>>>>>>      The cached is introduced mainly because get_alternative_base
>() may
>>>>>> be
>>>>>> called twice on the same 'base' tree, once in the
>>>>>> find_basis_for_candidate () for look-up and the other time in
>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>happy
>>>>>> to leave out the cache if you think the benefit is trivial.
>>>>
>>>> Without some sense of how expensive the lookups are vs how often
>the
>>>> cache hits it's awful hard to know if the cache is worth it.
>>>>
>>>> I'd say take it out unless you have some sense it's really saving
>time.
>>>>     It's a pretty minor implementation detail either way.
>>>
>>>
>>> I think the affine tree routines are generally expensive; it is
>worth having
>>> a cache to avoid calling them too many times.  I run the slsr-*.c
>tests
>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>from
>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>represent
>>> the real world scenario, but they do show the fact that the 'base'
>tree can
>>> be shared to some extent.  So I'd like to have the cache in the
>patch.
>>>
>>>
>>>>
>>>>>>
>>>>>>> +/* { dg-do compile } */
>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>>>> +
>>>>>>> +typedef int arr_2[50][50];
>>>>>>> +
>>>>>>> +void foo (arr_2 a2, int v1)
>>>>>>> +{
>>>>>>> +  int i, j;
>>>>>>> +
>>>>>>> +  i = v1 + 5;
>>>>>>> +  j = i;
>>>>>>> +  a2 [i-10] [j] = 2;
>>>>>>> +  a2 [i] [j++] = i;
>>>>>>> +  a2 [i+20] [j++] = i;
>>>>>>> +  a2 [i-3] [i-1] += 1;
>>>>>>> +  return;
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>>>
>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>>>> checking which is bad, too.
>>>>>>
>>>>>>
>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>to
>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>MEM_REFs
>>>>>> is an effective check.  Alternatively, I can add a follow-up
>patch to
>>>>>> add some dumping facility in replace_ref () to print out the
>replacing
>>>>>> actions when -fdump-tree-slsr-details is on.
>>>>
>>>> I think adding some details to the dump and scanning for them would
>be
>>>> better.  That's the only change that is required for this to move
>forward.
>>>
>>>
>>> I've updated to patch to dump more details when
>-fdump-tree-slsr-details is
>>> on.  The tests have also been updated to scan for these new dumps
>instead of
>>> MEMs.
>>>
>>>
>>>>
>>>> I suggest doing it quickly.  We're well past stage1 close at this
>point.
>>>
>>>
>>> The bootstrapping on x86_64 is still running.  OK to commit if it
>succeeds?
>>
>> I still don't like it.  It's using the wrong and too expensive tools
>to do
>> stuff.  What kind of bases are we ultimately interested in?  Browsing
>> the code it looks like we're having
>>
>>    /* Base expression for the chain of candidates:  often, but not
>>       always, an SSA name.  */
>>    tree base_expr;
>>
>> which isn't really too informative but I suppose they are all
>> kind-of-gimple_val()s?  That said, I wonder if you can simply
>> use get_addr_base_and_unit_offset in place of get_alternative_base
>(),
>> ignoring the returned offset.
>
>'base_expr' is essentially the base address of a handled_component_p, 
>e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>
>the object returned by get_inner_reference ().
>
>Given a test case like the following:
>
>typedef int arr_2[20][20];
>
>void foo (arr_2 a2, int i, int j)
>{
>   a2[i+10][j] = 1;
>   a2[i+10][j+1] = 1;
>   a2[i+20][j] = 1;
>}
>
>The IR before SLSR is (on x86_64):
>
>   _2 = (long unsigned int) i_1(D);
>   _3 = _2 * 80;
>   _4 = _3 + 800;
>   _6 = a2_5(D) + _4;
>   *_6[j_8(D)] = 1;
>   _10 = j_8(D) + 1;
>   *_6[_10] = 1;
>   _12 = _3 + 1600;
>   _13 = a2_5(D) + _12;
>   *_13[j_8(D)] = 1;
>
>The base_expr for the 1st and 2nd memory reference are the same, i.e. 
>_6, while the base_expr for a2[i+20][j] is _13.
>
>_13 is essentially (_6 + 800), so all of the three memory references 
>essentially share the same base address.  As their strides are also the
>
>same (MULT_EXPR (j, 4)), the three references can all be lowered to 
>MEM_REFs.  What this patch does is to use the tree affine tools to help
>
>recognize the underlying base address expression; as it requires
>looking 
>into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset () 
>won't help here.
>
>Bill has helped me exploit other ways of achieving this in SLSR, but so
>
>far we think this is the best way to proceed.  The use of tree affine 
>routines has been restricted to CAND_REFs only and there is the 
>aforementioned cache facility to help reduce the overhead.
>
>Thanks,
>Yufeng
>
>P.S. some more details what the patch does:
>
>The CAND_REF for the three memory references are:
>
>  6  [2] *_6[j_8(D)] = 1;
>      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>      basis: 0  dependent: 8  sibling: 0
>      next-interp: 0  dead-savings: 0
>
>   8  [2] *_6[_10] = 1;
>      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>      basis: 6  dependent: 11  sibling: 0
>      next-interp: 0  dead-savings: 0
>
>  11  [2] *_13[j_8(D)] = 1;
>      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>      basis: 8  dependent: 0  sibling: 0
>      next-interp: 0  dead-savings: 0
>
>Before the patch, the strength reduction candidate chains for the three
>
>CAND_REFs are:
>
>   _6 -> 6 -> 8
>   _13 -> 11
>
>i.e. SLSR recognizes the first two references share the same basis, 
>while the last one is on it own.
>
>With the patch, an extra candidate chain can be recognized:
>
>   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
>
>i.e. all of the three references are found to have the same basis 
>(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>_6 
>or _13, with the immediate offset removed.  The pass is now able to 
>lower all of the three references, instead of the first two only, to 
>MEM_REFs.

Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)

Richard.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-03 20:32                             ` Richard Biener
@ 2013-12-03 21:57                               ` Yufeng Zhang
  2013-12-03 22:19                                 ` Bill Schmidt
  2013-12-03 22:04                               ` Bill Schmidt
  1 sibling, 1 reply; 34+ messages in thread
From: Yufeng Zhang @ 2013-12-03 21:57 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, Bill Schmidt, gcc-patches

On 12/03/13 20:35, Richard Biener wrote:
> Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
>> On 12/03/13 14:20, Richard Biener wrote:
>>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>> wrote:
>>>> On 12/03/13 06:48, Jeff Law wrote:
>>>>>
>>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>>>>
>>>>>> Ping~
>>>>>>
>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Yufeng
>>>>>>
>>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>>>>
>>>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>>>>
>>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
>>>>>>>>>
>>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>>>>
>>>>>>>>>> The second version of your original patch is ok with me with
>> the
>>>>>>>>>> following changes.  Sorry for the little side adventure into
>> the
>>>>>>>>>> next-interp logic; in the end that's going to hurt more than
>> it
>>>>>>>>>> helps in
>>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>> also for
>>>>>>>>>> cleaning up this version to be less intrusive to common
>> interfaces; I
>>>>>>>>>> appreciate it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>> with the
>>>>>>>>> suggested changes incorporated.
>>>>>>>>>
>>>>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>>>>> experiment; it's
>>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>> for
>>>>>>>>> your prompt
>>>>>>>>> replies and patience in guiding!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>>>>> approval,
>>>>>>>>>> as I'm not a maintainer.
>>>>>
>>>>> First a note, I need to check on voting for Bill as the slsr
>> maintainer
>>>>> from the steering committee.   Voting was in progress just before
>> the
>>>>> close of stage1 development so I haven't tallied the results :-)
>>>>
>>>>
>>>> Looking forward to some good news! :)
>>>>
>>>>
>>>>>>>
>>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>> shared.
>>>>>>>       The cached is introduced mainly because get_alternative_base
>> () may
>>>>>>> be
>>>>>>> called twice on the same 'base' tree, once in the
>>>>>>> find_basis_for_candidate () for look-up and the other time in
>>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>> happy
>>>>>>> to leave out the cache if you think the benefit is trivial.
>>>>>
>>>>> Without some sense of how expensive the lookups are vs how often
>> the
>>>>> cache hits it's awful hard to know if the cache is worth it.
>>>>>
>>>>> I'd say take it out unless you have some sense it's really saving
>> time.
>>>>>      It's a pretty minor implementation detail either way.
>>>>
>>>>
>>>> I think the affine tree routines are generally expensive; it is
>> worth having
>>>> a cache to avoid calling them too many times.  I run the slsr-*.c
>> tests
>>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>> from
>>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>> represent
>>>> the real world scenario, but they do show the fact that the 'base'
>> tree can
>>>> be shared to some extent.  So I'd like to have the cache in the
>> patch.
>>>>
>>>>
>>>>>
>>>>>>>
>>>>>>>> +/* { dg-do compile } */
>>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>>>>> +
>>>>>>>> +typedef int arr_2[50][50];
>>>>>>>> +
>>>>>>>> +void foo (arr_2 a2, int v1)
>>>>>>>> +{
>>>>>>>> +  int i, j;
>>>>>>>> +
>>>>>>>> +  i = v1 + 5;
>>>>>>>> +  j = i;
>>>>>>>> +  a2 [i-10] [j] = 2;
>>>>>>>> +  a2 [i] [j++] = i;
>>>>>>>> +  a2 [i+20] [j++] = i;
>>>>>>>> +  a2 [i-3] [i-1] += 1;
>>>>>>>> +  return;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>>>>
>>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>>>>> checking which is bad, too.
>>>>>>>
>>>>>>>
>>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>> to
>>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>> MEM_REFs
>>>>>>> is an effective check.  Alternatively, I can add a follow-up
>> patch to
>>>>>>> add some dumping facility in replace_ref () to print out the
>> replacing
>>>>>>> actions when -fdump-tree-slsr-details is on.
>>>>>
>>>>> I think adding some details to the dump and scanning for them would
>> be
>>>>> better.  That's the only change that is required for this to move
>> forward.
>>>>
>>>>
>>>> I've updated to patch to dump more details when
>> -fdump-tree-slsr-details is
>>>> on.  The tests have also been updated to scan for these new dumps
>> instead of
>>>> MEMs.
>>>>
>>>>
>>>>>
>>>>> I suggest doing it quickly.  We're well past stage1 close at this
>> point.
>>>>
>>>>
>>>> The bootstrapping on x86_64 is still running.  OK to commit if it
>> succeeds?
>>>
>>> I still don't like it.  It's using the wrong and too expensive tools
>> to do
>>> stuff.  What kind of bases are we ultimately interested in?  Browsing
>>> the code it looks like we're having
>>>
>>>     /* Base expression for the chain of candidates:  often, but not
>>>        always, an SSA name.  */
>>>     tree base_expr;
>>>
>>> which isn't really too informative but I suppose they are all
>>> kind-of-gimple_val()s?  That said, I wonder if you can simply
>>> use get_addr_base_and_unit_offset in place of get_alternative_base
>> (),
>>> ignoring the returned offset.
>>
>> 'base_expr' is essentially the base address of a handled_component_p,
>> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>>
>> the object returned by get_inner_reference ().
>>
>> Given a test case like the following:
>>
>> typedef int arr_2[20][20];
>>
>> void foo (arr_2 a2, int i, int j)
>> {
>>    a2[i+10][j] = 1;
>>    a2[i+10][j+1] = 1;
>>    a2[i+20][j] = 1;
>> }
>>
>> The IR before SLSR is (on x86_64):
>>
>>    _2 = (long unsigned int) i_1(D);
>>    _3 = _2 * 80;
>>    _4 = _3 + 800;
>>    _6 = a2_5(D) + _4;
>>    *_6[j_8(D)] = 1;
>>    _10 = j_8(D) + 1;
>>    *_6[_10] = 1;
>>    _12 = _3 + 1600;
>>    _13 = a2_5(D) + _12;
>>    *_13[j_8(D)] = 1;
>>
>> The base_expr for the 1st and 2nd memory reference are the same, i.e.
>> _6, while the base_expr for a2[i+20][j] is _13.
>>
>> _13 is essentially (_6 + 800), so all of the three memory references
>> essentially share the same base address.  As their strides are also the
>>
>> same (MULT_EXPR (j, 4)), the three references can all be lowered to
>> MEM_REFs.  What this patch does is to use the tree affine tools to help
>>
>> recognize the underlying base address expression; as it requires
>> looking
>> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>> won't help here.
>>
>> Bill has helped me exploit other ways of achieving this in SLSR, but so
>>
>> far we think this is the best way to proceed.  The use of tree affine
>> routines has been restricted to CAND_REFs only and there is the
>> aforementioned cache facility to help reduce the overhead.
>>
>> Thanks,
>> Yufeng
>>
>> P.S. some more details what the patch does:
>>
>> The CAND_REF for the three memory references are:
>>
>>   6  [2] *_6[j_8(D)] = 1;
>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>       basis: 0  dependent: 8  sibling: 0
>>       next-interp: 0  dead-savings: 0
>>
>>    8  [2] *_6[_10] = 1;
>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>>       basis: 6  dependent: 11  sibling: 0
>>       next-interp: 0  dead-savings: 0
>>
>>   11  [2] *_13[j_8(D)] = 1;
>>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>       basis: 8  dependent: 0  sibling: 0
>>       next-interp: 0  dead-savings: 0
>>
>> Before the patch, the strength reduction candidate chains for the three
>>
>> CAND_REFs are:
>>
>>    _6 ->  6 ->  8
>>    _13 ->  11
>>
>> i.e. SLSR recognizes the first two references share the same basis,
>> while the last one is on it own.
>>
>> With the patch, an extra candidate chain can be recognized:
>>
>>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
>>
>> i.e. all of the three references are found to have the same basis
>> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>> _6
>> or _13, with the immediate offset removed.  The pass is now able to
>> lower all of the three references, instead of the first two only, to
>> MEM_REFs.
>
> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference?

slsr is indeed already using get_inner_reference () to figure out the 
common components; see restructure_reference () and the comment at the 
beginning of gimple-ssa-strength-reduction.c.  Quote some of the comment 
here for the convenience:

    Specifically, we are interested in references for which
    get_inner_reference returns a base address, offset, and bitpos as
    follows:

      base:    MEM_REF (T1, C1)
      offset:  MULT_EXPR (PLUS_EXPR (T2, C2), C3)
      bitpos:  C4 * BITS_PER_UNIT

    Here T1 and T2 are arbitrary trees, and C1, C2, C3, C4 are
    arbitrary integer constants.  Note that C2 may be zero, in which
    case the offset will be MULT_EXPR (T2, C3).

    When this pattern is recognized, the original memory reference
    can be replaced with:

      MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
               C1 + (C2 * C3) + C4)

    which distributes the multiply to allow constant folding.  When
    two or more addressing expressions can be represented by MEM_REFs
    of this form, differing only in the constants C1, C2, and C4,
    making this substitution produces more efficient addressing during
    the RTL phases.  When there are not at least two expressions with
    the same values of T1, T2, and C3, there is nothing to be gained
    by the replacement.

    Strength reduction of CAND_REFs uses the same infrastructure as
    that used by CAND_MULTs and CAND_ADDs.  We record T1 in the base (B)
    field, MULT_EXPR (T2, C3) in the stride (S) field, and
    C1 + (C2 * C3) + C4 in the index (i) field.  A basis for a CAND_REF
    is thus another CAND_REF with the same B and S values.  When at
    least two CAND_REFs are chained together using the basis relation,
    each of them is replaced as above, resulting in improved code
    generation for addressing.

As the last paragraphs says, a basis for a CAND_REF is another CAND_REF 
with the same B and S values.  This patch extends the definition of 
basis by allowing B's not to be the same but only differ by an immediate 
constant.

>  After all slsr does not use tree-affine as representation for bases (which it could?)

In theory it could use tree-affine for bases and I had experimented this 
approach as well, but encountered some unexpected re-association issue 
when building spec2k, as when tree-affine is combined to tree, the 
association order can be different from, or worse than, what was before 
tree-affine.  I also didn't see any obvious benefit, so didn't proceed 
further.

In the long run, the additional lowering of memory accesses you 
mentioned in http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03731.html may 
be a better solution to what I'm trying to tackle here.  I'll see if I 
can get time to work out something useful for 4.10. :)

Yufeng

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-03 20:32                             ` Richard Biener
  2013-12-03 21:57                               ` Yufeng Zhang
@ 2013-12-03 22:04                               ` Bill Schmidt
  2013-12-04 10:26                                 ` Richard Biener
  1 sibling, 1 reply; 34+ messages in thread
From: Bill Schmidt @ 2013-12-03 22:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: Yufeng Zhang, Jeff Law, gcc-patches

On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> >On 12/03/13 14:20, Richard Biener wrote:
> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com> 
> >wrote:
> >>> On 12/03/13 06:48, Jeff Law wrote:
> >>>>
> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>>>>
> >>>>> Ping~
> >>>>>
> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>>>
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Yufeng
> >>>>>
> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>>>>>
> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>>>>>>
> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
> >>>>>>>>
> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>>>>>>>>
> >>>>>>>>> The second version of your original patch is ok with me with
> >the
> >>>>>>>>> following changes.  Sorry for the little side adventure into
> >the
> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >it
> >>>>>>>>> helps in
> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >also for
> >>>>>>>>> cleaning up this version to be less intrusive to common
> >interfaces; I
> >>>>>>>>> appreciate it.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >with the
> >>>>>>>> suggested changes incorporated.
> >>>>>>>>
> >>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>>>>>>> experiment; it's
> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >for
> >>>>>>>> your prompt
> >>>>>>>> replies and patience in guiding!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >>>>>>>>> approval,
> >>>>>>>>> as I'm not a maintainer.
> >>>>
> >>>> First a note, I need to check on voting for Bill as the slsr
> >maintainer
> >>>> from the steering committee.   Voting was in progress just before
> >the
> >>>> close of stage1 development so I haven't tallied the results :-)
> >>>
> >>>
> >>> Looking forward to some good news! :)
> >>>
> >>>
> >>>>>>
> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >shared.
> >>>>>>      The cached is introduced mainly because get_alternative_base
> >() may
> >>>>>> be
> >>>>>> called twice on the same 'base' tree, once in the
> >>>>>> find_basis_for_candidate () for look-up and the other time in
> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >happy
> >>>>>> to leave out the cache if you think the benefit is trivial.
> >>>>
> >>>> Without some sense of how expensive the lookups are vs how often
> >the
> >>>> cache hits it's awful hard to know if the cache is worth it.
> >>>>
> >>>> I'd say take it out unless you have some sense it's really saving
> >time.
> >>>>     It's a pretty minor implementation detail either way.
> >>>
> >>>
> >>> I think the affine tree routines are generally expensive; it is
> >worth having
> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >tests
> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >from
> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >represent
> >>> the real world scenario, but they do show the fact that the 'base'
> >tree can
> >>> be shared to some extent.  So I'd like to have the cache in the
> >patch.
> >>>
> >>>
> >>>>
> >>>>>>
> >>>>>>> +/* { dg-do compile } */
> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>>>>>> +
> >>>>>>> +typedef int arr_2[50][50];
> >>>>>>> +
> >>>>>>> +void foo (arr_2 a2, int v1)
> >>>>>>> +{
> >>>>>>> +  int i, j;
> >>>>>>> +
> >>>>>>> +  i = v1 + 5;
> >>>>>>> +  j = i;
> >>>>>>> +  a2 [i-10] [j] = 2;
> >>>>>>> +  a2 [i] [j++] = i;
> >>>>>>> +  a2 [i+20] [j++] = i;
> >>>>>>> +  a2 [i-3] [i-1] += 1;
> >>>>>>> +  return;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>>>>>>
> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >>>>>>> checking which is bad, too.
> >>>>>>
> >>>>>>
> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >to
> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >MEM_REFs
> >>>>>> is an effective check.  Alternatively, I can add a follow-up
> >patch to
> >>>>>> add some dumping facility in replace_ref () to print out the
> >replacing
> >>>>>> actions when -fdump-tree-slsr-details is on.
> >>>>
> >>>> I think adding some details to the dump and scanning for them would
> >be
> >>>> better.  That's the only change that is required for this to move
> >forward.
> >>>
> >>>
> >>> I've updated to patch to dump more details when
> >-fdump-tree-slsr-details is
> >>> on.  The tests have also been updated to scan for these new dumps
> >instead of
> >>> MEMs.
> >>>
> >>>
> >>>>
> >>>> I suggest doing it quickly.  We're well past stage1 close at this
> >point.
> >>>
> >>>
> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >succeeds?
> >>
> >> I still don't like it.  It's using the wrong and too expensive tools
> >to do
> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >> the code it looks like we're having
> >>
> >>    /* Base expression for the chain of candidates:  often, but not
> >>       always, an SSA name.  */
> >>    tree base_expr;
> >>
> >> which isn't really too informative but I suppose they are all
> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >> use get_addr_base_and_unit_offset in place of get_alternative_base
> >(),
> >> ignoring the returned offset.
> >
> >'base_expr' is essentially the base address of a handled_component_p, 
> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >
> >the object returned by get_inner_reference ().
> >
> >Given a test case like the following:
> >
> >typedef int arr_2[20][20];
> >
> >void foo (arr_2 a2, int i, int j)
> >{
> >   a2[i+10][j] = 1;
> >   a2[i+10][j+1] = 1;
> >   a2[i+20][j] = 1;
> >}
> >
> >The IR before SLSR is (on x86_64):
> >
> >   _2 = (long unsigned int) i_1(D);
> >   _3 = _2 * 80;
> >   _4 = _3 + 800;
> >   _6 = a2_5(D) + _4;
> >   *_6[j_8(D)] = 1;
> >   _10 = j_8(D) + 1;
> >   *_6[_10] = 1;
> >   _12 = _3 + 1600;
> >   _13 = a2_5(D) + _12;
> >   *_13[j_8(D)] = 1;
> >
> >The base_expr for the 1st and 2nd memory reference are the same, i.e. 
> >_6, while the base_expr for a2[i+20][j] is _13.
> >
> >_13 is essentially (_6 + 800), so all of the three memory references 
> >essentially share the same base address.  As their strides are also the
> >
> >same (MULT_EXPR (j, 4)), the three references can all be lowered to 
> >MEM_REFs.  What this patch does is to use the tree affine tools to help
> >
> >recognize the underlying base address expression; as it requires
> >looking 
> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset () 
> >won't help here.
> >
> >Bill has helped me exploit other ways of achieving this in SLSR, but so
> >
> >far we think this is the best way to proceed.  The use of tree affine 
> >routines has been restricted to CAND_REFs only and there is the 
> >aforementioned cache facility to help reduce the overhead.
> >
> >Thanks,
> >Yufeng
> >
> >P.S. some more details what the patch does:
> >
> >The CAND_REF for the three memory references are:
> >
> >  6  [2] *_6[j_8(D)] = 1;
> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >      basis: 0  dependent: 8  sibling: 0
> >      next-interp: 0  dead-savings: 0
> >
> >   8  [2] *_6[_10] = 1;
> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >      basis: 6  dependent: 11  sibling: 0
> >      next-interp: 0  dead-savings: 0
> >
> >  11  [2] *_13[j_8(D)] = 1;
> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >      basis: 8  dependent: 0  sibling: 0
> >      next-interp: 0  dead-savings: 0
> >
> >Before the patch, the strength reduction candidate chains for the three
> >
> >CAND_REFs are:
> >
> >   _6 -> 6 -> 8
> >   _13 -> 11
> >
> >i.e. SLSR recognizes the first two references share the same basis, 
> >while the last one is on it own.
> >
> >With the patch, an extra candidate chain can be recognized:
> >
> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
> >
> >i.e. all of the three references are found to have the same basis 
> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >_6 
> >or _13, with the immediate offset removed.  The pass is now able to 
> >lower all of the three references, instead of the first two only, to 
> >MEM_REFs.
> 
> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)

I think that's overstating SLSR's current capabilities a bit. :)  We do
use get_inner_reference to come up with the base expression for
reference candidates (based on some of your suggestions a couple of
years back).  However, in the case of multiple levels of array
references, we miss opportunities because get_inner_reference stops at
an SSA name that could be further expanded by following its definition
back to a more fundamental base expression.

Part of the issue here is that reference candidates are basis for a more
specific optimization than the mult and add candidates.  The latter have
a more general framework for building up a recording of simple affine
expressions that can be strength-reduced.  Ultimately we ought to be
able to do something similar for reference candidates, building up
simple affine expressions from base expressions, so that everything is
done in a forward order and the tree-affine interfaces aren't needed.
But that will take some more fundamental design changes, and since this
provides some good improvements for important cases, I feel it's
reasonable to get this into the release.

Thanks,
Bill

> 
> Richard.
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-03 21:57                               ` Yufeng Zhang
@ 2013-12-03 22:19                                 ` Bill Schmidt
  0 siblings, 0 replies; 34+ messages in thread
From: Bill Schmidt @ 2013-12-03 22:19 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: Richard Biener, Jeff Law, gcc-patches

On Tue, 2013-12-03 at 21:57 +0000, Yufeng Zhang wrote:
> On 12/03/13 20:35, Richard Biener wrote:
> > Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
> >> On 12/03/13 14:20, Richard Biener wrote:
> >>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> >> wrote:
> >>>> On 12/03/13 06:48, Jeff Law wrote:
> >>>>>
> >>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>>>>>
> >>>>>> Ping~
> >>>>>>
> >>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Yufeng
> >>>>>>
> >>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>>>>>>
> >>>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>>>>>>>
> >>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
> >>>>>>>>>
> >>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>>>>>>>>>
> >>>>>>>>>> The second version of your original patch is ok with me with
> >> the
> >>>>>>>>>> following changes.  Sorry for the little side adventure into
> >> the
> >>>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >> it
> >>>>>>>>>> helps in
> >>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >> also for
> >>>>>>>>>> cleaning up this version to be less intrusive to common
> >> interfaces; I
> >>>>>>>>>> appreciate it.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >> with the
> >>>>>>>>> suggested changes incorporated.
> >>>>>>>>>
> >>>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>>>>>>>> experiment; it's
> >>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >> for
> >>>>>>>>> your prompt
> >>>>>>>>> replies and patience in guiding!
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >>>>>>>>>> approval,
> >>>>>>>>>> as I'm not a maintainer.
> >>>>>
> >>>>> First a note, I need to check on voting for Bill as the slsr
> >> maintainer
> >>>>> from the steering committee.   Voting was in progress just before
> >> the
> >>>>> close of stage1 development so I haven't tallied the results :-)
> >>>>
> >>>>
> >>>> Looking forward to some good news! :)
> >>>>
> >>>>
> >>>>>>>
> >>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >> shared.
> >>>>>>>       The cached is introduced mainly because get_alternative_base
> >> () may
> >>>>>>> be
> >>>>>>> called twice on the same 'base' tree, once in the
> >>>>>>> find_basis_for_candidate () for look-up and the other time in
> >>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >> happy
> >>>>>>> to leave out the cache if you think the benefit is trivial.
> >>>>>
> >>>>> Without some sense of how expensive the lookups are vs how often
> >> the
> >>>>> cache hits it's awful hard to know if the cache is worth it.
> >>>>>
> >>>>> I'd say take it out unless you have some sense it's really saving
> >> time.
> >>>>>      It's a pretty minor implementation detail either way.
> >>>>
> >>>>
> >>>> I think the affine tree routines are generally expensive; it is
> >> worth having
> >>>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >> tests
> >>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >> from
> >>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >> represent
> >>>> the real world scenario, but they do show the fact that the 'base'
> >> tree can
> >>>> be shared to some extent.  So I'd like to have the cache in the
> >> patch.
> >>>>
> >>>>
> >>>>>
> >>>>>>>
> >>>>>>>> +/* { dg-do compile } */
> >>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>>>>>>> +
> >>>>>>>> +typedef int arr_2[50][50];
> >>>>>>>> +
> >>>>>>>> +void foo (arr_2 a2, int v1)
> >>>>>>>> +{
> >>>>>>>> +  int i, j;
> >>>>>>>> +
> >>>>>>>> +  i = v1 + 5;
> >>>>>>>> +  j = i;
> >>>>>>>> +  a2 [i-10] [j] = 2;
> >>>>>>>> +  a2 [i] [j++] = i;
> >>>>>>>> +  a2 [i+20] [j++] = i;
> >>>>>>>> +  a2 [i-3] [i-1] += 1;
> >>>>>>>> +  return;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>>>>>>>
> >>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >>>>>>>> checking which is bad, too.
> >>>>>>>
> >>>>>>>
> >>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >> to
> >>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >> MEM_REFs
> >>>>>>> is an effective check.  Alternatively, I can add a follow-up
> >> patch to
> >>>>>>> add some dumping facility in replace_ref () to print out the
> >> replacing
> >>>>>>> actions when -fdump-tree-slsr-details is on.
> >>>>>
> >>>>> I think adding some details to the dump and scanning for them would
> >> be
> >>>>> better.  That's the only change that is required for this to move
> >> forward.
> >>>>
> >>>>
> >>>> I've updated to patch to dump more details when
> >> -fdump-tree-slsr-details is
> >>>> on.  The tests have also been updated to scan for these new dumps
> >> instead of
> >>>> MEMs.
> >>>>
> >>>>
> >>>>>
> >>>>> I suggest doing it quickly.  We're well past stage1 close at this
> >> point.
> >>>>
> >>>>
> >>>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >> succeeds?
> >>>
> >>> I still don't like it.  It's using the wrong and too expensive tools
> >> to do
> >>> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >>> the code it looks like we're having
> >>>
> >>>     /* Base expression for the chain of candidates:  often, but not
> >>>        always, an SSA name.  */
> >>>     tree base_expr;
> >>>
> >>> which isn't really too informative but I suppose they are all
> >>> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >>> use get_addr_base_and_unit_offset in place of get_alternative_base
> >> (),
> >>> ignoring the returned offset.
> >>
> >> 'base_expr' is essentially the base address of a handled_component_p,
> >> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >>
> >> the object returned by get_inner_reference ().
> >>
> >> Given a test case like the following:
> >>
> >> typedef int arr_2[20][20];
> >>
> >> void foo (arr_2 a2, int i, int j)
> >> {
> >>    a2[i+10][j] = 1;
> >>    a2[i+10][j+1] = 1;
> >>    a2[i+20][j] = 1;
> >> }
> >>
> >> The IR before SLSR is (on x86_64):
> >>
> >>    _2 = (long unsigned int) i_1(D);
> >>    _3 = _2 * 80;
> >>    _4 = _3 + 800;
> >>    _6 = a2_5(D) + _4;
> >>    *_6[j_8(D)] = 1;
> >>    _10 = j_8(D) + 1;
> >>    *_6[_10] = 1;
> >>    _12 = _3 + 1600;
> >>    _13 = a2_5(D) + _12;
> >>    *_13[j_8(D)] = 1;
> >>
> >> The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >> _6, while the base_expr for a2[i+20][j] is _13.
> >>
> >> _13 is essentially (_6 + 800), so all of the three memory references
> >> essentially share the same base address.  As their strides are also the
> >>
> >> same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >> MEM_REFs.  What this patch does is to use the tree affine tools to help
> >>
> >> recognize the underlying base address expression; as it requires
> >> looking
> >> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >> won't help here.
> >>
> >> Bill has helped me exploit other ways of achieving this in SLSR, but so
> >>
> >> far we think this is the best way to proceed.  The use of tree affine
> >> routines has been restricted to CAND_REFs only and there is the
> >> aforementioned cache facility to help reduce the overhead.
> >>
> >> Thanks,
> >> Yufeng
> >>
> >> P.S. some more details what the patch does:
> >>
> >> The CAND_REF for the three memory references are:
> >>
> >>   6  [2] *_6[j_8(D)] = 1;
> >>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>       basis: 0  dependent: 8  sibling: 0
> >>       next-interp: 0  dead-savings: 0
> >>
> >>    8  [2] *_6[_10] = 1;
> >>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >>       basis: 6  dependent: 11  sibling: 0
> >>       next-interp: 0  dead-savings: 0
> >>
> >>   11  [2] *_13[j_8(D)] = 1;
> >>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>       basis: 8  dependent: 0  sibling: 0
> >>       next-interp: 0  dead-savings: 0
> >>
> >> Before the patch, the strength reduction candidate chains for the three
> >>
> >> CAND_REFs are:
> >>
> >>    _6 ->  6 ->  8
> >>    _13 ->  11
> >>
> >> i.e. SLSR recognizes the first two references share the same basis,
> >> while the last one is on it own.
> >>
> >> With the patch, an extra candidate chain can be recognized:
> >>
> >>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
> >>
> >> i.e. all of the three references are found to have the same basis
> >> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >> _6
> >> or _13, with the immediate offset removed.  The pass is now able to
> >> lower all of the three references, instead of the first two only, to
> >> MEM_REFs.
> >
> > Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference?
> 
> slsr is indeed already using get_inner_reference () to figure out the 
> common components; see restructure_reference () and the comment at the 
> beginning of gimple-ssa-strength-reduction.c.  Quote some of the comment 
> here for the convenience:
> 
>     Specifically, we are interested in references for which
>     get_inner_reference returns a base address, offset, and bitpos as
>     follows:
> 
>       base:    MEM_REF (T1, C1)
>       offset:  MULT_EXPR (PLUS_EXPR (T2, C2), C3)
>       bitpos:  C4 * BITS_PER_UNIT
> 
>     Here T1 and T2 are arbitrary trees, and C1, C2, C3, C4 are
>     arbitrary integer constants.  Note that C2 may be zero, in which
>     case the offset will be MULT_EXPR (T2, C3).
> 
>     When this pattern is recognized, the original memory reference
>     can be replaced with:
> 
>       MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
>                C1 + (C2 * C3) + C4)
> 
>     which distributes the multiply to allow constant folding.  When
>     two or more addressing expressions can be represented by MEM_REFs
>     of this form, differing only in the constants C1, C2, and C4,
>     making this substitution produces more efficient addressing during
>     the RTL phases.  When there are not at least two expressions with
>     the same values of T1, T2, and C3, there is nothing to be gained
>     by the replacement.
> 
>     Strength reduction of CAND_REFs uses the same infrastructure as
>     that used by CAND_MULTs and CAND_ADDs.  We record T1 in the base (B)
>     field, MULT_EXPR (T2, C3) in the stride (S) field, and
>     C1 + (C2 * C3) + C4 in the index (i) field.  A basis for a CAND_REF
>     is thus another CAND_REF with the same B and S values.  When at
>     least two CAND_REFs are chained together using the basis relation,
>     each of them is replaced as above, resulting in improved code
>     generation for addressing.
> 
> As the last paragraphs says, a basis for a CAND_REF is another CAND_REF 
> with the same B and S values.  This patch extends the definition of 
> basis by allowing B's not to be the same but only differ by an immediate 
> constant.
> 
> >  After all slsr does not use tree-affine as representation for bases (which it could?)
> 
> In theory it could use tree-affine for bases and I had experimented this 
> approach as well, but encountered some unexpected re-association issue 
> when building spec2k, as when tree-affine is combined to tree, the 
> association order can be different from, or worse than, what was before 
> tree-affine.  I also didn't see any obvious benefit, so didn't proceed 
> further.
> 
> In the long run, the additional lowering of memory accesses you 
> mentioned in http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03731.html may 
> be a better solution to what I'm trying to tackle here.  I'll see if I 
> can get time to work out something useful for 4.10. :)

This can be pretty dicey as well.  I originally tried to address the
reference candidate commoning by doing earlier lowering of memory
references.  The problem I ran into is a premature loss of aliasing
information, resulting in much worse code generation in a number of
places.  We couldn't think of a good way to carry the lost aliasing
information forward that wasn't a complete bloody hack.  So I backed off
from that.  Richard then made the astute observation that this was a
special case of straight line strength reduction, which GCC didn't
handle yet.  And that's how we ended up where we are...

I don't want to discourage you from looking at further lowering of
memory reference components, but be aware that you need to be thinking
carefully about the TBAA issue from the start.  Otherwise the RTL phases
can be much more constrained (particularly scheduling).

Bill

> 
> Yufeng
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-03 22:04                               ` Bill Schmidt
@ 2013-12-04 10:26                                 ` Richard Biener
  2013-12-04 10:30                                   ` Richard Biener
  2013-12-04 13:08                                   ` Bill Schmidt
  0 siblings, 2 replies; 34+ messages in thread
From: Richard Biener @ 2013-12-04 10:26 UTC (permalink / raw)
  To: Bill Schmidt; +Cc: Yufeng Zhang, Jeff Law, gcc-patches

On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
>> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
>> >On 12/03/13 14:20, Richard Biener wrote:
>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>> >wrote:
>> >>> On 12/03/13 06:48, Jeff Law wrote:
>> >>>>
>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>> >>>>>
>> >>>>> Ping~
>> >>>>>
>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Yufeng
>> >>>>>
>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>> >>>>>>
>> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
>> >>>>>>>
>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
>> >>>>>>>>
>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>> >>>>>>>>>
>> >>>>>>>>> The second version of your original patch is ok with me with
>> >the
>> >>>>>>>>> following changes.  Sorry for the little side adventure into
>> >the
>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
>> >it
>> >>>>>>>>> helps in
>> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>> >also for
>> >>>>>>>>> cleaning up this version to be less intrusive to common
>> >interfaces; I
>> >>>>>>>>> appreciate it.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>> >with the
>> >>>>>>>> suggested changes incorporated.
>> >>>>>>>>
>> >>>>>>>> For the next-interp adventure, I was quite happy to do the
>> >>>>>>>> experiment; it's
>> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>> >for
>> >>>>>>>> your prompt
>> >>>>>>>> replies and patience in guiding!
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>> >>>>>>>>> approval,
>> >>>>>>>>> as I'm not a maintainer.
>> >>>>
>> >>>> First a note, I need to check on voting for Bill as the slsr
>> >maintainer
>> >>>> from the steering committee.   Voting was in progress just before
>> >the
>> >>>> close of stage1 development so I haven't tallied the results :-)
>> >>>
>> >>>
>> >>> Looking forward to some good news! :)
>> >>>
>> >>>
>> >>>>>>
>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>> >shared.
>> >>>>>>      The cached is introduced mainly because get_alternative_base
>> >() may
>> >>>>>> be
>> >>>>>> called twice on the same 'base' tree, once in the
>> >>>>>> find_basis_for_candidate () for look-up and the other time in
>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>> >happy
>> >>>>>> to leave out the cache if you think the benefit is trivial.
>> >>>>
>> >>>> Without some sense of how expensive the lookups are vs how often
>> >the
>> >>>> cache hits it's awful hard to know if the cache is worth it.
>> >>>>
>> >>>> I'd say take it out unless you have some sense it's really saving
>> >time.
>> >>>>     It's a pretty minor implementation detail either way.
>> >>>
>> >>>
>> >>> I think the affine tree routines are generally expensive; it is
>> >worth having
>> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
>> >tests
>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>> >from
>> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>> >represent
>> >>> the real world scenario, but they do show the fact that the 'base'
>> >tree can
>> >>> be shared to some extent.  So I'd like to have the cache in the
>> >patch.
>> >>>
>> >>>
>> >>>>
>> >>>>>>
>> >>>>>>> +/* { dg-do compile } */
>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>> >>>>>>> +
>> >>>>>>> +typedef int arr_2[50][50];
>> >>>>>>> +
>> >>>>>>> +void foo (arr_2 a2, int v1)
>> >>>>>>> +{
>> >>>>>>> +  int i, j;
>> >>>>>>> +
>> >>>>>>> +  i = v1 + 5;
>> >>>>>>> +  j = i;
>> >>>>>>> +  a2 [i-10] [j] = 2;
>> >>>>>>> +  a2 [i] [j++] = i;
>> >>>>>>> +  a2 [i+20] [j++] = i;
>> >>>>>>> +  a2 [i-3] [i-1] += 1;
>> >>>>>>> +  return;
>> >>>>>>> +}
>> >>>>>>> +
>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>> >>>>>>>
>> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>> >>>>>>> checking which is bad, too.
>> >>>>>>
>> >>>>>>
>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>> >to
>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>> >MEM_REFs
>> >>>>>> is an effective check.  Alternatively, I can add a follow-up
>> >patch to
>> >>>>>> add some dumping facility in replace_ref () to print out the
>> >replacing
>> >>>>>> actions when -fdump-tree-slsr-details is on.
>> >>>>
>> >>>> I think adding some details to the dump and scanning for them would
>> >be
>> >>>> better.  That's the only change that is required for this to move
>> >forward.
>> >>>
>> >>>
>> >>> I've updated to patch to dump more details when
>> >-fdump-tree-slsr-details is
>> >>> on.  The tests have also been updated to scan for these new dumps
>> >instead of
>> >>> MEMs.
>> >>>
>> >>>
>> >>>>
>> >>>> I suggest doing it quickly.  We're well past stage1 close at this
>> >point.
>> >>>
>> >>>
>> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
>> >succeeds?
>> >>
>> >> I still don't like it.  It's using the wrong and too expensive tools
>> >to do
>> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
>> >> the code it looks like we're having
>> >>
>> >>    /* Base expression for the chain of candidates:  often, but not
>> >>       always, an SSA name.  */
>> >>    tree base_expr;
>> >>
>> >> which isn't really too informative but I suppose they are all
>> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
>> >> use get_addr_base_and_unit_offset in place of get_alternative_base
>> >(),
>> >> ignoring the returned offset.
>> >
>> >'base_expr' is essentially the base address of a handled_component_p,
>> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>> >
>> >the object returned by get_inner_reference ().
>> >
>> >Given a test case like the following:
>> >
>> >typedef int arr_2[20][20];
>> >
>> >void foo (arr_2 a2, int i, int j)
>> >{
>> >   a2[i+10][j] = 1;
>> >   a2[i+10][j+1] = 1;
>> >   a2[i+20][j] = 1;
>> >}
>> >
>> >The IR before SLSR is (on x86_64):
>> >
>> >   _2 = (long unsigned int) i_1(D);
>> >   _3 = _2 * 80;
>> >   _4 = _3 + 800;
>> >   _6 = a2_5(D) + _4;
>> >   *_6[j_8(D)] = 1;
>> >   _10 = j_8(D) + 1;
>> >   *_6[_10] = 1;
>> >   _12 = _3 + 1600;
>> >   _13 = a2_5(D) + _12;
>> >   *_13[j_8(D)] = 1;
>> >
>> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
>> >_6, while the base_expr for a2[i+20][j] is _13.
>> >
>> >_13 is essentially (_6 + 800), so all of the three memory references
>> >essentially share the same base address.  As their strides are also the
>> >
>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
>> >MEM_REFs.  What this patch does is to use the tree affine tools to help
>> >
>> >recognize the underlying base address expression; as it requires
>> >looking
>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>> >won't help here.
>> >
>> >Bill has helped me exploit other ways of achieving this in SLSR, but so
>> >
>> >far we think this is the best way to proceed.  The use of tree affine
>> >routines has been restricted to CAND_REFs only and there is the
>> >aforementioned cache facility to help reduce the overhead.
>> >
>> >Thanks,
>> >Yufeng
>> >
>> >P.S. some more details what the patch does:
>> >
>> >The CAND_REF for the three memory references are:
>> >
>> >  6  [2] *_6[j_8(D)] = 1;
>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>> >      basis: 0  dependent: 8  sibling: 0
>> >      next-interp: 0  dead-savings: 0
>> >
>> >   8  [2] *_6[_10] = 1;
>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>> >      basis: 6  dependent: 11  sibling: 0
>> >      next-interp: 0  dead-savings: 0
>> >
>> >  11  [2] *_13[j_8(D)] = 1;
>> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>> >      basis: 8  dependent: 0  sibling: 0
>> >      next-interp: 0  dead-savings: 0
>> >
>> >Before the patch, the strength reduction candidate chains for the three
>> >
>> >CAND_REFs are:
>> >
>> >   _6 -> 6 -> 8
>> >   _13 -> 11
>> >
>> >i.e. SLSR recognizes the first two references share the same basis,
>> >while the last one is on it own.
>> >
>> >With the patch, an extra candidate chain can be recognized:
>> >
>> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
>> >
>> >i.e. all of the three references are found to have the same basis
>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>> >_6
>> >or _13, with the immediate offset removed.  The pass is now able to
>> >lower all of the three references, instead of the first two only, to
>> >MEM_REFs.
>>
>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
>
> I think that's overstating SLSR's current capabilities a bit. :)  We do
> use get_inner_reference to come up with the base expression for
> reference candidates (based on some of your suggestions a couple of
> years back).  However, in the case of multiple levels of array
> references, we miss opportunities because get_inner_reference stops at
> an SSA name that could be further expanded by following its definition
> back to a more fundamental base expression.

Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
same problem.

> Part of the issue here is that reference candidates are basis for a more
> specific optimization than the mult and add candidates.  The latter have
> a more general framework for building up a recording of simple affine
> expressions that can be strength-reduced.  Ultimately we ought to be
> able to do something similar for reference candidates, building up
> simple affine expressions from base expressions, so that everything is
> done in a forward order and the tree-affine interfaces aren't needed.
> But that will take some more fundamental design changes, and since this
> provides some good improvements for important cases, I feel it's
> reasonable to get this into the release.

But I fail to see what is special about doing the dance to affine and
then back to trees just to drop the constant offset which would be
done by get_inner_reference as well and cheaper if you just ignore
bitpos.

?!

Richard.

> Thanks,
> Bill
>
>>
>> Richard.
>>
>>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-04 10:26                                 ` Richard Biener
@ 2013-12-04 10:30                                   ` Richard Biener
  2013-12-04 11:32                                     ` Yufeng Zhang
  2013-12-04 13:14                                     ` Bill Schmidt
  2013-12-04 13:08                                   ` Bill Schmidt
  1 sibling, 2 replies; 34+ messages in thread
From: Richard Biener @ 2013-12-04 10:30 UTC (permalink / raw)
  To: Bill Schmidt; +Cc: Yufeng Zhang, Jeff Law, gcc-patches

On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
>> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
>>> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
>>> >On 12/03/13 14:20, Richard Biener wrote:
>>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>>> >wrote:
>>> >>> On 12/03/13 06:48, Jeff Law wrote:
>>> >>>>
>>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>> >>>>>
>>> >>>>> Ping~
>>> >>>>>
>>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Yufeng
>>> >>>>>
>>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>> >>>>>>
>>> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>> >>>>>>>
>>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
>>> >>>>>>>>
>>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> The second version of your original patch is ok with me with
>>> >the
>>> >>>>>>>>> following changes.  Sorry for the little side adventure into
>>> >the
>>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
>>> >it
>>> >>>>>>>>> helps in
>>> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>>> >also for
>>> >>>>>>>>> cleaning up this version to be less intrusive to common
>>> >interfaces; I
>>> >>>>>>>>> appreciate it.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>>> >with the
>>> >>>>>>>> suggested changes incorporated.
>>> >>>>>>>>
>>> >>>>>>>> For the next-interp adventure, I was quite happy to do the
>>> >>>>>>>> experiment; it's
>>> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>>> >for
>>> >>>>>>>> your prompt
>>> >>>>>>>> replies and patience in guiding!
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>> >>>>>>>>> approval,
>>> >>>>>>>>> as I'm not a maintainer.
>>> >>>>
>>> >>>> First a note, I need to check on voting for Bill as the slsr
>>> >maintainer
>>> >>>> from the steering committee.   Voting was in progress just before
>>> >the
>>> >>>> close of stage1 development so I haven't tallied the results :-)
>>> >>>
>>> >>>
>>> >>> Looking forward to some good news! :)
>>> >>>
>>> >>>
>>> >>>>>>
>>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>>> >shared.
>>> >>>>>>      The cached is introduced mainly because get_alternative_base
>>> >() may
>>> >>>>>> be
>>> >>>>>> called twice on the same 'base' tree, once in the
>>> >>>>>> find_basis_for_candidate () for look-up and the other time in
>>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>>> >happy
>>> >>>>>> to leave out the cache if you think the benefit is trivial.
>>> >>>>
>>> >>>> Without some sense of how expensive the lookups are vs how often
>>> >the
>>> >>>> cache hits it's awful hard to know if the cache is worth it.
>>> >>>>
>>> >>>> I'd say take it out unless you have some sense it's really saving
>>> >time.
>>> >>>>     It's a pretty minor implementation detail either way.
>>> >>>
>>> >>>
>>> >>> I think the affine tree routines are generally expensive; it is
>>> >worth having
>>> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
>>> >tests
>>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>>> >from
>>> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>>> >represent
>>> >>> the real world scenario, but they do show the fact that the 'base'
>>> >tree can
>>> >>> be shared to some extent.  So I'd like to have the cache in the
>>> >patch.
>>> >>>
>>> >>>
>>> >>>>
>>> >>>>>>
>>> >>>>>>> +/* { dg-do compile } */
>>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>> >>>>>>> +
>>> >>>>>>> +typedef int arr_2[50][50];
>>> >>>>>>> +
>>> >>>>>>> +void foo (arr_2 a2, int v1)
>>> >>>>>>> +{
>>> >>>>>>> +  int i, j;
>>> >>>>>>> +
>>> >>>>>>> +  i = v1 + 5;
>>> >>>>>>> +  j = i;
>>> >>>>>>> +  a2 [i-10] [j] = 2;
>>> >>>>>>> +  a2 [i] [j++] = i;
>>> >>>>>>> +  a2 [i+20] [j++] = i;
>>> >>>>>>> +  a2 [i-3] [i-1] += 1;
>>> >>>>>>> +  return;
>>> >>>>>>> +}
>>> >>>>>>> +
>>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>> >>>>>>>
>>> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>> >>>>>>> checking which is bad, too.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>>> >to
>>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>>> >MEM_REFs
>>> >>>>>> is an effective check.  Alternatively, I can add a follow-up
>>> >patch to
>>> >>>>>> add some dumping facility in replace_ref () to print out the
>>> >replacing
>>> >>>>>> actions when -fdump-tree-slsr-details is on.
>>> >>>>
>>> >>>> I think adding some details to the dump and scanning for them would
>>> >be
>>> >>>> better.  That's the only change that is required for this to move
>>> >forward.
>>> >>>
>>> >>>
>>> >>> I've updated to patch to dump more details when
>>> >-fdump-tree-slsr-details is
>>> >>> on.  The tests have also been updated to scan for these new dumps
>>> >instead of
>>> >>> MEMs.
>>> >>>
>>> >>>
>>> >>>>
>>> >>>> I suggest doing it quickly.  We're well past stage1 close at this
>>> >point.
>>> >>>
>>> >>>
>>> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
>>> >succeeds?
>>> >>
>>> >> I still don't like it.  It's using the wrong and too expensive tools
>>> >to do
>>> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
>>> >> the code it looks like we're having
>>> >>
>>> >>    /* Base expression for the chain of candidates:  often, but not
>>> >>       always, an SSA name.  */
>>> >>    tree base_expr;
>>> >>
>>> >> which isn't really too informative but I suppose they are all
>>> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
>>> >> use get_addr_base_and_unit_offset in place of get_alternative_base
>>> >(),
>>> >> ignoring the returned offset.
>>> >
>>> >'base_expr' is essentially the base address of a handled_component_p,
>>> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>>> >
>>> >the object returned by get_inner_reference ().
>>> >
>>> >Given a test case like the following:
>>> >
>>> >typedef int arr_2[20][20];
>>> >
>>> >void foo (arr_2 a2, int i, int j)
>>> >{
>>> >   a2[i+10][j] = 1;
>>> >   a2[i+10][j+1] = 1;
>>> >   a2[i+20][j] = 1;
>>> >}
>>> >
>>> >The IR before SLSR is (on x86_64):
>>> >
>>> >   _2 = (long unsigned int) i_1(D);
>>> >   _3 = _2 * 80;
>>> >   _4 = _3 + 800;
>>> >   _6 = a2_5(D) + _4;
>>> >   *_6[j_8(D)] = 1;
>>> >   _10 = j_8(D) + 1;
>>> >   *_6[_10] = 1;
>>> >   _12 = _3 + 1600;
>>> >   _13 = a2_5(D) + _12;
>>> >   *_13[j_8(D)] = 1;
>>> >
>>> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
>>> >_6, while the base_expr for a2[i+20][j] is _13.
>>> >
>>> >_13 is essentially (_6 + 800), so all of the three memory references
>>> >essentially share the same base address.  As their strides are also the
>>> >
>>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
>>> >MEM_REFs.  What this patch does is to use the tree affine tools to help
>>> >
>>> >recognize the underlying base address expression; as it requires
>>> >looking
>>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>>> >won't help here.
>>> >
>>> >Bill has helped me exploit other ways of achieving this in SLSR, but so
>>> >
>>> >far we think this is the best way to proceed.  The use of tree affine
>>> >routines has been restricted to CAND_REFs only and there is the
>>> >aforementioned cache facility to help reduce the overhead.
>>> >
>>> >Thanks,
>>> >Yufeng
>>> >
>>> >P.S. some more details what the patch does:
>>> >
>>> >The CAND_REF for the three memory references are:
>>> >
>>> >  6  [2] *_6[j_8(D)] = 1;
>>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>> >      basis: 0  dependent: 8  sibling: 0
>>> >      next-interp: 0  dead-savings: 0
>>> >
>>> >   8  [2] *_6[_10] = 1;
>>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>>> >      basis: 6  dependent: 11  sibling: 0
>>> >      next-interp: 0  dead-savings: 0
>>> >
>>> >  11  [2] *_13[j_8(D)] = 1;
>>> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>> >      basis: 8  dependent: 0  sibling: 0
>>> >      next-interp: 0  dead-savings: 0
>>> >
>>> >Before the patch, the strength reduction candidate chains for the three
>>> >
>>> >CAND_REFs are:
>>> >
>>> >   _6 -> 6 -> 8
>>> >   _13 -> 11
>>> >
>>> >i.e. SLSR recognizes the first two references share the same basis,
>>> >while the last one is on it own.
>>> >
>>> >With the patch, an extra candidate chain can be recognized:
>>> >
>>> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
>>> >
>>> >i.e. all of the three references are found to have the same basis
>>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>>> >_6
>>> >or _13, with the immediate offset removed.  The pass is now able to
>>> >lower all of the three references, instead of the first two only, to
>>> >MEM_REFs.
>>>
>>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
>>
>> I think that's overstating SLSR's current capabilities a bit. :)  We do
>> use get_inner_reference to come up with the base expression for
>> reference candidates (based on some of your suggestions a couple of
>> years back).  However, in the case of multiple levels of array
>> references, we miss opportunities because get_inner_reference stops at
>> an SSA name that could be further expanded by following its definition
>> back to a more fundamental base expression.
>
> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> same problem.

Oh, you're using affine combination expansion ... which is even more
expensive.  So why isn't that then done for all ref candidates?  That is,
why do two different things, get_inner_reference _and_ affine-combination
dances.  And why build back trees from that instead of storing the
affine combination.

I'll bet we come back with compile-time issues after this patch
went in.  I'll count on you two to fix them then.

Richard.

>> Part of the issue here is that reference candidates are basis for a more
>> specific optimization than the mult and add candidates.  The latter have
>> a more general framework for building up a recording of simple affine
>> expressions that can be strength-reduced.  Ultimately we ought to be
>> able to do something similar for reference candidates, building up
>> simple affine expressions from base expressions, so that everything is
>> done in a forward order and the tree-affine interfaces aren't needed.
>> But that will take some more fundamental design changes, and since this
>> provides some good improvements for important cases, I feel it's
>> reasonable to get this into the release.
>
> But I fail to see what is special about doing the dance to affine and
> then back to trees just to drop the constant offset which would be
> done by get_inner_reference as well and cheaper if you just ignore
> bitpos.
>
> ?!
>
> Richard.
>
>> Thanks,
>> Bill
>>
>>>
>>> Richard.
>>>
>>>
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-04 10:30                                   ` Richard Biener
@ 2013-12-04 11:32                                     ` Yufeng Zhang
  2013-12-04 13:24                                       ` Bill Schmidt
  2013-12-04 13:14                                     ` Bill Schmidt
  1 sibling, 1 reply; 34+ messages in thread
From: Yufeng Zhang @ 2013-12-04 11:32 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bill Schmidt, Jeff Law, gcc-patches

On 12/04/13 10:30, Richard Biener wrote:
> On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
> <richard.guenther@gmail.com>  wrote:
>> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
>> <wschmidt@linux.vnet.ibm.com>  wrote:
>>> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
>>>> Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
>>>>> On 12/03/13 14:20, Richard Biener wrote:
>>>>>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>>>>> wrote:
>>>>>>> On 12/03/13 06:48, Jeff Law wrote:
>>>>>>>>
>>>>>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>>>>>>>
>>>>>>>>> Ping~
>>>>>>>>>
>>>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yufeng
>>>>>>>>>
>>>>>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> The second version of your original patch is ok with me with
>>>>> the
>>>>>>>>>>>>> following changes.  Sorry for the little side adventure into
>>>>> the
>>>>>>>>>>>>> next-interp logic; in the end that's going to hurt more than
>>>>> it
>>>>>>>>>>>>> helps in
>>>>>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>>>>> also for
>>>>>>>>>>>>> cleaning up this version to be less intrusive to common
>>>>> interfaces; I
>>>>>>>>>>>>> appreciate it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>>>>> with the
>>>>>>>>>>>> suggested changes incorporated.
>>>>>>>>>>>>
>>>>>>>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>>>>>>>> experiment; it's
>>>>>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>>>>> for
>>>>>>>>>>>> your prompt
>>>>>>>>>>>> replies and patience in guiding!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>>>>>>>> approval,
>>>>>>>>>>>>> as I'm not a maintainer.
>>>>>>>>
>>>>>>>> First a note, I need to check on voting for Bill as the slsr
>>>>> maintainer
>>>>>>>> from the steering committee.   Voting was in progress just before
>>>>> the
>>>>>>>> close of stage1 development so I haven't tallied the results :-)
>>>>>>>
>>>>>>>
>>>>>>> Looking forward to some good news! :)
>>>>>>>
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>>>>> shared.
>>>>>>>>>>       The cached is introduced mainly because get_alternative_base
>>>>> () may
>>>>>>>>>> be
>>>>>>>>>> called twice on the same 'base' tree, once in the
>>>>>>>>>> find_basis_for_candidate () for look-up and the other time in
>>>>>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>>>>> happy
>>>>>>>>>> to leave out the cache if you think the benefit is trivial.
>>>>>>>>
>>>>>>>> Without some sense of how expensive the lookups are vs how often
>>>>> the
>>>>>>>> cache hits it's awful hard to know if the cache is worth it.
>>>>>>>>
>>>>>>>> I'd say take it out unless you have some sense it's really saving
>>>>> time.
>>>>>>>>      It's a pretty minor implementation detail either way.
>>>>>>>
>>>>>>>
>>>>>>> I think the affine tree routines are generally expensive; it is
>>>>> worth having
>>>>>>> a cache to avoid calling them too many times.  I run the slsr-*.c
>>>>> tests
>>>>>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>>>>> from
>>>>>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>>>>> represent
>>>>>>> the real world scenario, but they do show the fact that the 'base'
>>>>> tree can
>>>>>>> be shared to some extent.  So I'd like to have the cache in the
>>>>> patch.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> +/* { dg-do compile } */
>>>>>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>>>>>>>> +
>>>>>>>>>>> +typedef int arr_2[50][50];
>>>>>>>>>>> +
>>>>>>>>>>> +void foo (arr_2 a2, int v1)
>>>>>>>>>>> +{
>>>>>>>>>>> +  int i, j;
>>>>>>>>>>> +
>>>>>>>>>>> +  i = v1 + 5;
>>>>>>>>>>> +  j = i;
>>>>>>>>>>> +  a2 [i-10] [j] = 2;
>>>>>>>>>>> +  a2 [i] [j++] = i;
>>>>>>>>>>> +  a2 [i+20] [j++] = i;
>>>>>>>>>>> +  a2 [i-3] [i-1] += 1;
>>>>>>>>>>> +  return;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>>>>>>>
>>>>>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>>>>>>>> checking which is bad, too.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>>>>> to
>>>>>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>>>>> MEM_REFs
>>>>>>>>>> is an effective check.  Alternatively, I can add a follow-up
>>>>> patch to
>>>>>>>>>> add some dumping facility in replace_ref () to print out the
>>>>> replacing
>>>>>>>>>> actions when -fdump-tree-slsr-details is on.
>>>>>>>>
>>>>>>>> I think adding some details to the dump and scanning for them would
>>>>> be
>>>>>>>> better.  That's the only change that is required for this to move
>>>>> forward.
>>>>>>>
>>>>>>>
>>>>>>> I've updated to patch to dump more details when
>>>>> -fdump-tree-slsr-details is
>>>>>>> on.  The tests have also been updated to scan for these new dumps
>>>>> instead of
>>>>>>> MEMs.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I suggest doing it quickly.  We're well past stage1 close at this
>>>>> point.
>>>>>>>
>>>>>>>
>>>>>>> The bootstrapping on x86_64 is still running.  OK to commit if it
>>>>> succeeds?
>>>>>>
>>>>>> I still don't like it.  It's using the wrong and too expensive tools
>>>>> to do
>>>>>> stuff.  What kind of bases are we ultimately interested in?  Browsing
>>>>>> the code it looks like we're having
>>>>>>
>>>>>>     /* Base expression for the chain of candidates:  often, but not
>>>>>>        always, an SSA name.  */
>>>>>>     tree base_expr;
>>>>>>
>>>>>> which isn't really too informative but I suppose they are all
>>>>>> kind-of-gimple_val()s?  That said, I wonder if you can simply
>>>>>> use get_addr_base_and_unit_offset in place of get_alternative_base
>>>>> (),
>>>>>> ignoring the returned offset.
>>>>>
>>>>> 'base_expr' is essentially the base address of a handled_component_p,
>>>>> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>>>>>
>>>>> the object returned by get_inner_reference ().
>>>>>
>>>>> Given a test case like the following:
>>>>>
>>>>> typedef int arr_2[20][20];
>>>>>
>>>>> void foo (arr_2 a2, int i, int j)
>>>>> {
>>>>>    a2[i+10][j] = 1;
>>>>>    a2[i+10][j+1] = 1;
>>>>>    a2[i+20][j] = 1;
>>>>> }
>>>>>
>>>>> The IR before SLSR is (on x86_64):
>>>>>
>>>>>    _2 = (long unsigned int) i_1(D);
>>>>>    _3 = _2 * 80;
>>>>>    _4 = _3 + 800;
>>>>>    _6 = a2_5(D) + _4;
>>>>>    *_6[j_8(D)] = 1;
>>>>>    _10 = j_8(D) + 1;
>>>>>    *_6[_10] = 1;
>>>>>    _12 = _3 + 1600;
>>>>>    _13 = a2_5(D) + _12;
>>>>>    *_13[j_8(D)] = 1;
>>>>>
>>>>> The base_expr for the 1st and 2nd memory reference are the same, i.e.
>>>>> _6, while the base_expr for a2[i+20][j] is _13.
>>>>>
>>>>> _13 is essentially (_6 + 800), so all of the three memory references
>>>>> essentially share the same base address.  As their strides are also the
>>>>>
>>>>> same (MULT_EXPR (j, 4)), the three references can all be lowered to
>>>>> MEM_REFs.  What this patch does is to use the tree affine tools to help
>>>>>
>>>>> recognize the underlying base address expression; as it requires
>>>>> looking
>>>>> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>>>>> won't help here.
>>>>>
>>>>> Bill has helped me exploit other ways of achieving this in SLSR, but so
>>>>>
>>>>> far we think this is the best way to proceed.  The use of tree affine
>>>>> routines has been restricted to CAND_REFs only and there is the
>>>>> aforementioned cache facility to help reduce the overhead.
>>>>>
>>>>> Thanks,
>>>>> Yufeng
>>>>>
>>>>> P.S. some more details what the patch does:
>>>>>
>>>>> The CAND_REF for the three memory references are:
>>>>>
>>>>>   6  [2] *_6[j_8(D)] = 1;
>>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>>>>       basis: 0  dependent: 8  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>>    8  [2] *_6[_10] = 1;
>>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>>>>>       basis: 6  dependent: 11  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>>   11  [2] *_13[j_8(D)] = 1;
>>>>>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>>>>       basis: 8  dependent: 0  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>> Before the patch, the strength reduction candidate chains for the three
>>>>>
>>>>> CAND_REFs are:
>>>>>
>>>>>    _6 ->  6 ->  8
>>>>>    _13 ->  11
>>>>>
>>>>> i.e. SLSR recognizes the first two references share the same basis,
>>>>> while the last one is on it own.
>>>>>
>>>>> With the patch, an extra candidate chain can be recognized:
>>>>>
>>>>>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
>>>>>
>>>>> i.e. all of the three references are found to have the same basis
>>>>> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>>>>> _6
>>>>> or _13, with the immediate offset removed.  The pass is now able to
>>>>> lower all of the three references, instead of the first two only, to
>>>>> MEM_REFs.
>>>>
>>>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
>>>
>>> I think that's overstating SLSR's current capabilities a bit. :)  We do
>>> use get_inner_reference to come up with the base expression for
>>> reference candidates (based on some of your suggestions a couple of
>>> years back).  However, in the case of multiple levels of array
>>> references, we miss opportunities because get_inner_reference stops at
>>> an SSA name that could be further expanded by following its definition
>>> back to a more fundamental base expression.
>>
>> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
>> same problem.
>
> Oh, you're using affine combination expansion ... which is even more
> expensive.  So why isn't that then done for all ref candidates?  That is,
> why do two different things, get_inner_reference _and_ affine-combination
> dances.

affine-combination is only called from where get_inner_reference stops 
(an SSA_NAME), rather than on the whole address.

> And why build back trees from that instead of storing the
> affine combination.

I think we can store the affine combination instead, but it would need 
changes to the infrastructure in slsr.

>
> I'll bet we come back with compile-time issues after this patch
> went in.  I'll count on you two to fix them then.

I'll do some timing this week; if the result on the compilation time is 
not good, I'll revert the patch.

Yufeng

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-04 10:26                                 ` Richard Biener
  2013-12-04 10:30                                   ` Richard Biener
@ 2013-12-04 13:08                                   ` Bill Schmidt
  2013-12-05 12:02                                     ` Yufeng Zhang
  1 sibling, 1 reply; 34+ messages in thread
From: Bill Schmidt @ 2013-12-04 13:08 UTC (permalink / raw)
  To: Richard Biener; +Cc: Yufeng Zhang, Jeff Law, gcc-patches

On Wed, 2013-12-04 at 11:26 +0100, Richard Biener wrote:
> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
> > On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> >> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> >> >On 12/03/13 14:20, Richard Biener wrote:
> >> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> >> >wrote:
> >> >>> On 12/03/13 06:48, Jeff Law wrote:
> >> >>>>
> >> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >> >>>>>
> >> >>>>> Ping~
> >> >>>>>
> >> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >> >>>>
> >> >>>>
> >> >>>>>
> >> >>>>> Thanks,
> >> >>>>> Yufeng
> >> >>>>>
> >> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >> >>>>>>
> >> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >> >>>>>>>
> >> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
> >> >>>>>>>>
> >> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> The second version of your original patch is ok with me with
> >> >the
> >> >>>>>>>>> following changes.  Sorry for the little side adventure into
> >> >the
> >> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >> >it
> >> >>>>>>>>> helps in
> >> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >> >also for
> >> >>>>>>>>> cleaning up this version to be less intrusive to common
> >> >interfaces; I
> >> >>>>>>>>> appreciate it.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >> >with the
> >> >>>>>>>> suggested changes incorporated.
> >> >>>>>>>>
> >> >>>>>>>> For the next-interp adventure, I was quite happy to do the
> >> >>>>>>>> experiment; it's
> >> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >> >for
> >> >>>>>>>> your prompt
> >> >>>>>>>> replies and patience in guiding!
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >> >>>>>>>>> approval,
> >> >>>>>>>>> as I'm not a maintainer.
> >> >>>>
> >> >>>> First a note, I need to check on voting for Bill as the slsr
> >> >maintainer
> >> >>>> from the steering committee.   Voting was in progress just before
> >> >the
> >> >>>> close of stage1 development so I haven't tallied the results :-)
> >> >>>
> >> >>>
> >> >>> Looking forward to some good news! :)
> >> >>>
> >> >>>
> >> >>>>>>
> >> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >> >shared.
> >> >>>>>>      The cached is introduced mainly because get_alternative_base
> >> >() may
> >> >>>>>> be
> >> >>>>>> called twice on the same 'base' tree, once in the
> >> >>>>>> find_basis_for_candidate () for look-up and the other time in
> >> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >> >happy
> >> >>>>>> to leave out the cache if you think the benefit is trivial.
> >> >>>>
> >> >>>> Without some sense of how expensive the lookups are vs how often
> >> >the
> >> >>>> cache hits it's awful hard to know if the cache is worth it.
> >> >>>>
> >> >>>> I'd say take it out unless you have some sense it's really saving
> >> >time.
> >> >>>>     It's a pretty minor implementation detail either way.
> >> >>>
> >> >>>
> >> >>> I think the affine tree routines are generally expensive; it is
> >> >worth having
> >> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >> >tests
> >> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >> >from
> >> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >> >represent
> >> >>> the real world scenario, but they do show the fact that the 'base'
> >> >tree can
> >> >>> be shared to some extent.  So I'd like to have the cache in the
> >> >patch.
> >> >>>
> >> >>>
> >> >>>>
> >> >>>>>>
> >> >>>>>>> +/* { dg-do compile } */
> >> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >> >>>>>>> +
> >> >>>>>>> +typedef int arr_2[50][50];
> >> >>>>>>> +
> >> >>>>>>> +void foo (arr_2 a2, int v1)
> >> >>>>>>> +{
> >> >>>>>>> +  int i, j;
> >> >>>>>>> +
> >> >>>>>>> +  i = v1 + 5;
> >> >>>>>>> +  j = i;
> >> >>>>>>> +  a2 [i-10] [j] = 2;
> >> >>>>>>> +  a2 [i] [j++] = i;
> >> >>>>>>> +  a2 [i+20] [j++] = i;
> >> >>>>>>> +  a2 [i-3] [i-1] += 1;
> >> >>>>>>> +  return;
> >> >>>>>>> +}
> >> >>>>>>> +
> >> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >> >>>>>>>
> >> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >> >>>>>>> checking which is bad, too.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >> >to
> >> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >> >MEM_REFs
> >> >>>>>> is an effective check.  Alternatively, I can add a follow-up
> >> >patch to
> >> >>>>>> add some dumping facility in replace_ref () to print out the
> >> >replacing
> >> >>>>>> actions when -fdump-tree-slsr-details is on.
> >> >>>>
> >> >>>> I think adding some details to the dump and scanning for them would
> >> >be
> >> >>>> better.  That's the only change that is required for this to move
> >> >forward.
> >> >>>
> >> >>>
> >> >>> I've updated to patch to dump more details when
> >> >-fdump-tree-slsr-details is
> >> >>> on.  The tests have also been updated to scan for these new dumps
> >> >instead of
> >> >>> MEMs.
> >> >>>
> >> >>>
> >> >>>>
> >> >>>> I suggest doing it quickly.  We're well past stage1 close at this
> >> >point.
> >> >>>
> >> >>>
> >> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >> >succeeds?
> >> >>
> >> >> I still don't like it.  It's using the wrong and too expensive tools
> >> >to do
> >> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >> >> the code it looks like we're having
> >> >>
> >> >>    /* Base expression for the chain of candidates:  often, but not
> >> >>       always, an SSA name.  */
> >> >>    tree base_expr;
> >> >>
> >> >> which isn't really too informative but I suppose they are all
> >> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >> >> use get_addr_base_and_unit_offset in place of get_alternative_base
> >> >(),
> >> >> ignoring the returned offset.
> >> >
> >> >'base_expr' is essentially the base address of a handled_component_p,
> >> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >> >
> >> >the object returned by get_inner_reference ().
> >> >
> >> >Given a test case like the following:
> >> >
> >> >typedef int arr_2[20][20];
> >> >
> >> >void foo (arr_2 a2, int i, int j)
> >> >{
> >> >   a2[i+10][j] = 1;
> >> >   a2[i+10][j+1] = 1;
> >> >   a2[i+20][j] = 1;
> >> >}
> >> >
> >> >The IR before SLSR is (on x86_64):
> >> >
> >> >   _2 = (long unsigned int) i_1(D);
> >> >   _3 = _2 * 80;
> >> >   _4 = _3 + 800;
> >> >   _6 = a2_5(D) + _4;
> >> >   *_6[j_8(D)] = 1;
> >> >   _10 = j_8(D) + 1;
> >> >   *_6[_10] = 1;
> >> >   _12 = _3 + 1600;
> >> >   _13 = a2_5(D) + _12;
> >> >   *_13[j_8(D)] = 1;
> >> >
> >> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >> >_6, while the base_expr for a2[i+20][j] is _13.
> >> >
> >> >_13 is essentially (_6 + 800), so all of the three memory references
> >> >essentially share the same base address.  As their strides are also the
> >> >
> >> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >> >MEM_REFs.  What this patch does is to use the tree affine tools to help
> >> >
> >> >recognize the underlying base address expression; as it requires
> >> >looking
> >> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >> >won't help here.
> >> >
> >> >Bill has helped me exploit other ways of achieving this in SLSR, but so
> >> >
> >> >far we think this is the best way to proceed.  The use of tree affine
> >> >routines has been restricted to CAND_REFs only and there is the
> >> >aforementioned cache facility to help reduce the overhead.
> >> >
> >> >Thanks,
> >> >Yufeng
> >> >
> >> >P.S. some more details what the patch does:
> >> >
> >> >The CAND_REF for the three memory references are:
> >> >
> >> >  6  [2] *_6[j_8(D)] = 1;
> >> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >> >      basis: 0  dependent: 8  sibling: 0
> >> >      next-interp: 0  dead-savings: 0
> >> >
> >> >   8  [2] *_6[_10] = 1;
> >> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >> >      basis: 6  dependent: 11  sibling: 0
> >> >      next-interp: 0  dead-savings: 0
> >> >
> >> >  11  [2] *_13[j_8(D)] = 1;
> >> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >> >      basis: 8  dependent: 0  sibling: 0
> >> >      next-interp: 0  dead-savings: 0
> >> >
> >> >Before the patch, the strength reduction candidate chains for the three
> >> >
> >> >CAND_REFs are:
> >> >
> >> >   _6 -> 6 -> 8
> >> >   _13 -> 11
> >> >
> >> >i.e. SLSR recognizes the first two references share the same basis,
> >> >while the last one is on it own.
> >> >
> >> >With the patch, an extra candidate chain can be recognized:
> >> >
> >> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
> >> >
> >> >i.e. all of the three references are found to have the same basis
> >> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >> >_6
> >> >or _13, with the immediate offset removed.  The pass is now able to
> >> >lower all of the three references, instead of the first two only, to
> >> >MEM_REFs.
> >>
> >> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
> >
> > I think that's overstating SLSR's current capabilities a bit. :)  We do
> > use get_inner_reference to come up with the base expression for
> > reference candidates (based on some of your suggestions a couple of
> > years back).  However, in the case of multiple levels of array
> > references, we miss opportunities because get_inner_reference stops at
> > an SSA name that could be further expanded by following its definition
> > back to a more fundamental base expression.
> 
> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> same problem.
> 
> > Part of the issue here is that reference candidates are basis for a more
> > specific optimization than the mult and add candidates.  The latter have
> > a more general framework for building up a recording of simple affine
> > expressions that can be strength-reduced.  Ultimately we ought to be
> > able to do something similar for reference candidates, building up
> > simple affine expressions from base expressions, so that everything is
> > done in a forward order and the tree-affine interfaces aren't needed.
> > But that will take some more fundamental design changes, and since this
> > provides some good improvements for important cases, I feel it's
> > reasonable to get this into the release.
> 
> But I fail to see what is special about doing the dance to affine and
> then back to trees just to drop the constant offset which would be
> done by get_inner_reference as well and cheaper if you just ignore
> bitpos.

I'm not sure what you're suggesting that he use get_inner_reference on
at this point.  At the point where the affine machinery is invoked, the
memory reference was already expanded with get_inner_reference, and
there was no basis involving the SSA name produced as the base.  The
affine machinery is invoked on that SSA name to see if it is hiding
another base.  There's no additional memory reference to use
get_inner_reference on, just potentially some pointer arithmetic.

That said, if we have real compile-time issues, we should hold off on
this patch for this release.

Yufeng, please time some reasonably large benchmarks (some version of
SPECint or similar) and report back here before the patch goes in.

I will respond in a different part of the thread about the real
underlying problem that needs to be solved in a more general way.

Thanks,
Bill

> 
> ?!
> 
> Richard.
> 
> > Thanks,
> > Bill
> >
> >>
> >> Richard.
> >>
> >>
> >
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-04 10:30                                   ` Richard Biener
  2013-12-04 11:32                                     ` Yufeng Zhang
@ 2013-12-04 13:14                                     ` Bill Schmidt
  2013-12-04 13:28                                       ` Bill Schmidt
  1 sibling, 1 reply; 34+ messages in thread
From: Bill Schmidt @ 2013-12-04 13:14 UTC (permalink / raw)
  To: Richard Biener; +Cc: Yufeng Zhang, Jeff Law, gcc-patches

On Wed, 2013-12-04 at 11:30 +0100, Richard Biener wrote:
> On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
> > On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> > <wschmidt@linux.vnet.ibm.com> wrote:
> >> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> >>> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> >>> >On 12/03/13 14:20, Richard Biener wrote:
> >>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> >>> >wrote:
> >>> >>> On 12/03/13 06:48, Jeff Law wrote:
> >>> >>>>
> >>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>> >>>>>
> >>> >>>>> Ping~
> >>> >>>>>
> >>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>> >>>>
> >>> >>>>
> >>> >>>>>
> >>> >>>>> Thanks,
> >>> >>>>> Yufeng
> >>> >>>>>
> >>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>> >>>>>>
> >>> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>> >>>>>>>
> >>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
> >>> >>>>>>>>
> >>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>> >>>>>>>>>
> >>> >>>>>>>>> The second version of your original patch is ok with me with
> >>> >the
> >>> >>>>>>>>> following changes.  Sorry for the little side adventure into
> >>> >the
> >>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >>> >it
> >>> >>>>>>>>> helps in
> >>> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >>> >also for
> >>> >>>>>>>>> cleaning up this version to be less intrusive to common
> >>> >interfaces; I
> >>> >>>>>>>>> appreciate it.
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >>> >with the
> >>> >>>>>>>> suggested changes incorporated.
> >>> >>>>>>>>
> >>> >>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>> >>>>>>>> experiment; it's
> >>> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >>> >for
> >>> >>>>>>>> your prompt
> >>> >>>>>>>> replies and patience in guiding!
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >>> >>>>>>>>> approval,
> >>> >>>>>>>>> as I'm not a maintainer.
> >>> >>>>
> >>> >>>> First a note, I need to check on voting for Bill as the slsr
> >>> >maintainer
> >>> >>>> from the steering committee.   Voting was in progress just before
> >>> >the
> >>> >>>> close of stage1 development so I haven't tallied the results :-)
> >>> >>>
> >>> >>>
> >>> >>> Looking forward to some good news! :)
> >>> >>>
> >>> >>>
> >>> >>>>>>
> >>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >>> >shared.
> >>> >>>>>>      The cached is introduced mainly because get_alternative_base
> >>> >() may
> >>> >>>>>> be
> >>> >>>>>> called twice on the same 'base' tree, once in the
> >>> >>>>>> find_basis_for_candidate () for look-up and the other time in
> >>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >>> >happy
> >>> >>>>>> to leave out the cache if you think the benefit is trivial.
> >>> >>>>
> >>> >>>> Without some sense of how expensive the lookups are vs how often
> >>> >the
> >>> >>>> cache hits it's awful hard to know if the cache is worth it.
> >>> >>>>
> >>> >>>> I'd say take it out unless you have some sense it's really saving
> >>> >time.
> >>> >>>>     It's a pretty minor implementation detail either way.
> >>> >>>
> >>> >>>
> >>> >>> I think the affine tree routines are generally expensive; it is
> >>> >worth having
> >>> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >>> >tests
> >>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >>> >from
> >>> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >>> >represent
> >>> >>> the real world scenario, but they do show the fact that the 'base'
> >>> >tree can
> >>> >>> be shared to some extent.  So I'd like to have the cache in the
> >>> >patch.
> >>> >>>
> >>> >>>
> >>> >>>>
> >>> >>>>>>
> >>> >>>>>>> +/* { dg-do compile } */
> >>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>> >>>>>>> +
> >>> >>>>>>> +typedef int arr_2[50][50];
> >>> >>>>>>> +
> >>> >>>>>>> +void foo (arr_2 a2, int v1)
> >>> >>>>>>> +{
> >>> >>>>>>> +  int i, j;
> >>> >>>>>>> +
> >>> >>>>>>> +  i = v1 + 5;
> >>> >>>>>>> +  j = i;
> >>> >>>>>>> +  a2 [i-10] [j] = 2;
> >>> >>>>>>> +  a2 [i] [j++] = i;
> >>> >>>>>>> +  a2 [i+20] [j++] = i;
> >>> >>>>>>> +  a2 [i-3] [i-1] += 1;
> >>> >>>>>>> +  return;
> >>> >>>>>>> +}
> >>> >>>>>>> +
> >>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>> >>>>>>>
> >>> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >>> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >>> >>>>>>> checking which is bad, too.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >>> >to
> >>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >>> >MEM_REFs
> >>> >>>>>> is an effective check.  Alternatively, I can add a follow-up
> >>> >patch to
> >>> >>>>>> add some dumping facility in replace_ref () to print out the
> >>> >replacing
> >>> >>>>>> actions when -fdump-tree-slsr-details is on.
> >>> >>>>
> >>> >>>> I think adding some details to the dump and scanning for them would
> >>> >be
> >>> >>>> better.  That's the only change that is required for this to move
> >>> >forward.
> >>> >>>
> >>> >>>
> >>> >>> I've updated to patch to dump more details when
> >>> >-fdump-tree-slsr-details is
> >>> >>> on.  The tests have also been updated to scan for these new dumps
> >>> >instead of
> >>> >>> MEMs.
> >>> >>>
> >>> >>>
> >>> >>>>
> >>> >>>> I suggest doing it quickly.  We're well past stage1 close at this
> >>> >point.
> >>> >>>
> >>> >>>
> >>> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >>> >succeeds?
> >>> >>
> >>> >> I still don't like it.  It's using the wrong and too expensive tools
> >>> >to do
> >>> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >>> >> the code it looks like we're having
> >>> >>
> >>> >>    /* Base expression for the chain of candidates:  often, but not
> >>> >>       always, an SSA name.  */
> >>> >>    tree base_expr;
> >>> >>
> >>> >> which isn't really too informative but I suppose they are all
> >>> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >>> >> use get_addr_base_and_unit_offset in place of get_alternative_base
> >>> >(),
> >>> >> ignoring the returned offset.
> >>> >
> >>> >'base_expr' is essentially the base address of a handled_component_p,
> >>> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >>> >
> >>> >the object returned by get_inner_reference ().
> >>> >
> >>> >Given a test case like the following:
> >>> >
> >>> >typedef int arr_2[20][20];
> >>> >
> >>> >void foo (arr_2 a2, int i, int j)
> >>> >{
> >>> >   a2[i+10][j] = 1;
> >>> >   a2[i+10][j+1] = 1;
> >>> >   a2[i+20][j] = 1;
> >>> >}
> >>> >
> >>> >The IR before SLSR is (on x86_64):
> >>> >
> >>> >   _2 = (long unsigned int) i_1(D);
> >>> >   _3 = _2 * 80;
> >>> >   _4 = _3 + 800;
> >>> >   _6 = a2_5(D) + _4;
> >>> >   *_6[j_8(D)] = 1;
> >>> >   _10 = j_8(D) + 1;
> >>> >   *_6[_10] = 1;
> >>> >   _12 = _3 + 1600;
> >>> >   _13 = a2_5(D) + _12;
> >>> >   *_13[j_8(D)] = 1;
> >>> >
> >>> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >>> >_6, while the base_expr for a2[i+20][j] is _13.
> >>> >
> >>> >_13 is essentially (_6 + 800), so all of the three memory references
> >>> >essentially share the same base address.  As their strides are also the
> >>> >
> >>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >>> >MEM_REFs.  What this patch does is to use the tree affine tools to help
> >>> >
> >>> >recognize the underlying base address expression; as it requires
> >>> >looking
> >>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >>> >won't help here.
> >>> >
> >>> >Bill has helped me exploit other ways of achieving this in SLSR, but so
> >>> >
> >>> >far we think this is the best way to proceed.  The use of tree affine
> >>> >routines has been restricted to CAND_REFs only and there is the
> >>> >aforementioned cache facility to help reduce the overhead.
> >>> >
> >>> >Thanks,
> >>> >Yufeng
> >>> >
> >>> >P.S. some more details what the patch does:
> >>> >
> >>> >The CAND_REF for the three memory references are:
> >>> >
> >>> >  6  [2] *_6[j_8(D)] = 1;
> >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>> >      basis: 0  dependent: 8  sibling: 0
> >>> >      next-interp: 0  dead-savings: 0
> >>> >
> >>> >   8  [2] *_6[_10] = 1;
> >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >>> >      basis: 6  dependent: 11  sibling: 0
> >>> >      next-interp: 0  dead-savings: 0
> >>> >
> >>> >  11  [2] *_13[j_8(D)] = 1;
> >>> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>> >      basis: 8  dependent: 0  sibling: 0
> >>> >      next-interp: 0  dead-savings: 0
> >>> >
> >>> >Before the patch, the strength reduction candidate chains for the three
> >>> >
> >>> >CAND_REFs are:
> >>> >
> >>> >   _6 -> 6 -> 8
> >>> >   _13 -> 11
> >>> >
> >>> >i.e. SLSR recognizes the first two references share the same basis,
> >>> >while the last one is on it own.
> >>> >
> >>> >With the patch, an extra candidate chain can be recognized:
> >>> >
> >>> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
> >>> >
> >>> >i.e. all of the three references are found to have the same basis
> >>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >>> >_6
> >>> >or _13, with the immediate offset removed.  The pass is now able to
> >>> >lower all of the three references, instead of the first two only, to
> >>> >MEM_REFs.
> >>>
> >>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
> >>
> >> I think that's overstating SLSR's current capabilities a bit. :)  We do
> >> use get_inner_reference to come up with the base expression for
> >> reference candidates (based on some of your suggestions a couple of
> >> years back).  However, in the case of multiple levels of array
> >> references, we miss opportunities because get_inner_reference stops at
> >> an SSA name that could be further expanded by following its definition
> >> back to a more fundamental base expression.
> >
> > Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> > same problem.
> 
> Oh, you're using affine combination expansion ... which is even more
> expensive.  So why isn't that then done for all ref candidates?  That is,
> why do two different things, get_inner_reference _and_ affine-combination
> dances.  And why build back trees from that instead of storing the
> affine combination.

Well, the original design had no desire to use the expensive machinery
of affine combination expansion.  For what was envisioned, the simpler
mechanisms of get_inner_reference have been plenty.

My thought, and please correct me if I'm wrong, is that once we've
already reduced to an SSA name from get_inner_reference, the affine
machinery will terminate fairly quickly -- we shouldn't get into too
deep a search on underlying pointer arithmetic in most cases.  But
compile time testing will tell us whether this is reasonable.

Bill

> 
> I'll bet we come back with compile-time issues after this patch
> went in.  I'll count on you two to fix them then.
> 
> Richard.
> 
> >> Part of the issue here is that reference candidates are basis for a more
> >> specific optimization than the mult and add candidates.  The latter have
> >> a more general framework for building up a recording of simple affine
> >> expressions that can be strength-reduced.  Ultimately we ought to be
> >> able to do something similar for reference candidates, building up
> >> simple affine expressions from base expressions, so that everything is
> >> done in a forward order and the tree-affine interfaces aren't needed.
> >> But that will take some more fundamental design changes, and since this
> >> provides some good improvements for important cases, I feel it's
> >> reasonable to get this into the release.
> >
> > But I fail to see what is special about doing the dance to affine and
> > then back to trees just to drop the constant offset which would be
> > done by get_inner_reference as well and cheaper if you just ignore
> > bitpos.
> >
> > ?!
> >
> > Richard.
> >
> >> Thanks,
> >> Bill
> >>
> >>>
> >>> Richard.
> >>>
> >>>
> >>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-04 11:32                                     ` Yufeng Zhang
@ 2013-12-04 13:24                                       ` Bill Schmidt
  0 siblings, 0 replies; 34+ messages in thread
From: Bill Schmidt @ 2013-12-04 13:24 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: Richard Biener, Jeff Law, gcc-patches

On Wed, 2013-12-04 at 11:32 +0000, Yufeng Zhang wrote:
> On 12/04/13 10:30, Richard Biener wrote:
> > On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
> > <richard.guenther@gmail.com>  wrote:
> >> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> >> <wschmidt@linux.vnet.ibm.com>  wrote:
> >>> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> >>>> Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
> >>>>> On 12/03/13 14:20, Richard Biener wrote:
> >>>>>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> >>>>> wrote:
> >>>>>>> On 12/03/13 06:48, Jeff Law wrote:
> >>>>>>>>
> >>>>>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>>>>>>>>
> >>>>>>>>> Ping~
> >>>>>>>>>
> >>>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Yufeng
> >>>>>>>>>
> >>>>>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>>>>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The second version of your original patch is ok with me with
> >>>>> the
> >>>>>>>>>>>>> following changes.  Sorry for the little side adventure into
> >>>>> the
> >>>>>>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >>>>> it
> >>>>>>>>>>>>> helps in
> >>>>>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >>>>> also for
> >>>>>>>>>>>>> cleaning up this version to be less intrusive to common
> >>>>> interfaces; I
> >>>>>>>>>>>>> appreciate it.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >>>>> with the
> >>>>>>>>>>>> suggested changes incorporated.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>>>>>>>>>>> experiment; it's
> >>>>>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >>>>> for
> >>>>>>>>>>>> your prompt
> >>>>>>>>>>>> replies and patience in guiding!
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >>>>>>>>>>>>> approval,
> >>>>>>>>>>>>> as I'm not a maintainer.
> >>>>>>>>
> >>>>>>>> First a note, I need to check on voting for Bill as the slsr
> >>>>> maintainer
> >>>>>>>> from the steering committee.   Voting was in progress just before
> >>>>> the
> >>>>>>>> close of stage1 development so I haven't tallied the results :-)
> >>>>>>>
> >>>>>>>
> >>>>>>> Looking forward to some good news! :)
> >>>>>>>
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >>>>> shared.
> >>>>>>>>>>       The cached is introduced mainly because get_alternative_base
> >>>>> () may
> >>>>>>>>>> be
> >>>>>>>>>> called twice on the same 'base' tree, once in the
> >>>>>>>>>> find_basis_for_candidate () for look-up and the other time in
> >>>>>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >>>>> happy
> >>>>>>>>>> to leave out the cache if you think the benefit is trivial.
> >>>>>>>>
> >>>>>>>> Without some sense of how expensive the lookups are vs how often
> >>>>> the
> >>>>>>>> cache hits it's awful hard to know if the cache is worth it.
> >>>>>>>>
> >>>>>>>> I'd say take it out unless you have some sense it's really saving
> >>>>> time.
> >>>>>>>>      It's a pretty minor implementation detail either way.
> >>>>>>>
> >>>>>>>
> >>>>>>> I think the affine tree routines are generally expensive; it is
> >>>>> worth having
> >>>>>>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >>>>> tests
> >>>>>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >>>>> from
> >>>>>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >>>>> represent
> >>>>>>> the real world scenario, but they do show the fact that the 'base'
> >>>>> tree can
> >>>>>>> be shared to some extent.  So I'd like to have the cache in the
> >>>>> patch.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> +/* { dg-do compile } */
> >>>>>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>>>>>>>>>> +
> >>>>>>>>>>> +typedef int arr_2[50][50];
> >>>>>>>>>>> +
> >>>>>>>>>>> +void foo (arr_2 a2, int v1)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +  int i, j;
> >>>>>>>>>>> +
> >>>>>>>>>>> +  i = v1 + 5;
> >>>>>>>>>>> +  j = i;
> >>>>>>>>>>> +  a2 [i-10] [j] = 2;
> >>>>>>>>>>> +  a2 [i] [j++] = i;
> >>>>>>>>>>> +  a2 [i+20] [j++] = i;
> >>>>>>>>>>> +  a2 [i-3] [i-1] += 1;
> >>>>>>>>>>> +  return;
> >>>>>>>>>>> +}
> >>>>>>>>>>> +
> >>>>>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>>>>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>>>>>>>>>>
> >>>>>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >>>>>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >>>>>>>>>>> checking which is bad, too.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >>>>> to
> >>>>>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >>>>> MEM_REFs
> >>>>>>>>>> is an effective check.  Alternatively, I can add a follow-up
> >>>>> patch to
> >>>>>>>>>> add some dumping facility in replace_ref () to print out the
> >>>>> replacing
> >>>>>>>>>> actions when -fdump-tree-slsr-details is on.
> >>>>>>>>
> >>>>>>>> I think adding some details to the dump and scanning for them would
> >>>>> be
> >>>>>>>> better.  That's the only change that is required for this to move
> >>>>> forward.
> >>>>>>>
> >>>>>>>
> >>>>>>> I've updated to patch to dump more details when
> >>>>> -fdump-tree-slsr-details is
> >>>>>>> on.  The tests have also been updated to scan for these new dumps
> >>>>> instead of
> >>>>>>> MEMs.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I suggest doing it quickly.  We're well past stage1 close at this
> >>>>> point.
> >>>>>>>
> >>>>>>>
> >>>>>>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >>>>> succeeds?
> >>>>>>
> >>>>>> I still don't like it.  It's using the wrong and too expensive tools
> >>>>> to do
> >>>>>> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >>>>>> the code it looks like we're having
> >>>>>>
> >>>>>>     /* Base expression for the chain of candidates:  often, but not
> >>>>>>        always, an SSA name.  */
> >>>>>>     tree base_expr;
> >>>>>>
> >>>>>> which isn't really too informative but I suppose they are all
> >>>>>> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >>>>>> use get_addr_base_and_unit_offset in place of get_alternative_base
> >>>>> (),
> >>>>>> ignoring the returned offset.
> >>>>>
> >>>>> 'base_expr' is essentially the base address of a handled_component_p,
> >>>>> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >>>>>
> >>>>> the object returned by get_inner_reference ().
> >>>>>
> >>>>> Given a test case like the following:
> >>>>>
> >>>>> typedef int arr_2[20][20];
> >>>>>
> >>>>> void foo (arr_2 a2, int i, int j)
> >>>>> {
> >>>>>    a2[i+10][j] = 1;
> >>>>>    a2[i+10][j+1] = 1;
> >>>>>    a2[i+20][j] = 1;
> >>>>> }
> >>>>>
> >>>>> The IR before SLSR is (on x86_64):
> >>>>>
> >>>>>    _2 = (long unsigned int) i_1(D);
> >>>>>    _3 = _2 * 80;
> >>>>>    _4 = _3 + 800;
> >>>>>    _6 = a2_5(D) + _4;
> >>>>>    *_6[j_8(D)] = 1;
> >>>>>    _10 = j_8(D) + 1;
> >>>>>    *_6[_10] = 1;
> >>>>>    _12 = _3 + 1600;
> >>>>>    _13 = a2_5(D) + _12;
> >>>>>    *_13[j_8(D)] = 1;
> >>>>>
> >>>>> The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >>>>> _6, while the base_expr for a2[i+20][j] is _13.
> >>>>>
> >>>>> _13 is essentially (_6 + 800), so all of the three memory references
> >>>>> essentially share the same base address.  As their strides are also the
> >>>>>
> >>>>> same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >>>>> MEM_REFs.  What this patch does is to use the tree affine tools to help
> >>>>>
> >>>>> recognize the underlying base address expression; as it requires
> >>>>> looking
> >>>>> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >>>>> won't help here.
> >>>>>
> >>>>> Bill has helped me exploit other ways of achieving this in SLSR, but so
> >>>>>
> >>>>> far we think this is the best way to proceed.  The use of tree affine
> >>>>> routines has been restricted to CAND_REFs only and there is the
> >>>>> aforementioned cache facility to help reduce the overhead.
> >>>>>
> >>>>> Thanks,
> >>>>> Yufeng
> >>>>>
> >>>>> P.S. some more details what the patch does:
> >>>>>
> >>>>> The CAND_REF for the three memory references are:
> >>>>>
> >>>>>   6  [2] *_6[j_8(D)] = 1;
> >>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>>>>       basis: 0  dependent: 8  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0
> >>>>>
> >>>>>    8  [2] *_6[_10] = 1;
> >>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >>>>>       basis: 6  dependent: 11  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0
> >>>>>
> >>>>>   11  [2] *_13[j_8(D)] = 1;
> >>>>>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>>>>       basis: 8  dependent: 0  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0

The Real Problem here is that all of these references really need to be
recorded as based on a2_5(D) as the fundamental base.  Using
get_inner_reference is not giving us the full picture needed to fully
unwrap the arithmetic.

But the information needed is all present in the candidate table, or can
be added to the candidate table, to allow this to be done fully in a
forward order according to the pass's design philosophy.  That's what we
need to concentrate on in the long term.  In the end I do not want the
affine machinery to be how we do this, because this requires us to go
back over ground we should have already been able to cover with the
forward analysis.  But this will require recording some more complicated
expressions in the table than the current infrastructure envisions.

Richard, if you're not comfortable with Yufeng's implementation as a
temporary solution for this release, then we'll have to hold it off.
I'd like to see if the compile-time issue is real or a chimera before we
dismiss it, though.

Thanks,
Bill

> >>>>>
> >>>>> Before the patch, the strength reduction candidate chains for the three
> >>>>>
> >>>>> CAND_REFs are:
> >>>>>
> >>>>>    _6 ->  6 ->  8
> >>>>>    _13 ->  11
> >>>>>
> >>>>> i.e. SLSR recognizes the first two references share the same basis,
> >>>>> while the last one is on it own.
> >>>>>
> >>>>> With the patch, an extra candidate chain can be recognized:
> >>>>>
> >>>>>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
> >>>>>
> >>>>> i.e. all of the three references are found to have the same basis
> >>>>> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >>>>> _6
> >>>>> or _13, with the immediate offset removed.  The pass is now able to
> >>>>> lower all of the three references, instead of the first two only, to
> >>>>> MEM_REFs.
> >>>>
> >>>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
> >>>
> >>> I think that's overstating SLSR's current capabilities a bit. :)  We do
> >>> use get_inner_reference to come up with the base expression for
> >>> reference candidates (based on some of your suggestions a couple of
> >>> years back).  However, in the case of multiple levels of array
> >>> references, we miss opportunities because get_inner_reference stops at
> >>> an SSA name that could be further expanded by following its definition
> >>> back to a more fundamental base expression.
> >>
> >> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> >> same problem.
> >
> > Oh, you're using affine combination expansion ... which is even more
> > expensive.  So why isn't that then done for all ref candidates?  That is,
> > why do two different things, get_inner_reference _and_ affine-combination
> > dances.
> 
> affine-combination is only called from where get_inner_reference stops 
> (an SSA_NAME), rather than on the whole address.
> 
> > And why build back trees from that instead of storing the
> > affine combination.
> 
> I think we can store the affine combination instead, but it would need 
> changes to the infrastructure in slsr.
> 
> >
> > I'll bet we come back with compile-time issues after this patch
> > went in.  I'll count on you two to fix them then.
> 
> I'll do some timing this week; if the result on the compilation time is 
> not good, I'll revert the patch.
> 
> Yufeng
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-04 13:14                                     ` Bill Schmidt
@ 2013-12-04 13:28                                       ` Bill Schmidt
  2013-12-05  8:49                                         ` Richard Biener
  0 siblings, 1 reply; 34+ messages in thread
From: Bill Schmidt @ 2013-12-04 13:28 UTC (permalink / raw)
  To: Richard Biener; +Cc: Yufeng Zhang, Jeff Law, gcc-patches

On Wed, 2013-12-04 at 07:13 -0600, Bill Schmidt wrote:
> On Wed, 2013-12-04 at 11:30 +0100, Richard Biener wrote:
> > On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > > On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> > > <wschmidt@linux.vnet.ibm.com> wrote:
> > >> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> > >>> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> > >>> >On 12/03/13 14:20, Richard Biener wrote:
> > >>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> > >>> >wrote:
> > >>> >>> On 12/03/13 06:48, Jeff Law wrote:
> > >>> >>>>
> > >>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> > >>> >>>>>
> > >>> >>>>> Ping~
> > >>> >>>>>
> > >>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> > >>> >>>>
> > >>> >>>>
> > >>> >>>>>
> > >>> >>>>> Thanks,
> > >>> >>>>> Yufeng
> > >>> >>>>>
> > >>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> > >>> >>>>>>
> > >>> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
> > >>> >>>>>>>
> > >>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> > >>> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
> > >>> >>>>>>>>
> > >>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> The second version of your original patch is ok with me with
> > >>> >the
> > >>> >>>>>>>>> following changes.  Sorry for the little side adventure into
> > >>> >the
> > >>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
> > >>> >it
> > >>> >>>>>>>>> helps in
> > >>> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> > >>> >also for
> > >>> >>>>>>>>> cleaning up this version to be less intrusive to common
> > >>> >interfaces; I
> > >>> >>>>>>>>> appreciate it.
> > >>> >>>>>>>>
> > >>> >>>>>>>>
> > >>> >>>>>>>>
> > >>> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> > >>> >with the
> > >>> >>>>>>>> suggested changes incorporated.
> > >>> >>>>>>>>
> > >>> >>>>>>>> For the next-interp adventure, I was quite happy to do the
> > >>> >>>>>>>> experiment; it's
> > >>> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> > >>> >for
> > >>> >>>>>>>> your prompt
> > >>> >>>>>>>> replies and patience in guiding!
> > >>> >>>>>>>>
> > >>> >>>>>>>>
> > >>> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> > >>> >>>>>>>>> approval,
> > >>> >>>>>>>>> as I'm not a maintainer.
> > >>> >>>>
> > >>> >>>> First a note, I need to check on voting for Bill as the slsr
> > >>> >maintainer
> > >>> >>>> from the steering committee.   Voting was in progress just before
> > >>> >the
> > >>> >>>> close of stage1 development so I haven't tallied the results :-)
> > >>> >>>
> > >>> >>>
> > >>> >>> Looking forward to some good news! :)
> > >>> >>>
> > >>> >>>
> > >>> >>>>>>
> > >>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> > >>> >shared.
> > >>> >>>>>>      The cached is introduced mainly because get_alternative_base
> > >>> >() may
> > >>> >>>>>> be
> > >>> >>>>>> called twice on the same 'base' tree, once in the
> > >>> >>>>>> find_basis_for_candidate () for look-up and the other time in
> > >>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> > >>> >happy
> > >>> >>>>>> to leave out the cache if you think the benefit is trivial.
> > >>> >>>>
> > >>> >>>> Without some sense of how expensive the lookups are vs how often
> > >>> >the
> > >>> >>>> cache hits it's awful hard to know if the cache is worth it.
> > >>> >>>>
> > >>> >>>> I'd say take it out unless you have some sense it's really saving
> > >>> >time.
> > >>> >>>>     It's a pretty minor implementation detail either way.
> > >>> >>>
> > >>> >>>
> > >>> >>> I think the affine tree routines are generally expensive; it is
> > >>> >worth having
> > >>> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
> > >>> >tests
> > >>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> > >>> >from
> > >>> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> > >>> >represent
> > >>> >>> the real world scenario, but they do show the fact that the 'base'
> > >>> >tree can
> > >>> >>> be shared to some extent.  So I'd like to have the cache in the
> > >>> >patch.
> > >>> >>>
> > >>> >>>
> > >>> >>>>
> > >>> >>>>>>
> > >>> >>>>>>> +/* { dg-do compile } */
> > >>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> > >>> >>>>>>> +
> > >>> >>>>>>> +typedef int arr_2[50][50];
> > >>> >>>>>>> +
> > >>> >>>>>>> +void foo (arr_2 a2, int v1)
> > >>> >>>>>>> +{
> > >>> >>>>>>> +  int i, j;
> > >>> >>>>>>> +
> > >>> >>>>>>> +  i = v1 + 5;
> > >>> >>>>>>> +  j = i;
> > >>> >>>>>>> +  a2 [i-10] [j] = 2;
> > >>> >>>>>>> +  a2 [i] [j++] = i;
> > >>> >>>>>>> +  a2 [i+20] [j++] = i;
> > >>> >>>>>>> +  a2 [i-3] [i-1] += 1;
> > >>> >>>>>>> +  return;
> > >>> >>>>>>> +}
> > >>> >>>>>>> +
> > >>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> > >>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> > >>> >>>>>>>
> > >>> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> > >>> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> > >>> >>>>>>> checking which is bad, too.
> > >>> >>>>>>
> > >>> >>>>>>
> > >>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> > >>> >to
> > >>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> > >>> >MEM_REFs
> > >>> >>>>>> is an effective check.  Alternatively, I can add a follow-up
> > >>> >patch to
> > >>> >>>>>> add some dumping facility in replace_ref () to print out the
> > >>> >replacing
> > >>> >>>>>> actions when -fdump-tree-slsr-details is on.
> > >>> >>>>
> > >>> >>>> I think adding some details to the dump and scanning for them would
> > >>> >be
> > >>> >>>> better.  That's the only change that is required for this to move
> > >>> >forward.
> > >>> >>>
> > >>> >>>
> > >>> >>> I've updated to patch to dump more details when
> > >>> >-fdump-tree-slsr-details is
> > >>> >>> on.  The tests have also been updated to scan for these new dumps
> > >>> >instead of
> > >>> >>> MEMs.
> > >>> >>>
> > >>> >>>
> > >>> >>>>
> > >>> >>>> I suggest doing it quickly.  We're well past stage1 close at this
> > >>> >point.
> > >>> >>>
> > >>> >>>
> > >>> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
> > >>> >succeeds?
> > >>> >>
> > >>> >> I still don't like it.  It's using the wrong and too expensive tools
> > >>> >to do
> > >>> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
> > >>> >> the code it looks like we're having
> > >>> >>
> > >>> >>    /* Base expression for the chain of candidates:  often, but not
> > >>> >>       always, an SSA name.  */
> > >>> >>    tree base_expr;
> > >>> >>
> > >>> >> which isn't really too informative but I suppose they are all
> > >>> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
> > >>> >> use get_addr_base_and_unit_offset in place of get_alternative_base
> > >>> >(),
> > >>> >> ignoring the returned offset.
> > >>> >
> > >>> >'base_expr' is essentially the base address of a handled_component_p,
> > >>> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> > >>> >
> > >>> >the object returned by get_inner_reference ().
> > >>> >
> > >>> >Given a test case like the following:
> > >>> >
> > >>> >typedef int arr_2[20][20];
> > >>> >
> > >>> >void foo (arr_2 a2, int i, int j)
> > >>> >{
> > >>> >   a2[i+10][j] = 1;
> > >>> >   a2[i+10][j+1] = 1;
> > >>> >   a2[i+20][j] = 1;
> > >>> >}
> > >>> >
> > >>> >The IR before SLSR is (on x86_64):
> > >>> >
> > >>> >   _2 = (long unsigned int) i_1(D);
> > >>> >   _3 = _2 * 80;
> > >>> >   _4 = _3 + 800;
> > >>> >   _6 = a2_5(D) + _4;
> > >>> >   *_6[j_8(D)] = 1;
> > >>> >   _10 = j_8(D) + 1;
> > >>> >   *_6[_10] = 1;
> > >>> >   _12 = _3 + 1600;
> > >>> >   _13 = a2_5(D) + _12;
> > >>> >   *_13[j_8(D)] = 1;
> > >>> >
> > >>> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
> > >>> >_6, while the base_expr for a2[i+20][j] is _13.
> > >>> >
> > >>> >_13 is essentially (_6 + 800), so all of the three memory references
> > >>> >essentially share the same base address.  As their strides are also the
> > >>> >
> > >>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
> > >>> >MEM_REFs.  What this patch does is to use the tree affine tools to help
> > >>> >
> > >>> >recognize the underlying base address expression; as it requires
> > >>> >looking
> > >>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> > >>> >won't help here.
> > >>> >
> > >>> >Bill has helped me exploit other ways of achieving this in SLSR, but so
> > >>> >
> > >>> >far we think this is the best way to proceed.  The use of tree affine
> > >>> >routines has been restricted to CAND_REFs only and there is the
> > >>> >aforementioned cache facility to help reduce the overhead.
> > >>> >
> > >>> >Thanks,
> > >>> >Yufeng
> > >>> >
> > >>> >P.S. some more details what the patch does:
> > >>> >
> > >>> >The CAND_REF for the three memory references are:
> > >>> >
> > >>> >  6  [2] *_6[j_8(D)] = 1;
> > >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> > >>> >      basis: 0  dependent: 8  sibling: 0
> > >>> >      next-interp: 0  dead-savings: 0
> > >>> >
> > >>> >   8  [2] *_6[_10] = 1;
> > >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> > >>> >      basis: 6  dependent: 11  sibling: 0
> > >>> >      next-interp: 0  dead-savings: 0
> > >>> >
> > >>> >  11  [2] *_13[j_8(D)] = 1;
> > >>> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> > >>> >      basis: 8  dependent: 0  sibling: 0
> > >>> >      next-interp: 0  dead-savings: 0
> > >>> >
> > >>> >Before the patch, the strength reduction candidate chains for the three
> > >>> >
> > >>> >CAND_REFs are:
> > >>> >
> > >>> >   _6 -> 6 -> 8
> > >>> >   _13 -> 11
> > >>> >
> > >>> >i.e. SLSR recognizes the first two references share the same basis,
> > >>> >while the last one is on it own.
> > >>> >
> > >>> >With the patch, an extra candidate chain can be recognized:
> > >>> >
> > >>> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
> > >>> >
> > >>> >i.e. all of the three references are found to have the same basis
> > >>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> > >>> >_6
> > >>> >or _13, with the immediate offset removed.  The pass is now able to
> > >>> >lower all of the three references, instead of the first two only, to
> > >>> >MEM_REFs.
> > >>>
> > >>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
> > >>
> > >> I think that's overstating SLSR's current capabilities a bit. :)  We do
> > >> use get_inner_reference to come up with the base expression for
> > >> reference candidates (based on some of your suggestions a couple of
> > >> years back).  However, in the case of multiple levels of array
> > >> references, we miss opportunities because get_inner_reference stops at
> > >> an SSA name that could be further expanded by following its definition
> > >> back to a more fundamental base expression.
> > >
> > > Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> > > same problem.
> > 
> > Oh, you're using affine combination expansion ... which is even more
> > expensive.  So why isn't that then done for all ref candidates?  That is,
> > why do two different things, get_inner_reference _and_ affine-combination
> > dances.  And why build back trees from that instead of storing the
> > affine combination.
> 
> Well, the original design had no desire to use the expensive machinery
> of affine combination expansion.  For what was envisioned, the simpler
> mechanisms of get_inner_reference have been plenty.
> 
> My thought, and please correct me if I'm wrong, is that once we've
> already reduced to an SSA name from get_inner_reference, the affine
> machinery will terminate fairly quickly -- we shouldn't get into too
> deep a search on underlying pointer arithmetic in most cases.  But
> compile time testing will tell us whether this is reasonable.

As a middle ground, may I suggest that we only do the extra tree_affine
expansion at -O2 and above?  Any extra compile time should be a blip at
those levels.  At -O1 there could be legitimate issues, though.

Bill

> 
> Bill
> 
> > 
> > I'll bet we come back with compile-time issues after this patch
> > went in.  I'll count on you two to fix them then.
> > 
> > Richard.
> > 
> > >> Part of the issue here is that reference candidates are basis for a more
> > >> specific optimization than the mult and add candidates.  The latter have
> > >> a more general framework for building up a recording of simple affine
> > >> expressions that can be strength-reduced.  Ultimately we ought to be
> > >> able to do something similar for reference candidates, building up
> > >> simple affine expressions from base expressions, so that everything is
> > >> done in a forward order and the tree-affine interfaces aren't needed.
> > >> But that will take some more fundamental design changes, and since this
> > >> provides some good improvements for important cases, I feel it's
> > >> reasonable to get this into the release.
> > >
> > > But I fail to see what is special about doing the dance to affine and
> > > then back to trees just to drop the constant offset which would be
> > > done by get_inner_reference as well and cheaper if you just ignore
> > > bitpos.
> > >
> > > ?!
> > >
> > > Richard.
> > >
> > >> Thanks,
> > >> Bill
> > >>
> > >>>
> > >>> Richard.
> > >>>
> > >>>
> > >>
> > 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-04 13:28                                       ` Bill Schmidt
@ 2013-12-05  8:49                                         ` Richard Biener
  0 siblings, 0 replies; 34+ messages in thread
From: Richard Biener @ 2013-12-05  8:49 UTC (permalink / raw)
  To: Bill Schmidt; +Cc: Yufeng Zhang, Jeff Law, gcc-patches

On Wed, Dec 4, 2013 at 2:27 PM, Bill Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> On Wed, 2013-12-04 at 07:13 -0600, Bill Schmidt wrote:
>> On Wed, 2013-12-04 at 11:30 +0100, Richard Biener wrote:
>> > On Wed, Dec 4, 2013 at 11:26 AM, Richard Biener
>> > <richard.guenther@gmail.com> wrote:
>> > > On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
>> > > <wschmidt@linux.vnet.ibm.com> wrote:
>> > >> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
>> > >>> Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
>> > >>> >On 12/03/13 14:20, Richard Biener wrote:
>> > >>> >> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>> > >>> >wrote:
>> > >>> >>> On 12/03/13 06:48, Jeff Law wrote:
>> > >>> >>>>
>> > >>> >>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>> > >>> >>>>>
>> > >>> >>>>> Ping~
>> > >>> >>>>>
>> > >>> >>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>> > >>> >>>>
>> > >>> >>>>
>> > >>> >>>>>
>> > >>> >>>>> Thanks,
>> > >>> >>>>> Yufeng
>> > >>> >>>>>
>> > >>> >>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>> > >>> >>>>>>
>> > >>> >>>>>> On 11/26/13 12:45, Richard Biener wrote:
>> > >>> >>>>>>>
>> > >>> >>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>> > >>> >>>>>>> Zhang<Yufeng.Zhang@arm.com>     wrote:
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>> > >>> >>>>>>>>>
>> > >>> >>>>>>>>> The second version of your original patch is ok with me with
>> > >>> >the
>> > >>> >>>>>>>>> following changes.  Sorry for the little side adventure into
>> > >>> >the
>> > >>> >>>>>>>>> next-interp logic; in the end that's going to hurt more than
>> > >>> >it
>> > >>> >>>>>>>>> helps in
>> > >>> >>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>> > >>> >also for
>> > >>> >>>>>>>>> cleaning up this version to be less intrusive to common
>> > >>> >interfaces; I
>> > >>> >>>>>>>>> appreciate it.
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>> > >>> >with the
>> > >>> >>>>>>>> suggested changes incorporated.
>> > >>> >>>>>>>>
>> > >>> >>>>>>>> For the next-interp adventure, I was quite happy to do the
>> > >>> >>>>>>>> experiment; it's
>> > >>> >>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>> > >>> >for
>> > >>> >>>>>>>> your prompt
>> > >>> >>>>>>>> replies and patience in guiding!
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>
>> > >>> >>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>> > >>> >>>>>>>>> approval,
>> > >>> >>>>>>>>> as I'm not a maintainer.
>> > >>> >>>>
>> > >>> >>>> First a note, I need to check on voting for Bill as the slsr
>> > >>> >maintainer
>> > >>> >>>> from the steering committee.   Voting was in progress just before
>> > >>> >the
>> > >>> >>>> close of stage1 development so I haven't tallied the results :-)
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> Looking forward to some good news! :)
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>>>>
>> > >>> >>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>> > >>> >shared.
>> > >>> >>>>>>      The cached is introduced mainly because get_alternative_base
>> > >>> >() may
>> > >>> >>>>>> be
>> > >>> >>>>>> called twice on the same 'base' tree, once in the
>> > >>> >>>>>> find_basis_for_candidate () for look-up and the other time in
>> > >>> >>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>> > >>> >happy
>> > >>> >>>>>> to leave out the cache if you think the benefit is trivial.
>> > >>> >>>>
>> > >>> >>>> Without some sense of how expensive the lookups are vs how often
>> > >>> >the
>> > >>> >>>> cache hits it's awful hard to know if the cache is worth it.
>> > >>> >>>>
>> > >>> >>>> I'd say take it out unless you have some sense it's really saving
>> > >>> >time.
>> > >>> >>>>     It's a pretty minor implementation detail either way.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> I think the affine tree routines are generally expensive; it is
>> > >>> >worth having
>> > >>> >>> a cache to avoid calling them too many times.  I run the slsr-*.c
>> > >>> >tests
>> > >>> >>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>> > >>> >from
>> > >>> >>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>> > >>> >represent
>> > >>> >>> the real world scenario, but they do show the fact that the 'base'
>> > >>> >tree can
>> > >>> >>> be shared to some extent.  So I'd like to have the cache in the
>> > >>> >patch.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>>
>> > >>> >>>>>>
>> > >>> >>>>>>> +/* { dg-do compile } */
>> > >>> >>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +typedef int arr_2[50][50];
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +void foo (arr_2 a2, int v1)
>> > >>> >>>>>>> +{
>> > >>> >>>>>>> +  int i, j;
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +  i = v1 + 5;
>> > >>> >>>>>>> +  j = i;
>> > >>> >>>>>>> +  a2 [i-10] [j] = 2;
>> > >>> >>>>>>> +  a2 [i] [j++] = i;
>> > >>> >>>>>>> +  a2 [i+20] [j++] = i;
>> > >>> >>>>>>> +  a2 [i-3] [i-1] += 1;
>> > >>> >>>>>>> +  return;
>> > >>> >>>>>>> +}
>> > >>> >>>>>>> +
>> > >>> >>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>> > >>> >>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>> > >>> >>>>>>>
>> > >>> >>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>> > >>> >>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>> > >>> >>>>>>> checking which is bad, too.
>> > >>> >>>>>>
>> > >>> >>>>>>
>> > >>> >>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>> > >>> >to
>> > >>> >>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>> > >>> >MEM_REFs
>> > >>> >>>>>> is an effective check.  Alternatively, I can add a follow-up
>> > >>> >patch to
>> > >>> >>>>>> add some dumping facility in replace_ref () to print out the
>> > >>> >replacing
>> > >>> >>>>>> actions when -fdump-tree-slsr-details is on.
>> > >>> >>>>
>> > >>> >>>> I think adding some details to the dump and scanning for them would
>> > >>> >be
>> > >>> >>>> better.  That's the only change that is required for this to move
>> > >>> >forward.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> I've updated to patch to dump more details when
>> > >>> >-fdump-tree-slsr-details is
>> > >>> >>> on.  The tests have also been updated to scan for these new dumps
>> > >>> >instead of
>> > >>> >>> MEMs.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>>
>> > >>> >>>> I suggest doing it quickly.  We're well past stage1 close at this
>> > >>> >point.
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> The bootstrapping on x86_64 is still running.  OK to commit if it
>> > >>> >succeeds?
>> > >>> >>
>> > >>> >> I still don't like it.  It's using the wrong and too expensive tools
>> > >>> >to do
>> > >>> >> stuff.  What kind of bases are we ultimately interested in?  Browsing
>> > >>> >> the code it looks like we're having
>> > >>> >>
>> > >>> >>    /* Base expression for the chain of candidates:  often, but not
>> > >>> >>       always, an SSA name.  */
>> > >>> >>    tree base_expr;
>> > >>> >>
>> > >>> >> which isn't really too informative but I suppose they are all
>> > >>> >> kind-of-gimple_val()s?  That said, I wonder if you can simply
>> > >>> >> use get_addr_base_and_unit_offset in place of get_alternative_base
>> > >>> >(),
>> > >>> >> ignoring the returned offset.
>> > >>> >
>> > >>> >'base_expr' is essentially the base address of a handled_component_p,
>> > >>> >e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>> > >>> >
>> > >>> >the object returned by get_inner_reference ().
>> > >>> >
>> > >>> >Given a test case like the following:
>> > >>> >
>> > >>> >typedef int arr_2[20][20];
>> > >>> >
>> > >>> >void foo (arr_2 a2, int i, int j)
>> > >>> >{
>> > >>> >   a2[i+10][j] = 1;
>> > >>> >   a2[i+10][j+1] = 1;
>> > >>> >   a2[i+20][j] = 1;
>> > >>> >}
>> > >>> >
>> > >>> >The IR before SLSR is (on x86_64):
>> > >>> >
>> > >>> >   _2 = (long unsigned int) i_1(D);
>> > >>> >   _3 = _2 * 80;
>> > >>> >   _4 = _3 + 800;
>> > >>> >   _6 = a2_5(D) + _4;
>> > >>> >   *_6[j_8(D)] = 1;
>> > >>> >   _10 = j_8(D) + 1;
>> > >>> >   *_6[_10] = 1;
>> > >>> >   _12 = _3 + 1600;
>> > >>> >   _13 = a2_5(D) + _12;
>> > >>> >   *_13[j_8(D)] = 1;
>> > >>> >
>> > >>> >The base_expr for the 1st and 2nd memory reference are the same, i.e.
>> > >>> >_6, while the base_expr for a2[i+20][j] is _13.
>> > >>> >
>> > >>> >_13 is essentially (_6 + 800), so all of the three memory references
>> > >>> >essentially share the same base address.  As their strides are also the
>> > >>> >
>> > >>> >same (MULT_EXPR (j, 4)), the three references can all be lowered to
>> > >>> >MEM_REFs.  What this patch does is to use the tree affine tools to help
>> > >>> >
>> > >>> >recognize the underlying base address expression; as it requires
>> > >>> >looking
>> > >>> >into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>> > >>> >won't help here.
>> > >>> >
>> > >>> >Bill has helped me exploit other ways of achieving this in SLSR, but so
>> > >>> >
>> > >>> >far we think this is the best way to proceed.  The use of tree affine
>> > >>> >routines has been restricted to CAND_REFs only and there is the
>> > >>> >aforementioned cache facility to help reduce the overhead.
>> > >>> >
>> > >>> >Thanks,
>> > >>> >Yufeng
>> > >>> >
>> > >>> >P.S. some more details what the patch does:
>> > >>> >
>> > >>> >The CAND_REF for the three memory references are:
>> > >>> >
>> > >>> >  6  [2] *_6[j_8(D)] = 1;
>> > >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>> > >>> >      basis: 0  dependent: 8  sibling: 0
>> > >>> >      next-interp: 0  dead-savings: 0
>> > >>> >
>> > >>> >   8  [2] *_6[_10] = 1;
>> > >>> >      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>> > >>> >      basis: 6  dependent: 11  sibling: 0
>> > >>> >      next-interp: 0  dead-savings: 0
>> > >>> >
>> > >>> >  11  [2] *_13[j_8(D)] = 1;
>> > >>> >      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>> > >>> >      basis: 8  dependent: 0  sibling: 0
>> > >>> >      next-interp: 0  dead-savings: 0
>> > >>> >
>> > >>> >Before the patch, the strength reduction candidate chains for the three
>> > >>> >
>> > >>> >CAND_REFs are:
>> > >>> >
>> > >>> >   _6 -> 6 -> 8
>> > >>> >   _13 -> 11
>> > >>> >
>> > >>> >i.e. SLSR recognizes the first two references share the same basis,
>> > >>> >while the last one is on it own.
>> > >>> >
>> > >>> >With the patch, an extra candidate chain can be recognized:
>> > >>> >
>> > >>> >   a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8
>> > >>> >
>> > >>> >i.e. all of the three references are found to have the same basis
>> > >>> >(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>> > >>> >_6
>> > >>> >or _13, with the immediate offset removed.  The pass is now able to
>> > >>> >lower all of the three references, instead of the first two only, to
>> > >>> >MEM_REFs.
>> > >>>
>> > >>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
>> > >>
>> > >> I think that's overstating SLSR's current capabilities a bit. :)  We do
>> > >> use get_inner_reference to come up with the base expression for
>> > >> reference candidates (based on some of your suggestions a couple of
>> > >> years back).  However, in the case of multiple levels of array
>> > >> references, we miss opportunities because get_inner_reference stops at
>> > >> an SSA name that could be further expanded by following its definition
>> > >> back to a more fundamental base expression.
>> > >
>> > > Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
>> > > same problem.
>> >
>> > Oh, you're using affine combination expansion ... which is even more
>> > expensive.  So why isn't that then done for all ref candidates?  That is,
>> > why do two different things, get_inner_reference _and_ affine-combination
>> > dances.  And why build back trees from that instead of storing the
>> > affine combination.
>>
>> Well, the original design had no desire to use the expensive machinery
>> of affine combination expansion.  For what was envisioned, the simpler
>> mechanisms of get_inner_reference have been plenty.
>>
>> My thought, and please correct me if I'm wrong, is that once we've
>> already reduced to an SSA name from get_inner_reference, the affine
>> machinery will terminate fairly quickly -- we shouldn't get into too
>> deep a search on underlying pointer arithmetic in most cases.  But
>> compile time testing will tell us whether this is reasonable.

Indeed.  Doing this still feels backward and odd - the pass should be
able to determine this globally instead of repeating local analysis
(even with a cache).

> As a middle ground, may I suggest that we only do the extra tree_affine
> expansion at -O2 and above?  Any extra compile time should be a blip at
> those levels.  At -O1 there could be legitimate issues, though.

You should check flag_expensive_optimizations here I think.

Richard.

> Bill
>
>>
>> Bill
>>
>> >
>> > I'll bet we come back with compile-time issues after this patch
>> > went in.  I'll count on you two to fix them then.
>> >
>> > Richard.
>> >
>> > >> Part of the issue here is that reference candidates are basis for a more
>> > >> specific optimization than the mult and add candidates.  The latter have
>> > >> a more general framework for building up a recording of simple affine
>> > >> expressions that can be strength-reduced.  Ultimately we ought to be
>> > >> able to do something similar for reference candidates, building up
>> > >> simple affine expressions from base expressions, so that everything is
>> > >> done in a forward order and the tree-affine interfaces aren't needed.
>> > >> But that will take some more fundamental design changes, and since this
>> > >> provides some good improvements for important cases, I feel it's
>> > >> reasonable to get this into the release.
>> > >
>> > > But I fail to see what is special about doing the dance to affine and
>> > > then back to trees just to drop the constant offset which would be
>> > > done by get_inner_reference as well and cheaper if you just ignore
>> > > bitpos.
>> > >
>> > > ?!
>> > >
>> > > Richard.
>> > >
>> > >> Thanks,
>> > >> Bill
>> > >>
>> > >>>
>> > >>> Richard.
>> > >>>
>> > >>>
>> > >>
>> >
>>
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-04 13:08                                   ` Bill Schmidt
@ 2013-12-05 12:02                                     ` Yufeng Zhang
  2013-12-05 13:22                                       ` Bill Schmidt
  0 siblings, 1 reply; 34+ messages in thread
From: Yufeng Zhang @ 2013-12-05 12:02 UTC (permalink / raw)
  To: Bill Schmidt; +Cc: Richard Biener, Jeff Law, gcc-patches

On 12/04/13 13:08, Bill Schmidt wrote:
> On Wed, 2013-12-04 at 11:26 +0100, Richard Biener wrote:
>> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
>> <wschmidt@linux.vnet.ibm.com>  wrote:
>>> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
>>>> Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
>>>>> On 12/03/13 14:20, Richard Biener wrote:
>>>>>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
>>>>> wrote:
>>>>>>> On 12/03/13 06:48, Jeff Law wrote:
>>>>>>>>
>>>>>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
>>>>>>>>>
>>>>>>>>> Ping~
>>>>>>>>>
>>>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yufeng
>>>>>>>>>
>>>>>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/26/13 12:45, Richard Biener wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
>>>>>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> The second version of your original patch is ok with me with
>>>>> the
>>>>>>>>>>>>> following changes.  Sorry for the little side adventure into
>>>>> the
>>>>>>>>>>>>> next-interp logic; in the end that's going to hurt more than
>>>>> it
>>>>>>>>>>>>> helps in
>>>>>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
>>>>> also for
>>>>>>>>>>>>> cleaning up this version to be less intrusive to common
>>>>> interfaces; I
>>>>>>>>>>>>> appreciate it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
>>>>> with the
>>>>>>>>>>>> suggested changes incorporated.
>>>>>>>>>>>>
>>>>>>>>>>>> For the next-interp adventure, I was quite happy to do the
>>>>>>>>>>>> experiment; it's
>>>>>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
>>>>> for
>>>>>>>>>>>> your prompt
>>>>>>>>>>>> replies and patience in guiding!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
>>>>>>>>>>>>> approval,
>>>>>>>>>>>>> as I'm not a maintainer.
>>>>>>>>
>>>>>>>> First a note, I need to check on voting for Bill as the slsr
>>>>> maintainer
>>>>>>>> from the steering committee.   Voting was in progress just before
>>>>> the
>>>>>>>> close of stage1 development so I haven't tallied the results :-)
>>>>>>>
>>>>>>>
>>>>>>> Looking forward to some good news! :)
>>>>>>>
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
>>>>> shared.
>>>>>>>>>>       The cached is introduced mainly because get_alternative_base
>>>>> () may
>>>>>>>>>> be
>>>>>>>>>> called twice on the same 'base' tree, once in the
>>>>>>>>>> find_basis_for_candidate () for look-up and the other time in
>>>>>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
>>>>> happy
>>>>>>>>>> to leave out the cache if you think the benefit is trivial.
>>>>>>>>
>>>>>>>> Without some sense of how expensive the lookups are vs how often
>>>>> the
>>>>>>>> cache hits it's awful hard to know if the cache is worth it.
>>>>>>>>
>>>>>>>> I'd say take it out unless you have some sense it's really saving
>>>>> time.
>>>>>>>>      It's a pretty minor implementation detail either way.
>>>>>>>
>>>>>>>
>>>>>>> I think the affine tree routines are generally expensive; it is
>>>>> worth having
>>>>>>> a cache to avoid calling them too many times.  I run the slsr-*.c
>>>>> tests
>>>>>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
>>>>> from
>>>>>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
>>>>> represent
>>>>>>> the real world scenario, but they do show the fact that the 'base'
>>>>> tree can
>>>>>>> be shared to some extent.  So I'd like to have the cache in the
>>>>> patch.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> +/* { dg-do compile } */
>>>>>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
>>>>>>>>>>> +
>>>>>>>>>>> +typedef int arr_2[50][50];
>>>>>>>>>>> +
>>>>>>>>>>> +void foo (arr_2 a2, int v1)
>>>>>>>>>>> +{
>>>>>>>>>>> +  int i, j;
>>>>>>>>>>> +
>>>>>>>>>>> +  i = v1 + 5;
>>>>>>>>>>> +  j = i;
>>>>>>>>>>> +  a2 [i-10] [j] = 2;
>>>>>>>>>>> +  a2 [i] [j++] = i;
>>>>>>>>>>> +  a2 [i+20] [j++] = i;
>>>>>>>>>>> +  a2 [i-3] [i-1] += 1;
>>>>>>>>>>> +  return;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
>>>>>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
>>>>>>>>>>>
>>>>>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
>>>>>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
>>>>>>>>>>> checking which is bad, too.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
>>>>> to
>>>>>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
>>>>> MEM_REFs
>>>>>>>>>> is an effective check.  Alternatively, I can add a follow-up
>>>>> patch to
>>>>>>>>>> add some dumping facility in replace_ref () to print out the
>>>>> replacing
>>>>>>>>>> actions when -fdump-tree-slsr-details is on.
>>>>>>>>
>>>>>>>> I think adding some details to the dump and scanning for them would
>>>>> be
>>>>>>>> better.  That's the only change that is required for this to move
>>>>> forward.
>>>>>>>
>>>>>>>
>>>>>>> I've updated to patch to dump more details when
>>>>> -fdump-tree-slsr-details is
>>>>>>> on.  The tests have also been updated to scan for these new dumps
>>>>> instead of
>>>>>>> MEMs.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I suggest doing it quickly.  We're well past stage1 close at this
>>>>> point.
>>>>>>>
>>>>>>>
>>>>>>> The bootstrapping on x86_64 is still running.  OK to commit if it
>>>>> succeeds?
>>>>>>
>>>>>> I still don't like it.  It's using the wrong and too expensive tools
>>>>> to do
>>>>>> stuff.  What kind of bases are we ultimately interested in?  Browsing
>>>>>> the code it looks like we're having
>>>>>>
>>>>>>     /* Base expression for the chain of candidates:  often, but not
>>>>>>        always, an SSA name.  */
>>>>>>     tree base_expr;
>>>>>>
>>>>>> which isn't really too informative but I suppose they are all
>>>>>> kind-of-gimple_val()s?  That said, I wonder if you can simply
>>>>>> use get_addr_base_and_unit_offset in place of get_alternative_base
>>>>> (),
>>>>>> ignoring the returned offset.
>>>>>
>>>>> 'base_expr' is essentially the base address of a handled_component_p,
>>>>> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
>>>>>
>>>>> the object returned by get_inner_reference ().
>>>>>
>>>>> Given a test case like the following:
>>>>>
>>>>> typedef int arr_2[20][20];
>>>>>
>>>>> void foo (arr_2 a2, int i, int j)
>>>>> {
>>>>>    a2[i+10][j] = 1;
>>>>>    a2[i+10][j+1] = 1;
>>>>>    a2[i+20][j] = 1;
>>>>> }
>>>>>
>>>>> The IR before SLSR is (on x86_64):
>>>>>
>>>>>    _2 = (long unsigned int) i_1(D);
>>>>>    _3 = _2 * 80;
>>>>>    _4 = _3 + 800;
>>>>>    _6 = a2_5(D) + _4;
>>>>>    *_6[j_8(D)] = 1;
>>>>>    _10 = j_8(D) + 1;
>>>>>    *_6[_10] = 1;
>>>>>    _12 = _3 + 1600;
>>>>>    _13 = a2_5(D) + _12;
>>>>>    *_13[j_8(D)] = 1;
>>>>>
>>>>> The base_expr for the 1st and 2nd memory reference are the same, i.e.
>>>>> _6, while the base_expr for a2[i+20][j] is _13.
>>>>>
>>>>> _13 is essentially (_6 + 800), so all of the three memory references
>>>>> essentially share the same base address.  As their strides are also the
>>>>>
>>>>> same (MULT_EXPR (j, 4)), the three references can all be lowered to
>>>>> MEM_REFs.  What this patch does is to use the tree affine tools to help
>>>>>
>>>>> recognize the underlying base address expression; as it requires
>>>>> looking
>>>>> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
>>>>> won't help here.
>>>>>
>>>>> Bill has helped me exploit other ways of achieving this in SLSR, but so
>>>>>
>>>>> far we think this is the best way to proceed.  The use of tree affine
>>>>> routines has been restricted to CAND_REFs only and there is the
>>>>> aforementioned cache facility to help reduce the overhead.
>>>>>
>>>>> Thanks,
>>>>> Yufeng
>>>>>
>>>>> P.S. some more details what the patch does:
>>>>>
>>>>> The CAND_REF for the three memory references are:
>>>>>
>>>>>   6  [2] *_6[j_8(D)] = 1;
>>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>>>>       basis: 0  dependent: 8  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>>    8  [2] *_6[_10] = 1;
>>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
>>>>>       basis: 6  dependent: 11  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>>   11  [2] *_13[j_8(D)] = 1;
>>>>>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
>>>>>       basis: 8  dependent: 0  sibling: 0
>>>>>       next-interp: 0  dead-savings: 0
>>>>>
>>>>> Before the patch, the strength reduction candidate chains for the three
>>>>>
>>>>> CAND_REFs are:
>>>>>
>>>>>    _6 ->  6 ->  8
>>>>>    _13 ->  11
>>>>>
>>>>> i.e. SLSR recognizes the first two references share the same basis,
>>>>> while the last one is on it own.
>>>>>
>>>>> With the patch, an extra candidate chain can be recognized:
>>>>>
>>>>>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
>>>>>
>>>>> i.e. all of the three references are found to have the same basis
>>>>> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
>>>>> _6
>>>>> or _13, with the immediate offset removed.  The pass is now able to
>>>>> lower all of the three references, instead of the first two only, to
>>>>> MEM_REFs.
>>>>
>>>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
>>>
>>> I think that's overstating SLSR's current capabilities a bit. :)  We do
>>> use get_inner_reference to come up with the base expression for
>>> reference candidates (based on some of your suggestions a couple of
>>> years back).  However, in the case of multiple levels of array
>>> references, we miss opportunities because get_inner_reference stops at
>>> an SSA name that could be further expanded by following its definition
>>> back to a more fundamental base expression.
>>
>> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
>> same problem.
>>
>>> Part of the issue here is that reference candidates are basis for a more
>>> specific optimization than the mult and add candidates.  The latter have
>>> a more general framework for building up a recording of simple affine
>>> expressions that can be strength-reduced.  Ultimately we ought to be
>>> able to do something similar for reference candidates, building up
>>> simple affine expressions from base expressions, so that everything is
>>> done in a forward order and the tree-affine interfaces aren't needed.
>>> But that will take some more fundamental design changes, and since this
>>> provides some good improvements for important cases, I feel it's
>>> reasonable to get this into the release.
>>
>> But I fail to see what is special about doing the dance to affine and
>> then back to trees just to drop the constant offset which would be
>> done by get_inner_reference as well and cheaper if you just ignore
>> bitpos.
>
> I'm not sure what you're suggesting that he use get_inner_reference on
> at this point.  At the point where the affine machinery is invoked, the
> memory reference was already expanded with get_inner_reference, and
> there was no basis involving the SSA name produced as the base.  The
> affine machinery is invoked on that SSA name to see if it is hiding
> another base.  There's no additional memory reference to use
> get_inner_reference on, just potentially some pointer arithmetic.
>
> That said, if we have real compile-time issues, we should hold off on
> this patch for this release.
>
> Yufeng, please time some reasonably large benchmarks (some version of
> SPECint or similar) and report back here before the patch goes in.

I've got some build time data for SPEC2Kint.

On x86_64 the -O3 builds take about 4m7.5s with or without the patch 
(consistent over 3 samples).  The difference of the -O3 build time on 
arm cortex-a15 is also within 2 seconds.

The bootstrapping time on x86_64 is 134m48.040s without the patch and 
134m46.889s with the patch; this data is preliminary as I only sampled 
once, but the difference of the bootstrapping time on arm cortex-a15 is 
also within 5 seconds.

I can further time SPEC2006int if necessary.

I've also prepared a patch to further reduce the number of calls to 
tree-affine expansion, by checking whether or not the passed-in BASE in 
get_alternative_base () is simply an SSA_NAME of a declared variable. 
Please see the inlined patch below.

Thanks,
Yufeng


diff --git a/gcc/gimple-ssa-strength-reduction.c 
b/gcc/gimple-ssa-strength-reduction.c
index 26502c3..2984f06 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -437,13 +437,22 @@ get_alternative_base (tree base)

    if (result == NULL)
      {
-      tree expr;
-      aff_tree aff;
+      tree expr = NULL;
+      gimple def = NULL;

-      tree_to_aff_combination_expand (base, TREE_TYPE (base),
-				      &aff, &name_expansions);
-      aff.offset = tree_to_double_int (integer_zero_node);
-      expr = aff_combination_to_tree (&aff);
+      if (TREE_CODE (base) == SSA_NAME)
+	def = SSA_NAME_DEF_STMT (base);
+
+      /* Avoid calling expensive tree-affine expansion if BASE
+         is just an SSA_NAME of, e.g. a para_decl.  */
+      if (!def || (is_gimple_assign (def) && gimple_assign_lhs (def) == 
base))
+	{
+	  aff_tree aff;
+	  tree_to_aff_combination_expand (base, TREE_TYPE (base),
+					  &aff, &name_expansions);
+	  aff.offset = tree_to_double_int (integer_zero_node);
+	  expr = aff_combination_to_tree (&aff);
+	}

        result = (tree *) pointer_map_insert (alt_base_map, base);
        gcc_assert (!*result);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-05 12:02                                     ` Yufeng Zhang
@ 2013-12-05 13:22                                       ` Bill Schmidt
  2013-12-05 14:01                                         ` Yufeng Zhang
  0 siblings, 1 reply; 34+ messages in thread
From: Bill Schmidt @ 2013-12-05 13:22 UTC (permalink / raw)
  To: Yufeng Zhang; +Cc: Richard Biener, Jeff Law, gcc-patches

On Thu, 2013-12-05 at 12:02 +0000, Yufeng Zhang wrote:
> On 12/04/13 13:08, Bill Schmidt wrote:
> > On Wed, 2013-12-04 at 11:26 +0100, Richard Biener wrote:
> >> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
> >> <wschmidt@linux.vnet.ibm.com>  wrote:
> >>> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:
> >>>> Yufeng Zhang<Yufeng.Zhang@arm.com>  wrote:
> >>>>> On 12/03/13 14:20, Richard Biener wrote:
> >>>>>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<Yufeng.Zhang@arm.com>
> >>>>> wrote:
> >>>>>>> On 12/03/13 06:48, Jeff Law wrote:
> >>>>>>>>
> >>>>>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>>>>>>>>
> >>>>>>>>> Ping~
> >>>>>>>>>
> >>>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Yufeng
> >>>>>>>>>
> >>>>>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>>>>>>>>>> Zhang<Yufeng.Zhang@arm.com>      wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The second version of your original patch is ok with me with
> >>>>> the
> >>>>>>>>>>>>> following changes.  Sorry for the little side adventure into
> >>>>> the
> >>>>>>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >>>>> it
> >>>>>>>>>>>>> helps in
> >>>>>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >>>>> also for
> >>>>>>>>>>>>> cleaning up this version to be less intrusive to common
> >>>>> interfaces; I
> >>>>>>>>>>>>> appreciate it.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >>>>> with the
> >>>>>>>>>>>> suggested changes incorporated.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>>>>>>>>>>> experiment; it's
> >>>>>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >>>>> for
> >>>>>>>>>>>> your prompt
> >>>>>>>>>>>> replies and patience in guiding!
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >>>>>>>>>>>>> approval,
> >>>>>>>>>>>>> as I'm not a maintainer.
> >>>>>>>>
> >>>>>>>> First a note, I need to check on voting for Bill as the slsr
> >>>>> maintainer
> >>>>>>>> from the steering committee.   Voting was in progress just before
> >>>>> the
> >>>>>>>> close of stage1 development so I haven't tallied the results :-)
> >>>>>>>
> >>>>>>>
> >>>>>>> Looking forward to some good news! :)
> >>>>>>>
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >>>>> shared.
> >>>>>>>>>>       The cached is introduced mainly because get_alternative_base
> >>>>> () may
> >>>>>>>>>> be
> >>>>>>>>>> called twice on the same 'base' tree, once in the
> >>>>>>>>>> find_basis_for_candidate () for look-up and the other time in
> >>>>>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >>>>> happy
> >>>>>>>>>> to leave out the cache if you think the benefit is trivial.
> >>>>>>>>
> >>>>>>>> Without some sense of how expensive the lookups are vs how often
> >>>>> the
> >>>>>>>> cache hits it's awful hard to know if the cache is worth it.
> >>>>>>>>
> >>>>>>>> I'd say take it out unless you have some sense it's really saving
> >>>>> time.
> >>>>>>>>      It's a pretty minor implementation detail either way.
> >>>>>>>
> >>>>>>>
> >>>>>>> I think the affine tree routines are generally expensive; it is
> >>>>> worth having
> >>>>>>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >>>>> tests
> >>>>>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >>>>> from
> >>>>>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >>>>> represent
> >>>>>>> the real world scenario, but they do show the fact that the 'base'
> >>>>> tree can
> >>>>>>> be shared to some extent.  So I'd like to have the cache in the
> >>>>> patch.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> +/* { dg-do compile } */
> >>>>>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>>>>>>>>>> +
> >>>>>>>>>>> +typedef int arr_2[50][50];
> >>>>>>>>>>> +
> >>>>>>>>>>> +void foo (arr_2 a2, int v1)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +  int i, j;
> >>>>>>>>>>> +
> >>>>>>>>>>> +  i = v1 + 5;
> >>>>>>>>>>> +  j = i;
> >>>>>>>>>>> +  a2 [i-10] [j] = 2;
> >>>>>>>>>>> +  a2 [i] [j++] = i;
> >>>>>>>>>>> +  a2 [i+20] [j++] = i;
> >>>>>>>>>>> +  a2 [i-3] [i-1] += 1;
> >>>>>>>>>>> +  return;
> >>>>>>>>>>> +}
> >>>>>>>>>>> +
> >>>>>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>>>>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>>>>>>>>>>
> >>>>>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >>>>>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >>>>>>>>>>> checking which is bad, too.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >>>>> to
> >>>>>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >>>>> MEM_REFs
> >>>>>>>>>> is an effective check.  Alternatively, I can add a follow-up
> >>>>> patch to
> >>>>>>>>>> add some dumping facility in replace_ref () to print out the
> >>>>> replacing
> >>>>>>>>>> actions when -fdump-tree-slsr-details is on.
> >>>>>>>>
> >>>>>>>> I think adding some details to the dump and scanning for them would
> >>>>> be
> >>>>>>>> better.  That's the only change that is required for this to move
> >>>>> forward.
> >>>>>>>
> >>>>>>>
> >>>>>>> I've updated to patch to dump more details when
> >>>>> -fdump-tree-slsr-details is
> >>>>>>> on.  The tests have also been updated to scan for these new dumps
> >>>>> instead of
> >>>>>>> MEMs.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I suggest doing it quickly.  We're well past stage1 close at this
> >>>>> point.
> >>>>>>>
> >>>>>>>
> >>>>>>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >>>>> succeeds?
> >>>>>>
> >>>>>> I still don't like it.  It's using the wrong and too expensive tools
> >>>>> to do
> >>>>>> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >>>>>> the code it looks like we're having
> >>>>>>
> >>>>>>     /* Base expression for the chain of candidates:  often, but not
> >>>>>>        always, an SSA name.  */
> >>>>>>     tree base_expr;
> >>>>>>
> >>>>>> which isn't really too informative but I suppose they are all
> >>>>>> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >>>>>> use get_addr_base_and_unit_offset in place of get_alternative_base
> >>>>> (),
> >>>>>> ignoring the returned offset.
> >>>>>
> >>>>> 'base_expr' is essentially the base address of a handled_component_p,
> >>>>> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >>>>>
> >>>>> the object returned by get_inner_reference ().
> >>>>>
> >>>>> Given a test case like the following:
> >>>>>
> >>>>> typedef int arr_2[20][20];
> >>>>>
> >>>>> void foo (arr_2 a2, int i, int j)
> >>>>> {
> >>>>>    a2[i+10][j] = 1;
> >>>>>    a2[i+10][j+1] = 1;
> >>>>>    a2[i+20][j] = 1;
> >>>>> }
> >>>>>
> >>>>> The IR before SLSR is (on x86_64):
> >>>>>
> >>>>>    _2 = (long unsigned int) i_1(D);
> >>>>>    _3 = _2 * 80;
> >>>>>    _4 = _3 + 800;
> >>>>>    _6 = a2_5(D) + _4;
> >>>>>    *_6[j_8(D)] = 1;
> >>>>>    _10 = j_8(D) + 1;
> >>>>>    *_6[_10] = 1;
> >>>>>    _12 = _3 + 1600;
> >>>>>    _13 = a2_5(D) + _12;
> >>>>>    *_13[j_8(D)] = 1;
> >>>>>
> >>>>> The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >>>>> _6, while the base_expr for a2[i+20][j] is _13.
> >>>>>
> >>>>> _13 is essentially (_6 + 800), so all of the three memory references
> >>>>> essentially share the same base address.  As their strides are also the
> >>>>>
> >>>>> same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >>>>> MEM_REFs.  What this patch does is to use the tree affine tools to help
> >>>>>
> >>>>> recognize the underlying base address expression; as it requires
> >>>>> looking
> >>>>> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >>>>> won't help here.
> >>>>>
> >>>>> Bill has helped me exploit other ways of achieving this in SLSR, but so
> >>>>>
> >>>>> far we think this is the best way to proceed.  The use of tree affine
> >>>>> routines has been restricted to CAND_REFs only and there is the
> >>>>> aforementioned cache facility to help reduce the overhead.
> >>>>>
> >>>>> Thanks,
> >>>>> Yufeng
> >>>>>
> >>>>> P.S. some more details what the patch does:
> >>>>>
> >>>>> The CAND_REF for the three memory references are:
> >>>>>
> >>>>>   6  [2] *_6[j_8(D)] = 1;
> >>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>>>>       basis: 0  dependent: 8  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0
> >>>>>
> >>>>>    8  [2] *_6[_10] = 1;
> >>>>>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >>>>>       basis: 6  dependent: 11  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0
> >>>>>
> >>>>>   11  [2] *_13[j_8(D)] = 1;
> >>>>>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>>>>       basis: 8  dependent: 0  sibling: 0
> >>>>>       next-interp: 0  dead-savings: 0
> >>>>>
> >>>>> Before the patch, the strength reduction candidate chains for the three
> >>>>>
> >>>>> CAND_REFs are:
> >>>>>
> >>>>>    _6 ->  6 ->  8
> >>>>>    _13 ->  11
> >>>>>
> >>>>> i.e. SLSR recognizes the first two references share the same basis,
> >>>>> while the last one is on it own.
> >>>>>
> >>>>> With the patch, an extra candidate chain can be recognized:
> >>>>>
> >>>>>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
> >>>>>
> >>>>> i.e. all of the three references are found to have the same basis
> >>>>> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >>>>> _6
> >>>>> or _13, with the immediate offset removed.  The pass is now able to
> >>>>> lower all of the three references, instead of the first two only, to
> >>>>> MEM_REFs.
> >>>>
> >>>> Ok, so slsr handles arbitrary complex bases and figures out common components? If so, then why not just use get_inner_reference? After all slsr does not use tree-affine as representation for bases (which it could?)
> >>>
> >>> I think that's overstating SLSR's current capabilities a bit. :)  We do
> >>> use get_inner_reference to come up with the base expression for
> >>> reference candidates (based on some of your suggestions a couple of
> >>> years back).  However, in the case of multiple levels of array
> >>> references, we miss opportunities because get_inner_reference stops at
> >>> an SSA name that could be further expanded by following its definition
> >>> back to a more fundamental base expression.
> >>
> >> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
> >> same problem.
> >>
> >>> Part of the issue here is that reference candidates are basis for a more
> >>> specific optimization than the mult and add candidates.  The latter have
> >>> a more general framework for building up a recording of simple affine
> >>> expressions that can be strength-reduced.  Ultimately we ought to be
> >>> able to do something similar for reference candidates, building up
> >>> simple affine expressions from base expressions, so that everything is
> >>> done in a forward order and the tree-affine interfaces aren't needed.
> >>> But that will take some more fundamental design changes, and since this
> >>> provides some good improvements for important cases, I feel it's
> >>> reasonable to get this into the release.
> >>
> >> But I fail to see what is special about doing the dance to affine and
> >> then back to trees just to drop the constant offset which would be
> >> done by get_inner_reference as well and cheaper if you just ignore
> >> bitpos.
> >
> > I'm not sure what you're suggesting that he use get_inner_reference on
> > at this point.  At the point where the affine machinery is invoked, the
> > memory reference was already expanded with get_inner_reference, and
> > there was no basis involving the SSA name produced as the base.  The
> > affine machinery is invoked on that SSA name to see if it is hiding
> > another base.  There's no additional memory reference to use
> > get_inner_reference on, just potentially some pointer arithmetic.
> >
> > That said, if we have real compile-time issues, we should hold off on
> > this patch for this release.
> >
> > Yufeng, please time some reasonably large benchmarks (some version of
> > SPECint or similar) and report back here before the patch goes in.
> 
> I've got some build time data for SPEC2Kint.
> 
> On x86_64 the -O3 builds take about 4m7.5s with or without the patch 
> (consistent over 3 samples).  The difference of the -O3 build time on 
> arm cortex-a15 is also within 2 seconds.
> 
> The bootstrapping time on x86_64 is 134m48.040s without the patch and 
> 134m46.889s with the patch; this data is preliminary as I only sampled 
> once, but the difference of the bootstrapping time on arm cortex-a15 is 
> also within 5 seconds.
> 
> I can further time SPEC2006int if necessary.
> 
> I've also prepared a patch to further reduce the number of calls to 
> tree-affine expansion, by checking whether or not the passed-in BASE in 
> get_alternative_base () is simply an SSA_NAME of a declared variable. 
> Please see the inlined patch below.
> 
> Thanks,
> Yufeng
> 
> 
> diff --git a/gcc/gimple-ssa-strength-reduction.c 
> b/gcc/gimple-ssa-strength-reduction.c
> index 26502c3..2984f06 100644
> --- a/gcc/gimple-ssa-strength-reduction.c
> +++ b/gcc/gimple-ssa-strength-reduction.c
> @@ -437,13 +437,22 @@ get_alternative_base (tree base)
> 
>     if (result == NULL)
>       {
> -      tree expr;
> -      aff_tree aff;
> +      tree expr = NULL;
> +      gimple def = NULL;
> 
> -      tree_to_aff_combination_expand (base, TREE_TYPE (base),
> -				      &aff, &name_expansions);
> -      aff.offset = tree_to_double_int (integer_zero_node);
> -      expr = aff_combination_to_tree (&aff);
> +      if (TREE_CODE (base) == SSA_NAME)
> +	def = SSA_NAME_DEF_STMT (base);
> +
> +      /* Avoid calling expensive tree-affine expansion if BASE
> +         is just an SSA_NAME of, e.g. a para_decl.  */
> +      if (!def || (is_gimple_assign (def) && gimple_assign_lhs (def) == 
> base))

Well, that just isn't right.  !def indicates you have a parameter, so
why call tree_to_aff_combination_expand in that case?  Just forget this
addition and check for flag_expensive_optimizations as Richard suggested
in another branch of this thread.

Previous version of the patch is ok with this change, and with a comment
added that we should eliminate this backtracking with better forward
analysis in a future release.

Thanks,
Bill

> +	{
> +	  aff_tree aff;
> +	  tree_to_aff_combination_expand (base, TREE_TYPE (base),
> +					  &aff, &name_expansions);
> +	  aff.offset = tree_to_double_int (integer_zero_node);
> +	  expr = aff_combination_to_tree (&aff);
> +	}
> 
>         result = (tree *) pointer_map_insert (alt_base_map, base);
>         gcc_assert (!*result);
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs
  2013-12-05 13:22                                       ` Bill Schmidt
@ 2013-12-05 14:01                                         ` Yufeng Zhang
  0 siblings, 0 replies; 34+ messages in thread
From: Yufeng Zhang @ 2013-12-05 14:01 UTC (permalink / raw)
  To: Bill Schmidt; +Cc: Richard Biener, Jeff Law, gcc-patches

On 12/05/13 13:21, Bill Schmidt wrote:
> On Thu, 2013-12-05 at 12:02 +0000, Yufeng Zhang wrote:
>> On 12/04/13 13:08, Bill Schmidt wrote:
>>> On Wed, 2013-12-04 at 11:26 +0100, Richard Biener wrote:
[snip]
>>>
>>> I'm not sure what you're suggesting that he use get_inner_reference on
>>> at this point.  At the point where the affine machinery is invoked, the
>>> memory reference was already expanded with get_inner_reference, and
>>> there was no basis involving the SSA name produced as the base.  The
>>> affine machinery is invoked on that SSA name to see if it is hiding
>>> another base.  There's no additional memory reference to use
>>> get_inner_reference on, just potentially some pointer arithmetic.
>>>
>>> That said, if we have real compile-time issues, we should hold off on
>>> this patch for this release.
>>>
>>> Yufeng, please time some reasonably large benchmarks (some version of
>>> SPECint or similar) and report back here before the patch goes in.
>>
>> I've got some build time data for SPEC2Kint.
>>
>> On x86_64 the -O3 builds take about 4m7.5s with or without the patch
>> (consistent over 3 samples).  The difference of the -O3 build time on
>> arm cortex-a15 is also within 2 seconds.
>>
>> The bootstrapping time on x86_64 is 134m48.040s without the patch and
>> 134m46.889s with the patch; this data is preliminary as I only sampled
>> once, but the difference of the bootstrapping time on arm cortex-a15 is
>> also within 5 seconds.
>>
>> I can further time SPEC2006int if necessary.
>>
>> I've also prepared a patch to further reduce the number of calls to
>> tree-affine expansion, by checking whether or not the passed-in BASE in
>> get_alternative_base () is simply an SSA_NAME of a declared variable.
>> Please see the inlined patch below.
>>
>> Thanks,
>> Yufeng
>>
>>
>> diff --git a/gcc/gimple-ssa-strength-reduction.c
>> b/gcc/gimple-ssa-strength-reduction.c
>> index 26502c3..2984f06 100644
>> --- a/gcc/gimple-ssa-strength-reduction.c
>> +++ b/gcc/gimple-ssa-strength-reduction.c
>> @@ -437,13 +437,22 @@ get_alternative_base (tree base)
>>
>>      if (result == NULL)
>>        {
>> -      tree expr;
>> -      aff_tree aff;
>> +      tree expr = NULL;
>> +      gimple def = NULL;
>>
>> -      tree_to_aff_combination_expand (base, TREE_TYPE (base),
>> -&aff,&name_expansions);
>> -      aff.offset = tree_to_double_int (integer_zero_node);
>> -      expr = aff_combination_to_tree (&aff);
>> +      if (TREE_CODE (base) == SSA_NAME)
>> +     def = SSA_NAME_DEF_STMT (base);
>> +
>> +      /* Avoid calling expensive tree-affine expansion if BASE
>> +         is just an SSA_NAME of, e.g. a para_decl.  */
>> +      if (!def || (is_gimple_assign (def)&&  gimple_assign_lhs (def) ==
>> base))
>
> Well, that just isn't right.  !def indicates you have a parameter, so
> why call tree_to_aff_combination_expand in that case?  Just forget this
> addition and check for flag_expensive_optimizations as Richard suggested
> in another branch of this thread.

I thought every SSA_NAME has its DEF_STMT, at least in the cases which I 
checked they are GIMPLE_NOPs; that's why I used !def to check for cases 
where BASE is not an SSA_NAME (bad programming habit I guess).

Anyway, I'll leave out this addition.

> Previous version of the patch is ok with this change, and with a comment
> added that we should eliminate this backtracking with better forward
> analysis in a future release.

Thanks.  The following inlined diff is the incremental change.

Thanks again for your review and help.

Regards,
Yufeng


diff --git a/gcc/gimple-ssa-strength-reduction.c 
b/gcc/gimple-ssa-strength-reduction.c
index 26502c3..f406794 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -428,7 +428,10 @@ static struct pointer_map_t *alt_base_map;

  /* Given BASE, use the tree affine combiniation facilities to
     find the underlying tree expression for BASE, with any
-   immediate offset excluded.  */
+   immediate offset excluded.
+
+   N.B. we should eliminate this backtracking with better forward
+   analysis in a future release.  */

  static tree
  get_alternative_base (tree base)
@@ -556,7 +559,7 @@ find_basis_for_candidate (slsr_cand_t c)
  	}
      }

-  if (!basis && c->kind == CAND_REF)
+  if (flag_expensive_optimizations && !basis && c->kind == CAND_REF)
      {
        tree alt_base_expr = get_alternative_base (c->base_expr);
        if (alt_base_expr)
@@ -641,7 +644,7 @@ alloc_cand_and_find_basis (enum cand_kind kind, 
gimple gs, tree base,
      c->basis = find_basis_for_candidate (c);

    record_potential_basis (c, base);
-  if (kind == CAND_REF)
+  if (flag_expensive_optimizations && kind == CAND_REF)
      {
        tree alt_base = get_alternative_base (base);
        if (alt_base)

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2013-12-05 14:01 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-04 18:46 [PATCH] Optional alternative base_expr in finding basis for CAND_REFs Yufeng Zhang
2013-11-11 18:10 ` Bill Schmidt
2013-11-12 23:44   ` Yufeng Zhang
2013-11-13 21:12     ` Bill Schmidt
2013-11-13 22:29       ` Yufeng Zhang
2013-11-13 22:30         ` Bill Schmidt
2013-11-13 23:14           ` Bill Schmidt
2013-11-13 23:25             ` Bill Schmidt
2013-11-14  4:07             ` Yufeng Zhang
2013-11-19 12:32               ` [PING] " Yufeng Zhang
2013-11-26 14:53                 ` [PING^2] " Yufeng Zhang
2013-11-26 15:22               ` Richard Biener
2013-11-26 18:06                 ` Yufeng Zhang
2013-12-02 15:48                   ` [PING] " Yufeng Zhang
2013-12-03  6:50                     ` Jeff Law
2013-12-03 12:51                       ` Yufeng Zhang
2013-12-03 14:21                         ` Richard Biener
2013-12-03 15:52                           ` Yufeng Zhang
2013-12-03 19:21                             ` Jeff Law
2013-12-03 20:32                             ` Richard Biener
2013-12-03 21:57                               ` Yufeng Zhang
2013-12-03 22:19                                 ` Bill Schmidt
2013-12-03 22:04                               ` Bill Schmidt
2013-12-04 10:26                                 ` Richard Biener
2013-12-04 10:30                                   ` Richard Biener
2013-12-04 11:32                                     ` Yufeng Zhang
2013-12-04 13:24                                       ` Bill Schmidt
2013-12-04 13:14                                     ` Bill Schmidt
2013-12-04 13:28                                       ` Bill Schmidt
2013-12-05  8:49                                         ` Richard Biener
2013-12-04 13:08                                   ` Bill Schmidt
2013-12-05 12:02                                     ` Yufeng Zhang
2013-12-05 13:22                                       ` Bill Schmidt
2013-12-05 14:01                                         ` Yufeng Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).