[PATCH][0/n] Merge from match-and-simplify

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][0/n] Merge from match-and-simplify
@ 2014-10-15 13:20 Richard Biener
  2014-10-15 16:30 ` Kyrill Tkachov
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Richard Biener @ 2014-10-15 13:20 UTC (permalink / raw)
  To: gcc-patches

I have posted 5 patches as part of a larger series to merge
(parts) from the match-and-simplify branch.  While I think
there was overall consensus that the idea behind the project
is sound there are technical questions left for how the
thing should look in the end.  I've raised them in 3/n
which is the only patch of the series that contains any
patterns sofar.

To re-iterate here (as I expect most people will only look
at [0/n] patches ;)), the question is whether we are fine
with making fold-const (thus fold_{unary,binary,ternary})
not handle some cases it handles currently.

Some cases may result from the fact that the autogenerated
code from genmatch does not apply selective STRIP_NOPS/STRIP_SIGN_NOPS
to the outermost expression operands as fold does.

Some cases may result from the fact that the autogenerated
code bails out on operands with side-effects (patterns
really target GIMPLE where operands never have side-effects).

In 10 years (well, maybe earlier) we shouldn't do any
expression simplification from the frontends (and thus
GENERIC) if not explicitely required by language standards.
Thus I expect fold-const.c to "vanish" anyway.

Generally it is of course impossible to prove that adding
a pattern and removing existing fold-const.c and
gimple-fold.c/tree-ssa-forwprop.c code will not regress.
But I suppose that's expected and passing the testsuite
is fine (together with possibly amending it with testcases
for foldings that now should apply on GIMPLE).

For stage1 I'd like to merge [1/n] to [5/n] as one revision
(well, excluding 3/n) and add patterns in smaller chunks
to be able to bisect to issues that show up.

Any comments and reviews welcome (I don't think that
my maintainership covers enough to simply check this in
without approval).

Thanks,
Richard.

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-15 13:20 [PATCH][0/n] Merge from match-and-simplify Richard Biener
@ 2014-10-15 16:30 ` Kyrill Tkachov
  2014-10-17  7:39   ` Ramana Radhakrishnan
  2014-10-15 17:13 ` Jakub Jelinek
  2014-10-16 20:43 ` Sebastian Pop
  2 siblings, 1 reply; 15+ messages in thread
From: Kyrill Tkachov @ 2014-10-15 16:30 UTC (permalink / raw)
  To: Richard Biener, gcc-patches


On 15/10/14 14:00, Richard Biener wrote:
>
> Any comments and reviews welcome (I don't think that
> my maintainership covers enough to simply check this in
> without approval).
>
Hi Richard,

The match-and-simplify branch bootstrapped successfully on 
aarch64-none-linux-gnu FWIW.

Thanks,
Kyrill


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-15 16:30 ` Kyrill Tkachov
@ 2014-10-17  7:39   ` Ramana Radhakrishnan
  2014-10-17  8:24     ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Ramana Radhakrishnan @ 2014-10-17  7:39 UTC (permalink / raw)
  To: Kyrill Tkachov; +Cc: Richard Biener, gcc-patches

On Wed, Oct 15, 2014 at 5:29 PM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
>
> On 15/10/14 14:00, Richard Biener wrote:
>>
>>
>> Any comments and reviews welcome (I don't think that
>> my maintainership covers enough to simply check this in
>> without approval).
>>
> Hi Richard,
>
> The match-and-simplify branch bootstrapped successfully on
> aarch64-none-linux-gnu FWIW.
>

What about regression tests ?

> Thanks,
> Kyrill
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-17  7:39   ` Ramana Radhakrishnan
@ 2014-10-17  8:24     ` Richard Biener
  2014-10-17 11:58       ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2014-10-17  8:24 UTC (permalink / raw)
  To: ramrad01; +Cc: Kyrill Tkachov, gcc-patches

On Fri, 17 Oct 2014, Ramana Radhakrishnan wrote:

> On Wed, Oct 15, 2014 at 5:29 PM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
> >
> > On 15/10/14 14:00, Richard Biener wrote:
> >>
> >>
> >> Any comments and reviews welcome (I don't think that
> >> my maintainership covers enough to simply check this in
> >> without approval).
> >>
> > Hi Richard,
> >
> > The match-and-simplify branch bootstrapped successfully on
> > aarch64-none-linux-gnu FWIW.
> >
> 
> What about regression tests ?

Note the branch isn't regression free on x86_64 either.  The branch
does more than I want to merge to trunk (and it also retains all
folding code I added patterns for).  I've gone farther there to
explore whether it will end up working in the end and what kind
of features the IL and the APIs need.

I've pasted testsuite results on x86_64 below for rev. 216324
which is based on trunk rev. 216315 which unfortunately has
lots of regressions on its own.

This is why I want to restrict the effect of the machinery to
fold (), fold_stmt () and tree-ssa-forwprop.c for the moment
and merge individual patterns (well, maybe in small groups)
separately to allow for easy bi-section.

I suppose I should push the most visible change to trunk first,
namely tree-ssa-forwprop.c folding all statements via fold_stmt
after the merge.  I suspect this alone can have some odd effects
like the sub + cmp fusing.  That would be sth like the patch
attached below.

Richard.

Index: gcc/tree-ssa-forwprop.c
===================================================================
--- gcc/tree-ssa-forwprop.c	(revision 216258)
+++ gcc/tree-ssa-forwprop.c	(working copy)
@@ -54,6 +54,8 @@ along with GCC; see the file COPYING3.
 #include "tree-ssa-propagate.h"
 #include "tree-ssa-dom.h"
 #include "builtins.h"
+#include "tree-cfgcleanup.h"
+#include "tree-into-ssa.h"
 
 /* This pass propagates the RHS of assignment statements into use
    sites of the LHS of the assignment.  It's basically a specialized
@@ -3586,6 +3588,8 @@ simplify_mult (gimple_stmt_iterator *gsi
 
   return false;
 }
+
+
 /* Main entry point for the forward propagation and statement combine
    optimizer.  */
 
@@ -3626,6 +3630,40 @@ pass_forwprop::execute (function *fun)
 
   cfg_changed = false;
 
+  /* Combine stmts with the stmts defining their operands.  Do that
+     in an order that guarantees visiting SSA defs before SSA uses.  */
+  int *postorder = XNEWVEC (int, n_basic_blocks_for_fn (fun));
+  int postorder_num = inverted_post_order_compute (postorder);
+  for (int i = 0; i < postorder_num; ++i)
+    {
+      bb = BASIC_BLOCK_FOR_FN (fun, postorder[i]);
+      for (gimple_stmt_iterator gsi = gsi_start_bb (bb);
+	   !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple stmt = gsi_stmt (gsi);
+	  gimple orig_stmt = stmt;
+
+	  if (fold_stmt (&gsi))
+	    {
+	      stmt = gsi_stmt (gsi);
+	      if (maybe_clean_or_replace_eh_stmt (orig_stmt, stmt)
+		  && gimple_purge_dead_eh_edges (bb))
+		cfg_changed = true;
+	      update_stmt (stmt);
+	    }
+	}
+    }
+  free (postorder);
+
+  /* ???  Code below doesn't expect non-renamed VOPs and the above
+     doesn't keep virtual operand form up-to-date.  */
+  if (cfg_changed)
+    {
+      cleanup_tree_cfg ();
+      cfg_changed = false;
+    }
+  update_ssa (TODO_update_ssa_only_virtuals);
+
   FOR_EACH_BB_FN (bb, fun)
     {
       gimple_stmt_iterator gsi;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-17  8:24     ` Richard Biener
@ 2014-10-17 11:58       ` Richard Biener
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Biener @ 2014-10-17 11:58 UTC (permalink / raw)
  To: ramrad01; +Cc: Kyrill Tkachov, gcc-patches

On Fri, 17 Oct 2014, Richard Biener wrote:

> On Fri, 17 Oct 2014, Ramana Radhakrishnan wrote:
> 
> > On Wed, Oct 15, 2014 at 5:29 PM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
> > >
> > > On 15/10/14 14:00, Richard Biener wrote:
> > >>
> > >>
> > >> Any comments and reviews welcome (I don't think that
> > >> my maintainership covers enough to simply check this in
> > >> without approval).
> > >>
> > > Hi Richard,
> > >
> > > The match-and-simplify branch bootstrapped successfully on
> > > aarch64-none-linux-gnu FWIW.
> > >
> > 
> > What about regression tests ?
> 
> Note the branch isn't regression free on x86_64 either.  The branch
> does more than I want to merge to trunk (and it also retains all
> folding code I added patterns for).  I've gone farther there to
> explore whether it will end up working in the end and what kind
> of features the IL and the APIs need.
> 
> I've pasted testsuite results on x86_64 below for rev. 216324
> which is based on trunk rev. 216315 which unfortunately has
> lots of regressions on its own.
> 
> This is why I want to restrict the effect of the machinery to
> fold (), fold_stmt () and tree-ssa-forwprop.c for the moment
> and merge individual patterns (well, maybe in small groups)
> separately to allow for easy bi-section.
> 
> I suppose I should push the most visible change to trunk first,
> namely tree-ssa-forwprop.c folding all statements via fold_stmt
> after the merge.  I suspect this alone can have some odd effects
> like the sub + cmp fusing.  That would be sth like the patch
> attached below.

Just finished testing this (with -m32 on x86_64), showing regressions
in the testsuite like

FAIL: gcc.dg/tree-ssa/slsr-19.c scan-tree-dump-times optimized " \\\\* y" 
1
FAIL: gcc.dg/vect/bb-slp-27.c -flto -ffat-lto-objects  
scan-tree-dump-times slp2
 "basic block vectorized" 1
FAIL: gcc.dg/vect/bb-slp-27.c scan-tree-dump-times slp2 "basic block 
vectorized"
 1
FAIL: gcc.dg/vect/bb-slp-8b.c -flto -ffat-lto-objects  
scan-tree-dump-times slp2
 "basic block vectorized" 1
FAIL: gcc.dg/vect/bb-slp-8b.c scan-tree-dump-times slp2 "basic block 
vectorized"
 1
FAIL: gcc.dg/vect/slp-cond-3.c -flto -ffat-lto-objects  
scan-tree-dump-times vec
t "vectorizing stmts using SLP" 1
FAIL: gcc.dg/vect/slp-cond-3.c scan-tree-dump-times vect "vectorizing 
stmts usin
g SLP" 1

Bah.

I suppose I need to investigate this (simply folding a stmt shouldn't
cause any of the above... - with SLP it is probably operand
canonicalization, but not sure).

Richard.

> Richard.
> 
> Index: gcc/tree-ssa-forwprop.c
> ===================================================================
> --- gcc/tree-ssa-forwprop.c	(revision 216258)
> +++ gcc/tree-ssa-forwprop.c	(working copy)
> @@ -54,6 +54,8 @@ along with GCC; see the file COPYING3.
>  #include "tree-ssa-propagate.h"
>  #include "tree-ssa-dom.h"
>  #include "builtins.h"
> +#include "tree-cfgcleanup.h"
> +#include "tree-into-ssa.h"
>  
>  /* This pass propagates the RHS of assignment statements into use
>     sites of the LHS of the assignment.  It's basically a specialized
> @@ -3586,6 +3588,8 @@ simplify_mult (gimple_stmt_iterator *gsi
>  
>    return false;
>  }
> +
> +
>  /* Main entry point for the forward propagation and statement combine
>     optimizer.  */
>  
> @@ -3626,6 +3630,40 @@ pass_forwprop::execute (function *fun)
>  
>    cfg_changed = false;
>  
> +  /* Combine stmts with the stmts defining their operands.  Do that
> +     in an order that guarantees visiting SSA defs before SSA uses.  */
> +  int *postorder = XNEWVEC (int, n_basic_blocks_for_fn (fun));
> +  int postorder_num = inverted_post_order_compute (postorder);
> +  for (int i = 0; i < postorder_num; ++i)
> +    {
> +      bb = BASIC_BLOCK_FOR_FN (fun, postorder[i]);
> +      for (gimple_stmt_iterator gsi = gsi_start_bb (bb);
> +	   !gsi_end_p (gsi); gsi_next (&gsi))
> +	{
> +	  gimple stmt = gsi_stmt (gsi);
> +	  gimple orig_stmt = stmt;
> +
> +	  if (fold_stmt (&gsi))
> +	    {
> +	      stmt = gsi_stmt (gsi);
> +	      if (maybe_clean_or_replace_eh_stmt (orig_stmt, stmt)
> +		  && gimple_purge_dead_eh_edges (bb))
> +		cfg_changed = true;
> +	      update_stmt (stmt);
> +	    }
> +	}
> +    }
> +  free (postorder);
> +
> +  /* ???  Code below doesn't expect non-renamed VOPs and the above
> +     doesn't keep virtual operand form up-to-date.  */
> +  if (cfg_changed)
> +    {
> +      cleanup_tree_cfg ();
> +      cfg_changed = false;
> +    }
> +  update_ssa (TODO_update_ssa_only_virtuals);
> +
>    FOR_EACH_BB_FN (bb, fun)
>      {
>        gimple_stmt_iterator gsi;
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-15 13:20 [PATCH][0/n] Merge from match-and-simplify Richard Biener
  2014-10-15 16:30 ` Kyrill Tkachov
@ 2014-10-15 17:13 ` Jakub Jelinek
  2014-10-16 20:43 ` Sebastian Pop
  2 siblings, 0 replies; 15+ messages in thread
From: Jakub Jelinek @ 2014-10-15 17:13 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On Wed, Oct 15, 2014 at 03:00:57PM +0200, Richard Biener wrote:
> To re-iterate here (as I expect most people will only look
> at [0/n] patches ;)), the question is whether we are fine
> with making fold-const (thus fold_{unary,binary,ternary})
> not handle some cases it handles currently.

I'd say not dealing with side-effects should be smaller problem
than looking through casts etc., expressions with side-effects
generally shouldn't be folding into constants, where I'd expect the highest
risks (something previously folded into constants and used in initializers
not working any longer, or __builtin_constant_p, etc.).

I'd say just trying hard not to regress, and adding testsuite coverage
where not already covered, should be good enough, if something important
regresses, hopefully mass rebuilds will reveal it or users report it to us.

	Jakub

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-15 13:20 [PATCH][0/n] Merge from match-and-simplify Richard Biener
  2014-10-15 16:30 ` Kyrill Tkachov
  2014-10-15 17:13 ` Jakub Jelinek
@ 2014-10-16 20:43 ` Sebastian Pop
  2014-10-16 20:50   ` Andrew Pinski
  2014-10-17  8:00   ` Richard Biener
  2 siblings, 2 replies; 15+ messages in thread
From: Sebastian Pop @ 2014-10-16 20:43 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener wrote:
> 
> I have posted 5 patches as part of a larger series to merge
> (parts) from the match-and-simplify branch.  While I think
> there was overall consensus that the idea behind the project
> is sound there are technical questions left for how the
> thing should look in the end.  I've raised them in 3/n
> which is the only patch of the series that contains any
> patterns sofar.
> 
> To re-iterate here (as I expect most people will only look
> at [0/n] patches ;)), the question is whether we are fine
> with making fold-const (thus fold_{unary,binary,ternary})
> not handle some cases it handles currently.

I have tested on aarch64 all the code in the match-and-simplify against trunk as
of the last merge at r216315:

2014-10-16  Richard Biener  <rguenther@suse.de>

        Merge from trunk r216235 through r216315.

Overall, I see a lot of perf regressions (about 2/3 of the tests) than
improvements (1/3 of the tests).  I will try to reduce tests.

For instance, saxpy regresses at -O3 on aarch64:

void saxpy(double* x, double* y, double* z) {
    int i=0;
    for (i = 0 ; i < ARRAY_SIZE; i++) {
        z[i] = x[i] + scalar*y[i];
    }
}

$ diff -u base.s mas.s
--- base.s      2014-10-16 15:30:15.351430000 -0500
+++ mas.s       2014-10-16 15:30:16.183035000 -0500
@@ -2,12 +2,14 @@
        add     x1, x2, 800
        ldr     q0, [x0, x2]
        add     x3, x2, 1600
+       cmp     x0, 784
        ldr     q1, [x0, x1]
+       add     x1, x0, 16
        fmla    v0.2d, v1.2d, v2.2d
        str     q0, [x0, x3]
-       add     x0, x0, 16
-       cmp     x0, 800
+       mov     x0, x1
        bne     .L140
 .LBE179:
-       subs    w4, w4, #1
+       cmp     w4, 1
+       sub     w4, w4, #1
        bne     .L139



Thanks,
Sebastian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-16 20:43 ` Sebastian Pop
@ 2014-10-16 20:50   ` Andrew Pinski
  2014-10-17  7:29     ` Ramana Radhakrishnan
  2014-10-17  8:00   ` Richard Biener
  1 sibling, 1 reply; 15+ messages in thread
From: Andrew Pinski @ 2014-10-16 20:50 UTC (permalink / raw)
  To: Sebastian Pop; +Cc: Richard Biener, GCC Patches

On Thu, Oct 16, 2014 at 1:38 PM, Sebastian Pop <sebpop@gmail.com> wrote:
> Richard Biener wrote:
>>
>> I have posted 5 patches as part of a larger series to merge
>> (parts) from the match-and-simplify branch.  While I think
>> there was overall consensus that the idea behind the project
>> is sound there are technical questions left for how the
>> thing should look in the end.  I've raised them in 3/n
>> which is the only patch of the series that contains any
>> patterns sofar.
>>
>> To re-iterate here (as I expect most people will only look
>> at [0/n] patches ;)), the question is whether we are fine
>> with making fold-const (thus fold_{unary,binary,ternary})
>> not handle some cases it handles currently.
>
> I have tested on aarch64 all the code in the match-and-simplify against trunk as
> of the last merge at r216315:
>
> 2014-10-16  Richard Biener  <rguenther@suse.de>
>
>         Merge from trunk r216235 through r216315.
>
> Overall, I see a lot of perf regressions (about 2/3 of the tests) than
> improvements (1/3 of the tests).  I will try to reduce tests.
>
> For instance, saxpy regresses at -O3 on aarch64:
>
> void saxpy(double* x, double* y, double* z) {
>     int i=0;
>     for (i = 0 ; i < ARRAY_SIZE; i++) {
>         z[i] = x[i] + scalar*y[i];
>     }
> }

This looks like a scheduling issue rather than anything else.  The
scheduler for a57 is not complete and does not model some things like
the fusion of the compares and branch which is most likely what you
are seeing.

Thanks,
Andrew Pinski

>
> $ diff -u base.s mas.s
> --- base.s      2014-10-16 15:30:15.351430000 -0500
> +++ mas.s       2014-10-16 15:30:16.183035000 -0500
> @@ -2,12 +2,14 @@
>         add     x1, x2, 800
>         ldr     q0, [x0, x2]
>         add     x3, x2, 1600
> +       cmp     x0, 784
>         ldr     q1, [x0, x1]
> +       add     x1, x0, 16
>         fmla    v0.2d, v1.2d, v2.2d
>         str     q0, [x0, x3]
> -       add     x0, x0, 16
> -       cmp     x0, 800
> +       mov     x0, x1
>         bne     .L140
>  .LBE179:
> -       subs    w4, w4, #1
> +       cmp     w4, 1
> +       sub     w4, w4, #1
>         bne     .L139
>
>
>
> Thanks,
> Sebastian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-16 20:50   ` Andrew Pinski
@ 2014-10-17  7:29     ` Ramana Radhakrishnan
  0 siblings, 0 replies; 15+ messages in thread
From: Ramana Radhakrishnan @ 2014-10-17  7:29 UTC (permalink / raw)
  To: Andrew Pinski, Sebastian Pop, Richard Biener; +Cc: GCC Patches



On 16/10/14 21:43, Andrew Pinski wrote:
> On Thu, Oct 16, 2014 at 1:38 PM, Sebastian Pop <sebpop@gmail.com> wrote:
>> Richard Biener wrote:
>>>
>>> I have posted 5 patches as part of a larger series to merge
>>> (parts) from the match-and-simplify branch.  While I think
>>> there was overall consensus that the idea behind the project
>>> is sound there are technical questions left for how the
>>> thing should look in the end.  I've raised them in 3/n
>>> which is the only patch of the series that contains any
>>> patterns sofar.
>>>
>>> To re-iterate here (as I expect most people will only look
>>> at [0/n] patches ;)), the question is whether we are fine
>>> with making fold-const (thus fold_{unary,binary,ternary})
>>> not handle some cases it handles currently.


>>
>> I have tested on aarch64 all the code in the match-and-simplify against trunk as
>> of the last merge at r216315:
>>
>> 2014-10-16  Richard Biener  <rguenther@suse.de>
>>
>>          Merge from trunk r216235 through r216315.
>>
>> Overall, I see a lot of perf regressions (about 2/3 of the tests) than
>> improvements (1/3 of the tests).  I will try to reduce tests.

>>
>> For instance, saxpy regresses at -O3 on aarch64:
>>
>> void saxpy(double* x, double* y, double* z) {
>>      int i=0;
>>      for (i = 0 ; i < ARRAY_SIZE; i++) {
>>          z[i] = x[i] + scalar*y[i];
>>      }
>> }
>
> This looks like a scheduling issue rather than anything else.  The
> scheduler for a57 is not complete and does not model some things like
> the fusion of the compares and branch which is most likely what you
> are seeing.


Huh !! how is that related to the code generation shown by Seb ?

See the replacement of subs by cmp and sub. Folding cmp into other flag 
setting instructions is a very useful optimization on ARM and AArch64 
and that's what appears missing in fold-const. That maybe what's causing 
the slowdown. I've never known that to be caused by any scheduler 
vagaries !

regards
Ramana




>
> Thanks,
> Andrew Pinski
>
>>
>> $ diff -u base.s mas.s
>> --- base.s      2014-10-16 15:30:15.351430000 -0500
>> +++ mas.s       2014-10-16 15:30:16.183035000 -0500
>> @@ -2,12 +2,14 @@
>>          add     x1, x2, 800
>>          ldr     q0, [x0, x2]
>>          add     x3, x2, 1600
>> +       cmp     x0, 784
>>          ldr     q1, [x0, x1]
>> +       add     x1, x0, 16
>>          fmla    v0.2d, v1.2d, v2.2d
>>          str     q0, [x0, x3]
>> -       add     x0, x0, 16
>> -       cmp     x0, 800
>> +       mov     x0, x1
>>          bne     .L140
>>   .LBE179:
>> -       subs    w4, w4, #1
>> +       cmp     w4, 1
>> +       sub     w4, w4, #1
>>          bne     .L139
>>
>>
>>
>> Thanks,
>> Sebastian
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-16 20:43 ` Sebastian Pop
  2014-10-16 20:50   ` Andrew Pinski
@ 2014-10-17  8:00   ` Richard Biener
  2014-10-17 16:44     ` Sebastian Pop
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Biener @ 2014-10-17  8:00 UTC (permalink / raw)
  To: Sebastian Pop; +Cc: gcc-patches

On Thu, 16 Oct 2014, Sebastian Pop wrote:

> Richard Biener wrote:
> > 
> > I have posted 5 patches as part of a larger series to merge
> > (parts) from the match-and-simplify branch.  While I think
> > there was overall consensus that the idea behind the project
> > is sound there are technical questions left for how the
> > thing should look in the end.  I've raised them in 3/n
> > which is the only patch of the series that contains any
> > patterns sofar.
> > 
> > To re-iterate here (as I expect most people will only look
> > at [0/n] patches ;)), the question is whether we are fine
> > with making fold-const (thus fold_{unary,binary,ternary})
> > not handle some cases it handles currently.
> 
> I have tested on aarch64 all the code in the match-and-simplify against trunk as
> of the last merge at r216315:
> 
> 2014-10-16  Richard Biener  <rguenther@suse.de>
> 
>         Merge from trunk r216235 through r216315.
> 
> Overall, I see a lot of perf regressions (about 2/3 of the tests) than
> improvements (1/3 of the tests).  I will try to reduce tests.

Note that the branch goes much further in exercising the machinery
than I want to merge at this point (that applies mostly to all
passes using the SSA propagator such as CCP and VRP and passes
exercising value-numbering - FRE and PRE).

It may also simply show the effect of now folding all statements
from tree-ssa-forwprop.c.  I have yet to investigate the testsuite
fallout of [1/n] to [5/n] - testresults have been very noisy lately
due to the C11 change and now ICF.

> For instance, saxpy regresses at -O3 on aarch64:
> 
> void saxpy(double* x, double* y, double* z) {
>     int i=0;
>     for (i = 0 ; i < ARRAY_SIZE; i++) {
>         z[i] = x[i] + scalar*y[i];
>     }
> }
> 
> $ diff -u base.s mas.s
> --- base.s      2014-10-16 15:30:15.351430000 -0500
> +++ mas.s       2014-10-16 15:30:16.183035000 -0500
> @@ -2,12 +2,14 @@
>         add     x1, x2, 800
>         ldr     q0, [x0, x2]
>         add     x3, x2, 1600
> +       cmp     x0, 784
>         ldr     q1, [x0, x1]
> +       add     x1, x0, 16
>         fmla    v0.2d, v1.2d, v2.2d
>         str     q0, [x0, x3]
> -       add     x0, x0, 16
> -       cmp     x0, 800
> +       mov     x0, x1
>         bne     .L140
>  .LBE179:
> -       subs    w4, w4, #1
> +       cmp     w4, 1
> +       sub     w4, w4, #1
>         bne     .L139

I don't understand AARCH64 assembly very well but the above looks like
RTL issues and/or IVOPTs issues?

Thanks for doing performance measurements.

Richard.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-17  8:00   ` Richard Biener
@ 2014-10-17 16:44     ` Sebastian Pop
  2014-10-17 17:37       ` Richard Biener
  2014-10-17 18:32       ` Sebastian Pop
  0 siblings, 2 replies; 15+ messages in thread
From: Sebastian Pop @ 2014-10-17 16:44 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener wrote:
> On Thu, 16 Oct 2014, Sebastian Pop wrote:
> 
> > Richard Biener wrote:
> > > 
> > > I have posted 5 patches as part of a larger series to merge
> > > (parts) from the match-and-simplify branch.  While I think
> > > there was overall consensus that the idea behind the project
> > > is sound there are technical questions left for how the
> > > thing should look in the end.  I've raised them in 3/n
> > > which is the only patch of the series that contains any
> > > patterns sofar.
> > > 
> > > To re-iterate here (as I expect most people will only look
> > > at [0/n] patches ;)), the question is whether we are fine
> > > with making fold-const (thus fold_{unary,binary,ternary})
> > > not handle some cases it handles currently.
> > 
> > I have tested on aarch64 all the code in the match-and-simplify against trunk as
> > of the last merge at r216315:
> > 
> > 2014-10-16  Richard Biener  <rguenther@suse.de>
> > 
> >         Merge from trunk r216235 through r216315.
> > 
> > Overall, I see a lot of perf regressions (about 2/3 of the tests) than
> > improvements (1/3 of the tests).  I will try to reduce tests.
> 
> Note that the branch goes much further in exercising the machinery
> than I want to merge at this point (that applies mostly to all
> passes using the SSA propagator such as CCP and VRP and passes
> exercising value-numbering - FRE and PRE).

I see.  Should I run benchmarks only with the patches that you submitted for
trunk?

> I don't understand AARCH64 assembly very well but the above looks like
> RTL issues and/or IVOPTs issues?

I should have posted the first diff between the compilers with -fdump-tree-all:
that would expose the problem at its root.

I have seen that there is a way to dump the folded expressions from the new
functionality, is there a flag to print the folded expressions in current trunk?
It would be interesting to have the same kind of output, such that we could run
a diff between.

Thanks,
Sebastian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-17 16:44     ` Sebastian Pop
@ 2014-10-17 17:37       ` Richard Biener
  2014-10-17 18:32       ` Sebastian Pop
  1 sibling, 0 replies; 15+ messages in thread
From: Richard Biener @ 2014-10-17 17:37 UTC (permalink / raw)
  To: Sebastian Pop; +Cc: gcc-patches

On October 17, 2014 6:35:58 PM CEST, Sebastian Pop <sebpop@gmail.com> wrote:
>Richard Biener wrote:
>> On Thu, 16 Oct 2014, Sebastian Pop wrote:
>> 
>> > Richard Biener wrote:
>> > > 
>> > > I have posted 5 patches as part of a larger series to merge
>> > > (parts) from the match-and-simplify branch.  While I think
>> > > there was overall consensus that the idea behind the project
>> > > is sound there are technical questions left for how the
>> > > thing should look in the end.  I've raised them in 3/n
>> > > which is the only patch of the series that contains any
>> > > patterns sofar.
>> > > 
>> > > To re-iterate here (as I expect most people will only look
>> > > at [0/n] patches ;)), the question is whether we are fine
>> > > with making fold-const (thus fold_{unary,binary,ternary})
>> > > not handle some cases it handles currently.
>> > 
>> > I have tested on aarch64 all the code in the match-and-simplify
>against trunk as
>> > of the last merge at r216315:
>> > 
>> > 2014-10-16  Richard Biener  <rguenther@suse.de>
>> > 
>> >         Merge from trunk r216235 through r216315.
>> > 
>> > Overall, I see a lot of perf regressions (about 2/3 of the tests)
>than
>> > improvements (1/3 of the tests).  I will try to reduce tests.
>> 
>> Note that the branch goes much further in exercising the machinery
>> than I want to merge at this point (that applies mostly to all
>> passes using the SSA propagator such as CCP and VRP and passes
>> exercising value-numbering - FRE and PRE).
>
>I see.  Should I run benchmarks only with the patches that you
>submitted for
>trunk?

Yes. What I posted sofar should be a no-op performance wise.

Benchmarks on the branch are still useful though as eventually all changes should get merged to trunk if enough patterns are implemented.

Richard.

>> I don't understand AARCH64 assembly very well but the above looks
>like
>> RTL issues and/or IVOPTs issues?
>
>I should have posted the first diff between the compilers with
>-fdump-tree-all:
>that would expose the problem at its root.
>
>I have seen that there is a way to dump the folded expressions from the
>new
>functionality, is there a flag to print the folded expressions in
>current trunk?
>It would be interesting to have the same kind of output, such that we
>could run
>a diff between.
>
>Thanks,
>Sebastian


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-17 16:44     ` Sebastian Pop
  2014-10-17 17:37       ` Richard Biener
@ 2014-10-17 18:32       ` Sebastian Pop
  2014-10-20 11:47         ` Richard Biener
  1 sibling, 1 reply; 15+ messages in thread
From: Sebastian Pop @ 2014-10-17 18:32 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Sebastian Pop wrote:
> Richard Biener wrote:
> > looks like
> > RTL issues and/or IVOPTs issues?
> 
> I should have posted the first diff between the compilers with -fdump-tree-all:
> that would expose the problem at its root.

Looks like this is caused by the fwprop pass:

diff -u -r ./foo.i.087t.forwprop3 ../mas/foo.i.087t.forwprop3
--- ./foo.i.087t.forwprop3      2014-10-17 13:17:29.985327000 -0500
+++ ../mas/foo.i.087t.forwprop3 2014-10-17 13:17:29.308814000 -0500
@@ -5,6 +5,8 @@
 Pass statistics:
 ----------------
 
+Applying pattern match-comparison.pd:43, gimple-match.c:11747
+gimple_simplified to if (i_20 != 99)
 
 Pass statistics:
 ----------------
@@ -60,7 +62,7 @@
   i_17 = i_20 + 1;
   # DEBUG iD.2450 => i_17
   # DEBUG iD.2450 => i_17
-  if (i_17 != 100)
+  if (i_20 != 99)
     goto <bb 3>;
   else
     goto <bb 4>;

[...]
diff -u -r ./foo.i.089t.ccp3 ../mas/foo.i.089t.ccp3
--- ./foo.i.089t.ccp3   2014-10-17 13:17:29.991734000 -0500
+++ ../mas/foo.i.089t.ccp3      2014-10-17 13:17:29.316140000 -0500
@@ -53,13 +53,13 @@
 # VUSE <.MEM_16>
 return;
 
-i_17 : -->2 uses.
+i_17 : --> single use.
 i_20 = PHI <i_17(3), 0(2)>
 # DEBUG i => i_17
-if (i_17 != 100)
 # DEBUG i => i_17
 
-i_20 : -->2 uses.
+i_20 : -->3 uses.
+if (i_20 != 99)
 i_17 = i_20 + 1;
 _4 = (long unsigned int) i_20;
 # DEBUG i => i_20

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-17 18:32       ` Sebastian Pop
@ 2014-10-20 11:47         ` Richard Biener
  2014-10-22 21:06           ` Jeff Law
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2014-10-20 11:47 UTC (permalink / raw)
  To: Sebastian Pop; +Cc: gcc-patches, marc.glisse

On Fri, 17 Oct 2014, Sebastian Pop wrote:

> Sebastian Pop wrote:
> > Richard Biener wrote:
> > > looks like
> > > RTL issues and/or IVOPTs issues?
> > 
> > I should have posted the first diff between the compilers with -fdump-tree-all:
> > that would expose the problem at its root.
> 
> Looks like this is caused by the fwprop pass:
> 
> diff -u -r ./foo.i.087t.forwprop3 ../mas/foo.i.087t.forwprop3
> --- ./foo.i.087t.forwprop3      2014-10-17 13:17:29.985327000 -0500
> +++ ../mas/foo.i.087t.forwprop3 2014-10-17 13:17:29.308814000 -0500
> @@ -5,6 +5,8 @@
>  Pass statistics:
>  ----------------
>  
> +Applying pattern match-comparison.pd:43, gimple-match.c:11747
> +gimple_simplified to if (i_20 != 99)
>  
>  Pass statistics:
>  ----------------
> @@ -60,7 +62,7 @@
>    i_17 = i_20 + 1;
>    # DEBUG iD.2450 => i_17
>    # DEBUG iD.2450 => i_17
> -  if (i_17 != 100)
> +  if (i_20 != 99)
>      goto <bb 3>;
>    else
>      goto <bb 4>;

Ok, so this is one effect on the thing Marc pointed out - currently
no patterns (well, no but one) guards itself with has_single_use
predicates.

That was a conscious decision and the idea was that the caller should
do this via its lattice valueization function which could look like

tree
valueize (tree t)
{
  if (TREE_CODE (t) == SSA_NAME
      && !has_single_use (t))
    return NULL_TREE;
  return t;
}

But of course doing that unconditionally would also pessimize code.
Generally we'd like to avoid un-CSEing stuff in a way that cannot
be CSEd again.  That's a more complex condition than what can be
implemented with has_single_use.  You might also consider a
stmt doing a_1 + a_1 where a_1 has two uses now.

For Sebastians case above the issue is that we are appearantly
bad at optimizing post-increment exit tests.  But if you'd consider
code like

  i_2 = i_1 + 1;
  b1_3 = i_2 < 100;
  b2_4 = i_2 > 50;
  if (b1_3 && b2_4)
    ...

then it is profitable to remove i_2 by changing the two comparisons
to i_2 <= 98 and i_2 > 49.

I thought about doing all simplifications first without committing
any simplified sequence to the IL, then scanning over the result,
pruning out cases that end up pessimizing code (how exactly isn't
yet clear to me).

So I'm not sure what we want to do here now.  I don't very much like
doing things explicitely in the pattern description (nor using the
"has_single_use" predicate).
I suppose for the gimple_build () stuff we could restrict simplifications
to the expression we are building (not simplifying with SSA defs in the 
IL), more exactly mimicing fold_buildN behavior.
I suppose for forwprop we could use the above valueize hook (but then
regress because not all patterns as implemented in forwprop guard
their def stmt lookup with has_single_use...).

Any opinion on this?  Any idea of a "simple" cost function if
you have the functions IL before and after simplifications (but
without any DCE/CSE applied)?

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH][0/n] Merge from match-and-simplify
  2014-10-20 11:47         ` Richard Biener
@ 2014-10-22 21:06           ` Jeff Law
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff Law @ 2014-10-22 21:06 UTC (permalink / raw)
  To: Richard Biener, Sebastian Pop; +Cc: gcc-patches, marc.glisse

On 10/20/14 05:42, Richard Biener wrote:
> That was a conscious decision and the idea was that the caller should
> do this via its lattice valueization function which could look like
>
> tree
> valueize (tree t)
> {
>    if (TREE_CODE (t) == SSA_NAME
>        && !has_single_use (t))
>      return NULL_TREE;
>    return t;
> }
>
> But of course doing that unconditionally would also pessimize code.
> Generally we'd like to avoid un-CSEing stuff in a way that cannot
> be CSEd again.  That's a more complex condition than what can be
> implemented with has_single_use.  You might also consider a
> stmt doing a_1 + a_1 where a_1 has two uses now.
FWIW, I wouldn't worry much about the two uses in a single statement 
case.  I looked at that in RTL eons ago it just doesn't happen enough to 
bother trying to detect and treat as a single use.

>
> I thought about doing all simplifications first without committing
> any simplified sequence to the IL, then scanning over the result,
> pruning out cases that end up pessimizing code (how exactly isn't
> yet clear to me).
>
> So I'm not sure what we want to do here now.  I don't very much like
> doing things explicitely in the pattern description (nor using the
> "has_single_use" predicate).
> I suppose for the gimple_build () stuff we could restrict simplifications
> to the expression we are building (not simplifying with SSA defs in the
> IL), more exactly mimicing fold_buildN behavior.
> I suppose for forwprop we could use the above valueize hook (but then
> regress because not all patterns as implemented in forwprop guard
> their def stmt lookup with has_single_use...).
>
> Any opinion on this?  Any idea of a "simple" cost function if
> you have the functions IL before and after simplifications (but
> without any DCE/CSE applied)?
It's certainly ideal to be able to be able to CSE/un-CSE depending on 
final context and it's a design goal I've heard other compiler 
developers making.  ie, every transformation early which may be somewhat 
speculative must be "un-doable" later.  But the infrastructure for that 
is, umm, hard.

The concept of simplify on the side, then prune out stuff that isn't 
profitable is nice, but as you state, that's nontrivial as well.

In general, the has_single_use case is profitable.  So we want to 
aggressively go after those and I think we can commit those immediately 
and use the valueize function shown above.

Maybe you then look at the more speculative cases...

jeff

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-10-22 20:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-15 13:20 [PATCH][0/n] Merge from match-and-simplify Richard Biener
2014-10-15 16:30 ` Kyrill Tkachov
2014-10-17  7:39   ` Ramana Radhakrishnan
2014-10-17  8:24     ` Richard Biener
2014-10-17 11:58       ` Richard Biener
2014-10-15 17:13 ` Jakub Jelinek
2014-10-16 20:43 ` Sebastian Pop
2014-10-16 20:50   ` Andrew Pinski
2014-10-17  7:29     ` Ramana Radhakrishnan
2014-10-17  8:00   ` Richard Biener
2014-10-17 16:44     ` Sebastian Pop
2014-10-17 17:37       ` Richard Biener
2014-10-17 18:32       ` Sebastian Pop
2014-10-20 11:47         ` Richard Biener
2014-10-22 21:06           ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).