public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH][RFC] Add FRE in pass_vectorize
@ 2015-06-10 14:11 Richard Biener
  2015-06-23 20:22 ` Jeff Law
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2015-06-10 14:11 UTC (permalink / raw)
  To: gcc-patches


The following patch adds FRE after vectorization which is needed
for IVOPTs to remove redundant PHI nodes (well, I'm testing a
patch for FRE that will do it already there).

The patch also makes FRE preserve loop-closed SSA form and thus
make it suitable for use in the loop pipeline.

With the placement in the vectorizer sub-pass FRE will effectively
be enabled by -O3 only (well, or if one requests loop vectorization).
I've considered placing it after complete_unroll instead but that
would enable it at -O1 already.  I have no strong opinion on the
exact placement, but it should help all passes between vectorizing
and ivopts for vectorized loops.

Yeah, it adds yet another pass and thus I don't like it very much.
But it for example improves code generated for
gfortran.dg/vect/fast-math-pr37021.f90
from

.L14:
        movupd  (%r11), %xmm3
        addl    $1, %ecx
        addq    %rax, %r11
        movupd  (%r8), %xmm0
        addq    %rax, %r8
        unpckhpd        %xmm3, %xmm3
        movupd  (%rdi), %xmm2
        unpcklpd        %xmm0, %xmm0
        addq    %rsi, %rdi
        movupd  (%rbx), %xmm1
        mulpd   %xmm3, %xmm2
        addq    %rsi, %rbx
        cmpl    %ecx, %ebp
        palignr $8, %xmm1, %xmm1
        mulpd   %xmm1, %xmm0
        movapd  %xmm2, %xmm1
        addpd   %xmm0, %xmm1
        subpd   %xmm2, %xmm0
        shufpd  $2, %xmm0, %xmm1
        addpd   %xmm1, %xmm4
        jne     .L14

to

.L14:
        movupd  (%r8), %xmm0
        addl    $1, %ecx
        addq    %rax, %r8
        movapd  %xmm0, %xmm2
        movupd  (%rdi), %xmm1
        addq    %rsi, %rdi
        cmpl    %ecx, %r11d
        unpckhpd        %xmm0, %xmm2
        unpcklpd        %xmm0, %xmm0
        mulpd   %xmm1, %xmm2
        palignr $8, %xmm1, %xmm1
        mulpd   %xmm1, %xmm0
        movapd  %xmm2, %xmm1
        addpd   %xmm0, %xmm1
        subpd   %xmm2, %xmm0
        shufpd  $2, %xmm0, %xmm1
        addpd   %xmm1, %xmm3
        jne     .L14

(yeah, the vectorizer happily generates redundant loads and one IV
for each such load)

Any other suggestions on pass placement?  I can of course key
that FRE run on -O3 explicitely.  Not sure if we at this point
want to start playing fancy games like setting a property
when a pass (likely) generated redundancies that are worth
fixing up and then key FRE on that one (it gets harder and
less predictable what transforms are run on code).

Bootstrap / regtest running on x86_64-unknown-linux-gnu.  With
other placements I'd expect quite some testsuite fallout
eventually.

Thoughts?

Thanks,
Richard.

2015-06-10  Richard Biener  <rguenther@suse.de>

	* passes.def (pass_vectorize): Add pass_fre.
	* tree-ssa-pre.c (eliminate_dom_walker::before_dom_children):
	Preserve loop-closed SSA form.

Index: gcc/passes.def
===================================================================
*** gcc/passes.def	(revision 224324)
--- gcc/passes.def	(working copy)
*************** along with GCC; see the file COPYING3.
*** 252,257 ****
--- 252,258 ----
  	     Please do not add any other passes in between.  */
  	  NEXT_PASS (pass_vectorize);
            PUSH_INSERT_PASSES_WITHIN (pass_vectorize)
+ 	      NEXT_PASS (pass_fre);
  	      NEXT_PASS (pass_dce);
            POP_INSERT_PASSES ()
            NEXT_PASS (pass_predcom);
Index: gcc/tree-ssa-pre.c
===================================================================
*** gcc/tree-ssa-pre.c	(revision 224324)
--- gcc/tree-ssa-pre.c	(working copy)
*************** eliminate_dom_walker::before_dom_childre
*** 4013,4018 ****
--- 4013,4028 ----
       tailmerging.  Eventually we can reduce its reliance on SCCVN now
       that we fully copy/constant-propagate (most) things.  */
  
+   /* Compute whether this block has loop-closed PHI nodes we need
+      to preserve.  */
+   bool lc_phi = false;
+   edge e;
+   if (loops_state_satisfies_p (LOOP_CLOSED_SSA)
+       && single_pred_p (b)
+       && (e = single_pred_edge (b))
+       && loop_exit_edge_p (e->src->loop_father, e))
+     lc_phi = true;
+ 
    for (gphi_iterator gsi = gsi_start_phis (b); !gsi_end_p (gsi);)
      {
        gphi *phi = gsi.phi ();
*************** eliminate_dom_walker::before_dom_childre
*** 4026,4032 ****
  
        tree sprime = eliminate_avail (res);
        if (sprime
! 	  && sprime != res)
  	{
  	  if (dump_file && (dump_flags & TDF_DETAILS))
  	    {
--- 4036,4043 ----
  
        tree sprime = eliminate_avail (res);
        if (sprime
! 	  && sprime != res
! 	  && !lc_phi)
  	{
  	  if (dump_file && (dump_flags & TDF_DETAILS))
  	    {
*************** eliminate_dom_walker::before_dom_childre
*** 4466,4472 ****
  
    /* Replace destination PHI arguments.  */
    edge_iterator ei;
-   edge e;
    FOR_EACH_EDGE (e, ei, b->succs)
      {
        for (gphi_iterator gsi = gsi_start_phis (e->dest);
--- 4477,4482 ----

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] Add FRE in pass_vectorize
  2015-06-10 14:11 [PATCH][RFC] Add FRE in pass_vectorize Richard Biener
@ 2015-06-23 20:22 ` Jeff Law
  2015-06-24  8:16   ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff Law @ 2015-06-23 20:22 UTC (permalink / raw)
  To: Richard Biener, gcc-patches

On 06/10/2015 08:02 AM, Richard Biener wrote:
>
> The following patch adds FRE after vectorization which is needed
> for IVOPTs to remove redundant PHI nodes (well, I'm testing a
> patch for FRE that will do it already there).
Redundant or degenerates which should be propagated?

I believe Alan Lawrence has run into similar issues (unpropagated 
degenerates) with his changes to make loop header copying more 
aggressive.  Threading will also create them.  The phi-only propagator 
may be the solution.  It ought to be cheaper than FRE.



> The patch also makes FRE preserve loop-closed SSA form and thus
> make it suitable for use in the loop pipeline.
Loop optimizations will tend to create opportunities for redundancy 
elimination, so the ability to use FRE in the loop pipeline seems like a 
good thing.  We ran into this in RTL land, so I'm not surprised to see 
it occurring in the gimple optimizers and thus I'm not opposed to 
running FRE in the loop pipeline.



>
> With the placement in the vectorizer sub-pass FRE will effectively
> be enabled by -O3 only (well, or if one requests loop vectorization).
> I've considered placing it after complete_unroll instead but that
> would enable it at -O1 already.  I have no strong opinion on the
> exact placement, but it should help all passes between vectorizing
> and ivopts for vectorized loops.
For -O3/vectorization it seems like a no-brainer.  -O1 less so.  IIRC we 
conditionalize -frerun-cse-after-loop on -O2 which seems more 
appropriate than doing it with -O1.

>
> Any other suggestions on pass placement?  I can of course key
> that FRE run on -O3 explicitely.  Not sure if we at this point
> want to start playing fancy games like setting a property
> when a pass (likely) generated redundancies that are worth
> fixing up and then key FRE on that one (it gets harder and
> less predictable what transforms are run on code).
RTL CSE is bloody expensive and so many times I wanted the ability to 
know a bit about what the loop optimizer had done (or not done) so that 
I could conditionally skip the second CSE pass.   We never built that, 
but it's something I've wanted for decades.

Jeff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] Add FRE in pass_vectorize
  2015-06-23 20:22 ` Jeff Law
@ 2015-06-24  8:16   ` Richard Biener
  2015-06-25  3:40     ` Jeff Law
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2015-06-24  8:16 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

On Tue, 23 Jun 2015, Jeff Law wrote:

> On 06/10/2015 08:02 AM, Richard Biener wrote:
> > 
> > The following patch adds FRE after vectorization which is needed
> > for IVOPTs to remove redundant PHI nodes (well, I'm testing a
> > patch for FRE that will do it already there).
> Redundant or degenerates which should be propagated?

Redundant, basically two IVs with the same initial value and same step.
IVOPTs can deal with this if the initial values and the step are already
same "enough" - the vectorizer can end up generating redundant huge
expressions for both.

> I believe Alan Lawrence has run into similar issues (unpropagated degenerates)
> with his changes to make loop header copying more aggressive.  Threading will
> also create them.  The phi-only propagator may be the solution.  It ought to
> be cheaper than FRE.

Yes, but that's unrelated (see above).

> > The patch also makes FRE preserve loop-closed SSA form and thus
> > make it suitable for use in the loop pipeline.
> Loop optimizations will tend to create opportunities for redundancy
> elimination, so the ability to use FRE in the loop pipeline seems like a good
> thing.  We ran into this in RTL land, so I'm not surprised to see it occurring
> in the gimple optimizers and thus I'm not opposed to running FRE in the loop
> pipeline.
> 
> 
> 
> > 
> > With the placement in the vectorizer sub-pass FRE will effectively
> > be enabled by -O3 only (well, or if one requests loop vectorization).
> > I've considered placing it after complete_unroll instead but that
> > would enable it at -O1 already.  I have no strong opinion on the
> > exact placement, but it should help all passes between vectorizing
> > and ivopts for vectorized loops.
> For -O3/vectorization it seems like a no-brainer.  -O1 less so.  IIRC we
> conditionalize -frerun-cse-after-loop on -O2 which seems more appropriate than
> doing it with -O1.
> 
> > 
> > Any other suggestions on pass placement?  I can of course key
> > that FRE run on -O3 explicitely.  Not sure if we at this point
> > want to start playing fancy games like setting a property
> > when a pass (likely) generated redundancies that are worth
> > fixing up and then key FRE on that one (it gets harder and
> > less predictable what transforms are run on code).
> RTL CSE is bloody expensive and so many times I wanted the ability to know a
> bit about what the loop optimizer had done (or not done) so that I could
> conditionally skip the second CSE pass.   We never built that, but it's
> something I've wanted for decades.

Hmm, ok.  We can abuse pass properties for this but I don't think
they are a scalable fit.  Not sure if we'd like to go full way
adding sth like PROP_want_ccp PROP_want_copyprop PROP_want_cse, etc.
(any others?).  And whether FRE would then catch a PROP_want_copyprop
because it also can do copy propagation.

Eventually we'll just end up setting PROP_want_* from every pass...
(like we schedule a CFG cleanup from nearly every pass that did
anything).

Going a bit further here, esp. in the loop context, would be to
have the basic cleanups be region-based.  Because given a big
function with many loops and just one vectorized it would be
enough to cleanup the vectorized loop (yes, and in theory
all downstream effects, but that's probably secondary and not
so important).  It's not too difficult to make FRE run on
a MEME region, the interesting part, engineering-wise, is to
really make it O(size of MEME region) - that is, eliminate
things like O(num_ssa_names) or O(n_basic_blocks) setup cost.

And then there is the possibility of making passes generate less
needs to perform cleanups after them - like in the present case
with the redundant IVs make them more appearant redundant by
CSEing the initial value and step during vectorizer code generation.
I'm playing with the idea of adding a simple CSE machinery to
the gimple_build () interface (aka match-and-simplify).  It
eventually invokes (well, not currently, but that can be fixed)
maybe_push_res_to_seq which is a good place to maintain a
table of already generated expressions.  That of course only
works if you either always append to the same sequence or at least
insert at the same place.

I'm now back to match-and-simplify and will pursue that last idea
a bit (also wanting it for SCCVN itself).

Richard.

> Jeff
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Dilip Upmanyu, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] Add FRE in pass_vectorize
  2015-06-24  8:16   ` Richard Biener
@ 2015-06-25  3:40     ` Jeff Law
  2015-07-02 11:40       ` Alan Lawrence
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff Law @ 2015-06-25  3:40 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On 06/24/2015 01:59 AM, Richard Biener wrote:
> Redundant, basically two IVs with the same initial value and same step.
> IVOPTs can deal with this if the initial values and the step are already
> same "enough" - the vectorizer can end up generating redundant huge
> expressions for both.
Ah, so yes, this is a totally different issue than Alan and I are 
discussing.

>> RTL CSE is bloody expensive and so many times I wanted the ability to know a
>> bit about what the loop optimizer had done (or not done) so that I could
>> conditionally skip the second CSE pass.   We never built that, but it's
>> something I've wanted for decades.
>
> Hmm, ok.  We can abuse pass properties for this but I don't think
> they are a scalable fit.  Not sure if we'd like to go full way
> adding sth like PROP_want_ccp PROP_want_copyprop PROP_want_cse, etc.
> (any others?).  And whether FRE would then catch a PROP_want_copyprop
> because it also can do copy propagation.
And that's why we haven't pushed hard on this issue -- it doesn't scale 
and to make it scale requires rethinking the basics of the pass manager.


> Going a bit further here, esp. in the loop context, would be to
> have the basic cleanups be region-based.  Because given a big
> function with many loops and just one vectorized it would be
> enough to cleanup the vectorized loop (yes, and in theory
> all downstream effects, but that's probably secondary and not
> so important).  It's not too difficult to make FRE run on
> a MEME region, the interesting part, engineering-wise, is to
> really make it O(size of MEME region) - that is, eliminate
> things like O(num_ssa_names) or O(n_basic_blocks) setup cost.
I had a long talk with some of the SGI compiler guys many years ago 
about region-based optimizations.  It was something they had been trying 
to bring into their compiler for years, but never got it working to a 
point where they were happy with it.  While they didn't show me the 
code, they indicated the changes were highly invasive -- and all the 
code had been #ifdef'd out because it just didn't work.  Naturally it 
was all bitrotting.






>
> And then there is the possibility of making passes generate less
> needs to perform cleanups after them - like in the present case
> with the redundant IVs make them more appearant redundant by
> CSEing the initial value and step during vectorizer code generation.
> I'm playing with the idea of adding a simple CSE machinery to
> the gimple_build () interface (aka match-and-simplify).  It
> eventually invokes (well, not currently, but that can be fixed)
> maybe_push_res_to_seq which is a good place to maintain a
> table of already generated expressions.  That of course only
> works if you either always append to the same sequence or at least
> insert at the same place.
As you know we've gone back and forth on this in the past.  It's always 
a trade-off.  I still ponder from time to time putting the simple CSE 
and cprop bits back into the SSA rewriting phase to avoid generating all 
kinds of garbage that just needs to be cleaned up later -- particularly 
for incremental SSA updates.



Jeff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] Add FRE in pass_vectorize
  2015-06-25  3:40     ` Jeff Law
@ 2015-07-02 11:40       ` Alan Lawrence
  2015-07-02 12:11         ` Richard Biener
  2015-07-02 17:52         ` Jeff Law
  0 siblings, 2 replies; 7+ messages in thread
From: Alan Lawrence @ 2015-07-02 11:40 UTC (permalink / raw)
  To: Jeff Law; +Cc: Richard Biener, gcc-patches

Jeff Law wrote:
> On 06/24/2015 01:59 AM, Richard Biener wrote:
>> And then there is the possibility of making passes generate less
>> needs to perform cleanups after them - like in the present case
>> with the redundant IVs make them more appearant redundant by
>> CSEing the initial value and step during vectorizer code generation.
>> I'm playing with the idea of adding a simple CSE machinery to
>> the gimple_build () interface (aka match-and-simplify).  It
>> eventually invokes (well, not currently, but that can be fixed)
>> maybe_push_res_to_seq which is a good place to maintain a
>> table of already generated expressions.  That of course only
>> works if you either always append to the same sequence or at least
>> insert at the same place.
> As you know we've gone back and forth on this in the past.  It's always 
> a trade-off.  I still ponder from time to time putting the simple CSE 
> and cprop bits back into the SSA rewriting phase to avoid generating all 
> kinds of garbage that just needs to be cleaned up later -- particularly 
> for incremental SSA updates.

Coming to this rather late, and without the background knowledge about having 
gone back and forth, sorry! But what are the arguments against this? Am I right 
in thinking that the "SSA Rewriting" phase would not trigger as often as 
gimple_build(), or are these the same thing?

Presumably when you say "simple CSE machinery" you'd have to bail out quickly 
from tricky cases like, say:

if (P)
   {
     use ...expr...
   }
...
if (Q)
   {
     now building a new ...expr... here
   }

Thanks, Alan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] Add FRE in pass_vectorize
  2015-07-02 11:40       ` Alan Lawrence
@ 2015-07-02 12:11         ` Richard Biener
  2015-07-02 17:52         ` Jeff Law
  1 sibling, 0 replies; 7+ messages in thread
From: Richard Biener @ 2015-07-02 12:11 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: Jeff Law, gcc-patches

On Thu, 2 Jul 2015, Alan Lawrence wrote:

> Jeff Law wrote:
> > On 06/24/2015 01:59 AM, Richard Biener wrote:
> > > And then there is the possibility of making passes generate less
> > > needs to perform cleanups after them - like in the present case
> > > with the redundant IVs make them more appearant redundant by
> > > CSEing the initial value and step during vectorizer code generation.
> > > I'm playing with the idea of adding a simple CSE machinery to
> > > the gimple_build () interface (aka match-and-simplify).  It
> > > eventually invokes (well, not currently, but that can be fixed)
> > > maybe_push_res_to_seq which is a good place to maintain a
> > > table of already generated expressions.  That of course only
> > > works if you either always append to the same sequence or at least
> > > insert at the same place.
> > As you know we've gone back and forth on this in the past.  It's always a
> > trade-off.  I still ponder from time to time putting the simple CSE and
> > cprop bits back into the SSA rewriting phase to avoid generating all kinds
> > of garbage that just needs to be cleaned up later -- particularly for
> > incremental SSA updates.
> 
> Coming to this rather late, and without the background knowledge about having
> gone back and forth, sorry! But what are the arguments against this? Am I
> right in thinking that the "SSA Rewriting" phase would not trigger as often as
> gimple_build(), or are these the same thing?

Not sure what Jeff means here either.  We don't do any SSA rewriting
nowadays but instead produce SSA gimple even from force_gimple_operand
operating on GENERIC.

As the goal is to get rid of force_gimple_operand ("late gimplifying")
this means to build the CSE into its replacement.

> Presumably when you say "simple CSE machinery" you'd have to bail out quickly
> from tricky cases like, say:
> 
> if (P)
>   {
>     use ...expr...
>   }
> ...
> if (Q)
>   {
>     now building a new ...expr... here
>   }

Yes, of course.

Richard.

> Thanks, Alan
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Dilip Upmanyu, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] Add FRE in pass_vectorize
  2015-07-02 11:40       ` Alan Lawrence
  2015-07-02 12:11         ` Richard Biener
@ 2015-07-02 17:52         ` Jeff Law
  1 sibling, 0 replies; 7+ messages in thread
From: Jeff Law @ 2015-07-02 17:52 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: Richard Biener, gcc-patches

On 07/02/2015 05:40 AM, Alan Lawrence wrote:
> Jeff Law wrote:
>> On 06/24/2015 01:59 AM, Richard Biener wrote:
>>> And then there is the possibility of making passes generate less
>>> needs to perform cleanups after them - like in the present case
>>> with the redundant IVs make them more appearant redundant by
>>> CSEing the initial value and step during vectorizer code generation.
>>> I'm playing with the idea of adding a simple CSE machinery to
>>> the gimple_build () interface (aka match-and-simplify).  It
>>> eventually invokes (well, not currently, but that can be fixed)
>>> maybe_push_res_to_seq which is a good place to maintain a
>>> table of already generated expressions.  That of course only
>>> works if you either always append to the same sequence or at least
>>> insert at the same place.
>> As you know we've gone back and forth on this in the past.  It's
>> always a trade-off.  I still ponder from time to time putting the
>> simple CSE and cprop bits back into the SSA rewriting phase to avoid
>> generating all kinds of garbage that just needs to be cleaned up later
>> -- particularly for incremental SSA updates.
>
> Coming to this rather late, and without the background knowledge about
> having gone back and forth, sorry! But what are the arguments against
> this? Am I right in thinking that the "SSA Rewriting" phase would not
> trigger as often as gimple_build(), or are these the same thing?
It's the into-ssa and incremental update phases.  The basic idea is it 
is very inexpensive to do const/copy propagation and simple CSE at that 
point.

When processing an assignment, after rewriting the inputs from _DECL 
nodes to SSA_NAMEs, you lookup the RHS in your hash table.  If you get a 
hit, you replace the expression with the SSA_NAME from the hash table 
and record that the destination has an equivalence.

Diego took this out several years ago with the idea that the into-ssa & 
updates should be kept separate from optimizations.  With the ongoing 
need for early cleanups to make IPA more effective, I think it's time to 
revisit that decision as we get a lot of the obvious redundancies out of 
the stream by just being smart during into-ssa.  Which in turn means we 
don't have to do as much in the early optimizations before IPA.

>
> Presumably when you say "simple CSE machinery" you'd have to bail out
> quickly from tricky cases like, say:
>
> if (P)
>    {
>      use ...expr...
>    }
> ...
> if (Q)
>    {
>      now building a new ...expr... here
>    }
Not sure the problem here.  The simple CSE/cprop occurs as we're going 
into SSA form -- because into-ssa is inherently a dominator walk and 
we're rewriting operands as we go, we can trivially determine that we've 
already seen a given expression earlier in the dominator tree and that 
the result of that expression hasn't changed (by the nature of SSA).

Jeff


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-07-02 17:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-10 14:11 [PATCH][RFC] Add FRE in pass_vectorize Richard Biener
2015-06-23 20:22 ` Jeff Law
2015-06-24  8:16   ` Richard Biener
2015-06-25  3:40     ` Jeff Law
2015-07-02 11:40       ` Alan Lawrence
2015-07-02 12:11         ` Richard Biener
2015-07-02 17:52         ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).