Good news about increased jump threading from my PHI merge patch

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Good news about increased jump threading from my PHI merge patch
@ 2004-12-11  7:25 Kazu Hirata
  2004-12-13 16:38 ` Jeffrey A Law
  0 siblings, 1 reply; 5+ messages in thread
From: Kazu Hirata @ 2004-12-11  7:25 UTC (permalink / raw)
  To: gcc; +Cc: stevenb, law, dvorakz

Hi,

I was playing with my PHI merge patch that merge two PHI nodes like

  # tem_6 = PHI <tem_17(8), tem_23(7)>;
<L9>:;

  # tem_3 = PHI <tem_6(9), tem_2(5)>;
<L10>:;

into one PHI node like so

  # tem_3 = PHI <tem_23(7), tem_2(5), tem_17(8)>;
<L10>:;

I've been saying my PHI merge patch would improve jump threading but
never said by how much.  So here is that number.

Specifically, I compiled cc1-i files.  I grepped for "Threaded" in
t??.dom[123] dumps to see how many times we do tree-level jump
threading.  Similarly, I grepped for "JUMP-BYPASS" in 07.bypass to see
how many times we do RTL-level threading.

     mainline patched   diff%
tree    18918   20524 +7.824%
RTL      2648    2415 -9.648%
-----------------------------
total   21566   22939 +5.985%

OK, so from the point of view of final generated code, the net
increase of 6% is not bad.  I have yet to measure the speed of the
generated code.

It's also good from the point of view of doing more work at tree level
than at RTL level.  We've taken away nearly 10% of work from the RTL
level jump threader.

Compile time is reduced by 0.3% or so.  More on this later in this
post.

Actually, what I am doing is a little more than the PHI merge
described above.  Sometimes a PHI merge opportunity is "blocked" with
an intervening cast(s) like so

  # tem_6 = PHI <0(8), 1(7)>;
<L9>:;
  tem_7 = (_Bool) tem_6;

  # tem_3 = PHI <tem_7(9), 1(5)>;
<L10>:;

To merge PHI nodes even in this kind of situaiton, my pass has a
PHI-specific constant propagator to eat the cast like so

  # tem_7 = PHI <0(8), 1(7)>;  <- Notice PHI result is tmp_7, not tmp_6
<L9>:;

  # tem_3 = PHI <tem_7(9), 1(5)>;
<L10>:;

To be precise, my patch is actually a fixed point operation of two
subpasses, PHI merge and this PHI constant propagator.  They create
opportunities for each other.  Without this PHI constant propagator,
my patch brings about 1% compile-time speed improvement.  (Yes, I need
to speed up the PHI-specific constant propagator or use a smart
worklist shared between the two subpasses.)

I have not included any patch to fix PR 18649 during this benchmark.
I don't expect any big difference because the PR is basically a rather
rare corner case.

Aside from being in Stage 3, let me note another reason for holding
this until 4.1.  We create large PHI nodes.  The current PHI-OPT
cannot handle this, but the one in tcb branch can thanks to Andrew
Pinski.

Right now I am running this pass only once immediately after the first
forwprop.  The key in this pass is lack of dead code so that I can
find forwarder blocks with PHI nodes.  It may be beneficial to run
this one or two more times during tree optimization.

In any event, I think this is a good candidate to consider for 4.1.
Any comment?

Kazu Hirata

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Good news about increased jump threading from my PHI merge patch
  2004-12-11  7:25 Good news about increased jump threading from my PHI merge patch Kazu Hirata
@ 2004-12-13 16:38 ` Jeffrey A Law
  2004-12-13 17:57   ` Kazu Hirata
  0 siblings, 1 reply; 5+ messages in thread
From: Jeffrey A Law @ 2004-12-13 16:38 UTC (permalink / raw)
  To: Kazu Hirata; +Cc: gcc, stevenb, dvorakz

On Sat, 2004-12-11 at 02:24 -0500, Kazu Hirata wrote:
> Hi,
> 
> I was playing with my PHI merge patch that merge two PHI nodes like
> 
>   # tem_6 = PHI <tem_17(8), tem_23(7)>;
> <L9>:;
> 
>   # tem_3 = PHI <tem_6(9), tem_2(5)>;
> <L10>:;
> 
> into one PHI node like so
> 
>   # tem_3 = PHI <tem_23(7), tem_2(5), tem_17(8)>;
> <L10>:;
> 
> I've been saying my PHI merge patch would improve jump threading but
> never said by how much.  So here is that number.
[ ... ]
Thanks.  It's still on my list of things to evaluate for 4.0, mostly
because of its potential to help compile-times on some of our 
problematical cases without introducing lots of regressions elsewhere.

You mentioned that we can't handle large PHI nodes.  Can you comment
more on that -- that would be my largest individual concern right now.

jeff


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Good news about increased jump threading from my PHI merge patch
  2004-12-13 16:38 ` Jeffrey A Law
@ 2004-12-13 17:57   ` Kazu Hirata
  2004-12-13 22:56     ` Jeffrey A Law
  0 siblings, 1 reply; 5+ messages in thread
From: Kazu Hirata @ 2004-12-13 17:57 UTC (permalink / raw)
  To: law; +Cc: gcc, stevenb, dvorakz

Hi Jeff,

> Thanks.  It's still on my list of things to evaluate for 4.0, mostly
> because of its potential to help compile-times on some of our 
> problematical cases without introducing lots of regressions elsewhere.

If we are just merging PHI nodes and not constant-propagating PHI
nodes (like killing casts), I get about 1% speed up on some test
cases.  (For more accurate number, I have to retest.)

What I am doing is really the same as Zdenek's thread_jumps rewrite,
which you recently approved.  The only difference is that I am trying
to remove forwarder blocks with PHI nodes.  Zdenek's version tries to
remove those without.

But do note, though, that my pass helps if it is run once or twice,
but if it's part of cleanup_tree_cfg, the compiler measurably slows
down because cleanup_tree_cfg is run so many times.  I think my pass
basically requires DCE.  We clean up CFG before entering SSA, so we
shouldn't have any PHI merge opportunity unless some dead code between
two PHI nodes are removed.

Here is a quick advertisement. :-) Let me mention that I can quickly
determine if I can merge a forwarder block A (and its PHI nodes) into
basic block B most of the time.  The only condition I need is:

  if (!dominated_by_p (CDI_DOMINATORS, B, A))

because if A does not dominate B, then the only valid place where any
values produced in A can be used is PHI arguments on edge A->B.
Otherwise, we would have a use that is not dominated by its def.  In
other words, we don't need to compute immediate uses.

The reason why I said "most of the time" above is that sometimes A
dominates B even if B has PHI nodes (and thus multiple incoming
edges).  This case applies to B being a loop header.  I don't know how
aggressive we want to be in this case because we are effectively
undoing create_preheader.

> You mentioned that we can't handle large PHI nodes.  Can you comment
> more on that -- that would be my largest individual concern right now.

The only thing that's technically blocking my PHI merge is PHI-OPT.
IIRC, PHI-OPT looks for a basic block with a single PHI node with
exactly two arguments like so:

  a_1 = PHI <0(here), 1(there)>
<L123>:;

and it tries to convert this to

<L123>:;
  a_1 = COND;

The newer version of PHI-OPT on tree cleanup branch looks at COND_EXPR
instead of a PHI node and sees if two arms of COND_EXPR feed 0 and 1
into some PHI node, so we end up with

<L100>:;
  :
  :
  c_7 = COND;
  goto <L123>;

  a_1 = PHI <c_7(here), ... arbitrarily long ...>
<L123>:;

Of course, in the special case of two incoming edges, the PHI node is
equivalent to a copy statement because we would have

  a_1 = PHI <c_7(here)>
<L123>:;

Kazu Hirata

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Good news about increased jump threading from my PHI merge patch
  2004-12-13 17:57   ` Kazu Hirata
@ 2004-12-13 22:56     ` Jeffrey A Law
  2004-12-14  0:13       ` Kazu Hirata
  0 siblings, 1 reply; 5+ messages in thread
From: Jeffrey A Law @ 2004-12-13 22:56 UTC (permalink / raw)
  To: Kazu Hirata; +Cc: gcc, stevenb, dvorakz

On Mon, 2004-12-13 at 12:57 -0500, Kazu Hirata wrote:
> Hi Jeff,
> 
> > Thanks.  It's still on my list of things to evaluate for 4.0, mostly
> > because of its potential to help compile-times on some of our 
> > problematical cases without introducing lots of regressions elsewhere.
> 
> If we are just merging PHI nodes and not constant-propagating PHI
> nodes (like killing casts), I get about 1% speed up on some test
> cases.  (For more accurate number, I have to retest.)
> 
> What I am doing is really the same as Zdenek's thread_jumps rewrite,
> which you recently approved.  The only difference is that I am trying
> to remove forwarder blocks with PHI nodes.  Zdenek's version tries to
> remove those without.
Right.  It seems like a pretty natural extension to Zdenek's code.


> But do note, though, that my pass helps if it is run once or twice,
> but if it's part of cleanup_tree_cfg, the compiler measurably slows
> down because cleanup_tree_cfg is run so many times.  
Sigh.   That is something on the long term todo list, but I haven't
really figured out where/how to start cutting down those calls.


> I think my pass
> basically requires DCE.  We clean up CFG before entering SSA, so we
> shouldn't have any PHI merge opportunity unless some dead code between
> two PHI nodes are removed.
Which almost makes me wonder if removing forwarder blocks is better
off as its own pass.

We would schedule it to run after the DCE/DSE.  We would also call
it from cleanup_tree_cfg if and only if we were able to determine
the result of a COND_EXPR_COND or SWITCH_COND.

  
> Here is a quick advertisement. :-) Let me mention that I can quickly
> determine if I can merge a forwarder block A (and its PHI nodes) into
> basic block B most of the time.  The only condition I need is:
> 
>   if (!dominated_by_p (CDI_DOMINATORS, B, A))
> 
> because if A does not dominate B, then the only valid place where any
> values produced in A can be used is PHI arguments on edge A->B.
> Otherwise, we would have a use that is not dominated by its def.  In
> other words, we don't need to compute immediate uses.
> 
> The reason why I said "most of the time" above is that sometimes A
> dominates B even if B has PHI nodes (and thus multiple incoming
> edges).  This case applies to B being a loop header.  I don't know how
> aggressive we want to be in this case because we are effectively
> undoing create_preheader.
Presumably we've already eliminate degenerate PHIs by this point.  I
see.

> 
> > You mentioned that we can't handle large PHI nodes.  Can you comment
> > more on that -- that would be my largest individual concern right now.
> 
> The only thing that's technically blocking my PHI merge is PHI-OPT.
> IIRC, PHI-OPT looks for a basic block with a single PHI node with
> exactly two arguments like so:
Oh.  That code.  I thought you meant there was some time-complexity
problem you needed to be solved with large PHI arguments.

It would seem to me that the phi-opt code would just make your changes
more effective, but the phi-opt code shouldn't be a prerequisite for
your code to merge phis.

jeff


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Good news about increased jump threading from my PHI merge patch
  2004-12-13 22:56     ` Jeffrey A Law
@ 2004-12-14  0:13       ` Kazu Hirata
  0 siblings, 0 replies; 5+ messages in thread
From: Kazu Hirata @ 2004-12-14  0:13 UTC (permalink / raw)
  To: law; +Cc: gcc, stevenb, dvorakz

Hi Jeff,

> > I think my pass
> > basically requires DCE.  We clean up CFG before entering SSA, so we
> > shouldn't have any PHI merge opportunity unless some dead code between
> > two PHI nodes are removed.
> Which almost makes me wonder if removing forwarder blocks is better
> off as its own pass.

That's what I am doing for PHI merge.

> We would schedule it to run after the DCE/DSE.  We would also call
> it from cleanup_tree_cfg if and only if we were able to determine
> the result of a COND_EXPR_COND or SWITCH_COND.

But we have more ways to create forwarder blocks than just DCE and
DSE.  split_edge creates forwarder blocks.  So does ch.

I instrumented cleanup_tree_cfg and printed the current pass name
whenever forwarder blocks are removed while compiling cc1-i files.
Here is the number of times cleanup_forwarder_blocks did some work.

   5034 dce
   5024 loopdone
   2352 ch
   1903 cddce
   1477 final_cleanup
   1449 cfg
   1373 dom
     32 tailr
     28 ccp
      3 phiopt

Passes that call cleanup_tree_cfg (either directly or via
TODO_cleanup_cfg) are ccp, dom, dce, cddce, loopdone, ch, phiopt,
tailr, tailc, pre, and final_cleanup.  The only pass that does not
appear in the stats above is tailc.  PRE does not appear either but
that's because abnormal edges don't appear in GCC source code.

Kazu Hirata

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-12-14  0:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-11  7:25 Good news about increased jump threading from my PHI merge patch Kazu Hirata
2004-12-13 16:38 ` Jeffrey A Law
2004-12-13 17:57   ` Kazu Hirata
2004-12-13 22:56     ` Jeffrey A Law
2004-12-14  0:13       ` Kazu Hirata

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).