Hi Richard,

here's a new version of the pass. I attempted to address as much as possible
your comments. The pass was bootstrapped and reg-tested on x86_64.

On 06/14/2011 05:01 PM, Richard Guenther wrote:
> On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries <vries@codesourcery.com> wrote:
>> Hi Richard,
>>
>> thanks for the review.
>>
>> On 06/08/2011 11:55 AM, Richard Guenther wrote:
>>> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vries@codesourcery.com> wrote:
>>>> Hi Richard,
>>>>
>>>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>>>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>>>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
>>>> table (%, lower is better).
>>>>
>>>>                     none            pic
>>>>                thumb1  thumb2  thumb1 thumb2
>>>> spec2000         99.9    99.9    99.8   99.8
>>>>
>>>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
>>>> optimizations proposed in PR20070 would fix this PR.
>>>>
>>>> The problem in this PR is that when compiling with -O2, the example below should
>>>> only have one call to free. The original problem is formulated in terms of -Os,
>>>> but currently we generate one call to free with -Os, although still not the
>>>> smallest code possible. I'll show here the -O2 case, since that's similar to the
>>>> original PR.
>>>>
>>
>> Example A. (naming it for reference below)
>>
>>>> #include <stdio.h>
>>>> void foo (char*, FILE*);
>>>> char* hprofStartupp(char *outputFileName, char *ctx)
>>>> {
>>>>    char fileName[1000];
>>>>    FILE *fp;
>>>>    sprintf(fileName, outputFileName);
>>>>    if (access(fileName, 1) == 0) {
>>>>        free(ctx);
>>>>        return 0;
>>>>    }
>>>>
>>>>    fp = fopen(fileName, 0);
>>>>    if (fp == 0) {
>>>>        free(ctx);
>>>>        return 0;
>>>>    }
>>>>
>>>>    foo(outputFileName, fp);
>>>>
>>>>    return ctx;
>>>> }
>>>>
>>>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070.
>>>> - Merging 2 blocks which are identical expect for input registers, by using a
>>>>  conditional move to choose between the different input registers.
>>>> - Merging 2 blocks which have different local registers, by ignoring those
>>>>  differences
>>>>
>>>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>>>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>>>> conditional move would probably be benificial in terms of size, but it's not
>>>> clear what condition the conditional move should be using. Calculating such a
>>>> condition would add in size and increase the execution path.
>>>>
>>>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>>>> ...
>>>>        push    {r4, r5, lr}
>>>>        mov     r4, r0
>>>>        sub     sp, sp, #1004
>>>>        mov     r5, r1
>>>>        mov     r0, sp
>>>>        mov     r1, r4
>>>>        bl      sprintf
>>>>        mov     r0, sp
>>>>        movs    r1, #1
>>>>        bl      access
>>>>        mov     r3, r0
>>>>        cbz     r0, .L6
>>>>        movs    r1, #0
>>>>        mov     r0, sp
>>>>        bl      fopen
>>>>        mov     r1, r0
>>>>        cbz     r0, .L7
>>>>        mov     r0, r4
>>>>        bl      foo
>>>> .L3:
>>>>        mov     r0, r5
>>>>        add     sp, sp, #1004
>>>>        pop     {r4, r5, pc}
>>>> .L6:
>>>>        mov     r0, r5
>>>>        mov     r5, r3
>>>>        bl      free
>>>>        b       .L3
>>>> .L7:
>>>>        mov     r0, r5
>>>>        mov     r5, r1
>>>>        bl      free
>>>>        b       .L3
>>>> ...
>>>>
>>>> The proposed patch solved the problem by dealing with the 2 blocks at a level
>>>> when they are still identical: at gimple level. It detect that the 2 blocks are
>>>> identical, and removes one of them.
>>>>
>>>> The following table shows the impact of the patch on the example in terms of
>>>> size for -march=armv7-a:
>>>>
>>>>          without     with    delta
>>>> Os      :     108      104       -4
>>>> O2      :     120      104      -16
>>>> Os thumb:      68       64       -4
>>>> O2 thumb:      76       64      -12
>>>>
>>>> The gain in size for -O2 is that of removing the entire block, plus the
>>>> replacement of 2 moves by a constant set, which also decreases the execution
>>>> path. The patch ensures optimal code for both -O2 and -Os.
>>>>
>>>>
>>>> By keeping track of equivalent definitions in the 2 blocks, we can ignore those
>>>> differences in comparison. Without this feature, we would only match blocks with
>>>> resultless operations, due the the ssa-nature of gimples.
>>>> For example, with this feature, we reduce the following function to its minimum
>>>> at gimple level, rather than at rtl level.
>>>>
>>
>> Example B. (naming it for reference below)
>>
>>>> int f(int c, int b, int d)
>>>> {
>>>>  int r, e;
>>>>
>>>>  if (c)
>>>>    r = b + d;
>>>>  else
>>>>    {
>>>>      e = b + d;
>>>>      r = e;
>>>>    }
>>>>
>>>>  return r;
>>>> }
>>>>
>>>> ;; Function f (f)
>>>>
>>>> f (int c, int b, int d)
>>>> {
>>>>  int e;
>>>>
>>>> <bb 2>:
>>>>  e_6 = b_3(D) + d_4(D);
>>>>  return e_6;
>>>>
>>>> }
>>>>
>>>> I'll send the patch with the testcases in a separate email.
>>>>
>>>> OK for trunk?
>>>
>>> I don't like that you hook this into cleanup_tree_cfg - that is called
>>> _way_ too often.
>>>
>>
>> Here is a reworked patch that addresses several concerns, particularly the
>> compile time overhead.
>>
>> Changes:
>> - The optimization is now in a separate file.
>> - The optimization is now a pass rather than a cleanup. That allowed me to
>>  remove the test for pass-local flags.
>>  New is the pass driver tail_merge_optimize, based on
>>  tree-cfgcleanup.c:cleanup_tree_cfg_1.
>> - The pass is run once, on SSA. Before, the patch would
>>  fix example A only before SSA and example B only on SSA.
>>  In order to fix example A on SSA, I added these changes:
>>  - handle the vop state at entry of bb1 and bb2 as equal (gimple_equal_p)
>>  - insert vop phi in bb2, and use that one (update_vuses)
>>  - complete pt_solutions_equal_p.
>>
>> Passed x86_64 bootstrapping and regression testing, currently regtesting on ARM.
>>
>> I placed the pass at the earliest point where it fixes example B: After copy
>> propagation and dead code elimination, specifically, after the first invocation
>> of pass_cd_dce. Do you know (any other points) where the pass should be scheduled?
> 
> It's probably reasonable to run it after IPA inlining has taken place which
> means insert it somewhen after the second pass_fre (I'd suggest after
> pass_merge_phi).
> 

I placed it there, but I ran into some interaction with
pass_late_warn_uninitialized.  Addition of the pass makes test
gcc.dg/uninit-pred-2_c.c fail.

FAIL: gcc.dg/uninit-pred-2_c.c bogus uninitialized var warning
                               (test for bogus messages, line 43)
FAIL: gcc.dg/uninit-pred-2_c.c real uninitialized var warning
                               (test for warnings, line 45)

   int foo_2 (int n, int m, int r)
   {
     int flag = 0;
     int v;

     if (n)
       {
         v = r;
	 flag = 1;
       }

     if (m) g++;
     else bar ();

     if (flag)
       blah (v); { dg-bogus "uninitialized" "bogus uninitialized var warning" }
     else
       blah (v); { dg-warning "uninitialized" "real uninitialized var warning" }

     return 0;
   }

The pass replaces the second call to blah with the first one, and eliminates
the if.  After that, the uninitialized warning is issued for the line number
of the first call to blah, while at source level the warning only makes sense
for the second call to blah.

Shall I try putting the pass after pass_late_warn_uninitialized?

> But my general comment still applies - I don't like the structural
> comparison code at all and you should really use the value-numbering
> machineries we have

I now use sccvn.

> or even better, merge this pass with FRE itself
> (or DOM if that suits you more).  For FRE you'd want to hook into
> tree-ssa-pre.c:eliminate().
> 

If we need to do the transformation after pass_late_warn_uninitialized, it needs
to stay on its own, I suppose.

>>> This also duplicates the literal matching done on the RTL level - instead
>>> I think this optimization would be more related to value-numbering
>>> (either that of SCCVN/FRE/PRE or that of DOM which also does
>>> jump-threading).
>>
>> The pass currently does just duplicate block elimination, not cross-jumping.
>> If we would like to extend this to cross-jumping, I think we need to do the
>> reverse of value numbering: walk backwards over the bb, and keep track of the
>> way values are used rather than defined. This will allows us to make a cut
>> halfway a basic block.
> 
> I don't understand - I propose to do literal matching but using value-numbering
> for tracking equivalences to avoid literal matching for stuff we know is
> equivalent.  In fact I think it will be mostly calls and stores where we
> need to do literal matching, but never intermediate computations on
> registers.
> 

I tried to implement that scheme now.

> But maybe I miss something here.
>
>> In general, we cannot do cut halfway a basic block in the current implementation
>> (of value numbering and forward matching), since we assume equivalence of the
>> incoming vops at bb entry. This assumption is in general only valid if we indeed
>> replace the entire block by another entire block.
> 
> Why are VOPs of concern at all?
> 

In the previous version, I inserted the phis for the vops manually.
In the current version of the pass, I let TODO_update_ssa_only_virtuals deal
with vops, so it's not relevant anymore.

>> I imagine that a cross-jumping heuristic would be based on the length of the
>> match and the amount of non-vop phis it would introduce. Then value numbering
>> would be something orthogonal to this optimization, which would reduce amount of
>> phis needed for a cross-jump.
>> I think it would make sense to use SCCVN value numbering at the point that we
>> have this backward matching.
>>
>> I'm not sure whether it's a good idea to try to replace the current forward
>> local value numbering with SCCVN value numbering, since we currently declare
>> vops equal, which are, in the global sense, not equal. And once we go to
>> backward matching, we'll need something to keep track of the uses, and we can
>> reuse the current infrastructure for that, but not the SCCVN value numbering.
>>
>> Does that make any sense?
> 
> Ok, let me think about this a bit.
> 

I tried to to be more clear on this in the header comment of the pass.

> For now about the patch in general.  The functions need renaming to
> something more sensible now that this isn't cfg-cleanup anymore.
> 
> I miss a general overview of the pass - it's hard to reverse engineer
> its working for me.

I added a header comment.

> Like (working backwards), you are detecting
> duplicate predecessors
> - that obviously doesn't work for duplicates
> without any successors, like those ending in noreturn calls.
> 

Merging of blocks without successors works now.

> +  n = EDGE_COUNT (bb->preds);
> +
> +  for (i = 0; i < n; ++i)
> +    {
> +      e1 = EDGE_PRED (bb, i);
> +      if (e1->flags & EDGE_COMPLEX)
> +        continue;
> +      for (j = i + 1; j < n; ++j)
> +        {
> 
> that's quadratic in the number of predecessors.
> 

The quadratic comparison is now limited by PARAM_TAIL_MERGE_MAX_COMPARISONS.
Each bb is compared to maximally PARAM_TAIL_MERGE_MAX_COMPARISONS similar bbs
per worklist iteration.

> +          /* Block e1->src might be deleted.  If bb and e1->src are the same
> +             block, delete e2->src instead, by swapping e1 and e2.  */
> +          e1_swapped = (bb == e1->src) ? e2: e1;
> +          e2_swapped = (bb == e1->src) ? e1: e2;
> 
> is that because you incrementally merge preds two at a time?  As you
> are deleting blocks don't you need to adjust the quadratic walking?
> Thus, with say four equivalent preds won't your code crash anyway?
> 

I think it was to make calculation of dominator info easier, but I use now
functions from dominance.c for that, so this piece of code is gone.

> I think the code needs to delay the CFG manipulation to the end
> of this function.
> 

I now delay the cfg manipulation till after each analysis phase.

> +/* Returns whether for all phis in E1->dest the phi alternatives for E1 and
> +   E2 are either:
> +   - equal, or
> +   - defined locally in E1->src and E2->src.
> +   In the latter case, register the alternatives in *PHI_EQUIV.  */
> +
> +static bool
> +same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2)
> +{
> +  int n1 = e1->dest_idx;
> +  int n2 = e2->dest_idx;
> +  gimple_stmt_iterator gsi;
> +  basic_block dest = e1->dest;
> +  gcc_assert (dest == e2->dest);
> 
> too many asserts in general - I'd say for this case pass in the destination
> block as argument.
> 
> +      gcc_assert (val1 != NULL_TREE);
> +      gcc_assert (val2 != NULL_TREE);
> 
> superfluous.
> 
> +static bool
> +cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2)
> ...
> +  VEC (edge,heap) *redirected_edges;
> +  gcc_assert (bb == e2->dest);
> 
> same.
> 
> +  if (e1->flags != e2->flags)
> +    return false;
> 
> that's bad - it should handle EDGE_TRUE/FALSE_VALUE mismatches
> by swapping edges in the preds.
> 

That's handled now.

> +  /* TODO: We could allow multiple successor edges here, as long as bb1 and bb2
> +     have the same successors.  */
> +  if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1)
> +    return false;
> 
> hm, ok - that would need fixing, too.  Same or mergeable successors
> of course, which makes me wonder if doing this whole transformation
> incrementally and locally is a good idea ;)   Also
> 

Also handled now.

> +  /* Calculate the changes to be made to the dominator info.
> +     Calculate bb2_dom.  */
> ...
> 
> wouldn't be necessary I suppose (just throw away dom info after the
> pass).
> 
> That is, I'd globally record BB equivalences (thus, "value-number"
> BBs) and apply the CFG manipulations at a single point.
> 

I delay the cfg manipulation till after each analysis phase. Delaying the cfg
manipulation till the end of the pass instead might make the analysis code more
convoluted.

> Btw, I miss where you insert PHI nodes for all uses that flow in
> from the preds preds - you do that for VOPs but not for real
> operands?
> 

Indeed, inserting phis for non-vops is a todo.

> +  /* Replace uses of vuse2 with uses of the phi.  */
> +  for (gsi = gsi_start_bb (bb2); !gsi_end_p (gsi); gsi_next (&gsi))
> +    {
> 
> why not walk immediate uses of the old PHI and SET_USE to
> the new one instead (for those uses in the duplicate BB of course)?
> 

And I no longer insert VOP phis, but let a TODO handle that, so this code is gone.

> +    case GSS_CALL:
> +      if (!pt_solution_equal_p (gimple_call_use_set (s1),
> +                                gimple_call_use_set (s2))
> 
> I don't understand why you are concerned about equality of
> points-to information.  Why not simply ior it (pt_solution_ior_into - note
> they are shared so you need to unshare them first).
> 

I let a todo handle the alias info now.

> +/* Return true if p1 and p2 can be considered equal.  */
> +
> +static bool
> +pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2)
> 
> would go into tree-ssa-structalias.c instead.
> 
> +static bool
> +gimple_base_equal_p (gimple s1, gimple s2)
> +{
> ...
> +  if (gimple_modified_p (s1) || gimple_modified_p (s2))
> +    return false;
> 
> that shouldn't be of concern.
> 
> +  if (s1->gsbase.subcode != s2->gsbase.subcode)
> +    return false;
> 
> for assigns that are of class GIMPLE_SINGLE_RHS we do not
> update subcode during transformations so it can differ for now
> equal statements.
> 

handled properly now.

> I'm not sure if a splay tree for the SSA name version equivalency
> map is the best representation - I would have used a simple
> array of num_ssa_names size and assign value-numbers
> (the lesser version for example).
> 
> Thus equiv_insert would do
> 
>   value = MIN (SSA_NAME_VERSION (val1), SSA_NAME_VERSION (val2));
>   values[SSA_NAME_VERSION (val1)] = value;
>   values[SSA_NAME_VERSION (val2)] = value;
> 
> if the names are not defined in bb1 resp. bb2 we would have to insert
> a PHI node in the merged block - that would be a cost thingy for
> doing this value-numbering in a more global way.
> 

local value numbering code has been removed.

> You don't seem to be concerned about the SSA names points-to
> information, but it surely has the same issues as that of the calls
> (so either they should be equal or they should be conservatively
> merged).  But as-is you don't allow any values to flow into the
> merged blocks that are not equal for both edges, no?
> 

Correct, that's still a todo.

> +  TV_TREE_CFG,                          /* tv_id */
> 
> add a new timevar.  We wan to be able to turn the pass off,
> so also add a new option (I can see it might make debugging harder
> in some cases).
> 

I added -ftree-tail-merge and TV_TREE_TAIL_MERGE.

> Can you assess the effect of the patch on GCC itself (for example
> when building cc1?)? What's the size benefit and the compile-time
> overhead?
> 

effect on building cc1:

               real        user        sys
without: 19m50.158s  19m 2.090s  0m20.860s
with:    19m59.456s  19m17.170s  0m20.350s
                     ----------
                       +15.080s
                         +1.31%

$ size without/cc1 with/cc1
    text   data      bss       dec      hex     filename
17515986  41320  1364352  18921658  120b8ba  without/cc1
17399226  41320  1364352  18804898  11ef0a2     with/cc1
--------
 -116760
  -0.67%

OK for trunk, provided build & reg-testing on ARM is ok?

Thanks,
- Tom

2011-07-12  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43864
	* tree-ssa-tail-merge.c: New file.
	(bb_dominated_by_p): New function.
	(scc_vn_ok): New var.
	(init_gvn, delete_gvn, gvn_val, gvn_uses_equal): New function.
	(bb_size): New var.
	(init_bb_size, delete_bb_size): New function.
	(struct same_succ): Define.
	(same_succ_t, const_same_succ_t): New typedef.
	(same_succ_print, same_succ_print_traverse, same_succ_hash)
	(inverse_flags, same_succ_equal, same_succ_alloc, same_succ_delete)
	(same_succ_reset): New function.
	(same_succ_htab, bb_to_same_succ, same_succ_edge_flags)
	(bitmap deleted_bbs, deleted_bb_preds): New vars.
	(debug_same_succ): New function.
	(worklist): New var.
	(print_worklist, add_to_worklist, find_same_succ_bb, find_same_succ)
	(init_worklist, delete_worklist, delete_basic_block_same_succ)
	(update_worklist): New function.
	(struct bb_cluster): Define.
	(bb_cluster_t, const_bb_cluster_t): New typedef.
	(print_cluster, debug_cluster, same_predecessors)
	(add_bb_to_cluster, new_cluster, delete_cluster): New function.
	(merge_cluster, all_clusters): New var.
	(alloc_cluster_vectors, reset_cluster_vectors, delete_cluster_vectors)
	(merge_clusters, set_cluster): New function.
	(gimple_subcode_equal_p, gimple_base_equal_p, gimple_equal_p)
	(bb_gimple_equal_p): New function.
	(find_duplicate, same_phi_alternatives_1, same_phi_alternatives)
	(bb_has_non_vop_phi, find_clusters_1, find_clusters): New function.
	(replace_block_by, apply_clusters): New	function.
	(update_debug_stmt, update_debug_stmt): New function.
	(tail_merge_optimize, gate_tail_merge): New function.
	(pass_tail_merge): New gimple pass.
	* tree-pass.h (pass_tail_merge): Declare new pass.
	* passes.c (init_optimization_passes): Use new pass.
	* Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o.
	(tree-ssa-tail-merge.o): New rule.
	* opts.c (default_options_table): Set OPT_ftree_tail_merge by default at
	OPT_LEVELS_2_PLUS.
	* timevar.def (TV_TREE_TAIL_MERGE): New timevar.
	* common.opt (ftree-tail-merge): New switches.
	* params.def (PARAM_TAIL_MERGE_MAX_COMPARISONS): New parameter.