On 07/12/2011 04:07 PM, Richard Guenther wrote: > On Tue, Jul 12, 2011 at 2:12 PM, Tom de Vries wrote: >> Hi Richard, >> >> here's a new version of the pass. I attempted to address as much as possible >> your comments. The pass was bootstrapped and reg-tested on x86_64. >> >> On 06/14/2011 05:01 PM, Richard Guenther wrote: >>> On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries wrote: >>>> Hi Richard, >>>> >>>> thanks for the review. >>>> >>>> On 06/08/2011 11:55 AM, Richard Guenther wrote: >>>>> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries wrote: >>>>>> Hi Richard, >>>>>> >>>>>> I have a patch for PR43864. The patch adds a gimple level duplicate block >>>>>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and >>>>>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following >>>>>> table (%, lower is better). >>>>>> >>>>>> none pic >>>>>> thumb1 thumb2 thumb1 thumb2 >>>>>> spec2000 99.9 99.9 99.8 99.8 >>>>>> >>>>>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the >>>>>> optimizations proposed in PR20070 would fix this PR. >>>>>> >>>>>> The problem in this PR is that when compiling with -O2, the example below should >>>>>> only have one call to free. The original problem is formulated in terms of -Os, >>>>>> but currently we generate one call to free with -Os, although still not the >>>>>> smallest code possible. I'll show here the -O2 case, since that's similar to the >>>>>> original PR. >>>>>> >>>> >>>> Example A. (naming it for reference below) >>>> >>>>>> #include >>>>>> void foo (char*, FILE*); >>>>>> char* hprofStartupp(char *outputFileName, char *ctx) >>>>>> { >>>>>> char fileName[1000]; >>>>>> FILE *fp; >>>>>> sprintf(fileName, outputFileName); >>>>>> if (access(fileName, 1) == 0) { >>>>>> free(ctx); >>>>>> return 0; >>>>>> } >>>>>> >>>>>> fp = fopen(fileName, 0); >>>>>> if (fp == 0) { >>>>>> free(ctx); >>>>>> return 0; >>>>>> } >>>>>> >>>>>> foo(outputFileName, fp); >>>>>> >>>>>> return ctx; >>>>>> } >>>>>> >>>>>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070. >>>>>> - Merging 2 blocks which are identical expect for input registers, by using a >>>>>> conditional move to choose between the different input registers. >>>>>> - Merging 2 blocks which have different local registers, by ignoring those >>>>>> differences >>>>>> >>>>>> Blocks .L6 and.L7 have no difference in local registers, but they have a >>>>>> difference in input registers: r3 and r1. Replacing the move to r5 by a >>>>>> conditional move would probably be benificial in terms of size, but it's not >>>>>> clear what condition the conditional move should be using. Calculating such a >>>>>> condition would add in size and increase the execution path. >>>>>> >>>>>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S: >>>>>> ... >>>>>> push {r4, r5, lr} >>>>>> mov r4, r0 >>>>>> sub sp, sp, #1004 >>>>>> mov r5, r1 >>>>>> mov r0, sp >>>>>> mov r1, r4 >>>>>> bl sprintf >>>>>> mov r0, sp >>>>>> movs r1, #1 >>>>>> bl access >>>>>> mov r3, r0 >>>>>> cbz r0, .L6 >>>>>> movs r1, #0 >>>>>> mov r0, sp >>>>>> bl fopen >>>>>> mov r1, r0 >>>>>> cbz r0, .L7 >>>>>> mov r0, r4 >>>>>> bl foo >>>>>> .L3: >>>>>> mov r0, r5 >>>>>> add sp, sp, #1004 >>>>>> pop {r4, r5, pc} >>>>>> .L6: >>>>>> mov r0, r5 >>>>>> mov r5, r3 >>>>>> bl free >>>>>> b .L3 >>>>>> .L7: >>>>>> mov r0, r5 >>>>>> mov r5, r1 >>>>>> bl free >>>>>> b .L3 >>>>>> ... >>>>>> >>>>>> The proposed patch solved the problem by dealing with the 2 blocks at a level >>>>>> when they are still identical: at gimple level. It detect that the 2 blocks are >>>>>> identical, and removes one of them. >>>>>> >>>>>> The following table shows the impact of the patch on the example in terms of >>>>>> size for -march=armv7-a: >>>>>> >>>>>> without with delta >>>>>> Os : 108 104 -4 >>>>>> O2 : 120 104 -16 >>>>>> Os thumb: 68 64 -4 >>>>>> O2 thumb: 76 64 -12 >>>>>> >>>>>> The gain in size for -O2 is that of removing the entire block, plus the >>>>>> replacement of 2 moves by a constant set, which also decreases the execution >>>>>> path. The patch ensures optimal code for both -O2 and -Os. >>>>>> >>>>>> >>>>>> By keeping track of equivalent definitions in the 2 blocks, we can ignore those >>>>>> differences in comparison. Without this feature, we would only match blocks with >>>>>> resultless operations, due the the ssa-nature of gimples. >>>>>> For example, with this feature, we reduce the following function to its minimum >>>>>> at gimple level, rather than at rtl level. >>>>>> >>>> >>>> Example B. (naming it for reference below) >>>> >>>>>> int f(int c, int b, int d) >>>>>> { >>>>>> int r, e; >>>>>> >>>>>> if (c) >>>>>> r = b + d; >>>>>> else >>>>>> { >>>>>> e = b + d; >>>>>> r = e; >>>>>> } >>>>>> >>>>>> return r; >>>>>> } >>>>>> >>>>>> ;; Function f (f) >>>>>> >>>>>> f (int c, int b, int d) >>>>>> { >>>>>> int e; >>>>>> >>>>>> : >>>>>> e_6 = b_3(D) + d_4(D); >>>>>> return e_6; >>>>>> >>>>>> } >>>>>> >>>>>> I'll send the patch with the testcases in a separate email. >>>>>> >>>>>> OK for trunk? >>>>> >>>>> I don't like that you hook this into cleanup_tree_cfg - that is called >>>>> _way_ too often. >>>>> >>>> >>>> Here is a reworked patch that addresses several concerns, particularly the >>>> compile time overhead. >>>> >>>> Changes: >>>> - The optimization is now in a separate file. >>>> - The optimization is now a pass rather than a cleanup. That allowed me to >>>> remove the test for pass-local flags. >>>> New is the pass driver tail_merge_optimize, based on >>>> tree-cfgcleanup.c:cleanup_tree_cfg_1. >>>> - The pass is run once, on SSA. Before, the patch would >>>> fix example A only before SSA and example B only on SSA. >>>> In order to fix example A on SSA, I added these changes: >>>> - handle the vop state at entry of bb1 and bb2 as equal (gimple_equal_p) >>>> - insert vop phi in bb2, and use that one (update_vuses) >>>> - complete pt_solutions_equal_p. >>>> >>>> Passed x86_64 bootstrapping and regression testing, currently regtesting on ARM. >>>> >>>> I placed the pass at the earliest point where it fixes example B: After copy >>>> propagation and dead code elimination, specifically, after the first invocation >>>> of pass_cd_dce. Do you know (any other points) where the pass should be scheduled? >>> >>> It's probably reasonable to run it after IPA inlining has taken place which >>> means insert it somewhen after the second pass_fre (I'd suggest after >>> pass_merge_phi). >>> >> >> I placed it there, but I ran into some interaction with >> pass_late_warn_uninitialized. Addition of the pass makes test >> gcc.dg/uninit-pred-2_c.c fail. >> >> FAIL: gcc.dg/uninit-pred-2_c.c bogus uninitialized var warning >> (test for bogus messages, line 43) >> FAIL: gcc.dg/uninit-pred-2_c.c real uninitialized var warning >> (test for warnings, line 45) >> >> int foo_2 (int n, int m, int r) >> { >> int flag = 0; >> int v; >> >> if (n) >> { >> v = r; >> flag = 1; >> } >> >> if (m) g++; >> else bar (); >> >> if (flag) >> blah (v); { dg-bogus "uninitialized" "bogus uninitialized var warning" } >> else >> blah (v); { dg-warning "uninitialized" "real uninitialized var warning" } >> >> return 0; >> } >> >> The pass replaces the second call to blah with the first one, and eliminates >> the if. After that, the uninitialized warning is issued for the line number >> of the first call to blah, while at source level the warning only makes sense >> for the second call to blah. >> >> Shall I try putting the pass after pass_late_warn_uninitialized? > > No, simply pass -fno-tree-tail-merge in the testcase. > >>> But my general comment still applies - I don't like the structural >>> comparison code at all and you should really use the value-numbering >>> machineries we have >> >> I now use sccvn. > > Good. > >>> or even better, merge this pass with FRE itself >>> (or DOM if that suits you more). For FRE you'd want to hook into >>> tree-ssa-pre.c:eliminate(). >>> >> >> If we need to do the transformation after pass_late_warn_uninitialized, it needs >> to stay on its own, I suppose. > > I suppose part of the high cost of the pass is running SCCVN, so it > makes sense to share that with the existing FRE run. Done. > Any reason > you use VN_NOWALK? > No, that was just a first-try value. >>>>> This also duplicates the literal matching done on the RTL level - instead >>>>> I think this optimization would be more related to value-numbering >>>>> (either that of SCCVN/FRE/PRE or that of DOM which also does >>>>> jump-threading). >>>> >>>> The pass currently does just duplicate block elimination, not cross-jumping. >>>> If we would like to extend this to cross-jumping, I think we need to do the >>>> reverse of value numbering: walk backwards over the bb, and keep track of the >>>> way values are used rather than defined. This will allows us to make a cut >>>> halfway a basic block. >>> >>> I don't understand - I propose to do literal matching but using value-numbering >>> for tracking equivalences to avoid literal matching for stuff we know is >>> equivalent. In fact I think it will be mostly calls and stores where we >>> need to do literal matching, but never intermediate computations on >>> registers. >>> >> >> I tried to implement that scheme now. >> >>> But maybe I miss something here. >>> >>>> In general, we cannot do cut halfway a basic block in the current implementation >>>> (of value numbering and forward matching), since we assume equivalence of the >>>> incoming vops at bb entry. This assumption is in general only valid if we indeed >>>> replace the entire block by another entire block. >>> >>> Why are VOPs of concern at all? >>> >> >> In the previous version, I inserted the phis for the vops manually. >> In the current version of the pass, I let TODO_update_ssa_only_virtuals deal >> with vops, so it's not relevant anymore. >> >>>> I imagine that a cross-jumping heuristic would be based on the length of the >>>> match and the amount of non-vop phis it would introduce. Then value numbering >>>> would be something orthogonal to this optimization, which would reduce amount of >>>> phis needed for a cross-jump. >>>> I think it would make sense to use SCCVN value numbering at the point that we >>>> have this backward matching. >>>> >>>> I'm not sure whether it's a good idea to try to replace the current forward >>>> local value numbering with SCCVN value numbering, since we currently declare >>>> vops equal, which are, in the global sense, not equal. And once we go to >>>> backward matching, we'll need something to keep track of the uses, and we can >>>> reuse the current infrastructure for that, but not the SCCVN value numbering. >>>> >>>> Does that make any sense? >>> >>> Ok, let me think about this a bit. >>> >> >> I tried to to be more clear on this in the header comment of the pass. >> >>> For now about the patch in general. The functions need renaming to >>> something more sensible now that this isn't cfg-cleanup anymore. >>> >>> I miss a general overview of the pass - it's hard to reverse engineer >>> its working for me. >> >> I added a header comment. >> >>> Like (working backwards), you are detecting >>> duplicate predecessors >>> - that obviously doesn't work for duplicates >>> without any successors, like those ending in noreturn calls. >>> >> >> Merging of blocks without successors works now. >> >>> + n = EDGE_COUNT (bb->preds); >>> + >>> + for (i = 0; i < n; ++i) >>> + { >>> + e1 = EDGE_PRED (bb, i); >>> + if (e1->flags & EDGE_COMPLEX) >>> + continue; >>> + for (j = i + 1; j < n; ++j) >>> + { >>> >>> that's quadratic in the number of predecessors. >>> >> >> The quadratic comparison is now limited by PARAM_TAIL_MERGE_MAX_COMPARISONS. >> Each bb is compared to maximally PARAM_TAIL_MERGE_MAX_COMPARISONS similar bbs >> per worklist iteration. >> >>> + /* Block e1->src might be deleted. If bb and e1->src are the same >>> + block, delete e2->src instead, by swapping e1 and e2. */ >>> + e1_swapped = (bb == e1->src) ? e2: e1; >>> + e2_swapped = (bb == e1->src) ? e1: e2; >>> >>> is that because you incrementally merge preds two at a time? As you >>> are deleting blocks don't you need to adjust the quadratic walking? >>> Thus, with say four equivalent preds won't your code crash anyway? >>> >> >> I think it was to make calculation of dominator info easier, but I use now >> functions from dominance.c for that, so this piece of code is gone. >> >>> I think the code needs to delay the CFG manipulation to the end >>> of this function. >>> >> >> I now delay the cfg manipulation till after each analysis phase. >> >>> +/* Returns whether for all phis in E1->dest the phi alternatives for E1 and >>> + E2 are either: >>> + - equal, or >>> + - defined locally in E1->src and E2->src. >>> + In the latter case, register the alternatives in *PHI_EQUIV. */ >>> + >>> +static bool >>> +same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2) >>> +{ >>> + int n1 = e1->dest_idx; >>> + int n2 = e2->dest_idx; >>> + gimple_stmt_iterator gsi; >>> + basic_block dest = e1->dest; >>> + gcc_assert (dest == e2->dest); >>> >>> too many asserts in general - I'd say for this case pass in the destination >>> block as argument. >>> >>> + gcc_assert (val1 != NULL_TREE); >>> + gcc_assert (val2 != NULL_TREE); >>> >>> superfluous. >>> >>> +static bool >>> +cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2) >>> ... >>> + VEC (edge,heap) *redirected_edges; >>> + gcc_assert (bb == e2->dest); >>> >>> same. >>> >>> + if (e1->flags != e2->flags) >>> + return false; >>> >>> that's bad - it should handle EDGE_TRUE/FALSE_VALUE mismatches >>> by swapping edges in the preds. >>> >> >> That's handled now. >> >>> + /* TODO: We could allow multiple successor edges here, as long as bb1 and bb2 >>> + have the same successors. */ >>> + if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1) >>> + return false; >>> >>> hm, ok - that would need fixing, too. Same or mergeable successors >>> of course, which makes me wonder if doing this whole transformation >>> incrementally and locally is a good idea ;) Also >>> >> >> Also handled now. >> >>> + /* Calculate the changes to be made to the dominator info. >>> + Calculate bb2_dom. */ >>> ... >>> >>> wouldn't be necessary I suppose (just throw away dom info after the >>> pass). >>> >>> That is, I'd globally record BB equivalences (thus, "value-number" >>> BBs) and apply the CFG manipulations at a single point. >>> >> >> I delay the cfg manipulation till after each analysis phase. Delaying the cfg >> manipulation till the end of the pass instead might make the analysis code more >> convoluted. >> >>> Btw, I miss where you insert PHI nodes for all uses that flow in >>> from the preds preds - you do that for VOPs but not for real >>> operands? >>> >> >> Indeed, inserting phis for non-vops is a todo. >> >>> + /* Replace uses of vuse2 with uses of the phi. */ >>> + for (gsi = gsi_start_bb (bb2); !gsi_end_p (gsi); gsi_next (&gsi)) >>> + { >>> >>> why not walk immediate uses of the old PHI and SET_USE to >>> the new one instead (for those uses in the duplicate BB of course)? >>> >> >> And I no longer insert VOP phis, but let a TODO handle that, so this code is gone. > > Ok. Somewhat costly in comparison though. > I tried to add that back, guarded by update_vops. Handled in update_vuses, vop_phi, insn_vops, vop_at_entry, replace_block_by. >>> + case GSS_CALL: >>> + if (!pt_solution_equal_p (gimple_call_use_set (s1), >>> + gimple_call_use_set (s2)) >>> >>> I don't understand why you are concerned about equality of >>> points-to information. Why not simply ior it (pt_solution_ior_into - note >>> they are shared so you need to unshare them first). >>> >> >> I let a todo handle the alias info now. > > Hmm, that's not going to work if it's needed for correctness. > Should be handed my merge_calls now. >>> +/* Return true if p1 and p2 can be considered equal. */ >>> + >>> +static bool >>> +pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2) >>> >>> would go into tree-ssa-structalias.c instead. >>> >>> +static bool >>> +gimple_base_equal_p (gimple s1, gimple s2) >>> +{ >>> ... >>> + if (gimple_modified_p (s1) || gimple_modified_p (s2)) >>> + return false; >>> >>> that shouldn't be of concern. >>> >>> + if (s1->gsbase.subcode != s2->gsbase.subcode) >>> + return false; >>> >>> for assigns that are of class GIMPLE_SINGLE_RHS we do not >>> update subcode during transformations so it can differ for now >>> equal statements. >>> >> >> handled properly now. >> >>> I'm not sure if a splay tree for the SSA name version equivalency >>> map is the best representation - I would have used a simple >>> array of num_ssa_names size and assign value-numbers >>> (the lesser version for example). >>> >>> Thus equiv_insert would do >>> >>> value = MIN (SSA_NAME_VERSION (val1), SSA_NAME_VERSION (val2)); >>> values[SSA_NAME_VERSION (val1)] = value; >>> values[SSA_NAME_VERSION (val2)] = value; >>> >>> if the names are not defined in bb1 resp. bb2 we would have to insert >>> a PHI node in the merged block - that would be a cost thingy for >>> doing this value-numbering in a more global way. >>> >> >> local value numbering code has been removed. >> >>> You don't seem to be concerned about the SSA names points-to >>> information, but it surely has the same issues as that of the calls >>> (so either they should be equal or they should be conservatively >>> merged). But as-is you don't allow any values to flow into the >>> merged blocks that are not equal for both edges, no? >>> >> >> Correct, that's still a todo. >> >>> + TV_TREE_CFG, /* tv_id */ >>> >>> add a new timevar. We wan to be able to turn the pass off, >>> so also add a new option (I can see it might make debugging harder >>> in some cases). >>> >> >> I added -ftree-tail-merge and TV_TREE_TAIL_MERGE. >> >>> Can you assess the effect of the patch on GCC itself (for example >>> when building cc1?)? What's the size benefit and the compile-time >>> overhead? >>> >> >> effect on building cc1: >> >> real user sys >> without: 19m50.158s 19m 2.090s 0m20.860s >> with: 19m59.456s 19m17.170s 0m20.350s >> ---------- >> +15.080s >> +1.31% > > That's quite a lot of time. > Measurement for this version: real user sys without 19m59.995s 19m 9.970s 0m21.050s with 19m56.160s 19m14.830s 0m21.530s ---------- +4.86s +0.42% text data bss dec hex filename 17547657 41736 1364384 18953777 1213631 without/cc1 17211049 41736 1364384 18617169 11c1351 with/cc1 -------- -336608 -1.92% >> $ size without/cc1 with/cc1 >> text data bss dec hex filename >> 17515986 41320 1364352 18921658 120b8ba without/cc1 >> 17399226 41320 1364352 18804898 11ef0a2 with/cc1 >> -------- >> -116760 >> -0.67% >> >> OK for trunk, provided build & reg-testing on ARM is ok? > > I miss additions to the testsuite. > I will send an updated patch on thread http://gcc.gnu.org/ml/gcc-patches/2011-06/msg00625.html. > > +static bool > +bb_dominated_by_p (basic_block bb1, basic_block bb2) > > please use > > + if (TREE_CODE (val1) == SSA_NAME) > + { > + if (!same_preds > && !SSA_NAME_IS_DEFAULT_DEF (val1) > && !dominated_by_p (bb2, gimple_bb (SSA_NAME_DEF_STMT (val1))) > return false; > > instead. All stmts should have a BB apart from def stmts of default defs > (which are gimple_nops). > Done. > +/* Return the canonical scc_vn tree for X, if we can use scc_vn_info. > + Otherwise, return X. */ > + > +static tree > +gvn_val (tree x) > +{ > + return ((scc_vn_ok && x != NULL && TREE_CODE (x) == SSA_NAME) > + ? VN_INFO ((x))->valnum : x); > +} > > I suppose we want to export vn_valueize from tree-ssa-sccvn.c instead > which seems to perform the same. Done. > Do you ever call the above > when scc_vn_ok is false or x is NULL? Not in this version. Earlier, I also ran the pass if sccvn bailed out, but pre and fre only run if sccvn succeeded. > > +static bool > +gvn_uses_equal (tree val1, tree val2, basic_block bb1, > + basic_block bb2, bool same_preds) > +{ > + gimple def1, def2; > + basic_block def1_bb, def2_bb; > + > + if (val1 == NULL_TREE || val2 == NULL_TREE) > + return false; > > does this ever happen? Not in the current version. Removed. > > + if (gvn_val (val1) != gvn_val (val2)) > + return false; > > I suppose a shortcut > > if (val1 == val2) > return true; > > is possible? > Indeed. Added. > +static int *bb_size; > + > +/* Init bb_size administration. */ > + > +static void > +init_bb_size (void) > +{ > > if you need more per-BB info you can hook it into bb->aux. What's the > size used for (I guess I'll see below ...)? > > + for (gsi = gsi_start_nondebug_bb (bb); > + !gsi_end_p (gsi); gsi_next_nondebug (&gsi)) > + size++; > > is pretty rough. I guess for a quick false result for comparing BBs > (which means you could initialize the info lazily?) Done. > > +struct same_succ > +{ > + /* The bbs that have the same successor bbs. */ > + bitmap bbs; > + /* The successor bbs. */ > + bitmap succs; > + /* Indicates whether the EDGE_TRUE/FALSE_VALUEs of succ_flags are swapped for > + bb. */ > + bitmap inverse; > + /* The edge flags for each of the successor bbs. */ > + VEC (int, heap) *succ_flags; > + /* Indicates whether the struct is in the worklist. */ > + bool in_worklist; > +}; > > looks somewhat odd at first sight - maybe a overall comment what this > is used for is missing. Well, let's see. > Tried to add an overall comment. > +static hashval_t > +same_succ_hash (const void *ve) > +{ > + const_same_succ_t e = (const_same_succ_t)ve; > + hashval_t hashval = bitmap_hash (e->succs); > + int flags; > + unsigned int i; > + unsigned int first = bitmap_first_set_bit (e->bbs); > + int size = bb_size [first]; > + gimple_stmt_iterator gsi; > + gimple stmt; > + basic_block bb = BASIC_BLOCK (first); > + > + hashval = iterative_hash_hashval_t (size, hashval); > + for (gsi = gsi_start_nondebug_bb (bb); > + !gsi_end_p (gsi); gsi_next_nondebug (&gsi)) > + { > + stmt = gsi_stmt (gsi); > + hashval = iterative_hash_hashval_t (gimple_code (stmt), hashval); > + if (!is_gimple_call (stmt)) > + continue; > + if (gimple_call_internal_p (stmt)) > + hashval = iterative_hash_hashval_t > + ((hashval_t) gimple_call_internal_fn (stmt), hashval); > + else > + hashval = iterative_hash_expr (gimple_call_fn (stmt), hashval); > > you could also keep a cache of the BB hash as you keep a cache > of the size (if this function is called multiple times per BB). Right, I forgot it's a closed hash table. Added the cache of the bb hash. > The hash looks relatively weak - for all asignments it will hash > in GIMPLE_ASSIGN only ... I'd at least hash in gimple_assign_rhs_code. Done. > The call handling OTOH looks overly complicated to me ;) > That's an attempt to handle compiling insn-recog.c efficiently. All bbs without successors are grouped together, and even after selecting on same function name, there are still thousands of bbs in that group. I added now the args as well. > The hash will be dependent on stmt ordering even if that doesn't matter, > like > > i = i + 1; > j = j - 1; > > vs. the swapped variant. Right, that's a todo, added that in the header comment. > Similar the successor edges are not sorted, > so true/false edges may be in different order. > I keep a cache of the successor edge flags, in order of bbs. > Not sure yet if your comparison function would make those BBs > unequal anyway. > > +static bool > +inverse_flags (const_same_succ_t e1, const_same_succ_t e2) > +{ > + int f1a, f1b, f2a, f2b; > + int mask = ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE); > + > + if (VEC_length (int, e1->succ_flags) != 2) > + return false; > ... > > I wonder why you keep a VEC of successor edges in same_succ_t > instead of using the embedded successor edge vector in the basic_block > structure? > To keep the information in order of bbs, for quick comparison. > + bb_to_same_succ[bb->index] = *slot; > > looks like a candidate for per-BB info in bb->aux, too. > Done. > +static void > +find_same_succ (void) > +{ > + int i; > + same_succ_t same = same_succ_alloc (); > + > + for (i = 0; i < last_basic_block; ++i) > + { > + find_same_succ_bb (BASIC_BLOCK (i), &same); > + if (same == NULL) > + same = same_succ_alloc (); > + } > > I suppose you want FOR_EACH_BB (excluding entry/exit block) or > FOR_ALL_BB (including them). The above also can > have BASIC_BLOCK(i) == NULL. Similar in other places. > Done. > + for (i = 0; i < n1; ++i) > + { > + ei = EDGE_PRED (bb1, i); > + for (j = 0; j < n2; ++j) > + { > + ej = EDGE_PRED (bb2, j); > + if (ei->src != ej->src) > + continue; > + nr_matches++; > + break; > + } > + } > > FOR_EACH_EDGE (ei, iterator, bb1->preds) > if (!find_edge (ei->src, bb2)) > return false; > > is easier to parse. > Done. > +static bool > +gimple_subcode_equal_p (gimple s1, gimple s2, bool inv_cond) > +{ > + tree var, var_type; > + bool honor_nans; > + > + if (is_gimple_assign (s1) > + && gimple_assign_rhs_class (s1) == GIMPLE_SINGLE_RHS) > + return true; > > the subcode for GIMPLE_SINGLE_RHS is gimple_assign_rhs_code > (TREE_CODE of gimple_assign_rhs1 actually). > > +static bool > +gimple_base_equal_p (gimple s1, gimple s2, bool inv_cond) > > I wonder if you still need this given .. > > +static bool > +gimple_equal_p (gimple s1, gimple s2, bool same_preds, bool inv_cond) > +{ > + unsigned int i; > + enum gimple_statement_structure_enum gss; > + tree lhs1, lhs2; > + basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2); > + > + /* Handle omp gimples conservatively. */ > + if (is_gimple_omp (s1) || is_gimple_omp (s2)) > + return false; > + > + /* Handle lhs. */ > + lhs1 = gimple_get_lhs (s1); > + lhs2 = gimple_get_lhs (s2); > + if (lhs1 != NULL_TREE && lhs2 != NULL_TREE) > + return (same_preds && TREE_CODE (lhs1) == SSA_NAME > + && TREE_CODE (lhs2) == SSA_NAME > + && gvn_val (lhs1) == gvn_val (lhs2)); > + else if (!(lhs1 == NULL_TREE && lhs2 == NULL_TREE)) > + return false; > > all lhs equivalency is defered to GVN (which means all GIMPLE_ASSIGN > and GIMPLE_CALL stmts with a lhs). > > That leaves the case of calls without a lhs. I'd rather structure this > function like > > if (gimple_code (s1) != gimple_code (s2)) > return false; > swithc (gimple_code (s1)) > { > case GIMPLE_CALL: > ... compare arguments ... > if equal ok, if not and we have a lhs use GVN. > > case GIMPLE_ASSIGN: > ... compare GVN of the LHS ... > > case GIMPLE_COND: > ... compare operands ... > > default: > return false; > } > > Done. > +static bool > +bb_gimple_equal_p (basic_block bb1, basic_block bb2, bool same_preds, > + bool inv_cond) > + > +{ > > you don't do an early out by comparing the pre-computed sizes. This function is only called for bb with the same size. The hash table equal fuction does have the size comparison. > Mind > you can have hashtable collisions where they still differ (did you > check hashtable stats on it? how is the collision rate?) > I managed to lower the collision rate by specifying n_basic_blocks as hash table size. While compiling insn-recog.c, the highest collision rate for a function is 2.69. > +static bool > +bb_has_non_vop_phi (basic_block bb) > +{ > + gimple_seq phis = phi_nodes (bb); > + gimple phi; > + > + if (phis == NULL) > + return false; > + > + if (!gimple_seq_singleton_p (phis)) > + return true; > + > + phi = gimple_seq_first_stmt (phis); > + return !VOID_TYPE_P (TREE_TYPE (gimple_phi_result (phi))); > > return is_gimple_reg (gimple_phi_result (phi)); > Done. > +static void > +update_debug_stmts (void) > +{ > + int i; > + basic_block bb; > + > + for (i = 0; i < last_basic_block; ++i) > + { > + gimple stmt; > + gimple_stmt_iterator gsi; > + > + bb = BASIC_BLOCK (i); > > FOR_EACH_BB > > it must be possible to avoid scanning basic-blocks that are not affected > by the transform, no? In fact the only affected basic-blocks should be > those that were merged with another block? Done. I also check for MAY_HAVE_DEBUG_STMTS now. > > + /* Mark vops for updating. Without this, TODO_update_ssa_only_virtuals > + won't do anything. */ > + mark_sym_for_renaming (gimple_vop (cfun)); > > it won't insert any PHIs, that's correct. Still somewhat ugly, a manual > update of PHI nodes should be possible. > Added. I'm trying to be lazy about it though: + bool update_vops = ((todo & TODO_update_ssa_only_virtuals) == 0 + || !symbol_marked_for_renaming (gimple_vop (cfun))); If we're going to insert those phis anyway given the current todo, we don't bother. > + if (dump_file) > + { > + fprintf (dump_file, "Before TODOs.\n"); > > with TDF_DETAILS only please. > Done. > + free_dominance_info (CDI_DOMINATORS); > > if you keep dominance info up-to-date there is no need to free it. > Indeed. And by not freeing it, the info was checked, and I hit validation errors in that info, so the updating code had problems. I reverted back to calculating when needed and freeing when changed. > + TODO_verify_ssa | TODO_verify_stmts > + | TODO_verify_flow | TODO_update_ssa_only_virtuals > + | TODO_rebuild_alias > > please no TODO_rebuild_alias, simply remove it - alias info in merged > paths should be compatible enough if there is value-equivalence between > SSA names. At least you can't rely on TODO_rebuild_alias for > correctness - it is skipped if IPA PTA was run for example. > Done. > + | TODO_cleanup_cfg > > is that needed? If so return it from your execute function if you changed > anything only. But I doubt your transformation introduces cleanup > opportunities? > If all the predeccessor blocks of a block are merged, the block and it's remaining predecessor block might be merged, so that is a cleanup opportunity. Removed for the moment. > New options and params need documentation in doc/invoke.texi. > Added. > Thanks, > Richard. > Bootstrapped and reg-tested on x86_64. Ok for trunk (after ARM testing)? Thanks, - Tom 2011-07-17 Tom de Vries PR middle-end/43864 * tree-ssa-tail-merge.c: New file. (struct same_succ): Define. (same_succ_t, const_same_succ_t): New typedef. (struct bb_cluster): Define. (bb_cluster_t, const_bb_cluster_t): New typedef. (struct aux_bb_info): Define. (BB_SIZE, BB_SAME_SUCC, BB_CLUSTER, BB_VOP_AT_EXIT): Define. (gvn_uses_equal): New function. (same_succ_print, same_succ_print_traverse, same_succ_hash) (inverse_flags, same_succ_equal, same_succ_alloc, same_succ_delete) (same_succ_reset): New function. (same_succ_htab, same_succ_edge_flags) (deleted_bbs, deleted_bb_preds): New var. (debug_same_succ): New function. (worklist): New var. (print_worklist, add_to_worklist, find_same_succ_bb, find_same_succ) (init_worklist, delete_worklist, delete_basic_block_same_succ) (same_succ_flush_bbs, update_worklist): New function. (print_cluster, debug_cluster, same_predecessors) (add_bb_to_cluster, new_cluster, delete_cluster): New function. (all_clusters): New var. (alloc_cluster_vectors, reset_cluster_vectors, delete_cluster_vectors) (merge_clusters, set_cluster): New function. (gimple_equal_p, find_duplicate, same_phi_alternatives_1) (same_phi_alternatives, bb_has_non_vop_phi, find_clusters_1) (find_clusters): New function. (merge_calls, update_vuses, vop_phi, insn_vops, vop_at_entry) (replace_block_by): New function. (update_bbs): New var. (apply_clusters): New function. (update_debug_stmt, update_debug_stmts): New function. (tail_merge_optimize): New function. tree-flow.h (tail_merge_optimize): Declare. * tree-ssa-pre.c (execute_pre): Use tail_merge_optimize. * Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o. (tree-ssa-tail-merge.o): New rule. * opts.c (default_options_table): Set OPT_ftree_tail_merge by default at OPT_LEVELS_2_PLUS. * tree-ssa-sccvn.c (vn_valueize): Move to ... * tree-ssa-sccvn.h (vn_valueize): Here. * tree-ssa-alias.h (pt_solution_ior_into_shared): Declare. * tree-ssa-structalias.c (find_what_var_points_to): Factor out and use ... (pt_solution_share): New function. (pt_solution_unshare, pt_solution_ior_into_shared): New function. (delete_points_to_sets): Nullify shared_bitmap_table after deletion. * timevar.def (TV_TREE_TAIL_MERGE): New timevar. * common.opt (ftree-tail-merge): New switch. * params.def (PARAM_MAX_TAIL_MERGE_COMPARISONS): New parameter. * doc/invoke.texi (Optimization Options, -O2): Add -ftree-tail-merge. (-ftree-tail-merge, max-tail-merge-comparisons): New item.