From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id 78FF73858408; Tue, 31 Aug 2021 11:29:13 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 78FF73858408 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 46EA5221E3; Tue, 31 Aug 2021 11:29:12 +0000 (UTC) Received: from murzim.suse.de (murzim.suse.de [10.160.4.192]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 1CB32A3B99; Tue, 31 Aug 2021 11:29:12 +0000 (UTC) Date: Tue, 31 Aug 2021 13:29:12 +0200 (CEST) From: Richard Biener To: Xionghu Luo cc: gcc-patches@gcc.gnu.org, segher@kernel.crashing.org, dje.gcc@gmail.com, wschmidt@linux.ibm.com, guojiufu@linux.ibm.com, linkw@gcc.gnu.org Subject: Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 In-Reply-To: Message-ID: References: <20210816084612.7999-1-luoxhu@linux.ibm.com> <1c852e87-14ce-ec10-8d87-ef612abcce42@linux.ibm.com> <016969d3-80e2-8556-a8ac-42f166f04c19@linux.ibm.com> <1f880a67-50c6-665f-5633-0505a2388c30@linux.ibm.com> User-Agent: Alpine 2.21 (LSU 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-4.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 31 Aug 2021 11:29:24 -0000 On Tue, 31 Aug 2021, Xionghu Luo wrote: > > > On 2021/8/30 17:19, Richard Biener wrote: > >>>> bitmap_set_bit (work_set, loop->header->index); > >>>> + unsigned bb_index; > >>>> - for (i = 0; i < loop->num_nodes; i++) > >>>> - { > >>>> - edge_iterator ei; > >>>> - bb = bbs[i]; > >>>> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; > >>>> + int *bbd = XNEWVEC (int, array_size); > >>>> + bbd = XDUPVEC (int, bbi, array_size); > >>> I don't think you need to copy 'bbi' but you can re-use the > >>> state from the outer loop processing. Did you run into any > >>> issues with that? > >> Yes. For example, adding a small if-else block to ssa-lim-19.c, > >> Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when > >> call > >> fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0, > >> then if fill_always_executed_in_1 is called again for loop 2, it's value is > >> not > >> reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong. > >> > >> > >> struct X { int i; int j; int k;}; > >> > >> void foo(struct X *x, int n, int l, int m) > >> { > >> for (int j = 0; j < l; j++) // loop 1 > >> { > >> for (int i = 0; i < n; ++i) // loop 2 > >> { > >> if (m) > >> x->j++; > >> else > >> x->j = m+n+l; > >> > >> int *p = &x->j; // bb 6 > >> int tem = *p; > >> x->j += tem * i; > >> } > >> int *r = &x->k; > >> int tem2 = *r; > >> x->k += tem2 * j; > >> } > >> } > > Hmm, but if the outer loop processing reaches bb 6 then > > it should have set it ALWAYS_EXECUTED in loop 1 already? > > But bb 6 is NOT ALWAYS_EXECUTED for loop 1, it is only ALWAYS_EXECUTED for > loop 2 as it requires n>0. Please refer to the attached file > ssa-lim-19.c.138t.lim2. > > ;; > ;; Loop 1 > ;; header 8, latch 12 > ;; depth 1, outer 0 > ;; nodes: 8 12 7 6 4 5 3 13 11 > ;; > ;; Loop 2 > ;; header 3, latch 13 > ;; depth 2, outer 1 > ;; nodes: 3 13 6 4 5 > ;; 2 succs { 10 9 } > ;; 10 succs { 8 } > ;; 11 succs { 3 } > ;; 3 succs { 4 5 } > ;; 4 succs { 6 } > ;; 5 succs { 6 } > ;; 6 succs { 13 7 } > ;; 13 succs { 3 } > ;; 7 succs { 12 9 } > ;; 12 succs { 8 } > ;; 8 succs { 11 7 } > ;; 9 succs { 1 } > > always executed: bb->index:8, loop->num: 1 > always executed: bb->index:7, loop->num: 1 > always executed: bb->index:3, loop->num: 2 > always executed: bb->index:6, loop->num: 2 > > 8<--------------- > / \ | > 11 \ | > / \ | > 3<--- \ | > /\ | \ | > 4 5 | \ | > \/ | \ | > 6 | \ | > |-->13 \ | > |----------> 7 | > /\ | > 9 12--- > > (gdb) x /15x bbd > 0x1354c9b0: 0x00000000 0x00000000 0x00000001 0x00000001 > 0x1354c9c0: 0x00000001 0x00000001 0x00000002 0x00000002 > 0x1354c9d0: 0x00000001 0x00000002 0x00000001 0x00000001 > 0x1354c9e0: 0x00000001 0x00000001 0x00000000 > > our algorithm will walk through 8->11->3->4->5->6->7, > for loop 1, exit at edge 7->9. > > (gdb) x /15x bbd > 0x1354c9b0: 0x00000000 0x00000000 0x00000001 0x00000000 > 0x1354c9c0: 0x00000000 0x00000000 0x00000000 0x00000000 > 0x1354c9d0: 0x00000001 0x00000002 0x00000001 0x00000000 > 0x1354c9e0: 0x00000001 0x00000000 0x00000000 > > If we don't reset bbd to incoming_edge by memcpy, bbd[3],bbd[4],bbd[5] > and bbd[6] is 0 now for loop 2, fill_always_executed_in_1 couldn't set > ALWAYS_EXECUTED correctly for loop 2 at bb 3 and bb 6. > > > > >> > >>> > >>>> + while (!bitmap_empty_p (work_set)) > >>>> + { > >>>> + bb_index = bitmap_first_set_bit (work_set); > >>>> + bitmap_clear_bit (work_set, bb_index); > >>>> + bb = BASIC_BLOCK_FOR_FN (cfun, bb_index); > >>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>> - last = bb; > >>>> - > >>>> + SET_ALWAYS_EXECUTED_IN (bb, loop); > >>>> if (bitmap_bit_p (contains_call, bb->index)) > >>>> break; > >>> I think you want to continue; here (process remaining worklist > >>> but not continue greedy walking this block) > >> Same as above, if use 'continue' instead of 'break', the algorithm > >> seems also not work again. If inner loop contains a jump to outmost > >> loop, the blocks after the jump block will be set to ALWAYS EXECUTE > >> incorrectly. > >> > >>>> - > >>>> + edge_iterator ei; > >>>> FOR_EACH_EDGE (e, ei, bb->succs) > >>>> { > >>>> - /* If there is an exit from this BB. */ > >>>> if (!flow_bb_inside_loop_p (loop, e->dest)) > >>>> break; > >>> in particular this should keep the outer 'bbi' valid to re-use. > >>> > >>> But again, you want 'continue;' the greedy walk to other edges. > >>> If that's not valid (I'd need to think about this) then with > >>> your patch whether we process an edge depends on the order > >>> of the edge visit so you'd have to walk successors twice, > >>> once to determine whether we can greedily walk any of it > >>> and once to actually do the greedy walk. > >>> > >>> So thinking about it an exit edge is like a not returning call > >>> and thus we indeed should not process any outgoing edges of > >>> this block. > >>> > >>>> + > >>>> /* Or we enter a possibly non-finite loop. */ > >>>> if (flow_loop_nested_p (bb->loop_father, > >>>> e->dest->loop_father) > >>>> && ! finite_loop_p (e->dest->loop_father)) > >>>> break; > >>> I think this is no longer necessary? In any case it would > >>> again be 'continue;', see also above. > >>> > >>> I'm missing skipping of backedges in the walk. > >>> > >>>> + > >>>> + if (bbd[e->dest->index] == 1) > >>>> + { > >>>> + bitmap_set_bit (work_set, e->dest->index); > >>>> + bbd[e->dest->index]--; > >>>> + } > >>>> + else if (bbd[e->dest->index] > 1) > >>>> + bbd[e->dest->index]--; > >>> maybe just > >>> > >>> bbd[e->dest->index]--; > >>> if (bbd[e->dest->index] == 0) > >>> bitmap_set_bit (work_set, e->dest->index); > >> They are not equivalent. Former won't decrease if bbd[e->dest->index] > >> is 0, but the latter does. > > we should be never visiting in this case, so sth indeed doesn't > > work as I expected. > > I've verified this won't happen if duplicate the bb_incoming_edges > in fill_always_executed_in_1, so could replace it. > > > > >>>> } > >>>> + > >>>> if (e) > >>>> break; > >>> continue; processing the worklist > >>> > >>>> > >>>> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, > >>>> @@ sbitmap contains_call) > >>>> to disprove this if possible). */ > >>>> if (bb->flags & BB_IRREDUCIBLE_LOOP) > >>>> break; > >>> continue; processing the worklist. I think this has to > >>> come early, before processing successors, next to > >>> the contains_call processing. We could think of ignoring > >>> entering of irreducible regions by tracking exits from them, > >>> but I'm not sure we're identifying the regions in a good > >>> enough way to allow this. > >>> > >>> Likewise possibly infinite inner loops can be processed > >>> when we track exits from those which would be sth like > >>> > >>> /* When this is an exit from an inner loop that we cannot > >>> prove as finite do not follow this edge. */ > >>> if (bb->loop_father != loop > >>> && loop_exit_edge_p (bb->loop_father, e) > >>> && ! finite_loop_p (bb->loop_father)) > >>> continue; > >>> > >>>> - > >>>> - if (!flow_bb_inside_loop_p (inn_loop, bb)) > >>>> - break; > >>>> - > >>>> - if (bb->loop_father->header == bb) > >>>> - { > >>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>> - break; > >>>> - > >>>> - /* In a loop that is always entered we may proceed anyway. > >>>> - But record that we entered it and stop once we leave it. */ > >>>> - inn_loop = bb->loop_father; > >>>> - } > >>>> - } > >>>> - > >>>> - while (1) > >>>> - { > >>>> - SET_ALWAYS_EXECUTED_IN (last, loop); > >>>> - if (last == loop->header) > >>>> - break; > >>>> - last = get_immediate_dominator (CDI_DOMINATORS, last); > >>>> } > >>>> - > >>>> - free (bbs); > >>>> + free (bbd); > >>>> } > >>>> > >>>> for (loop = loop->inner; loop; loop = loop->next) > >>>> - fill_always_executed_in_1 (loop, contains_call); > >>>> + fill_always_executed_in_1 (loop, contains_call, bbi); > >>>> } > >>>> > >>>> /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e. > >>>> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void) > >>>> bitmap_set_bit (contains_call, bb->index); > >>>> } > >>>> + mark_dfs_back_edges (); > >>> Please put this to loop_invariant_motion_in_fun right before > >>> the call to fill_always_executed_in. > >> Done. > >> > >>>> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; > >>>> + int *bb_incoming_edges = XNEWVEC (int, array_size); > >>> There's XCNEWVEC that also zeros the array. > >> OK, thanks. > >> > >>>> + memset (bb_incoming_edges, 0, array_size * sizeof (int)); > >>>> + edge_iterator ei; > >>>> + edge e; > >>>> + > >>>> + FOR_EACH_BB_FN (bb, cfun) > >>>> + { > >>>> + int len = bb->preds->length (); > >>>> + FOR_EACH_EDGE (e, ei, bb->preds) > >>>> + if (e->flags & EDGE_DFS_BACK) > >>>> + len--; > >>>> + bb_incoming_edges[bb->index] = len; > >>>> + } > >>> it would be nice to combine this with the preceeding loop over > >>> all blocks. > >> Done. > >> > >>>> + static unsigned long diff = 0; > >>>> + struct timeval tv; > >>>> + gettimeofday (&tv, NULL); > >>>> + unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + > >>>> tv.tv_usec; > >>>> + > >>>> + //for (loop : loops_list (cfun, 0)) > >>>> for (loop = current_loops->tree_root->inner; loop; loop = > >>>> loop->next) > >>>> - fill_always_executed_in_1 (loop, contains_call); > >>>> + fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges); > >>>> + > >>>> + gettimeofday (&tv, NULL); > >>>> + unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; > >>>> + diff += end - begin; > >>>> + //printf ("%s,diff:%ld\n", current_function_name (), diff); > >>>> + free (bb_incoming_edges); > >>> Eh, looks very much like work in progress. From the O() complexity > >>> point duplicating bb_incoming_edges for each loop makes it quadratic > >>> again but as said I think we don't need this duplication. > >> Sorry I didn't quite follow your above comments about reusing > >> bb_incoming_edges, > >> we are searching the bbs that dominate loop->latch loop by loop > >> independently, > >> if outer loop changes the inner loops bb_incoming_edges, it will make the > >> inner > >> loop's ALWAYS EXECUTE computation incorrect? > > We're only supposed to greedily walk into always excuted blocks. I see > > this probably breaks down for conditionally executed inner loops like > > > > for () > > { > > if () > > for (); > > A; > > } > > > > where we want to process the inner loop separately but we still want > > to make sure A is marked as always executed in the outer loop. > > Yes, the problem is ALWAYS_EXECUTE blocks are not SUCCESSIVE in CFG. > so we have to check every bb and maintain bb_incoming_edges in each loop. If > there is any block goes to outer loop, we have to 'break' instead of > 'continue' as followed bbs are not ALWAYS_EXECUTE again. But we're using a worklist (since the CFG branches) - if we're not continue to pick items from the worklist then the outcome depends on the order of the worklist proceesing. That can't be correct either (similar when walking successor edges). > >> Seems duplicating bb_incoming_edges not increases the complexity but the > >> recursive call does? > > Since the bb_incoming_edges array has a size of N and we copy it > > number_of_loop times (which is O(N) bound) the copying effectively > > has O(N^2) cost. So yes, it's the iteration over all loops, > > whether by recursion or iteration. > > > Actually, the copying is outside of the greedy CFG walk loop. I guess there > will be some optimizations in memcpy bb_incoming_edges by libraries? But > maybe only works when most of the value are same then copy_by_pieces? > > And I modified the time-consuming case with many bbs to simulate the extreme > usage, didn't observe quadratic behavior from it, it shows almost linear > build time increase for both base and V3 patch, but V3 patch generates more > accurate ALWAYS_EXECUTED result. The function ratio in perf tool also looks > not bad. At least, it looks like an improvement compared with the base code. > > A function with 10K bbs would cost more than 8 minutes to build, so it seems > duplicating bb_incoming_edges much less then 1000*1000*1000 number of bbs will > not be a performance bottle neck? (While most values are different, the > copying > will exactly slow down when bb_incoming_edges array is quite large like > 1000*1000*1000, > if so, that's really crazy automatic program generators...) > > Base: > number of bbs(lim2+lim4) total build time (sec) tree loop invariant > motion (usr/sys/wall/Mem) > 7850+15615 123.99 13.52 ( 11%) 2.16 ( 52%) 15.68 ( 11%) 60M ( 21%) > 15338+30591 208.98 24.55 ( 12%) 4.04 ( 52%) 28.57 ( 12%) 121M ( 21%) > 30314+60543 511.54 48.96 ( 10%) 7.70 ( 52%) 56.90 ( 10%) 241M ( 21%) > > V3 patch: > number of bbs(lim2+lim4) total build time (sec) tree loop invariant > motion (usr/sys/wall/Mem) > 7850+15615 121.18 12.26 ( 10%) 1.99 ( 49%) 14.42 ( 11%) 60M ( 21%) > 15338+30591 204.99 24.72 ( 12%) 3.98 ( 52%) 28.85 ( 13%) 121M ( 21%) > 30314+60543 500.76 48.26 ( 10%) 8.27 ( 51%) 57.12 ( 11%) 241M ( 21%) > > > Samples: 455K of event 'cycles', Event count (approx.): 426772700334 > Overhead Command Shared Object Symbol > 7.19% cc1 cc1 [.] bitmap_set_bit > 5.86% cc1 cc1 [.] get_ref_base_and_extent > 4.08% cc1 cc1 [.] get_continuation_for_phi > 3.48% cc1 cc1 [.] hash_table xcallocator>::find > 3.41% cc1 cc1 [.] dominated_by_p > 3.28% cc1 cc1 [.] compute_transp > 2.98% cc1 cc1 [.] bitmap_set_range > 2.60% cc1 cc1 [.] bitmap_list_insert_element_after > 2.44% cc1 cc1 [.] stmt_may_clobber_ref_p_1 > 2.27% cc1 cc1 [.] refs_may_alias_p_2 > 2.26% cc1 cc1 [.] hash_table xcallocator>::fi > 2.26% cc1 cc1 [.] df_rd_transfer_function > 2.07% cc1 cc1 [.] bitmap_ior_into > 1.84% cc1 cc1 [.] find_base_term > 1.82% cc1 cc1 [.] get_alias_set > 1.79% cc1 cc1 [.] tree_strip_nop_conversions > 1.73% cc1 cc1 [.] dfs_enumerate_from > 1.73% cc1 cc1 [.] reference_alias_ptr_type_1 > 1.53% cc1 cc1 [.] alias_sets_conflict_p > 1.30% cc1 cc1 [.] c_get_alias_set > 1.16% cc1 cc1 [.] bitmap_and_into > 1.12% cc1 cc1 [.] indirect_ref_may_alias_decl_p > 1.04% cc1 cc1 [.] et_splay > 0.99% cc1 libc-2.27.so [.] __memset_power8 > 0.95% cc1 cc1 [.] bitmap_ior_and_compl > 0.86% cc1 cc1 [.] bitmap_copy > 0.78% cc1 cc1 [.] ao_ref_from_mem > > > > > I wonder if we can more cleverly handle this by picking up > > conditionally executed inner loops on the fly for example > > by recording a 'outermost unconditionally executed loop' > > for each loop header we visit and using that like > > SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father)) > > > > So make this a single greedy CFG walk starting at the outermost > > loops. loop exists will need to be processed specially then. > > Not sure do you mean we only check loop headers and exits? As said above, > the ALWAYS_EXECUTED block chains are not successive, there may be blocks > are ALWAYS_EXECUTED or goto-out-loops in the middle of the loop, but they > could only have two chains? One is from loop->header, and another is from > loop->exit->src for loops has one one exit? For setting ALWAYS_EXECUTED we rely on dominance. The greedy walk should determine whether there are paths in the CFG that skip execution of the block. I made the assumption in that if there exist "uninterrupted" paths covering all incoming edges of a block that this ensures the block is always executed. And since we only trace such "always executed" blocks this holds transitively. I'll see if I can find some time this week to spend some time on this problem and maybe also create a testcase for the existing issue with contains_call and the dominator order of BB processing. Richard.