From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
 by sourceware.org (Postfix) with ESMTPS id 78FF73858408;
 Tue, 31 Aug 2021 11:29:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 78FF73858408
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out1.suse.de (Postfix) with ESMTP id 46EA5221E3;
 Tue, 31 Aug 2021 11:29:12 +0000 (UTC)
Received: from murzim.suse.de (murzim.suse.de [10.160.4.192])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id 1CB32A3B99;
 Tue, 31 Aug 2021 11:29:12 +0000 (UTC)
Date: Tue, 31 Aug 2021 13:29:12 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: Xionghu Luo <luoxhu@linux.ibm.com>
cc: gcc-patches@gcc.gnu.org, segher@kernel.crashing.org, dje.gcc@gmail.com, 
 wschmidt@linux.ibm.com, guojiufu@linux.ibm.com, linkw@gcc.gnu.org
Subject: Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
In-Reply-To: <d827c243-a075-aab6-4ede-a93f48e93ae0@linux.ibm.com>
Message-ID: <nycvar.YFH.7.76.2108311317530.11781@zhemvz.fhfr.qr>
References: <20210816084612.7999-1-luoxhu@linux.ibm.com>
 <nycvar.YFH.7.76.2108161335101.11781@zhemvz.fhfr.qr>
 <a23374fc-6834-6503-744b-81da2123ec3c@linux.ibm.com>
 <nycvar.YFH.7.76.2108170901170.11781@zhemvz.fhfr.qr>
 <1c852e87-14ce-ec10-8d87-ef612abcce42@linux.ibm.com>
 <nycvar.YFH.7.76.2108191341410.11781@zhemvz.fhfr.qr>
 <016969d3-80e2-8556-a8ac-42f166f04c19@linux.ibm.com>
 <nycvar.YFH.7.76.2108241009380.11781@zhemvz.fhfr.qr>
 <1f880a67-50c6-665f-5633-0505a2388c30@linux.ibm.com>
 <nycvar.YFH.7.76.2108270916270.11781@zhemvz.fhfr.qr>
 <bedd0bdc-ac87-c701-c5ae-6791efc8dbe2@linux.ibm.com>
 <nycvar.YFH.7.76.2108301101470.11781@zhemvz.fhfr.qr>
 <d827c243-a075-aab6-4ede-a93f48e93ae0@linux.ibm.com>
User-Agent: Alpine 2.21 (LSU 202 2017-01-01)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-4.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 31 Aug 2021 11:29:24 -0000

On Tue, 31 Aug 2021, Xionghu Luo wrote:

> 
> 
> On 2021/8/30 17:19, Richard Biener wrote:
> >>>>      bitmap_set_bit (work_set, loop->header->index);
> >>>> +      unsigned bb_index;
> >>>>    -      for (i = 0; i < loop->num_nodes; i++)
> >>>> -	{
> >>>> -	  edge_iterator ei;
> >>>> -	  bb = bbs[i];
> >>>> +      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> >>>> +      int *bbd = XNEWVEC (int, array_size);
> >>>> +      bbd = XDUPVEC (int, bbi, array_size);
> >>> I don't think you need to copy 'bbi' but you can re-use the
> >>> state from the outer loop processing.  Did you run into any
> >>> issues with that?
> >> Yes.  For example, adding a small if-else block to ssa-lim-19.c,
> >> Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when
> >> call
> >> fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0,
> >> then if fill_always_executed_in_1 is called again for loop 2, it's value is
> >> not
> >> reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong.
> >>
> >>
> >> struct X { int i; int j; int k;};
> >>
> >> void foo(struct X *x, int n, int l, int m)
> >> {
> >>   for (int j = 0; j < l; j++)  // loop 1
> >>     {
> >>       for (int i = 0; i < n; ++i)  // loop 2
> >>         {
> >>           if (m)
> >>             x->j++;
> >>           else
> >>             x->j = m+n+l;
> >>
> >>           int *p = &x->j;   // bb 6
> >>           int tem = *p;
> >>           x->j += tem * i;
> >>         }
> >>       int *r = &x->k;
> >>       int tem2 = *r;
> >>       x->k += tem2 * j;
> >>     }
> >> }
> > Hmm, but if the outer loop processing reaches bb 6 then
> > it should have set it ALWAYS_EXECUTED in loop 1 already?
> 
> But bb 6 is NOT ALWAYS_EXECUTED for loop 1, it is only ALWAYS_EXECUTED for
> loop 2 as it requires n>0.  Please refer to the attached file
> ssa-lim-19.c.138t.lim2.
> 
> ;;
> ;; Loop 1
> ;;  header 8, latch 12
> ;;  depth 1, outer 0
> ;;  nodes: 8 12 7 6 4 5 3 13 11
> ;;
> ;; Loop 2
> ;;  header 3, latch 13
> ;;  depth 2, outer 1
> ;;  nodes: 3 13 6 4 5
> ;; 2 succs { 10 9 }
> ;; 10 succs { 8 }
> ;; 11 succs { 3 }
> ;; 3 succs { 4 5 }
> ;; 4 succs { 6 }
> ;; 5 succs { 6 }
> ;; 6 succs { 13 7 }
> ;; 13 succs { 3 }
> ;; 7 succs { 12 9 }
> ;; 12 succs { 8 }
> ;; 8 succs { 11 7 }
> ;; 9 succs { 1 }
> 
> always executed: bb->index:8, loop->num: 1
> always executed: bb->index:7, loop->num: 1
> always executed: bb->index:3, loop->num: 2
> always executed: bb->index:6, loop->num: 2
> 
>     8<---------------
>  /  \              |
>  11   \             |
>  /     \            |
>  3<---  \           | 
> /\    |  \          |
> 4 5   |   \         |
> \/    |    \        |
>  6    |     \       |
>  |-->13      \      |
>  |----------> 7     |
>              /\    |
>              9 12---
> 
> (gdb) x /15x bbd
> 0x1354c9b0: 0x00000000  0x00000000  0x00000001  0x00000001
> 0x1354c9c0: 0x00000001  0x00000001  0x00000002  0x00000002
> 0x1354c9d0: 0x00000001  0x00000002  0x00000001  0x00000001
> 0x1354c9e0: 0x00000001  0x00000001  0x00000000
> 
> our algorithm will walk through 8->11->3->4->5->6->7,
> for loop 1, exit at edge 7->9.
> 
> (gdb) x /15x bbd
> 0x1354c9b0: 0x00000000  0x00000000  0x00000001  0x00000000
> 0x1354c9c0: 0x00000000  0x00000000  0x00000000  0x00000000
> 0x1354c9d0: 0x00000001  0x00000002  0x00000001  0x00000000
> 0x1354c9e0: 0x00000001  0x00000000  0x00000000
> 
> If we don't reset bbd to incoming_edge by memcpy, bbd[3],bbd[4],bbd[5]
> and bbd[6] is 0 now for loop 2, fill_always_executed_in_1 couldn't set
> ALWAYS_EXECUTED correctly for loop 2 at bb 3 and bb 6.
> 
> > 
> >>
> >>>    
> >>>> +      while (!bitmap_empty_p (work_set))
> >>>> +	{
> >>>> +	  bb_index = bitmap_first_set_bit (work_set);
> >>>> +	  bitmap_clear_bit (work_set, bb_index);
> >>>> +	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
> >>>>    	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>> -	    last = bb;
> >>>> -
> >>>> +	    SET_ALWAYS_EXECUTED_IN (bb, loop);
> >>>>       if (bitmap_bit_p (contains_call, bb->index))
> >>>>         break;
> >>> I think you want to continue; here (process remaining worklist
> >>> but not continue greedy walking this block)
> >> Same as above, if use 'continue' instead of 'break', the algorithm
> >> seems also not work again.  If inner loop contains a jump to outmost
> >> loop, the blocks after the jump block will be set to ALWAYS EXECUTE
> >> incorrectly.
> >>
> >>>> -
> >>>> +	  edge_iterator ei;
> >>>>       FOR_EACH_EDGE (e, ei, bb->succs)
> >>>>    	    {
> >>>> -	      /* If there is an exit from this BB.  */
> >>>>           if (!flow_bb_inside_loop_p (loop, e->dest))
> >>>>      break;
> >>> in particular this should keep the outer 'bbi' valid to re-use.
> >>>
> >>> But again, you want 'continue;' the greedy walk to other edges.
> >>> If that's not valid (I'd need to think about this) then with
> >>> your patch whether we process an edge depends on the order
> >>> of the edge visit so you'd have to walk successors twice,
> >>> once to determine whether we can greedily walk any of it
> >>> and once to actually do the greedy walk.
> >>>
> >>> So thinking about it an exit edge is like a not returning call
> >>> and thus we indeed should not process any outgoing edges of
> >>> this block.
> >>>
> >>>> +
> >>>>           /* Or we enter a possibly non-finite loop.  */
> >>>>           if (flow_loop_nested_p (bb->loop_father,
> >>>>              e->dest->loop_father)
> >>>>        && ! finite_loop_p (e->dest->loop_father))
> >>>>      break;
> >>> I think this is no longer necessary?  In any case it would
> >>> again be 'continue;', see also above.
> >>>
> >>> I'm missing skipping of backedges in the walk.
> >>>
> >>>> +
> >>>> +	      if (bbd[e->dest->index] == 1)
> >>>> +	      {
> >>>> +		bitmap_set_bit (work_set, e->dest->index);
> >>>> +		bbd[e->dest->index]--;
> >>>> +	      }
> >>>> +	      else if (bbd[e->dest->index] > 1)
> >>>> +		bbd[e->dest->index]--;
> >>> maybe just
> >>>
> >>>               bbd[e->dest->index]--;
> >>>               if (bbd[e->dest->index] == 0)
> >>>                 bitmap_set_bit (work_set, e->dest->index);
> >> They are not equivalent.  Former won't decrease if bbd[e->dest->index]
> >> is 0, but the latter does.
> > we should be never visiting in this case, so sth indeed doesn't
> > work as I expected.
> 
> I've verified this won't happen if duplicate the bb_incoming_edges
> in fill_always_executed_in_1, so could replace it.
> 
> > 
> >>>>    	    }
> >>>> +
> >>>>       if (e)
> >>>>         break;
> >>> continue; processing the worklist
> >>>
> >>>>    
> >>>> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop,
> >>>> @@ sbitmap contains_call)
> >>>>          to disprove this if possible).  */
> >>>>       if (bb->flags & BB_IRREDUCIBLE_LOOP)
> >>>>         break;
> >>> continue; processing the worklist.  I think this has to
> >>> come early, before processing successors, next to
> >>> the contains_call processing.  We could think of ignoring
> >>> entering of irreducible regions by tracking exits from them,
> >>> but I'm not sure we're identifying the regions in a good
> >>> enough way to allow this.
> >>>
> >>> Likewise possibly infinite inner loops can be processed
> >>> when we track exits from those which would be sth like
> >>>
> >>>       /* When this is an exit from an inner loop that we cannot
> >>>          prove as finite do not follow this edge.  */
> >>>       if (bb->loop_father != loop
> >>>           && loop_exit_edge_p (bb->loop_father, e)
> >>>           && ! finite_loop_p (bb->loop_father))
> >>>         continue;
> >>>
> >>>> -
> >>>> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> >>>> -	    break;
> >>>> -
> >>>> -	  if (bb->loop_father->header == bb)
> >>>> -	    {
> >>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>> -		break;
> >>>> -
> >>>> -	      /* In a loop that is always entered we may proceed anyway.
> >>>> -		 But record that we entered it and stop once we leave it.  */
> >>>> -	      inn_loop = bb->loop_father;
> >>>> -	    }
> >>>> -	}
> >>>> -
> >>>> -      while (1)
> >>>> -	{
> >>>> -	  SET_ALWAYS_EXECUTED_IN (last, loop);
> >>>> -	  if (last == loop->header)
> >>>> -	    break;
> >>>> -	  last = get_immediate_dominator (CDI_DOMINATORS, last);
> >>>>    	}
> >>>> -
> >>>> -      free (bbs);
> >>>> +      free (bbd);
> >>>>        }
> >>>>    
> >>>>      for (loop = loop->inner; loop; loop = loop->next)
> >>>> -    fill_always_executed_in_1 (loop, contains_call);
> >>>> +    fill_always_executed_in_1 (loop, contains_call, bbi);
> >>>>    }
> >>>>    
> >>>>    /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
> >>>> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
> >>>>     bitmap_set_bit (contains_call, bb->index);
> >>>>        }
> >>>>    +  mark_dfs_back_edges ();
> >>> Please put this to loop_invariant_motion_in_fun right before
> >>> the call to fill_always_executed_in.
> >> Done.
> >>
> >>>> +  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> >>>> +  int *bb_incoming_edges = XNEWVEC (int, array_size);
> >>> There's XCNEWVEC that also zeros the array.
> >> OK, thanks.
> >>
> >>>> +  memset (bb_incoming_edges, 0, array_size * sizeof (int));
> >>>> +  edge_iterator ei;
> >>>> +  edge e;
> >>>> +
> >>>> +  FOR_EACH_BB_FN (bb, cfun)
> >>>> +    {
> >>>> +      int len = bb->preds->length ();
> >>>> +      FOR_EACH_EDGE (e, ei, bb->preds)
> >>>> +	if (e->flags & EDGE_DFS_BACK)
> >>>> +	  len--;
> >>>> +      bb_incoming_edges[bb->index] = len;
> >>>> +    }
> >>> it would be nice to combine this with the preceeding loop over
> >>> all blocks.
> >> Done.
> >>
> >>>> +  static unsigned long diff = 0;
> >>>> +  struct timeval tv;
> >>>> +  gettimeofday (&tv, NULL);
> >>>> +  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 +
> >>>> tv.tv_usec;
> >>>> +
> >>>> +  //for (loop : loops_list (cfun, 0))
> >>>>      for (loop = current_loops->tree_root->inner; loop; loop =
> >>>> loop->next)
> >>>> -    fill_always_executed_in_1 (loop, contains_call);
> >>>> +    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
> >>>> +
> >>>> +  gettimeofday (&tv, NULL);
> >>>> +  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
> >>>> +  diff += end - begin;
> >>>> +  //printf ("%s,diff:%ld\n", current_function_name (), diff);
> >>>> +  free (bb_incoming_edges);
> >>> Eh, looks very much like work in progress.  From the O() complexity
> >>> point duplicating bb_incoming_edges for each loop makes it quadratic
> >>> again but as said I think we don't need this duplication.
> >> Sorry I didn't quite follow your above comments about reusing
> >> bb_incoming_edges,
> >> we are searching the bbs that dominate loop->latch loop by loop
> >> independently,
> >> if outer loop changes the inner loops bb_incoming_edges, it will make the
> >> inner
> >> loop's ALWAYS EXECUTE computation incorrect?
> > We're only supposed to greedily walk into always excuted blocks.  I see
> > this probably breaks down for conditionally executed inner loops like
> > 
> >    for ()
> >     {
> >      if ()
> >        for ();
> >     A;
> >     }
> > 
> > where we want to process the inner loop separately but we still want
> > to make sure A is marked as always executed in the outer loop.
> 
> Yes, the problem is ALWAYS_EXECUTE blocks are not SUCCESSIVE in CFG.
> so we have to check every bb and maintain bb_incoming_edges in each loop. If
> there is any block goes to outer loop, we have to 'break' instead of
> 'continue' as followed bbs are not ALWAYS_EXECUTE again.

But we're using a worklist (since the CFG branches) - if we're not
continue to pick items from the worklist then the outcome depends
on the order of the worklist proceesing.  That can't be correct either
(similar when walking successor edges).

> >> Seems duplicating bb_incoming_edges not increases the complexity but the
> >> recursive call does?
> > Since the bb_incoming_edges array has a size of N and we copy it
> > number_of_loop times (which is O(N) bound) the copying effectively
> > has O(N^2) cost.  So yes, it's the iteration over all loops,
> > whether by recursion or iteration.
> 
> 
> Actually, the copying is outside of the greedy CFG walk loop. I guess there
> will be some optimizations in memcpy bb_incoming_edges by libraries?  But
> maybe only works when most of the value are same then copy_by_pieces?
> 
> And I modified the time-consuming case with many bbs to simulate the extreme
> usage, didn't observe quadratic behavior from it, it shows almost linear
> build time increase for both base and V3 patch, but V3 patch generates more
> accurate ALWAYS_EXECUTED result.  The function ratio in perf tool also looks
> not bad.  At least, it looks like an improvement compared with the base code.
> 
> A function with 10K bbs would cost more than 8 minutes to build, so it seems
> duplicating bb_incoming_edges much less then 1000*1000*1000 number of bbs will
> not be a performance bottle neck? (While most values are different, the
> copying
> will exactly slow down when bb_incoming_edges array is quite large like
> 1000*1000*1000,
> if so, that's really crazy automatic program generators...)
> 
> Base:		
> number of bbs(lim2+lim4)	total build time (sec)	tree loop invariant
> motion (usr/sys/wall/Mem)
> 7850+15615	123.99	13.52 ( 11%)   2.16 ( 52%)  15.68 ( 11%)    60M ( 21%)
> 15338+30591	208.98	24.55 ( 12%)   4.04 ( 52%)  28.57 ( 12%)   121M ( 21%)
> 30314+60543	511.54	48.96 ( 10%)   7.70 ( 52%)  56.90 ( 10%)   241M ( 21%)
> 		
> V3 patch:		
> number of bbs(lim2+lim4)	total build time (sec)	tree loop invariant
> motion (usr/sys/wall/Mem)
> 7850+15615	121.18	12.26 ( 10%)   1.99 ( 49%)  14.42 ( 11%)    60M ( 21%)
> 15338+30591	204.99	24.72 ( 12%)   3.98 ( 52%)  28.85 ( 13%)   121M ( 21%)
> 30314+60543	500.76	48.26 ( 10%)   8.27 ( 51%)  57.12 ( 11%)   241M ( 21%)
> 
> 
> Samples: 455K of event 'cycles', Event count (approx.): 426772700334
> Overhead  Command  Shared Object     Symbol
>   7.19%  cc1      cc1               [.] bitmap_set_bit
>   5.86%  cc1      cc1               [.] get_ref_base_and_extent
>   4.08%  cc1      cc1               [.] get_continuation_for_phi
>   3.48%  cc1      cc1               [.] hash_table<vn_ssa_aux_hasher, false,
>   xcallocator>::find
>   3.41%  cc1      cc1               [.] dominated_by_p
>   3.28%  cc1      cc1               [.] compute_transp
>   2.98%  cc1      cc1               [.] bitmap_set_range
>   2.60%  cc1      cc1               [.] bitmap_list_insert_element_after
>   2.44%  cc1      cc1               [.] stmt_may_clobber_ref_p_1
>   2.27%  cc1      cc1               [.] refs_may_alias_p_2
>   2.26%  cc1      cc1               [.] hash_table<vn_reference_hasher, false,
>   xcallocator>::fi
>   2.26%  cc1      cc1               [.] df_rd_transfer_function
>   2.07%  cc1      cc1               [.] bitmap_ior_into
>   1.84%  cc1      cc1               [.] find_base_term
>   1.82%  cc1      cc1               [.] get_alias_set
>   1.79%  cc1      cc1               [.] tree_strip_nop_conversions
>   1.73%  cc1      cc1               [.] dfs_enumerate_from
>   1.73%  cc1      cc1               [.] reference_alias_ptr_type_1
>   1.53%  cc1      cc1               [.] alias_sets_conflict_p
>   1.30%  cc1      cc1               [.] c_get_alias_set
>   1.16%  cc1      cc1               [.] bitmap_and_into
>   1.12%  cc1      cc1               [.] indirect_ref_may_alias_decl_p
>   1.04%  cc1      cc1               [.] et_splay
>   0.99%  cc1      libc-2.27.so      [.] __memset_power8
>   0.95%  cc1      cc1               [.] bitmap_ior_and_compl
>   0.86%  cc1      cc1               [.] bitmap_copy
>   0.78%  cc1      cc1               [.] ao_ref_from_mem
> 
> > 
> > I wonder if we can more cleverly handle this by picking up
> > conditionally executed inner loops on the fly for example
> > by recording a 'outermost unconditionally executed loop'
> > for each loop header we visit and using that like
> > SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father))
> > 
> > So make this a single greedy CFG walk starting at the outermost
> > loops.  loop exists will need to be processed specially then.
> 
> Not sure do you mean we only check loop headers and exits?  As said above,
> the ALWAYS_EXECUTED block chains are not successive, there may be blocks
> are ALWAYS_EXECUTED or goto-out-loops in the middle of the loop, but they
> could only have two chains?  One is from loop->header, and another is from
> loop->exit->src for loops has one one exit?

For setting ALWAYS_EXECUTED we rely on dominance.  The greedy walk
should determine whether there are paths in the CFG that skip
execution of the block.  I made the assumption
in that if there exist "uninterrupted" paths covering all incoming
edges of a block that this ensures the block is always executed.
And since we only trace such "always executed" blocks this holds
transitively.

I'll see if I can find some time this week to spend some time on this
problem and maybe also create a testcase for the existing issue
with contains_call and the dominator order of BB processing.

Richard.