From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
 by sourceware.org (Postfix) with ESMTPS id E21F43857403
 for <gcc-patches@gcc.gnu.org>; Thu,  9 Sep 2021 10:55:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E21F43857403
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
 (No client certificate requested)
 by smtp-out1.suse.de (Postfix) with ESMTPS id BE97F22357;
 Thu,  9 Sep 2021 10:55:12 +0000 (UTC)
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
 (No client certificate requested)
 by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id A493913B0C;
 Thu,  9 Sep 2021 10:55:12 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
 by imap2.suse-dmz.suse.de with ESMTPSA id IRzUJhDoOWErRAAAMHmgww
 (envelope-from <rguenther@suse.de>); Thu, 09 Sep 2021 10:55:12 +0000
Date: Thu, 9 Sep 2021 12:55:12 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: Xionghu Luo <luoxhu@linux.ibm.com>
cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] tree-optimization/102155 - fix LIM fill_always_executed_in
 CFG walk
In-Reply-To: <ss794419-o9po-51n8-3n0-22ro37r7n360@fhfr.qr>
Message-ID: <8on1r65r-n4qr-9555-s3q6-75783p6622@fhfr.qr>
References: <47so1129-4rp-93pp-op88-4p649p42po80@fhfr.qr>
 <4a364fc3-7390-d3ca-5e71-6520061cf738@linux.ibm.com>
 <nycvar.YFH.7.76.2109020959030.11781@zhemvz.fhfr.qr>
 <nycvar.YFH.7.76.2109021016120.11781@zhemvz.fhfr.qr>
 <566e5d26-2c0e-8181-2249-211ebf369b73@linux.ibm.com>
 <nycvar.YFH.7.76.2109021226110.11781@zhemvz.fhfr.qr>
 <f0d27d79-56b2-d302-f1e5-33c2c4019657@linux.ibm.com>
 <ss794419-o9po-51n8-3n0-22ro37r7n360@fhfr.qr>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Sep 2021 10:55:15 -0000

On Thu, 9 Sep 2021, Richard Biener wrote:

> On Thu, 9 Sep 2021, Xionghu Luo wrote:
> 
> > 
> > 
> > On 2021/9/2 18:37, Richard Biener wrote:
> > > On Thu, 2 Sep 2021, Xionghu Luo wrote:
> > > 
> > >>
> > >>
> > >> On 2021/9/2 16:50, Richard Biener wrote:
> > >>> On Thu, 2 Sep 2021, Richard Biener wrote:
> > >>>
> > >>>> On Thu, 2 Sep 2021, Xionghu Luo wrote:
> > >>>>
> > >>>>>
> > >>>>>
> > >>>>> On 2021/9/1 17:58, Richard Biener wrote:
> > >>>>>> This fixes the CFG walk order of fill_always_executed_in to use
> > >>>>>> RPO oder rather than the dominator based order computed by
> > >>>>>> get_loop_body_in_dom_order.  That fixes correctness issues with
> > >>>>>> unordered dominator children.
> > >>>>>>
> > >>>>>> The RPO order computed by rev_post_order_and_mark_dfs_back_seme in
> > >>>>>> its for-iteration mode is a good match for the algorithm.
> > >>>>>>
> > >>>>>> Xionghu, I've tried to only fix the CFG walk order issue and not
> > >>>>>> change anything else with this so we have a more correct base
> > >>>>>> to work against.  The code still walks inner loop bodies
> > >>>>>> up to loop depth times and thus is quadratic in the loop depth.
> > >>>>>>
> > >>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, if you don't
> > >>>>>> have any comments I plan to push this and then revisit what we
> > >>>>>> were circling around.
> > >>>>>
> > >>>>> LGTM, thanks.
> > >>>>
> > >>>> I pushed it, thought again in the attempt to build a testcase and
> > >>>> concluded I was wrong with the appearant mishandling of
> > >>>> contains_call - get_loop_body_in_dom_order seems to be exactly
> > >>>> correct for this specific case.  So I reverted the commit again.
> > >>>
> > >>> And I figured what the
> > >>>
> > >>>                 /* In a loop that is always entered we may proceed anyway.
> > >>>                    But record that we entered it and stop once we leave
> > >>> it.
> > >>> */
> > >>>
> > >>> comment was about.  The code was present before the fix for PR78185
> > >>> and it was supposed to catch the case where the entered inner loop
> > >>> is not finite.  Just as the testcase from PR78185 shows the
> > >>> stopping was done too late when the exit block was already marked
> > >>> as to be always executed.  A simpler fix for PR78185 would have been
> > >>> to move
> > >>>
> > >>>             if (!flow_bb_inside_loop_p (inn_loop, bb))
> > >>>               break;
> > >>>
> > >>> before setting of last = bb.  In fact the installed fix was more
> > >>> pessimistic than that given it terminated already when entering
> > >>> a possibly infinite loop.  So we can improve that by doing
> > >>> sth like which should also improve the situation for some of
> > >>> the cases you were looking at?
> > >>>
> > >>> What remains is that we continue to stop when entering a
> > >>> not always executed loop:
> > >>>
> > >>>             if (bb->loop_father->header == bb)
> > >>>               {
> > >>>                 if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> > >>>                   break;
> > >>
> > >> Yes.  This will cause blocks after inner loop missed to be check
> > >> if they are actually ALWAYS_EXECUTED.   I am afraid O(N^2) is
> > >> inevitable here...
> > > 
> > > Yes.  What we can try is pre-computing whether a loop
> > > has a call or an inner loop that might not terminate and then
> > > when that's true for the loop to be entered continue to break;
> > > but when not, skip processing that loop blocks (but we still
> > > fill the blocks array, and we do need to do this in the order
> > > for the loop we're processing ...).
> > > 
> > > So what I was thinking was to somehow embed the dominator
> > > walk of get_loop_body_in_dom_order and instead of pre-recording
> > > the above info (call, infinite loop) for loops, pre-record
> > > it on the dominator tree so that we can ask "in any of our
> > > dominator children, is there a call or an infinite loop" and
> > > thus cut the dominator walk at loop header blocks that are
> > > not dominating the outer loop latch ...
> > > 
> > > Of course the simplistic solution might be to simply do
> > > 
> > >     if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)
> > >         && ((loop_depth (bb->loop_father) - loop_depth (loop))
> > >             > param_max_lim_loop_depth_lookahead)))
> > >       break;
> > > 
> > > and thus limit the processing of conditionally executed inner
> > > loops by relative depth ... as you say the actual processing
> > > is unlikely to be the bottleneck for the degenerate cases
> > > of a very deep nest of conditionally executed loops.
> > > 
> > > But still for this case get_loop_body_in_dom_order is
> > > doing quadratic processing so we can also say that
> > > another linear walk over the produced array does not
> > > increase complexity.> 
> > >>>
> > >>> volatile int flag, bar;
> > >>> double foo (double *valp)
> > >>> {
> > >>>     double sum = 0;
> > >>>     for (int i = 0; i < 256; ++i)
> > >>>       {
> > >>>         for (int j = 0; j < 256; ++j)
> > >>>           bar = flag;
> > >>>         if (flag)
> > >>>           sum += 1.;
> > >>>         sum += *valp;
> > >>>       }
> > >>>     return sum;
> > >>> }
> > >>
> > >> The patch still fails to handle cases like this:
> > >>
> > >>
> > >> struct X { int i; int j; int k;};
> > >> volatile int m;
> > >>
> > >> void bar (struct X *x, int n, int l, int k)
> > >> {
> > >>    for (int i = 0; i < l; i++)
> > >>      {
> > >>       if (k)
> > >>          for (int j = 0; j < l; j++)
> > >>            {
> > >>              if (m)
> > >>                x->i = m;
> > >>              else
> > >>                x->i = 1 - m;
> > >>
> > >>              int *r = &x->k;
> > >>              int tem2 = *r;
> > >>              x->k += tem2 * j;
> > >>          }
> > >>
> > >>        x->i = m;
> > >>      }
> > >> }
> > >>
> > >> x->i is still not marked ALWAYS_EXECUTED for outer loop.
> > > 
> > 
> > Collected data when build gcc stage1 and bootstrap.  There are still
> > about 9% bbs are missed to be marked with ALWAYS_EXECUTED.  Execution time
> > of fill_always_executed_in_1 is increased obvious in stage1 but it is in
> > mini-second level while bootstrap build doesn't show big time change.
> > 
> > base vs. incoming_edge_counter_patch:
> > 
> > 1. For stage1: 2214/2216 functions handled in both lim2 and lim4 with:
> >     -  bbs find:  base vs patched  "always executed":  17666  vs. 19278.
> >     - total cost time of fill_always_executed_in_1 (usec):  30574 vs. 51238.
> > 2. bootstrap: 33910/33951 functions handled in both lim2 and lim4 with:
> >     - bbs find:  base vs patched  "always executed":  201171 vs. 222942.
> >     - total cost time of fill_always_executed_in_1 (usec):  987054  vs.
> >     1061821.5
> > 
> > Do we still have plan to replace the get_loop_body_in_dom_order with
> > linear search or just "limit the processing of conditionally executed inner
> > loops by relative depth"? ;)
> 
> Now that I think I understand how it all works we're going to stay
> with the current dominator based approach.  As yet another
> incremental change I have bootstrapped and tested the patch below
> which cuts off some processing time from the current implementation
> making it O(n) as long as we are not walking conditionally
> executed subloops.
> 
> I don't see yet that the greedy linear walk patch is correct, so
> the option forward to reap all the remaining opportunities would
> be to either do limited processing of conditionally executed
> inner loops or process all of them but for example gated by
> flag_expensive_optimizations.  Another option is to ignore
> the complexity concern given there's other quadraticnesses
> in loop depth in LIM (but on other entities).

So I've created a testcase (PR102253) and indeed other issues pop
up and LIM is not even on the radar here with the patch below
which I've now bootstrapped and tested on x86_64-unknown-linux-gnu
and pushed to trunk.

Richard.

>From 013cfc648405a8a118d07436f103e4d70224fe00 Mon Sep 17 00:00:00 2001
From: Richard Biener <rguenther@suse.de>
Date: Thu, 9 Sep 2021 11:50:20 +0200
Subject: [PATCH] Improve LIM fill_always_executed_in computation
To: gcc-patches@gcc.gnu.org

Currently the DOM walk over a loop body does not walk into not
always executed subloops to avoid scalability issues since doing
so makes the walk quadratic in the loop depth.  It turns out this
is not an issue in practice and even with a loop depth of 1800
this function is way off the radar.

So the following patch removes the limitation, replacing it with
a comment.

2021-09-09  Richard Biener  <rguenther@suse.de>

	* tree-ssa-loop-im.c (fill_always_executed_in_1): Walk
	into all subloops.

	* gcc.dg/tree-ssa/ssa-lim-17.c: New testcase.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c | 20 ++++++++++++++++++++
 gcc/tree-ssa-loop-im.c                     | 16 +++++++---------
 2 files changed, 27 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
new file mode 100644
index 00000000000..1c840e554b7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int flag, bar;
+double foo (double *valp)
+{
+  double sum = 0;
+  for (int i = 0; i < 256; ++i)
+    {
+      if (flag)
+	for (int j = 0; j < 256; ++j)
+	  bar = flag;
+      if (flag)
+        sum += 1.;
+      sum += *valp; // we should move the load of *valp out of the loop
+    }
+  return sum;
+}
+
+/* { dg-final { scan-tree-dump-times "Moving statement" 1 "lim2" } } */
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 5d6845478e7..4b187c2cdaf 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3074,15 +3074,13 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
 	    break;
 
 	  if (bb->loop_father->header == bb)
-	    {
-	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-		break;
-
-	      /* In a loop that is always entered we may proceed anyway.
-		 But record that we entered it and stop once we leave it
-		 since it might not be finite.  */
-	      inn_loop = bb->loop_father;
-	    }
+	    /* Record that we enter into a subloop since it might not
+	       be finite.  */
+	    /* ???  Entering into a not always executed subloop makes
+	       fill_always_executed_in quadratic in loop depth since
+	       we walk those loops N times.  This is not a problem
+	       in practice though, see PR102253 for a worst-case testcase.  */
+	    inn_loop = bb->loop_father;
 
 	  /* Walk the body of LOOP sorted by dominance relation.  Additionally,
 	     if a basic block S dominates the latch, then only blocks dominated
-- 
2.31.1