From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <matz@suse.de>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
 by sourceware.org (Postfix) with ESMTPS id 95D383858414
 for <gcc@gcc.gnu.org>; Wed,  8 Sep 2021 18:13:57 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 95D383858414
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out1.suse.de (Postfix) with ESMTP id 8727022230;
 Wed,  8 Sep 2021 18:13:56 +0000 (UTC)
Received: from wotan.suse.de (wotan.suse.de [10.160.0.1])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id 80514A3B81;
 Wed,  8 Sep 2021 18:13:56 +0000 (UTC)
Received: by wotan.suse.de (Postfix, from userid 10510)
 id 60EBB64AD; Wed,  8 Sep 2021 18:13:56 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by wotan.suse.de (Postfix) with ESMTP id 5FAAF62A2;
 Wed,  8 Sep 2021 18:13:56 +0000 (UTC)
Date: Wed, 8 Sep 2021 18:13:56 +0000 (UTC)
From: Michael Matz <matz@suse.de>
To: Aldy Hernandez <aldyh@redhat.com>
cc: Richard Biener <richard.guenther@gmail.com>, 
 Jeff Law <jeffreyalaw@gmail.com>, GCC Mailing List <gcc@gcc.gnu.org>, 
 Andrew MacLeod <amacleod@redhat.com>
Subject: Re: More aggressive threading causing loop-interchange-9.c regression
In-Reply-To: <alpine.LSU.2.20.2109081627081.12583@wotan.suse.de>
Message-ID: <alpine.LSU.2.20.2109081736530.12583@wotan.suse.de>
References: <09e48b82-bc51-405e-7680-89a5f08e4e8f@redhat.com>
 <alpine.LSU.2.20.2109071406070.12583@wotan.suse.de>
 <cb4191f1-0432-8740-b327-078c2aaff0fa@redhat.com>
 <CAFiYyc3ZBApnN3ks3YExLTLQ8tvEdFV-uuU2MXVVqXFYn+cdRw@mail.gmail.com>
 <d8234eed-6c60-c5fb-8800-5e9c5b932c58@redhat.com>
 <CAFiYyc2KWNEMD31AdYuNJ-dP7ixMsWtTCtokQpcbRrZctTUqzA@mail.gmail.com>
 <6d5695e4-e4eb-14a5-46a6-f425d1514008@redhat.com>
 <alpine.LSU.2.20.2109081627081.12583@wotan.suse.de>
User-Agent: Alpine 2.20 (LSU 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-9.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc mailing list <gcc.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <mailto:gcc-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 08 Sep 2021 18:13:59 -0000

Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:

> > > The forward threader guards against this by simply disallowing 
> > > threadings that involve different loops.  As I see
> > 
> > The thread in question (5->9->3) is all within the same outer loop, 
> > though. BTW, the backward threader also disallows threading across 
> > different loops (see path_crosses_loops variable).
...
> Maybe it's possible to not disable threading over latches alltogether in 
> the backward threader (like it's tried now), but I haven't looked at the 
> specific situation here in depth, so take my view only as opinion from a 
> large distance :-)

I've now looked at the concrete situation.  So yeah, the whole path is in 
the same loop, crosses the latch, _and there's code following the latch 
on that path_.  (I.e. the latch isn't the last block in the path).  In 
particular, after loop_optimizer_init() (before any threading) we have:

  <bb 3> [local count: 118111600]:
  # j_19 = PHI <j_13(9), 0(7)>
  sum_11 = c[j_19];
  if (n_10(D) > 0)
    goto <bb 8>; [89.00%]
  else
    goto <bb 5>; [11.00%]

     <bb 8> [local count: 105119324]:
...

  <bb 5> [local count: 118111600]:
  # sum_21 = PHI <sum_14(4), sum_11(3)>
  c[j_19] = sum_21;
  j_13 = j_19 + 1;
  if (n_10(D) > j_13)
    goto <bb 9>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 9> [local count: 105119324]:
  goto <bb 3>; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the 
pre-header of inner loop, but more importantly something that's not at the 
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its 
destination (i.e. where the backedge isn't the last to-be-redirected edge) 
necessarily duplicates all code from that destination onto the back edge.  
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge, 
conceptually; in reality it's simply included into the (new) latch block 
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after 
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers: 
there must be no code executed after the exit test (e.g. in the latch 
block).  (This requirement makes reasoning about which code is or isn't 
executed completely for an iteration trivial; simply everything in the 
body is always executed; e.g. loop interchange uses this to check that 
there are no memory references after the exit test, because those would 
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated 
instructions (because they are memory refs) must remain after the exit 
test.  Only by rerolling/unrotating the loop (i.e. noticing that the 
memory refs on the loop-entry path and on the back edge are equivalent) 
would that be possible, but that's something we aren't capable of.  Even 
if we were that would simply just revert the whole work that the threader 
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading 
if the path contains a latch at the non-last position (due to being 
backwards on the non-first position in the array).  I.e. it disables 
rotating the loop if there's danger of polluting the back edge.  It might 
be improved if the blocks following (preceding!) the latch are themself 
empty because then no code is duplicated.  It might also be improved if 
the latch is already non-empty.  That code should probably only be active 
before the loop optimizers, but currently the backward threader isn't 
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)


Ciao,
Michael.
diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 449232c7715..528a753b886 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
@@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
   loop_p loop = m_path[0]->loop_father;
   bool path_crosses_loops = false;
   bool threaded_through_latch = false;
+  bool latch_within_path = false;
   bool multiway_branch_in_path = false;
   bool threaded_multiway_branch = false;
   bool contains_hot_bb = false;
@@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	 the last entry in the array when determining if we thread
 	 through the loop latch.  */
       if (loop->latch == bb)
-	threaded_through_latch = true;
+	{
+	  threaded_through_latch = true;
+	  if (j != 0)
+	    latch_within_path = true;
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, " (latch)");
+	}
     }
 
   gimple *stmt = get_gimple_control_stmt (m_path[0]);
@@ -845,6 +852,15 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 		 "a multiway branch.\n");
       return false;
     }
+
+  if (latch_within_path)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "  FAIL: FSM Thread through latch would create non-empty latch\n");
+      return false;
+
+    }
   return true;
 }