From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <matz@suse.de>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
 by sourceware.org (Postfix) with ESMTPS id C20093858D34
 for <gcc@gcc.gnu.org>; Tue,  7 Sep 2021 14:45:35 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C20093858D34
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out2.suse.de (Postfix) with ESMTP id 94D1220083;
 Tue,  7 Sep 2021 14:45:34 +0000 (UTC)
Received: from wotan.suse.de (wotan.suse.de [10.160.0.1])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id 8C798A3B81;
 Tue,  7 Sep 2021 14:45:34 +0000 (UTC)
Received: by wotan.suse.de (Postfix, from userid 10510)
 id 8593864BF; Tue,  7 Sep 2021 14:45:34 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by wotan.suse.de (Postfix) with ESMTP id 848B762F5;
 Tue,  7 Sep 2021 14:45:34 +0000 (UTC)
Date: Tue, 7 Sep 2021 14:45:34 +0000 (UTC)
From: Michael Matz <matz@suse.de>
To: Aldy Hernandez <aldyh@redhat.com>
cc: Jeff Law <jeffreyalaw@gmail.com>, 
 Richard Biener <richard.guenther@gmail.com>, 
 GCC Mailing List <gcc@gcc.gnu.org>
Subject: Re: More aggressive threading causing loop-interchange-9.c regression
In-Reply-To: <09e48b82-bc51-405e-7680-89a5f08e4e8f@redhat.com>
Message-ID: <alpine.LSU.2.20.2109071406070.12583@wotan.suse.de>
References: <09e48b82-bc51-405e-7680-89a5f08e4e8f@redhat.com>
User-Agent: Alpine 2.20 (LSU 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc mailing list <gcc.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <mailto:gcc-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Sep 2021 14:45:37 -0000

Hello,

On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote:

> The regression comes from the simple_reduc_1() function in
> tree-ssa/loop-interchange-9.c, and it involves the following path:
> 
> =========== BB 5 ============
> Imports: n_10(D)  j_19
> Exports: n_10(D)  j_13  j_19
>          j_13 : j_19(I)
> n_10(D)	int [1, 257]
> j_19	int [0, 256]
> Relational : (j_13 > j_19)
>     <bb 5> [local count: 118111600]:
>     # sum_21 = PHI <sum_14(4), sum_11(3)>
>     c[j_19] = sum_21;
>     j_13 = j_19 + 1;
>     if (n_10(D) > j_13)
>       goto <bb 9>; [89.00%]
>     else
>       goto <bb 6>; [11.00%]

So, this is the outer loops exit block ...

> =========== BB 9 ============
> n_10(D)	int [2, 257]
> j_13	int [1, 256]
> Relational : (n_10(D) > j_19)
> Relational : (n_10(D) > j_13)
>     <bb 9> [local count: 105119324]:
>     goto <bb 3>; [100.00%]

... this the outer loops latch block ...

> =========== BB 3 ============
> Imports: n_10(D)
> Exports: n_10(D)
> n_10(D)	int [1, +INF]
>     <bb 3> [local count: 118111600]:
>     # j_19 = PHI <j_13(9), 0(7)>
>     sum_11 = c[j_19];
>     if (n_10(D) > 0)
>       goto <bb 8>; [89.00%]
>     else
>       goto <bb 5>; [11.00%]

... and this the outer loops header, as well as inner loops pre-header.
The attempted thread hence involves a back-edge (of the outer loop) and a 
loop-entry edge (bb3->bb8) of the inner loop.  Doing that threading would 
destroy some properties that our loop optimizers rely on, e.g. that the 
loop header of a natural loop is only reached by two edges: entry edge and 
back edge, and that the latch blocks are empty and that there's only a 
single latch.  (What exactly we require depends on several flags in 
loops_state_satisfies_p).

> With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that 
> the loop cannot legally iterate outside the size of c[256].  So, j_13 
> lies within [1, 257] and n_10 is [2, +INF] at the end of the path.  
> This allows us to thread BB3 to BB8.

So, IIUC doing this threading would create a new entry to BB8: it would 
then be entered by its natural entry (bb3), by its natural back edge 
(whatever bb that is now) and the new entry from the threaded outer back 
edge (bb9 or bb5?).

The inner loop wouldn't then be recognized as natural anymore and the 
whole nest not as perfect, and hence loop interchange can't easily happen 
anymore.  Most other structural loop optimizers of us would have problems 
with that situation as well.

> All the blocks lie within the same loop, and the path passes all the 
> tests in path_profitable_p().
> 
> Is there something about the shape of this path that should make it 
> invalid in the backward threader, or should the test be marked with 
> -fdisable-tree-threadN (or something else entirely)?

This is a functionality test checking that the very necessary interchange 
in this test does happen with default plus -floop-interchange (with the 
intention of it being enabled by O3 or with profile info).  So no 
additional flags can be added without changing the meaning of this test.

> Note that the 
> backward threader is allowed to peel through loop headers.

Something needs to give way in the path threaders before loop 
optimizations: either threading through back edges, through loop latches 
or through loop headers needs to be disabled.  I think traditionally at 
least threading through latches should be disabled, because doing so 
usually destroys simple loop structure.  I see that profitable_path_p() of 
the backward threader wants to be careful in some situations involving 
loops and latches; possibly it's not careful enough yet for the additional 
power brought by ranger.

See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is 
doing when loops are active.

After the structural loop optimizers the threader can go wild and thread 
everything it finds.


Ciao,
Michael.