From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <aldyh@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [216.205.24.124])
 by sourceware.org (Postfix) with ESMTP id 91D19384C002
 for <gcc@gcc.gnu.org>; Thu,  9 Sep 2021 08:14:50 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 91D19384C002
Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com
 [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-309-dGvb6QvzPQGAsx7FgjwL6g-1; Thu, 09 Sep 2021 04:14:49 -0400
X-MC-Unique: dGvb6QvzPQGAsx7FgjwL6g-1
Received: by mail-wm1-f70.google.com with SMTP id
 x10-20020a7bc76a000000b002f8cba3fd65so542993wmk.2
 for <gcc@gcc.gnu.org>; Thu, 09 Sep 2021 01:14:48 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=2P95GTXk769D9hiDC0HpxcOO89ESd6ayp8BPtj+vetE=;
 b=Zk5mREK5LZVqtby8YviSN5R1QQTg+bi3VtALoOHxBnSKpsR1cU4vfNTRTbFZJHgzGP
 WFawfou/MLPVOfbfPuskr4hRnrTl3PvKVmU2hzbyb6DC97+mkScyLecgnRiwqcBL5jtS
 vi+L9sVgv0ysZd9K4GZQv+4Usjc8tRAu1At8G0q6CtJgzSDRzflJajhB4ARoQKPqzph7
 wmF9YCtdGLih5UOpEdwFwpx5G9JwQW55xW/Nf/8jJdKbwxLAfUK6be/gHDGGwsYBzsg7
 hy9MzEyqRK25OXFD2rr08qjMR6dPtiLLkYLYDUuuAyvxaJwVP26s2ltdCsdxnt5oEt5J
 rKuw==
X-Gm-Message-State: AOAM532nOVlVISlRJ41d1QnDZfex4KNnoIznFFeYJlHLjug64C9YCSOl
 mQkSNflT10dlD1+d8rUkCNiICfTobijkN4vKVqpdomxFFdAwSrlucBotMIHNW+/eTJOaJwcpfYI
 LdVgSchQ=
X-Received: by 2002:a5d:6741:: with SMTP id l1mr2006284wrw.289.1631175287885; 
 Thu, 09 Sep 2021 01:14:47 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJw8Z/Xtf36Q6JcrN6oGA/1xifcx4j8UhVblzTpIArZfTQEXH8P9MSwvy+o9Uw3xKhWtcITuSA==
X-Received: by 2002:a5d:6741:: with SMTP id l1mr2006256wrw.289.1631175287503; 
 Thu, 09 Sep 2021 01:14:47 -0700 (PDT)
Received: from abulafia.quesejoda.com ([139.47.33.227])
 by smtp.gmail.com with ESMTPSA id a9sm1182612wru.27.2021.09.09.01.14.46
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Thu, 09 Sep 2021 01:14:46 -0700 (PDT)
Subject: Re: More aggressive threading causing loop-interchange-9.c regression
To: Michael Matz <matz@suse.de>
Cc: Richard Biener <richard.guenther@gmail.com>,
 Jeff Law <jeffreyalaw@gmail.com>, GCC Mailing List <gcc@gcc.gnu.org>,
 Andrew MacLeod <amacleod@redhat.com>
References: <09e48b82-bc51-405e-7680-89a5f08e4e8f@redhat.com>
 <alpine.LSU.2.20.2109071406070.12583@wotan.suse.de>
 <cb4191f1-0432-8740-b327-078c2aaff0fa@redhat.com>
 <CAFiYyc3ZBApnN3ks3YExLTLQ8tvEdFV-uuU2MXVVqXFYn+cdRw@mail.gmail.com>
 <d8234eed-6c60-c5fb-8800-5e9c5b932c58@redhat.com>
 <CAFiYyc2KWNEMD31AdYuNJ-dP7ixMsWtTCtokQpcbRrZctTUqzA@mail.gmail.com>
 <6d5695e4-e4eb-14a5-46a6-f425d1514008@redhat.com>
 <alpine.LSU.2.20.2109081627081.12583@wotan.suse.de>
 <alpine.LSU.2.20.2109081736530.12583@wotan.suse.de>
From: Aldy Hernandez <aldyh@redhat.com>
Message-ID: <56bd6a6c-0416-7123-c792-521495d69654@redhat.com>
Date: Thu, 9 Sep 2021 10:14:45 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <alpine.LSU.2.20.2109081736530.12583@wotan.suse.de>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc mailing list <gcc.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <mailto:gcc-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Sep 2021 08:14:52 -0000


On 9/8/21 8:13 PM, Michael Matz wrote:
> Hello,
> 
> [lame answer to self]
> 
> On Wed, 8 Sep 2021, Michael Matz wrote:
> 
>>>> The forward threader guards against this by simply disallowing
>>>> threadings that involve different loops.  As I see
>>>
>>> The thread in question (5->9->3) is all within the same outer loop,
>>> though. BTW, the backward threader also disallows threading across
>>> different loops (see path_crosses_loops variable).
> ...
>> Maybe it's possible to not disable threading over latches alltogether in
>> the backward threader (like it's tried now), but I haven't looked at the
>> specific situation here in depth, so take my view only as opinion from a
>> large distance :-)
> 
> I've now looked at the concrete situation.  So yeah, the whole path is in
> the same loop, crosses the latch, _and there's code following the latch
> on that path_.  (I.e. the latch isn't the last block in the path).  In
> particular, after loop_optimizer_init() (before any threading) we have:
> 
>    <bb 3> [local count: 118111600]:
>    # j_19 = PHI <j_13(9), 0(7)>
>    sum_11 = c[j_19];
>    if (n_10(D) > 0)
>      goto <bb 8>; [89.00%]
>    else
>      goto <bb 5>; [11.00%]
> 
>       <bb 8> [local count: 105119324]:
> ...
> 
>    <bb 5> [local count: 118111600]:
>    # sum_21 = PHI <sum_14(4), sum_11(3)>
>    c[j_19] = sum_21;
>    j_13 = j_19 + 1;
>    if (n_10(D) > j_13)
>      goto <bb 9>; [89.00%]
>    else
>      goto <bb 6>; [11.00%]
> 
>    <bb 9> [local count: 105119324]:
>    goto <bb 3>; [100.00%]
> 
> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> pre-header of inner loop, but more importantly something that's not at the
> start of the outer loop.
> 
> Now, any thread that includes the backedge 9->3 _including_ its
> destination (i.e. where the backedge isn't the last to-be-redirected edge)
> necessarily duplicates all code from that destination onto the back edge.
> Here it's the load from c[j] into sum_11.
> 
> The important part is the code is emitted onto the back edge,
> conceptually; in reality it's simply included into the (new) latch block
> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
> cfg_cleanup).
> 
> That's what we can't have for some of our structural loop optimizers:
> there must be no code executed after the exit test (e.g. in the latch
> block).  (This requirement makes reasoning about which code is or isn't
> executed completely for an iteration trivial; simply everything in the
> body is always executed; e.g. loop interchange uses this to check that
> there are no memory references after the exit test, because those would
> then be only conditional and hence make loop interchange very awkward).
> 
> Note that this situation can't be later rectified anymore: the duplicated
> instructions (because they are memory refs) must remain after the exit
> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> memory refs on the loop-entry path and on the back edge are equivalent)
> would that be possible, but that's something we aren't capable of.  Even
> if we were that would simply just revert the whole work that the threader
> did, so it's better to not even do that to start with.
> 
> I believe something like below would be appropriate, it disables threading
> if the path contains a latch at the non-last position (due to being
> backwards on the non-first position in the array).  I.e. it disables
> rotating the loop if there's danger of polluting the back edge.  It might
> be improved if the blocks following (preceding!) the latch are themself
> empty because then no code is duplicated.  It might also be improved if
> the latch is already non-empty.  That code should probably only be active
> before the loop optimizers, but currently the backward threader isn't
> differentiating between before/after loop-optims.
> 
> I haven't tested this patch at all, except that it fixes the testcase :)

Thanks for looking at this.

I think you're onto something with this approach.  Perhaps in addition 
to the loop header threading Richard mentions.

Your patch causes some regressions, but I think most are noise from FSM 
tests that must be adjusted.  However, there are some other ones that 
are curious:

 > FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated 
memset zero"
 > FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag"
< XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue 
"Performing shrink-wrapping"
< XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, 
line 36)
 > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
graphite "1 loops carried no dependency" 1
 > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
optimized "loopfn.1" 4
 > FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times 
graphite "5 loops carried no dependency" 1

Interestingly your patch is fixing shrink-wrap-loop.c and 
Warray-bounds-87, both of which were introduced by the backward threader 
rewrite.  At least the Warray-bounds was the threader peeling off an 
iteration that caused a bogus warning.

The ldist-22 regression is interesting though:

void foo ()
{
   int i;

   <bb 2> :
   goto <bb 6>; [INV]

   <bb 3> :
   a[i_1] = 0;
   if (i_1 > 100)
     goto <bb 4>; [INV]
   else
     goto <bb 5>; [INV]

   <bb 4> :
   b[i_1] = i_1;

   <bb 5> :
   i_8 = i_1 + 1;

   <bb 6> :
   # i_1 = PHI <0(2), i_8(5)>
   if (i_1 <= 1023)
     goto <bb 3>; [INV]
   else
     goto <bb 7>; [INV]

   <bb 7> :
   return;

}

Here we fail to look past 5->6 because BB5 is the latch and is not the 
last block in the path.  So we fail to thread 3->5->6->3.  Doing so 
would have split the function into two loops, one of which could use a 
memset:

void foo ()
{
   int i;

   <bb 2> :
   goto <bb 6>; [INV]

   <bb 3> :
   # i_12 = PHI <i_1(6), i_9(4)>
   a[i_12] = 0;
   if (i_12 > 100)
     goto <bb 5>; [INV]
   else
     goto <bb 4>; [INV]

   <bb 4> :
   i_9 = i_12 + 1;
   goto <bb 3>; [100.00%]

   <bb 5> :
   b[i_12] = i_12;
   i_8 = i_12 + 1;

   <bb 6> :
   # i_1 = PHI <0(2), i_8(5)>
   if (i_1 <= 1023)
     goto <bb 3>; [INV]
   else
     goto <bb 7>; [INV]

   <bb 7> :
   return;

}

I would have to agree that threading through latches is problematic. 
For that matter, the ldist-22 test shows that we're depending on the 
threader to do work that seems to belong in the loop optimizer world.

Would it be crazy to suggest that we disable threading through latches 
altogether, and do whatever we're missing in the loop world?  It seems 
loop has all the tools, cost model, and framework to do so.  Of course, 
I know 0 about loop, and would hate to add work to other's plates.

Thanks.
Aldy

BTW, I haven't looked at the graphite regressions.