From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <aldyh@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTP id 10D67384C002
 for <gcc@gcc.gnu.org>; Thu,  9 Sep 2021 09:21:18 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 10D67384C002
Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com
 [209.85.221.70]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-102-XNyQD3_KPMaKVSx-0QfL0A-1; Thu, 09 Sep 2021 05:21:16 -0400
X-MC-Unique: XNyQD3_KPMaKVSx-0QfL0A-1
Received: by mail-wr1-f70.google.com with SMTP id
 h1-20020adffd41000000b0015931e17ccfso304697wrs.18
 for <gcc@gcc.gnu.org>; Thu, 09 Sep 2021 02:21:16 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=r7Z+1zE5FyGZeBInbrCmiZkBd3HHg1h+2tKdbUyb8W0=;
 b=1ZTxpqOjJ9rmaene9ScTGGiQl+V6Hl1nU5vuO87kiKk0/O8B+Cx/laY2a5J3lZIIXW
 LsuuPy7MkhIa7XM60JKAiwC9b1hh8AefA5bww4yrNgbaLhJjOEEkj4U6UxI2fiqvL1d3
 eESzwzAsPliJXdSiZ7v1mu+jj0j6+4TjK1zO1EJJNsyVTclQk5IIm9MtmwRbw+2IdZbK
 Ck/bRvirvH+uW7qIacPqoL5cBhi2hiAy4Sd9A/vR9B3LhQC7rUmmrnt4tuSuQpksrpfz
 +U/shJ94GUxCPes1mn+eDHbJMnFteOFJKGV8sKj8N/c8ypcITLv8dxXhXABr3P/LpWGP
 tMGg==
X-Gm-Message-State: AOAM533FAYSYJrA9zdbQ4WEEkW8iR5XeGBHHn6PlZ9g2tHcksIYPoi2q
 CGYxYV0DjTaGx2JVZSfRMnhTdZHIPXjRaEY+cWjnKjUianZwIReXJnpyyoeyRI8fW5pU7E7ErIB
 lVK2JM7c=
X-Received: by 2002:a1c:21c3:: with SMTP id h186mr673440wmh.18.1631179275267; 
 Thu, 09 Sep 2021 02:21:15 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJxVGqwSGuh4S7Lg5fXdCrj/VO0xPfLZHKymH/RCkvWnL9MKc4/kfSXbkvwxrJTnWNfN9iaEsg==
X-Received: by 2002:a1c:21c3:: with SMTP id h186mr673411wmh.18.1631179274843; 
 Thu, 09 Sep 2021 02:21:14 -0700 (PDT)
Received: from abulafia.quesejoda.com ([139.47.33.227])
 by smtp.gmail.com with ESMTPSA id b1sm1332360wrh.85.2021.09.09.02.21.13
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Thu, 09 Sep 2021 02:21:14 -0700 (PDT)
Subject: Re: More aggressive threading causing loop-interchange-9.c regression
To: Richard Biener <richard.guenther@gmail.com>
Cc: Michael Matz <matz@suse.de>, Jeff Law <jeffreyalaw@gmail.com>,
 GCC Mailing List <gcc@gcc.gnu.org>, Andrew MacLeod <amacleod@redhat.com>
References: <09e48b82-bc51-405e-7680-89a5f08e4e8f@redhat.com>
 <alpine.LSU.2.20.2109071406070.12583@wotan.suse.de>
 <cb4191f1-0432-8740-b327-078c2aaff0fa@redhat.com>
 <CAFiYyc3ZBApnN3ks3YExLTLQ8tvEdFV-uuU2MXVVqXFYn+cdRw@mail.gmail.com>
 <d8234eed-6c60-c5fb-8800-5e9c5b932c58@redhat.com>
 <CAFiYyc2KWNEMD31AdYuNJ-dP7ixMsWtTCtokQpcbRrZctTUqzA@mail.gmail.com>
 <6d5695e4-e4eb-14a5-46a6-f425d1514008@redhat.com>
 <alpine.LSU.2.20.2109081627081.12583@wotan.suse.de>
 <alpine.LSU.2.20.2109081736530.12583@wotan.suse.de>
 <CAFiYyc1Mfw8RJ+qSGzwRtA4vhWLnLiFYeuPLxagFu6H6e_mVhg@mail.gmail.com>
 <07fdd2bb-e6b7-fe66-f6a0-df6ec0704ae4@redhat.com>
 <CAFiYyc1oUQciFz12-70SvZw3e7h_p8TO3LKBVNuvAvPyjn9smQ@mail.gmail.com>
 <ee556546-f644-24e5-9570-94dec3676e5b@redhat.com>
 <CAFiYyc1=Bj3yBhvKJtYFHWqHU7WgW-4RWwZ6CXSwQiP-QuYGXw@mail.gmail.com>
From: Aldy Hernandez <aldyh@redhat.com>
Message-ID: <8c49db8d-3119-0dc2-2bbb-4062c8d5d53b@redhat.com>
Date: Thu, 9 Sep 2021 11:21:13 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <CAFiYyc1=Bj3yBhvKJtYFHWqHU7WgW-4RWwZ6CXSwQiP-QuYGXw@mail.gmail.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc mailing list <gcc.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <mailto:gcc-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Sep 2021 09:21:19 -0000


On 9/9/21 10:58 AM, Richard Biener wrote:
> On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>
>>
>>
>> On 9/9/21 9:45 AM, Richard Biener wrote:
>>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 9/9/21 8:57 AM, Richard Biener wrote:
>>>>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> [lame answer to self]
>>>>>>
>>>>>> On Wed, 8 Sep 2021, Michael Matz wrote:
>>>>>>
>>>>>>>>> The forward threader guards against this by simply disallowing
>>>>>>>>> threadings that involve different loops.  As I see
>>>>>>>>
>>>>>>>> The thread in question (5->9->3) is all within the same outer loop,
>>>>>>>> though. BTW, the backward threader also disallows threading across
>>>>>>>> different loops (see path_crosses_loops variable).
>>>>>> ...
>>>>>>> Maybe it's possible to not disable threading over latches alltogether in
>>>>>>> the backward threader (like it's tried now), but I haven't looked at the
>>>>>>> specific situation here in depth, so take my view only as opinion from a
>>>>>>> large distance :-)
>>>>>>
>>>>>> I've now looked at the concrete situation.  So yeah, the whole path is in
>>>>>> the same loop, crosses the latch, _and there's code following the latch
>>>>>> on that path_.  (I.e. the latch isn't the last block in the path).  In
>>>>>> particular, after loop_optimizer_init() (before any threading) we have:
>>>>>>
>>>>>>      <bb 3> [local count: 118111600]:
>>>>>>      # j_19 = PHI <j_13(9), 0(7)>
>>>>>>      sum_11 = c[j_19];
>>>>>>      if (n_10(D) > 0)
>>>>>>        goto <bb 8>; [89.00%]
>>>>>>      else
>>>>>>        goto <bb 5>; [11.00%]
>>>>>>
>>>>>>         <bb 8> [local count: 105119324]:
>>>>>> ...
>>>>>>
>>>>>>      <bb 5> [local count: 118111600]:
>>>>>>      # sum_21 = PHI <sum_14(4), sum_11(3)>
>>>>>>      c[j_19] = sum_21;
>>>>>>      j_13 = j_19 + 1;
>>>>>>      if (n_10(D) > j_13)
>>>>>>        goto <bb 9>; [89.00%]
>>>>>>      else
>>>>>>        goto <bb 6>; [11.00%]
>>>>>>
>>>>>>      <bb 9> [local count: 105119324]:
>>>>>>      goto <bb 3>; [100.00%]
>>>>>>
>>>>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
>>>>>> pre-header of inner loop, but more importantly something that's not at the
>>>>>> start of the outer loop.
>>>>>>
>>>>>> Now, any thread that includes the backedge 9->3 _including_ its
>>>>>> destination (i.e. where the backedge isn't the last to-be-redirected edge)
>>>>>> necessarily duplicates all code from that destination onto the back edge.
>>>>>> Here it's the load from c[j] into sum_11.
>>>>>>
>>>>>> The important part is the code is emitted onto the back edge,
>>>>>> conceptually; in reality it's simply included into the (new) latch block
>>>>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
>>>>>> cfg_cleanup).
>>>>>>
>>>>>> That's what we can't have for some of our structural loop optimizers:
>>>>>> there must be no code executed after the exit test (e.g. in the latch
>>>>>> block).  (This requirement makes reasoning about which code is or isn't
>>>>>> executed completely for an iteration trivial; simply everything in the
>>>>>> body is always executed; e.g. loop interchange uses this to check that
>>>>>> there are no memory references after the exit test, because those would
>>>>>> then be only conditional and hence make loop interchange very awkward).
>>>>>>
>>>>>> Note that this situation can't be later rectified anymore: the duplicated
>>>>>> instructions (because they are memory refs) must remain after the exit
>>>>>> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
>>>>>> memory refs on the loop-entry path and on the back edge are equivalent)
>>>>>> would that be possible, but that's something we aren't capable of.  Even
>>>>>> if we were that would simply just revert the whole work that the threader
>>>>>> did, so it's better to not even do that to start with.
>>>>>>
>>>>>> I believe something like below would be appropriate, it disables threading
>>>>>> if the path contains a latch at the non-last position (due to being
>>>>>> backwards on the non-first position in the array).  I.e. it disables
>>>>>> rotating the loop if there's danger of polluting the back edge.  It might
>>>>>> be improved if the blocks following (preceding!) the latch are themself
>>>>>> empty because then no code is duplicated.  It might also be improved if
>>>>>> the latch is already non-empty.  That code should probably only be active
>>>>>> before the loop optimizers, but currently the backward threader isn't
>>>>>> differentiating between before/after loop-optims.
>>>>>>
>>>>>> I haven't tested this patch at all, except that it fixes the testcase :)
>>>>>
>>>>> Lame comment at the current end of the thread - it's not threading through the
>>>>
>>>> I don't know why y'all keep using the word "lame".  On the contrary,
>>>> these are incredibly useful explanations.  Thanks.
>>>>
>>>>> latch but threading through the loop header that's problematic, at least if the
>>>>> end of the threading path ends within the loop (threading through the header
>>>>> to the loop exit is fine).  Because in that situation you effectively created an
>>>>> alternate loop entry.  Threading through the latch into the loop header is
>>>>> fine but with simple latches that likely will never happen (if there are no
>>>>> simple latches then the latch can contain the loop exit test).
>>>>>
>>>>> See tree-ssa-threadupdate.c:thread_block_1
>>>>>
>>>>>          e2 = path->last ()->e;
>>>>>          if (!e2 || noloop_only)
>>>>>            {
>>>>>              /* If NOLOOP_ONLY is true, we only allow threading through the
>>>>>                 header of a loop to exit edges.  */
>>>>>
>>>>>              /* One case occurs when there was loop header buried in a jump
>>>>>                 threading path that crosses loop boundaries.  We do not try
>>>>>                 and thread this elsewhere, so just cancel the jump threading
>>>>>                 request by clearing the AUX field now.  */
>>>>>              if (bb->loop_father != e2->src->loop_father
>>>>>                  && (!loop_exit_edge_p (e2->src->loop_father, e2)
>>>>>                      || flow_loop_nested_p (bb->loop_father,
>>>>>                                             e2->dest->loop_father)))
>>>>>                {
>>>>>                  /* Since this case is not handled by our special code
>>>>>                     to thread through a loop header, we must explicitly
>>>>>                     cancel the threading request here.  */
>>>>>                  delete_jump_thread_path (path);
>>>>>                  e->aux = NULL;
>>>>>                  continue;
>>>>>                }
>>>>
>>>> But this is for a threading path that crosses loop boundaries, which is
>>>> not the case.  Perhaps we should restrict this further to threads within
>>>> a loop?
>>>>
>>>>>
>>>>> there are a lot of "useful" checks in this function and the backwards threader
>>>>> should adopt those.  Note the backwards threader originally only did
>>>>> FSM style threadings which are exactly those possibly "harmful" ones, forming
>>>>> irreducible regions at worst or sub-loops at best.  That might explain the
>>>>> lack of those checks.
>>>>
>>>> Also, the aforementioned checks are in jump_thread_path_registry, which
>>>> is also shared by the backward threader.  These are thread discards
>>>> _after_ a thread has been registered.
>>>
>>> Yeah, that's indeed unfortunate.
>>>
>>>>    The backward threader should also
>>>> be using these restrictions.  Unless, I'm missing some interaction with
>>>> the FSM/etc threading types as per the preamble to the snippet you provided:
>>>>
>>>>         if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners)
>>>>             || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners))
>>>>           continue;
>>>
>>> Indeed.  But I understand the backwards threader does not (only) do FSM
>>> threading now.
>>
>> If it does, it was not part of my rewrite.  I was careful to not touch
>> anything dealing with either path profitability or low-level path
>> registering.
>>
>> The path registering is in back_threader_registry::register_path().  We
>> only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY.  ISTM that
>> those are only EDGE_FSM_THREADs??
> 
> Well, if the backwards threader classifies everything as FSM that's probably
> inaccurate since only threads through the loop latch are "FSM".  There is
> the comment
> 
>    /* If this path does not thread through the loop latch, then we are
>       using the FSM threader to find old style jump threads.  This
>       is good, except the FSM threader does not re-use an existing
>       threading path to reduce code duplication.
> 
>       So for that case, drastically reduce the number of statements
>       we are allowed to copy.  */

*blink*

Woah.  The backward threader has been using FSM threads indiscriminately 
as far as I can remember.  I wonder what would break if we "fixed it".

> 
> so these cases should use the "old style" validity/costing metrics and thus
> classify threading opportunities in a different way?

Jeff, do you have any insight here?

> 
> I think today "backwards" vs, "forwards" only refers to the way we find
> threading opportunities.

Yes, it's a mess.

I ran some experiments a while back, and my current work on the enhanced 
solver/threader, can fold virtually everything the DOM/threader gets 
(even with its use of const_and_copies, avail_exprs, and 
evrp_range_analyzer), while getting 5% more DOM threads and 1% more 
overall threads.  That is, I've been testing if the path solver can 
solve everything the DOM threader needs (the hybrid approach I mentioned).

Unfortunately, replacing the forward threader right now is not feasible 
for a few reasons:

a) The const_and_copies/avail_exprs relation framework can do floats, 
and that's next year's ranger work.

b) Even though we can seemingly fold everything DOM/threader does, in 
order to replace it with a backward threader instance we'd have to merge 
the cost/profitability code scattered throughout the forward threader, 
as well as the EDGE_FSM* / EDGE_COPY* business.

c) DOM changes the IL as it goes.  Though we could conceivably divorce 
do the threading after DOM is done.

But I digress...
Aldy