Re: More aggressive threading causing loop-interchange-9.c regression

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

From: Aldy Hernandez <aldyh@redhat.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Michael Matz <matz@suse.de>, Jeff Law <jeffreyalaw@gmail.com>,
	GCC Mailing List <gcc@gcc.gnu.org>,
	Andrew MacLeod <amacleod@redhat.com>
Subject: Re: More aggressive threading causing loop-interchange-9.c regression
Date: Thu, 9 Sep 2021 11:21:13 +0200	[thread overview]
Message-ID: <8c49db8d-3119-0dc2-2bbb-4062c8d5d53b@redhat.com> (raw)
In-Reply-To: <CAFiYyc1=Bj3yBhvKJtYFHWqHU7WgW-4RWwZ6CXSwQiP-QuYGXw@mail.gmail.com>



On 9/9/21 10:58 AM, Richard Biener wrote:
> On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>
>>
>>
>> On 9/9/21 9:45 AM, Richard Biener wrote:
>>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 9/9/21 8:57 AM, Richard Biener wrote:
>>>>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> [lame answer to self]
>>>>>>
>>>>>> On Wed, 8 Sep 2021, Michael Matz wrote:
>>>>>>
>>>>>>>>> The forward threader guards against this by simply disallowing
>>>>>>>>> threadings that involve different loops.  As I see
>>>>>>>>
>>>>>>>> The thread in question (5->9->3) is all within the same outer loop,
>>>>>>>> though. BTW, the backward threader also disallows threading across
>>>>>>>> different loops (see path_crosses_loops variable).
>>>>>> ...
>>>>>>> Maybe it's possible to not disable threading over latches alltogether in
>>>>>>> the backward threader (like it's tried now), but I haven't looked at the
>>>>>>> specific situation here in depth, so take my view only as opinion from a
>>>>>>> large distance :-)
>>>>>>
>>>>>> I've now looked at the concrete situation.  So yeah, the whole path is in
>>>>>> the same loop, crosses the latch, _and there's code following the latch
>>>>>> on that path_.  (I.e. the latch isn't the last block in the path).  In
>>>>>> particular, after loop_optimizer_init() (before any threading) we have:
>>>>>>
>>>>>>      <bb 3> [local count: 118111600]:
>>>>>>      # j_19 = PHI <j_13(9), 0(7)>
>>>>>>      sum_11 = c[j_19];
>>>>>>      if (n_10(D) > 0)
>>>>>>        goto <bb 8>; [89.00%]
>>>>>>      else
>>>>>>        goto <bb 5>; [11.00%]
>>>>>>
>>>>>>         <bb 8> [local count: 105119324]:
>>>>>> ...
>>>>>>
>>>>>>      <bb 5> [local count: 118111600]:
>>>>>>      # sum_21 = PHI <sum_14(4), sum_11(3)>
>>>>>>      c[j_19] = sum_21;
>>>>>>      j_13 = j_19 + 1;
>>>>>>      if (n_10(D) > j_13)
>>>>>>        goto <bb 9>; [89.00%]
>>>>>>      else
>>>>>>        goto <bb 6>; [11.00%]
>>>>>>
>>>>>>      <bb 9> [local count: 105119324]:
>>>>>>      goto <bb 3>; [100.00%]
>>>>>>
>>>>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
>>>>>> pre-header of inner loop, but more importantly something that's not at the
>>>>>> start of the outer loop.
>>>>>>
>>>>>> Now, any thread that includes the backedge 9->3 _including_ its
>>>>>> destination (i.e. where the backedge isn't the last to-be-redirected edge)
>>>>>> necessarily duplicates all code from that destination onto the back edge.
>>>>>> Here it's the load from c[j] into sum_11.
>>>>>>
>>>>>> The important part is the code is emitted onto the back edge,
>>>>>> conceptually; in reality it's simply included into the (new) latch block
>>>>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
>>>>>> cfg_cleanup).
>>>>>>
>>>>>> That's what we can't have for some of our structural loop optimizers:
>>>>>> there must be no code executed after the exit test (e.g. in the latch
>>>>>> block).  (This requirement makes reasoning about which code is or isn't
>>>>>> executed completely for an iteration trivial; simply everything in the
>>>>>> body is always executed; e.g. loop interchange uses this to check that
>>>>>> there are no memory references after the exit test, because those would
>>>>>> then be only conditional and hence make loop interchange very awkward).
>>>>>>
>>>>>> Note that this situation can't be later rectified anymore: the duplicated
>>>>>> instructions (because they are memory refs) must remain after the exit
>>>>>> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
>>>>>> memory refs on the loop-entry path and on the back edge are equivalent)
>>>>>> would that be possible, but that's something we aren't capable of.  Even
>>>>>> if we were that would simply just revert the whole work that the threader
>>>>>> did, so it's better to not even do that to start with.
>>>>>>
>>>>>> I believe something like below would be appropriate, it disables threading
>>>>>> if the path contains a latch at the non-last position (due to being
>>>>>> backwards on the non-first position in the array).  I.e. it disables
>>>>>> rotating the loop if there's danger of polluting the back edge.  It might
>>>>>> be improved if the blocks following (preceding!) the latch are themself
>>>>>> empty because then no code is duplicated.  It might also be improved if
>>>>>> the latch is already non-empty.  That code should probably only be active
>>>>>> before the loop optimizers, but currently the backward threader isn't
>>>>>> differentiating between before/after loop-optims.
>>>>>>
>>>>>> I haven't tested this patch at all, except that it fixes the testcase :)
>>>>>
>>>>> Lame comment at the current end of the thread - it's not threading through the
>>>>
>>>> I don't know why y'all keep using the word "lame".  On the contrary,
>>>> these are incredibly useful explanations.  Thanks.
>>>>
>>>>> latch but threading through the loop header that's problematic, at least if the
>>>>> end of the threading path ends within the loop (threading through the header
>>>>> to the loop exit is fine).  Because in that situation you effectively created an
>>>>> alternate loop entry.  Threading through the latch into the loop header is
>>>>> fine but with simple latches that likely will never happen (if there are no
>>>>> simple latches then the latch can contain the loop exit test).
>>>>>
>>>>> See tree-ssa-threadupdate.c:thread_block_1
>>>>>
>>>>>          e2 = path->last ()->e;
>>>>>          if (!e2 || noloop_only)
>>>>>            {
>>>>>              /* If NOLOOP_ONLY is true, we only allow threading through the
>>>>>                 header of a loop to exit edges.  */
>>>>>
>>>>>              /* One case occurs when there was loop header buried in a jump
>>>>>                 threading path that crosses loop boundaries.  We do not try
>>>>>                 and thread this elsewhere, so just cancel the jump threading
>>>>>                 request by clearing the AUX field now.  */
>>>>>              if (bb->loop_father != e2->src->loop_father
>>>>>                  && (!loop_exit_edge_p (e2->src->loop_father, e2)
>>>>>                      || flow_loop_nested_p (bb->loop_father,
>>>>>                                             e2->dest->loop_father)))
>>>>>                {
>>>>>                  /* Since this case is not handled by our special code
>>>>>                     to thread through a loop header, we must explicitly
>>>>>                     cancel the threading request here.  */
>>>>>                  delete_jump_thread_path (path);
>>>>>                  e->aux = NULL;
>>>>>                  continue;
>>>>>                }
>>>>
>>>> But this is for a threading path that crosses loop boundaries, which is
>>>> not the case.  Perhaps we should restrict this further to threads within
>>>> a loop?
>>>>
>>>>>
>>>>> there are a lot of "useful" checks in this function and the backwards threader
>>>>> should adopt those.  Note the backwards threader originally only did
>>>>> FSM style threadings which are exactly those possibly "harmful" ones, forming
>>>>> irreducible regions at worst or sub-loops at best.  That might explain the
>>>>> lack of those checks.
>>>>
>>>> Also, the aforementioned checks are in jump_thread_path_registry, which
>>>> is also shared by the backward threader.  These are thread discards
>>>> _after_ a thread has been registered.
>>>
>>> Yeah, that's indeed unfortunate.
>>>
>>>>    The backward threader should also
>>>> be using these restrictions.  Unless, I'm missing some interaction with
>>>> the FSM/etc threading types as per the preamble to the snippet you provided:
>>>>
>>>>         if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners)
>>>>             || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners))
>>>>           continue;
>>>
>>> Indeed.  But I understand the backwards threader does not (only) do FSM
>>> threading now.
>>
>> If it does, it was not part of my rewrite.  I was careful to not touch
>> anything dealing with either path profitability or low-level path
>> registering.
>>
>> The path registering is in back_threader_registry::register_path().  We
>> only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY.  ISTM that
>> those are only EDGE_FSM_THREADs??
> 
> Well, if the backwards threader classifies everything as FSM that's probably
> inaccurate since only threads through the loop latch are "FSM".  There is
> the comment
> 
>    /* If this path does not thread through the loop latch, then we are
>       using the FSM threader to find old style jump threads.  This
>       is good, except the FSM threader does not re-use an existing
>       threading path to reduce code duplication.
> 
>       So for that case, drastically reduce the number of statements
>       we are allowed to copy.  */

*blink*

Woah.  The backward threader has been using FSM threads indiscriminately 
as far as I can remember.  I wonder what would break if we "fixed it".

> 
> so these cases should use the "old style" validity/costing metrics and thus
> classify threading opportunities in a different way?

Jeff, do you have any insight here?

> 
> I think today "backwards" vs, "forwards" only refers to the way we find
> threading opportunities.

Yes, it's a mess.

I ran some experiments a while back, and my current work on the enhanced 
solver/threader, can fold virtually everything the DOM/threader gets 
(even with its use of const_and_copies, avail_exprs, and 
evrp_range_analyzer), while getting 5% more DOM threads and 1% more 
overall threads.  That is, I've been testing if the path solver can 
solve everything the DOM threader needs (the hybrid approach I mentioned).

Unfortunately, replacing the forward threader right now is not feasible 
for a few reasons:

a) The const_and_copies/avail_exprs relation framework can do floats, 
and that's next year's ranger work.

b) Even though we can seemingly fold everything DOM/threader does, in 
order to replace it with a backward threader instance we'd have to merge 
the cost/profitability code scattered throughout the forward threader, 
as well as the EDGE_FSM* / EDGE_COPY* business.

c) DOM changes the IL as it goes.  Though we could conceivably divorce 
do the threading after DOM is done.

But I digress...
Aldy

next prev parent reply	other threads:[~2021-09-09  9:21 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-07 11:49 Aldy Hernandez
2021-09-07 14:45 ` Michael Matz
2021-09-08 10:44   ` Aldy Hernandez
2021-09-08 13:13     ` Richard Biener
2021-09-08 13:25       ` Aldy Hernandez
2021-09-08 13:49         ` Richard Biener
2021-09-08 16:19           ` Aldy Hernandez
2021-09-08 16:39             ` Michael Matz
2021-09-08 18:13               ` Michael Matz
2021-09-09  6:57                 ` Richard Biener
2021-09-09  7:37                   ` Aldy Hernandez
2021-09-09  7:45                     ` Richard Biener
2021-09-09  8:36                       ` Aldy Hernandez
2021-09-09  8:58                         ` Richard Biener
2021-09-09  9:21                           ` Aldy Hernandez [this message]
2021-09-09 10:15                             ` Richard Biener
2021-09-09 11:28                               ` Aldy Hernandez
2021-09-10 15:51                               ` Jeff Law
2021-09-10 16:11                                 ` Aldy Hernandez
2021-09-10 15:43                             ` Jeff Law
2021-09-10 16:05                               ` Aldy Hernandez
2021-09-10 16:21                                 ` Jeff Law
2021-09-10 16:38                                   ` Aldy Hernandez
2021-09-09 16:59                           ` Jeff Law
2021-09-09 12:47                   ` Michael Matz
2021-09-09  8:14                 ` Aldy Hernandez
2021-09-09  8:24                   ` Richard Biener
2021-09-09 12:52                   ` Michael Matz
2021-09-09 13:37                     ` Aldy Hernandez
2021-09-09 14:44                       ` Michael Matz
2021-09-09 15:07                         ` Aldy Hernandez
2021-09-10  7:04                         ` Aldy Hernandez
2021-09-09 16:54                   ` Jeff Law

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8c49db8d-3119-0dc2-2bbb-4062c8d5d53b@redhat.com \
    --to=aldyh@redhat.com \
    --cc=amacleod@redhat.com \
    --cc=gcc@gcc.gnu.org \
    --cc=jeffreyalaw@gmail.com \
    --cc=matz@suse.de \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).