From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x62e.google.com (mail-ej1-x62e.google.com [IPv6:2a00:1450:4864:20::62e]) by sourceware.org (Postfix) with ESMTPS id 4B66E3858C2C for ; Thu, 9 Sep 2021 10:16:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4B66E3858C2C Received: by mail-ej1-x62e.google.com with SMTP id ho42so908065ejc.9 for ; Thu, 09 Sep 2021 03:16:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=rN9znrXxD5Armdj2+3wNdk7dQswWt6Pj+7AVn7DYYSk=; b=cWLaCMeonH8mmTukx9IfYckgH03JW3OzQHW1cOggw5fvNTBbzBWFWmRhoPhCCwn2MN sG7HnOqSeGs7kyc+8VRluMp8TUZZdJgeOl5Z2Mh9J+YUP2HjXnvU4yfutCr7NRGXg6gT 6nenNU68HJoxvhAmpcbnNRODZ7Tplt4EjHi0WWaIsShp5o9HET70ONVQWI4Cgj5iofAB 6L5YEvOxlDsiR+kclvRvWfvUHaczcOahzDbJDMdcWK5jDvaVgTr8FfFrWq6IRDHgJaTH aLhp6mz/gxP8OGYyUe7JDZNIBlqQfuDa8ZfR+bpnwJsOXDFVdSv122/0jM5DJZrTSL+x Tzbg== X-Gm-Message-State: AOAM533uUmZ/j1RC11Ir0ISHQWutsgEPb0IDxvRLQzs5lugc80QvMs0Q I7mNFCClsuPx8VkRBepbPqGN8pVf2vjD40urBms= X-Google-Smtp-Source: ABdhPJxwvFFG9cEQiw1tY//wF4/Px5TqpaPjK4jcDoUsSaaGXtWCDe9yH2atjF3NUC7ifsK6+PfxfTMV1XERzAj4gqo= X-Received: by 2002:a17:906:ece1:: with SMTP id qt1mr2485900ejb.281.1631182569052; Thu, 09 Sep 2021 03:16:09 -0700 (PDT) MIME-Version: 1.0 References: <09e48b82-bc51-405e-7680-89a5f08e4e8f@redhat.com> <6d5695e4-e4eb-14a5-46a6-f425d1514008@redhat.com> <07fdd2bb-e6b7-fe66-f6a0-df6ec0704ae4@redhat.com> <8c49db8d-3119-0dc2-2bbb-4062c8d5d53b@redhat.com> In-Reply-To: <8c49db8d-3119-0dc2-2bbb-4062c8d5d53b@redhat.com> From: Richard Biener Date: Thu, 9 Sep 2021 12:15:57 +0200 Message-ID: Subject: Re: More aggressive threading causing loop-interchange-9.c regression To: Aldy Hernandez Cc: Michael Matz , Jeff Law , GCC Mailing List , Andrew MacLeod Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Sep 2021 10:16:12 -0000 On Thu, Sep 9, 2021 at 11:21 AM Aldy Hernandez wrote: > > > > On 9/9/21 10:58 AM, Richard Biener wrote: > > On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez wrote: > >> > >> > >> > >> On 9/9/21 9:45 AM, Richard Biener wrote: > >>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez wrote: > >>>> > >>>> > >>>> > >>>> On 9/9/21 8:57 AM, Richard Biener wrote: > >>>>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz wrote: > >>>>>> > >>>>>> Hello, > >>>>>> > >>>>>> [lame answer to self] > >>>>>> > >>>>>> On Wed, 8 Sep 2021, Michael Matz wrote: > >>>>>> > >>>>>>>>> The forward threader guards against this by simply disallowing > >>>>>>>>> threadings that involve different loops. As I see > >>>>>>>> > >>>>>>>> The thread in question (5->9->3) is all within the same outer loop, > >>>>>>>> though. BTW, the backward threader also disallows threading across > >>>>>>>> different loops (see path_crosses_loops variable). > >>>>>> ... > >>>>>>> Maybe it's possible to not disable threading over latches alltogether in > >>>>>>> the backward threader (like it's tried now), but I haven't looked at the > >>>>>>> specific situation here in depth, so take my view only as opinion from a > >>>>>>> large distance :-) > >>>>>> > >>>>>> I've now looked at the concrete situation. So yeah, the whole path is in > >>>>>> the same loop, crosses the latch, _and there's code following the latch > >>>>>> on that path_. (I.e. the latch isn't the last block in the path). In > >>>>>> particular, after loop_optimizer_init() (before any threading) we have: > >>>>>> > >>>>>> [local count: 118111600]: > >>>>>> # j_19 = PHI > >>>>>> sum_11 = c[j_19]; > >>>>>> if (n_10(D) > 0) > >>>>>> goto ; [89.00%] > >>>>>> else > >>>>>> goto ; [11.00%] > >>>>>> > >>>>>> [local count: 105119324]: > >>>>>> ... > >>>>>> > >>>>>> [local count: 118111600]: > >>>>>> # sum_21 = PHI > >>>>>> c[j_19] = sum_21; > >>>>>> j_13 = j_19 + 1; > >>>>>> if (n_10(D) > j_13) > >>>>>> goto ; [89.00%] > >>>>>> else > >>>>>> goto ; [11.00%] > >>>>>> > >>>>>> [local count: 105119324]: > >>>>>> goto ; [100.00%] > >>>>>> > >>>>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > >>>>>> pre-header of inner loop, but more importantly something that's not at the > >>>>>> start of the outer loop. > >>>>>> > >>>>>> Now, any thread that includes the backedge 9->3 _including_ its > >>>>>> destination (i.e. where the backedge isn't the last to-be-redirected edge) > >>>>>> necessarily duplicates all code from that destination onto the back edge. > >>>>>> Here it's the load from c[j] into sum_11. > >>>>>> > >>>>>> The important part is the code is emitted onto the back edge, > >>>>>> conceptually; in reality it's simply included into the (new) latch block > >>>>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > >>>>>> cfg_cleanup). > >>>>>> > >>>>>> That's what we can't have for some of our structural loop optimizers: > >>>>>> there must be no code executed after the exit test (e.g. in the latch > >>>>>> block). (This requirement makes reasoning about which code is or isn't > >>>>>> executed completely for an iteration trivial; simply everything in the > >>>>>> body is always executed; e.g. loop interchange uses this to check that > >>>>>> there are no memory references after the exit test, because those would > >>>>>> then be only conditional and hence make loop interchange very awkward). > >>>>>> > >>>>>> Note that this situation can't be later rectified anymore: the duplicated > >>>>>> instructions (because they are memory refs) must remain after the exit > >>>>>> test. Only by rerolling/unrotating the loop (i.e. noticing that the > >>>>>> memory refs on the loop-entry path and on the back edge are equivalent) > >>>>>> would that be possible, but that's something we aren't capable of. Even > >>>>>> if we were that would simply just revert the whole work that the threader > >>>>>> did, so it's better to not even do that to start with. > >>>>>> > >>>>>> I believe something like below would be appropriate, it disables threading > >>>>>> if the path contains a latch at the non-last position (due to being > >>>>>> backwards on the non-first position in the array). I.e. it disables > >>>>>> rotating the loop if there's danger of polluting the back edge. It might > >>>>>> be improved if the blocks following (preceding!) the latch are themself > >>>>>> empty because then no code is duplicated. It might also be improved if > >>>>>> the latch is already non-empty. That code should probably only be active > >>>>>> before the loop optimizers, but currently the backward threader isn't > >>>>>> differentiating between before/after loop-optims. > >>>>>> > >>>>>> I haven't tested this patch at all, except that it fixes the testcase :) > >>>>> > >>>>> Lame comment at the current end of the thread - it's not threading through the > >>>> > >>>> I don't know why y'all keep using the word "lame". On the contrary, > >>>> these are incredibly useful explanations. Thanks. > >>>> > >>>>> latch but threading through the loop header that's problematic, at least if the > >>>>> end of the threading path ends within the loop (threading through the header > >>>>> to the loop exit is fine). Because in that situation you effectively created an > >>>>> alternate loop entry. Threading through the latch into the loop header is > >>>>> fine but with simple latches that likely will never happen (if there are no > >>>>> simple latches then the latch can contain the loop exit test). > >>>>> > >>>>> See tree-ssa-threadupdate.c:thread_block_1 > >>>>> > >>>>> e2 = path->last ()->e; > >>>>> if (!e2 || noloop_only) > >>>>> { > >>>>> /* If NOLOOP_ONLY is true, we only allow threading through the > >>>>> header of a loop to exit edges. */ > >>>>> > >>>>> /* One case occurs when there was loop header buried in a jump > >>>>> threading path that crosses loop boundaries. We do not try > >>>>> and thread this elsewhere, so just cancel the jump threading > >>>>> request by clearing the AUX field now. */ > >>>>> if (bb->loop_father != e2->src->loop_father > >>>>> && (!loop_exit_edge_p (e2->src->loop_father, e2) > >>>>> || flow_loop_nested_p (bb->loop_father, > >>>>> e2->dest->loop_father))) > >>>>> { > >>>>> /* Since this case is not handled by our special code > >>>>> to thread through a loop header, we must explicitly > >>>>> cancel the threading request here. */ > >>>>> delete_jump_thread_path (path); > >>>>> e->aux = NULL; > >>>>> continue; > >>>>> } > >>>> > >>>> But this is for a threading path that crosses loop boundaries, which is > >>>> not the case. Perhaps we should restrict this further to threads within > >>>> a loop? > >>>> > >>>>> > >>>>> there are a lot of "useful" checks in this function and the backwards threader > >>>>> should adopt those. Note the backwards threader originally only did > >>>>> FSM style threadings which are exactly those possibly "harmful" ones, forming > >>>>> irreducible regions at worst or sub-loops at best. That might explain the > >>>>> lack of those checks. > >>>> > >>>> Also, the aforementioned checks are in jump_thread_path_registry, which > >>>> is also shared by the backward threader. These are thread discards > >>>> _after_ a thread has been registered. > >>> > >>> Yeah, that's indeed unfortunate. > >>> > >>>> The backward threader should also > >>>> be using these restrictions. Unless, I'm missing some interaction with > >>>> the FSM/etc threading types as per the preamble to the snippet you provided: > >>>> > >>>> if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners) > >>>> || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners)) > >>>> continue; > >>> > >>> Indeed. But I understand the backwards threader does not (only) do FSM > >>> threading now. > >> > >> If it does, it was not part of my rewrite. I was careful to not touch > >> anything dealing with either path profitability or low-level path > >> registering. > >> > >> The path registering is in back_threader_registry::register_path(). We > >> only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY. ISTM that > >> those are only EDGE_FSM_THREADs?? > > > > Well, if the backwards threader classifies everything as FSM that's probably > > inaccurate since only threads through the loop latch are "FSM". There is > > the comment > > > > /* If this path does not thread through the loop latch, then we are > > using the FSM threader to find old style jump threads. This > > is good, except the FSM threader does not re-use an existing > > threading path to reduce code duplication. > > > > So for that case, drastically reduce the number of statements > > we are allowed to copy. */ > > *blink* > > Woah. The backward threader has been using FSM threads indiscriminately > as far as I can remember. I wonder what would break if we "fixed it". > > > > > so these cases should use the "old style" validity/costing metrics and thus > > classify threading opportunities in a different way? > > Jeff, do you have any insight here? > > > > > I think today "backwards" vs, "forwards" only refers to the way we find > > threading opportunities. > > Yes, it's a mess. > > I ran some experiments a while back, and my current work on the enhanced > solver/threader, can fold virtually everything the DOM/threader gets > (even with its use of const_and_copies, avail_exprs, and > evrp_range_analyzer), while getting 5% more DOM threads and 1% more > overall threads. That is, I've been testing if the path solver can > solve everything the DOM threader needs (the hybrid approach I mentioned). > > Unfortunately, replacing the forward threader right now is not feasible > for a few reasons: > > a) The const_and_copies/avail_exprs relation framework can do floats, > and that's next year's ranger work. Hmm, it sounds like relying fully on ranger is a bit premature then. > b) Even though we can seemingly fold everything DOM/threader does, in > order to replace it with a backward threader instance we'd have to merge > the cost/profitability code scattered throughout the forward threader, > as well as the EDGE_FSM* / EDGE_COPY* business. > > c) DOM changes the IL as it goes. Though we could conceivably divorce > do the threading after DOM is done. Yeah, it does not actually process/simplify the blocks copied by threading. In fact I think it only registers jump threading opportunities during the DOM walk and commits them only later. But it of course uses its avail / copies stack to find them - that part you cannot easily divorce. DOM is also yet another value-numbering framework - one could think of stripping it down from that side, keeping the threading bits only (but you'd still have the avail / copies bits). That said, it has one nice property it can leverage due to its incredibly simple memory redundancy handling, in that it usually performs way less alias queries than FRE (even when you run the latter in non-iterative mode). But the same way DOM can register jump threading opportunities FRE could do as well. Richard. > But I digress... > Aldy >