From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x62e.google.com (mail-ej1-x62e.google.com [IPv6:2a00:1450:4864:20::62e]) by sourceware.org (Postfix) with ESMTPS id 29512384B0E6 for ; Thu, 9 Sep 2021 07:45:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 29512384B0E6 Received: by mail-ej1-x62e.google.com with SMTP id x11so1793192ejv.0 for ; Thu, 09 Sep 2021 00:45:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=WZWZsxYUrFFyOvs2Mlqy5/kGf/Utenr0eRXuG/2FhCA=; b=w1tt5NTB7+tWYAOwpB18vTxAP2tLgP7WNK0l2ZHEwIzeqIASmrwXq83EDjYDJxYAfj j/oqbRCrWkcJGTRHzavkmSKuVD9mZjfD/P1SgnO2PMM0RKt5d5kGyhfeBWQy7n6oIsDq 9+FUQ8zWo82awMvle23VZC8wfCmWTEKqmWmTyruMqt9oRLPzn7Wej4cCh8Gtmx6RUpFw P5+JITmQsRGEgmw4KKhwahroSrEEzckqxNQ+HFqX58Jq68liD6rHhbeiwnZEQ+Af5ovU lFrs2X7TGcTQPPFg32OsX9Rg9BPifZ3Ii17aC69MEzSe6FluH55fD5WddzoxQA/0zs6p QVuA== X-Gm-Message-State: AOAM531gE6J4hoo/iEDzbqOyw8dXUOjesdptOQ9R001qet/R5sVsecQm koVljqN6OffQF0WsKK7w9O5d0dlkA+d/+YE1HHQ= X-Google-Smtp-Source: ABdhPJyOCE3FmbYdgDz2r56qPC4MGbYLxcucrFDK+HWXGJAEM8zLdG2QVZTp9IzgZcHXBg5e9svwXAe5v6b//K/pbuA= X-Received: by 2002:a17:906:b104:: with SMTP id u4mr1971472ejy.201.1631173540047; Thu, 09 Sep 2021 00:45:40 -0700 (PDT) MIME-Version: 1.0 References: <09e48b82-bc51-405e-7680-89a5f08e4e8f@redhat.com> <6d5695e4-e4eb-14a5-46a6-f425d1514008@redhat.com> <07fdd2bb-e6b7-fe66-f6a0-df6ec0704ae4@redhat.com> In-Reply-To: <07fdd2bb-e6b7-fe66-f6a0-df6ec0704ae4@redhat.com> From: Richard Biener Date: Thu, 9 Sep 2021 09:45:29 +0200 Message-ID: Subject: Re: More aggressive threading causing loop-interchange-9.c regression To: Aldy Hernandez Cc: Michael Matz , Jeff Law , GCC Mailing List , Andrew MacLeod Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Sep 2021 07:45:43 -0000 On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez wrote: > > > > On 9/9/21 8:57 AM, Richard Biener wrote: > > On Wed, Sep 8, 2021 at 8:13 PM Michael Matz wrote: > >> > >> Hello, > >> > >> [lame answer to self] > >> > >> On Wed, 8 Sep 2021, Michael Matz wrote: > >> > >>>>> The forward threader guards against this by simply disallowing > >>>>> threadings that involve different loops. As I see > >>>> > >>>> The thread in question (5->9->3) is all within the same outer loop, > >>>> though. BTW, the backward threader also disallows threading across > >>>> different loops (see path_crosses_loops variable). > >> ... > >>> Maybe it's possible to not disable threading over latches alltogether in > >>> the backward threader (like it's tried now), but I haven't looked at the > >>> specific situation here in depth, so take my view only as opinion from a > >>> large distance :-) > >> > >> I've now looked at the concrete situation. So yeah, the whole path is in > >> the same loop, crosses the latch, _and there's code following the latch > >> on that path_. (I.e. the latch isn't the last block in the path). In > >> particular, after loop_optimizer_init() (before any threading) we have: > >> > >> [local count: 118111600]: > >> # j_19 = PHI > >> sum_11 = c[j_19]; > >> if (n_10(D) > 0) > >> goto ; [89.00%] > >> else > >> goto ; [11.00%] > >> > >> [local count: 105119324]: > >> ... > >> > >> [local count: 118111600]: > >> # sum_21 = PHI > >> c[j_19] = sum_21; > >> j_13 = j_19 + 1; > >> if (n_10(D) > j_13) > >> goto ; [89.00%] > >> else > >> goto ; [11.00%] > >> > >> [local count: 105119324]: > >> goto ; [100.00%] > >> > >> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > >> pre-header of inner loop, but more importantly something that's not at the > >> start of the outer loop. > >> > >> Now, any thread that includes the backedge 9->3 _including_ its > >> destination (i.e. where the backedge isn't the last to-be-redirected edge) > >> necessarily duplicates all code from that destination onto the back edge. > >> Here it's the load from c[j] into sum_11. > >> > >> The important part is the code is emitted onto the back edge, > >> conceptually; in reality it's simply included into the (new) latch block > >> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > >> cfg_cleanup). > >> > >> That's what we can't have for some of our structural loop optimizers: > >> there must be no code executed after the exit test (e.g. in the latch > >> block). (This requirement makes reasoning about which code is or isn't > >> executed completely for an iteration trivial; simply everything in the > >> body is always executed; e.g. loop interchange uses this to check that > >> there are no memory references after the exit test, because those would > >> then be only conditional and hence make loop interchange very awkward). > >> > >> Note that this situation can't be later rectified anymore: the duplicated > >> instructions (because they are memory refs) must remain after the exit > >> test. Only by rerolling/unrotating the loop (i.e. noticing that the > >> memory refs on the loop-entry path and on the back edge are equivalent) > >> would that be possible, but that's something we aren't capable of. Even > >> if we were that would simply just revert the whole work that the threader > >> did, so it's better to not even do that to start with. > >> > >> I believe something like below would be appropriate, it disables threading > >> if the path contains a latch at the non-last position (due to being > >> backwards on the non-first position in the array). I.e. it disables > >> rotating the loop if there's danger of polluting the back edge. It might > >> be improved if the blocks following (preceding!) the latch are themself > >> empty because then no code is duplicated. It might also be improved if > >> the latch is already non-empty. That code should probably only be active > >> before the loop optimizers, but currently the backward threader isn't > >> differentiating between before/after loop-optims. > >> > >> I haven't tested this patch at all, except that it fixes the testcase :) > > > > Lame comment at the current end of the thread - it's not threading through the > > I don't know why y'all keep using the word "lame". On the contrary, > these are incredibly useful explanations. Thanks. > > > latch but threading through the loop header that's problematic, at least if the > > end of the threading path ends within the loop (threading through the header > > to the loop exit is fine). Because in that situation you effectively created an > > alternate loop entry. Threading through the latch into the loop header is > > fine but with simple latches that likely will never happen (if there are no > > simple latches then the latch can contain the loop exit test). > > > > See tree-ssa-threadupdate.c:thread_block_1 > > > > e2 = path->last ()->e; > > if (!e2 || noloop_only) > > { > > /* If NOLOOP_ONLY is true, we only allow threading through the > > header of a loop to exit edges. */ > > > > /* One case occurs when there was loop header buried in a jump > > threading path that crosses loop boundaries. We do not try > > and thread this elsewhere, so just cancel the jump threading > > request by clearing the AUX field now. */ > > if (bb->loop_father != e2->src->loop_father > > && (!loop_exit_edge_p (e2->src->loop_father, e2) > > || flow_loop_nested_p (bb->loop_father, > > e2->dest->loop_father))) > > { > > /* Since this case is not handled by our special code > > to thread through a loop header, we must explicitly > > cancel the threading request here. */ > > delete_jump_thread_path (path); > > e->aux = NULL; > > continue; > > } > > But this is for a threading path that crosses loop boundaries, which is > not the case. Perhaps we should restrict this further to threads within > a loop? > > > > > there are a lot of "useful" checks in this function and the backwards threader > > should adopt those. Note the backwards threader originally only did > > FSM style threadings which are exactly those possibly "harmful" ones, forming > > irreducible regions at worst or sub-loops at best. That might explain the > > lack of those checks. > > Also, the aforementioned checks are in jump_thread_path_registry, which > is also shared by the backward threader. These are thread discards > _after_ a thread has been registered. Yeah, that's indeed unfortunate. > The backward threader should also > be using these restrictions. Unless, I'm missing some interaction with > the FSM/etc threading types as per the preamble to the snippet you provided: > > if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners) > || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners)) > continue; Indeed. But I understand the backwards threader does not (only) do FSM threading now. > You are right though, there are a lot of checks throughout the entire > forward threader that should be audited and somehow shared. It's on my > back burner, but I'm running out of cycles here :-/. yeah, it's quite a mess indeed and merging the path validity/costing/code-transform paths of both threaders would be incredibly useful. Richard. > > Thanks. > Aldy >