From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id 10D67384C002 for ; Thu, 9 Sep 2021 09:21:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 10D67384C002 Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-102-XNyQD3_KPMaKVSx-0QfL0A-1; Thu, 09 Sep 2021 05:21:16 -0400 X-MC-Unique: XNyQD3_KPMaKVSx-0QfL0A-1 Received: by mail-wr1-f70.google.com with SMTP id h1-20020adffd41000000b0015931e17ccfso304697wrs.18 for ; Thu, 09 Sep 2021 02:21:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=r7Z+1zE5FyGZeBInbrCmiZkBd3HHg1h+2tKdbUyb8W0=; b=1ZTxpqOjJ9rmaene9ScTGGiQl+V6Hl1nU5vuO87kiKk0/O8B+Cx/laY2a5J3lZIIXW LsuuPy7MkhIa7XM60JKAiwC9b1hh8AefA5bww4yrNgbaLhJjOEEkj4U6UxI2fiqvL1d3 eESzwzAsPliJXdSiZ7v1mu+jj0j6+4TjK1zO1EJJNsyVTclQk5IIm9MtmwRbw+2IdZbK Ck/bRvirvH+uW7qIacPqoL5cBhi2hiAy4Sd9A/vR9B3LhQC7rUmmrnt4tuSuQpksrpfz +U/shJ94GUxCPes1mn+eDHbJMnFteOFJKGV8sKj8N/c8ypcITLv8dxXhXABr3P/LpWGP tMGg== X-Gm-Message-State: AOAM533FAYSYJrA9zdbQ4WEEkW8iR5XeGBHHn6PlZ9g2tHcksIYPoi2q CGYxYV0DjTaGx2JVZSfRMnhTdZHIPXjRaEY+cWjnKjUianZwIReXJnpyyoeyRI8fW5pU7E7ErIB lVK2JM7c= X-Received: by 2002:a1c:21c3:: with SMTP id h186mr673440wmh.18.1631179275267; Thu, 09 Sep 2021 02:21:15 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxVGqwSGuh4S7Lg5fXdCrj/VO0xPfLZHKymH/RCkvWnL9MKc4/kfSXbkvwxrJTnWNfN9iaEsg== X-Received: by 2002:a1c:21c3:: with SMTP id h186mr673411wmh.18.1631179274843; Thu, 09 Sep 2021 02:21:14 -0700 (PDT) Received: from abulafia.quesejoda.com ([139.47.33.227]) by smtp.gmail.com with ESMTPSA id b1sm1332360wrh.85.2021.09.09.02.21.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 09 Sep 2021 02:21:14 -0700 (PDT) Subject: Re: More aggressive threading causing loop-interchange-9.c regression To: Richard Biener Cc: Michael Matz , Jeff Law , GCC Mailing List , Andrew MacLeod References: <09e48b82-bc51-405e-7680-89a5f08e4e8f@redhat.com> <6d5695e4-e4eb-14a5-46a6-f425d1514008@redhat.com> <07fdd2bb-e6b7-fe66-f6a0-df6ec0704ae4@redhat.com> From: Aldy Hernandez Message-ID: <8c49db8d-3119-0dc2-2bbb-4062c8d5d53b@redhat.com> Date: Thu, 9 Sep 2021 11:21:13 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Sep 2021 09:21:19 -0000 On 9/9/21 10:58 AM, Richard Biener wrote: > On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez wrote: >> >> >> >> On 9/9/21 9:45 AM, Richard Biener wrote: >>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez wrote: >>>> >>>> >>>> >>>> On 9/9/21 8:57 AM, Richard Biener wrote: >>>>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> [lame answer to self] >>>>>> >>>>>> On Wed, 8 Sep 2021, Michael Matz wrote: >>>>>> >>>>>>>>> The forward threader guards against this by simply disallowing >>>>>>>>> threadings that involve different loops. As I see >>>>>>>> >>>>>>>> The thread in question (5->9->3) is all within the same outer loop, >>>>>>>> though. BTW, the backward threader also disallows threading across >>>>>>>> different loops (see path_crosses_loops variable). >>>>>> ... >>>>>>> Maybe it's possible to not disable threading over latches alltogether in >>>>>>> the backward threader (like it's tried now), but I haven't looked at the >>>>>>> specific situation here in depth, so take my view only as opinion from a >>>>>>> large distance :-) >>>>>> >>>>>> I've now looked at the concrete situation. So yeah, the whole path is in >>>>>> the same loop, crosses the latch, _and there's code following the latch >>>>>> on that path_. (I.e. the latch isn't the last block in the path). In >>>>>> particular, after loop_optimizer_init() (before any threading) we have: >>>>>> >>>>>> [local count: 118111600]: >>>>>> # j_19 = PHI >>>>>> sum_11 = c[j_19]; >>>>>> if (n_10(D) > 0) >>>>>> goto ; [89.00%] >>>>>> else >>>>>> goto ; [11.00%] >>>>>> >>>>>> [local count: 105119324]: >>>>>> ... >>>>>> >>>>>> [local count: 118111600]: >>>>>> # sum_21 = PHI >>>>>> c[j_19] = sum_21; >>>>>> j_13 = j_19 + 1; >>>>>> if (n_10(D) > j_13) >>>>>> goto ; [89.00%] >>>>>> else >>>>>> goto ; [11.00%] >>>>>> >>>>>> [local count: 105119324]: >>>>>> goto ; [100.00%] >>>>>> >>>>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the >>>>>> pre-header of inner loop, but more importantly something that's not at the >>>>>> start of the outer loop. >>>>>> >>>>>> Now, any thread that includes the backedge 9->3 _including_ its >>>>>> destination (i.e. where the backedge isn't the last to-be-redirected edge) >>>>>> necessarily duplicates all code from that destination onto the back edge. >>>>>> Here it's the load from c[j] into sum_11. >>>>>> >>>>>> The important part is the code is emitted onto the back edge, >>>>>> conceptually; in reality it's simply included into the (new) latch block >>>>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after >>>>>> cfg_cleanup). >>>>>> >>>>>> That's what we can't have for some of our structural loop optimizers: >>>>>> there must be no code executed after the exit test (e.g. in the latch >>>>>> block). (This requirement makes reasoning about which code is or isn't >>>>>> executed completely for an iteration trivial; simply everything in the >>>>>> body is always executed; e.g. loop interchange uses this to check that >>>>>> there are no memory references after the exit test, because those would >>>>>> then be only conditional and hence make loop interchange very awkward). >>>>>> >>>>>> Note that this situation can't be later rectified anymore: the duplicated >>>>>> instructions (because they are memory refs) must remain after the exit >>>>>> test. Only by rerolling/unrotating the loop (i.e. noticing that the >>>>>> memory refs on the loop-entry path and on the back edge are equivalent) >>>>>> would that be possible, but that's something we aren't capable of. Even >>>>>> if we were that would simply just revert the whole work that the threader >>>>>> did, so it's better to not even do that to start with. >>>>>> >>>>>> I believe something like below would be appropriate, it disables threading >>>>>> if the path contains a latch at the non-last position (due to being >>>>>> backwards on the non-first position in the array). I.e. it disables >>>>>> rotating the loop if there's danger of polluting the back edge. It might >>>>>> be improved if the blocks following (preceding!) the latch are themself >>>>>> empty because then no code is duplicated. It might also be improved if >>>>>> the latch is already non-empty. That code should probably only be active >>>>>> before the loop optimizers, but currently the backward threader isn't >>>>>> differentiating between before/after loop-optims. >>>>>> >>>>>> I haven't tested this patch at all, except that it fixes the testcase :) >>>>> >>>>> Lame comment at the current end of the thread - it's not threading through the >>>> >>>> I don't know why y'all keep using the word "lame". On the contrary, >>>> these are incredibly useful explanations. Thanks. >>>> >>>>> latch but threading through the loop header that's problematic, at least if the >>>>> end of the threading path ends within the loop (threading through the header >>>>> to the loop exit is fine). Because in that situation you effectively created an >>>>> alternate loop entry. Threading through the latch into the loop header is >>>>> fine but with simple latches that likely will never happen (if there are no >>>>> simple latches then the latch can contain the loop exit test). >>>>> >>>>> See tree-ssa-threadupdate.c:thread_block_1 >>>>> >>>>> e2 = path->last ()->e; >>>>> if (!e2 || noloop_only) >>>>> { >>>>> /* If NOLOOP_ONLY is true, we only allow threading through the >>>>> header of a loop to exit edges. */ >>>>> >>>>> /* One case occurs when there was loop header buried in a jump >>>>> threading path that crosses loop boundaries. We do not try >>>>> and thread this elsewhere, so just cancel the jump threading >>>>> request by clearing the AUX field now. */ >>>>> if (bb->loop_father != e2->src->loop_father >>>>> && (!loop_exit_edge_p (e2->src->loop_father, e2) >>>>> || flow_loop_nested_p (bb->loop_father, >>>>> e2->dest->loop_father))) >>>>> { >>>>> /* Since this case is not handled by our special code >>>>> to thread through a loop header, we must explicitly >>>>> cancel the threading request here. */ >>>>> delete_jump_thread_path (path); >>>>> e->aux = NULL; >>>>> continue; >>>>> } >>>> >>>> But this is for a threading path that crosses loop boundaries, which is >>>> not the case. Perhaps we should restrict this further to threads within >>>> a loop? >>>> >>>>> >>>>> there are a lot of "useful" checks in this function and the backwards threader >>>>> should adopt those. Note the backwards threader originally only did >>>>> FSM style threadings which are exactly those possibly "harmful" ones, forming >>>>> irreducible regions at worst or sub-loops at best. That might explain the >>>>> lack of those checks. >>>> >>>> Also, the aforementioned checks are in jump_thread_path_registry, which >>>> is also shared by the backward threader. These are thread discards >>>> _after_ a thread has been registered. >>> >>> Yeah, that's indeed unfortunate. >>> >>>> The backward threader should also >>>> be using these restrictions. Unless, I'm missing some interaction with >>>> the FSM/etc threading types as per the preamble to the snippet you provided: >>>> >>>> if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners) >>>> || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners)) >>>> continue; >>> >>> Indeed. But I understand the backwards threader does not (only) do FSM >>> threading now. >> >> If it does, it was not part of my rewrite. I was careful to not touch >> anything dealing with either path profitability or low-level path >> registering. >> >> The path registering is in back_threader_registry::register_path(). We >> only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY. ISTM that >> those are only EDGE_FSM_THREADs?? > > Well, if the backwards threader classifies everything as FSM that's probably > inaccurate since only threads through the loop latch are "FSM". There is > the comment > > /* If this path does not thread through the loop latch, then we are > using the FSM threader to find old style jump threads. This > is good, except the FSM threader does not re-use an existing > threading path to reduce code duplication. > > So for that case, drastically reduce the number of statements > we are allowed to copy. */ *blink* Woah. The backward threader has been using FSM threads indiscriminately as far as I can remember. I wonder what would break if we "fixed it". > > so these cases should use the "old style" validity/costing metrics and thus > classify threading opportunities in a different way? Jeff, do you have any insight here? > > I think today "backwards" vs, "forwards" only refers to the way we find > threading opportunities. Yes, it's a mess. I ran some experiments a while back, and my current work on the enhanced solver/threader, can fold virtually everything the DOM/threader gets (even with its use of const_and_copies, avail_exprs, and evrp_range_analyzer), while getting 5% more DOM threads and 1% more overall threads. That is, I've been testing if the path solver can solve everything the DOM threader needs (the hybrid approach I mentioned). Unfortunately, replacing the forward threader right now is not feasible for a few reasons: a) The const_and_copies/avail_exprs relation framework can do floats, and that's next year's ranger work. b) Even though we can seemingly fold everything DOM/threader does, in order to replace it with a backward threader instance we'd have to merge the cost/profitability code scattered throughout the forward threader, as well as the EDGE_FSM* / EDGE_COPY* business. c) DOM changes the IL as it goes. Though we could conceivably divorce do the threading after DOM is done. But I digress... Aldy