From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by sourceware.org (Postfix) with ESMTPS id 717E83858D1E for ; Wed, 17 Aug 2022 07:53:59 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 717E83858D1E Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id 58AE61FAD7; Wed, 17 Aug 2022 07:53:58 +0000 (UTC) Received: from wotan.suse.de (wotan.suse.de [10.160.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 512482C174; Wed, 17 Aug 2022 07:53:58 +0000 (UTC) Date: Wed, 17 Aug 2022 07:53:58 +0000 (UTC) From: Richard Biener To: Aldy Hernandez cc: gcc-patches Subject: Re: [PATCH] Refactor back_threader_profitability In-Reply-To: Message-ID: References: <81970.122081610045900133@us-mta-26.us.mimecast.lan> User-Agent: Alpine 2.22 (LSU 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Aug 2022 07:54:00 -0000 On Wed, 17 Aug 2022, Aldy Hernandez wrote: > I just have a few high level comments. > > On Tue, Aug 16, 2022 at 4:05 PM Richard Biener wrote: > > > > The following refactors profitable_path_p in the backward threader, > > splitting out parts that can be computed once the exit block is known, > > parts that contiguously update and that can be checked allowing > > for the path to be later identified as FSM with larger limits, > > possibly_profitable_path_p, and final checks done when the whole > > path is known, profitable_path_p. > > I thought we were removing references to FSM, as they were leftovers > from some previous incarnation. For that matter, I don't think I ever > understood what they are, so if we're gonna keep them, could you > comment what makes FSM threads different from other threads? I don't know exactly what 'FSM' stands for but the FSM threader specifically tried to cover for (;;) { switch (state) { case 1: /* sth */ state = 3; break; ... case 3: ... } } so optimizing state machine transitions. That is, these are all multiway (switch or goto, but goto support has been removed with the ranger rewrite it seems) and the thread path going through the loop backedge. This is why FSM has different size limits because it was thought those threadings are especially worthwhile and they tend to be larger. To make discovery cheaper a TODO item would be to skip to the loop backedge when we reach the regular thread limit and only continue the real search there (and pick the "smallest" ways through any diamonds on the way). > In your possibly_profitable_path_p function, could you document a bit > better what's the difference between profitable_path_p and > possibly_profitable_path_p? Sure. > > > > I've removed the back_threader_profitability instance from the > > back_threader class and instead instantiate it once per path > > discovery. I've kept the size compute non-incremental to simplify > > the patch and not worry about unwinding. > > > > There's key changes to previous behavior - namely we apply > > the param_max_jump_thread_duplication_stmts early only when > > we know the path cannot become an FSM one (multiway + thread through > > latch) but make sure to elide the path query when we we didn't > > yet discover that but are over this limit. Similarly the > > speed limit is now used even when we did not yet discover a > > hot BB on the path. Basically the idea is to only stop path > > discovery when we know the path will never become profitable > > but avoid the expensive path range query when we know it's > > currently not. > > > > I've done a few cleanups, merging functions, on the way. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > Statistics show an overall slight increase in threading but > > looking at different files theres noise up and down. That's > > somewhat expected since we now are applying the "more correct" > > limits in the end. Unless I made big mistakes of course. > > > > The next thing cost-wise would be to raise the backwards > > threading limit to the limit of DOM so we don't get > > artificial high counts for that. > > The DOM threader has limits? I thought most of those limits were just > due to the fact that it couldn't determine long enough paths? Either > way, I like that we're merging the necessary forward threader bits > here, in preparation for its demise ;-). Both use param_max_jump_thread_duplication_stmts, but the backwards threader applies this limit to non-FSM threads with /* The generic copier used by the backthreader does not re-use an existing threading path to reduce code duplication. So for that case, drastically reduce the number of statements we are allowed to copy. */ if (!(threaded_through_latch && threaded_multiway_branch) && (n_insns * param_fsm_scale_path_stmts >= param_max_jump_thread_duplication_stmts)) and param_fsm_scale_path_stmts happens to be two. I have no idea why we apply this scaling, the scaling is otherwise used in /* We avoid creating irreducible inner loops unless we thread through a multiway branch, in which case we have deemed it worth losing other loop optimizations later. We also consider it worth creating an irreducible inner loop if the number of copied statement is low relative to the length of the path -- in that case there's little the traditional loop optimizer would have done anyway, so an irreducible loop is not so bad. */ if (!threaded_multiway_branch && creates_irreducible_loop && *creates_irreducible_loop && (n_insns * (unsigned) param_fsm_scale_path_stmts > (m_path.length () * (unsigned) param_fsm_scale_path_blocks))) so my idea is to drop the scaling from applying the param_max_jump_thread_duplication_stmts limit which raises the effective default limit from 15 / 2 to 15, just like what DOM applies (where DOM also estimates some optimization on the path, reducing the stmt count). > Looks good. Thanks - I'll adjust the commentary and push. Richard.