From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id A81113858C50 for ; Wed, 17 Aug 2022 08:46:28 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A81113858C50 Received: from mail-oa1-f72.google.com (mail-oa1-f72.google.com [209.85.160.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-557-fQmFFQvhONai-0JpdsVO_g-1; Wed, 17 Aug 2022 04:46:25 -0400 X-MC-Unique: fQmFFQvhONai-0JpdsVO_g-1 Received: by mail-oa1-f72.google.com with SMTP id 586e51a60fabf-11c1fdd025cso1448676fac.0 for ; Wed, 17 Aug 2022 01:46:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc; bh=cgOLwUFvSjNayz6PqeGWW3NWga+fq7FdS5OCkUJDD/Q=; b=lsw8Y8LYH9GrxPSAbB6OHFsBv/Corb4wD8Nr2RbY12IPdS3sBMNk3YkFoBHLZCQohY 18cLEkIaILTZsNMUA+N1b8TRK7wvGiC+bwICOYDspqE/mq9pgn5nt69tL/jEVb930EHM 6UXNReH0K1WXQcvH9s9208V/MJRMNnd1Y0d+ECpXksVEnd3alF5H0W9EW1415Jo4DNpy utQQs9LBIu/MV8OkVVzksfBTXkHfib4qPB3iooZ1Py9JcwYGQG1VsLBy+T18ir0yDIfq xUsUZiUa7zv9FSu0Y8Zb1E2TaeFrH4V/mXswsiEAV87HUQk28kAvLniP+Uan+AA6GFVN UpoA== X-Gm-Message-State: ACgBeo3jMMjpymIVmLCsi+uJ95coOUSCMRbz/AwPWeAy8875JvvX4jwN uxNXx1nZ4yyhec9phYSXWVIn5ygnzGcUkEuewYujQeB1CjFBANiY9B7rR4oeIrzQo2hX1XFlnXq t2EtyuRv5te2iSfIpiw/IY78Agi8YN45nVw== X-Received: by 2002:a05:6870:210b:b0:10b:ed11:4e2d with SMTP id f11-20020a056870210b00b0010bed114e2dmr1208967oae.265.1660725984175; Wed, 17 Aug 2022 01:46:24 -0700 (PDT) X-Google-Smtp-Source: AA6agR4p855qCKBxK87j8AJGfYnxCtBvMMUAnN+GDc7WFY253J9x7sbpFm7tXjoFy/iOZATNZXm90gBQQ5jGAor127A= X-Received: by 2002:a05:6870:210b:b0:10b:ed11:4e2d with SMTP id f11-20020a056870210b00b0010bed114e2dmr1208957oae.265.1660725983896; Wed, 17 Aug 2022 01:46:23 -0700 (PDT) MIME-Version: 1.0 References: <81970.122081610045900133@us-mta-26.us.mimecast.lan> In-Reply-To: From: Aldy Hernandez Date: Wed, 17 Aug 2022 10:46:12 +0200 Message-ID: Subject: Re: [PATCH] Refactor back_threader_profitability To: Richard Biener Cc: gcc-patches X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-5.8 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Aug 2022 08:46:31 -0000 On Wed, Aug 17, 2022 at 10:38 AM Richard Biener wrote: > > On Wed, 17 Aug 2022, Aldy Hernandez wrote: > > > On Wed, Aug 17, 2022 at 9:54 AM Richard Biener wrote: > > > > > > On Wed, 17 Aug 2022, Aldy Hernandez wrote: > > > > > > > I just have a few high level comments. > > > > > > > > On Tue, Aug 16, 2022 at 4:05 PM Richard Biener wrote: > > > > > > > > > > The following refactors profitable_path_p in the backward threader, > > > > > splitting out parts that can be computed once the exit block is known, > > > > > parts that contiguously update and that can be checked allowing > > > > > for the path to be later identified as FSM with larger limits, > > > > > possibly_profitable_path_p, and final checks done when the whole > > > > > path is known, profitable_path_p. > > > > > > > > I thought we were removing references to FSM, as they were leftovers > > > > from some previous incarnation. For that matter, I don't think I ever > > > > understood what they are, so if we're gonna keep them, could you > > > > comment what makes FSM threads different from other threads? > > > > > > I don't know exactly what 'FSM' stands for but the FSM threader > > > specifically tried to cover > > > > It probably stands for finite state machine then? > > > > > > > > for (;;) > > > { > > > switch (state) > > > { > > > case 1: > > > /* sth */ > > > state = 3; > > > break; > > > ... > > > case 3: > > > ... > > > } > > > } > > > > > > so optimizing state machine transitions. That is, these are all > > > multiway (switch or goto, but goto support has been removed with the > > > ranger rewrite it seems) and the thread path going through the > > > > ISTR going through the computed goto path in the old code, and it > > never got triggered. I just didn't get around to removing the > > references to GIMPLE_GOTO. At least, the old threader not once got a > > thread we didn't get with the new code, even through a full Fedora > > build. I could be wrong though, but that's my recollection. > > When one massages the threader to even consider gotos we eventually > run into find_taken_edge not handling them because range_of_expr > computes the range of 'state' as > > [irange] void * [1, +INF]$4 = void > > rather than &&E. > > A testcase would be > > unsigned g; > void FSM (int start) > { > void *states[] = { &&A, &&B, &&C, &&E }; > void *state = states[start]; > > do { > goto *state; > > A: > g += 1; > state = g & 1 ? &&B : &&E; > continue; > > B: > g += 2; > state = &&C; > continue; > > C: > g += 3; > state = states[g & 3]; > continue; > > E: > break; > } while (1); > } > > where we'd expect to thread the B -> C state transition. I checked > gcc 7 and it doesn't do that - not sure if it broke somewhen > on the way or if it was just never working. I'll file a bugreport, > OTOH &label and goto *p are both GNU extensions, so not sure if it > is worth optimizing. I'll attach the "simple" enabler I have. Yeah, that's what I thought. I seem to remember tracing through the old code and never finding a path that would handle computed gotos. Either way, thanks for distilling a testcase. If you could file a PR and CC me on it, that'd be great. When I contribute prange (irange for pointers), the plan is to have it handle pointer equivalences, and we should be able to get the address from the prange itself. Aldy > > > > loop backedge. This is why FSM has different size limits because > > > it was thought those threadings are especially worthwhile and > > > they tend to be larger. To make discovery cheaper a TODO item > > > would be to skip to the loop backedge when we reach the regular > > > thread limit and only continue the real search there (and > > > pick the "smallest" ways through any diamonds on the way). > > > > Ah, thanks. If you could, add some similar blurb, it'd be great. I'm > > sure we'll all forget it at some time (well, I will :)). > > Sure ;) > > Richard. > > > Aldy > > > > > > > > > In your possibly_profitable_path_p function, could you document a bit > > > > better what's the difference between profitable_path_p and > > > > possibly_profitable_path_p? > > > > > > Sure. > > > > > > > > > > > > > I've removed the back_threader_profitability instance from the > > > > > back_threader class and instead instantiate it once per path > > > > > discovery. I've kept the size compute non-incremental to simplify > > > > > the patch and not worry about unwinding. > > > > > > > > > > There's key changes to previous behavior - namely we apply > > > > > the param_max_jump_thread_duplication_stmts early only when > > > > > we know the path cannot become an FSM one (multiway + thread through > > > > > latch) but make sure to elide the path query when we we didn't > > > > > yet discover that but are over this limit. Similarly the > > > > > speed limit is now used even when we did not yet discover a > > > > > hot BB on the path. Basically the idea is to only stop path > > > > > discovery when we know the path will never become profitable > > > > > but avoid the expensive path range query when we know it's > > > > > currently not. > > > > > > > > > > I've done a few cleanups, merging functions, on the way. > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > > > Statistics show an overall slight increase in threading but > > > > > looking at different files theres noise up and down. That's > > > > > somewhat expected since we now are applying the "more correct" > > > > > limits in the end. Unless I made big mistakes of course. > > > > > > > > > > The next thing cost-wise would be to raise the backwards > > > > > threading limit to the limit of DOM so we don't get > > > > > artificial high counts for that. > > > > > > > > The DOM threader has limits? I thought most of those limits were just > > > > due to the fact that it couldn't determine long enough paths? Either > > > > way, I like that we're merging the necessary forward threader bits > > > > here, in preparation for its demise ;-). > > > > > > Both use param_max_jump_thread_duplication_stmts, but the backwards > > > threader applies this limit to non-FSM threads with > > > > > > /* The generic copier used by the backthreader does not re-use an > > > existing threading path to reduce code duplication. So for that > > > case, drastically reduce the number of statements we are allowed > > > to copy. */ > > > if (!(threaded_through_latch && threaded_multiway_branch) > > > && (n_insns * param_fsm_scale_path_stmts > > > >= param_max_jump_thread_duplication_stmts)) > > > > > > and param_fsm_scale_path_stmts happens to be two. I have no idea > > > why we apply this scaling, the scaling is otherwise used in > > > > > > /* We avoid creating irreducible inner loops unless we thread through > > > a multiway branch, in which case we have deemed it worth losing > > > other loop optimizations later. > > > > > > We also consider it worth creating an irreducible inner loop if > > > the number of copied statement is low relative to the length of > > > the path -- in that case there's little the traditional loop > > > optimizer would have done anyway, so an irreducible loop is not > > > so bad. */ > > > if (!threaded_multiway_branch > > > && creates_irreducible_loop > > > && *creates_irreducible_loop > > > && (n_insns * (unsigned) param_fsm_scale_path_stmts > > > > (m_path.length () * > > > (unsigned) param_fsm_scale_path_blocks))) > > > > > > so my idea is to drop the scaling from applying the > > > param_max_jump_thread_duplication_stmts limit which raises the > > > effective default limit from 15 / 2 to 15, just like what DOM > > > applies (where DOM also estimates some optimization on the path, > > > reducing the stmt count). > > > > > > > Looks good. > > > > > > Thanks - I'll adjust the commentary and push. > > > > > > Richard. > > > > > > > > > -- > Richard Biener > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; > HRB 36809 (AG Nuernberg) >