From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-x432.google.com (mail-wr1-x432.google.com [IPv6:2a00:1450:4864:20::432]) by sourceware.org (Postfix) with ESMTPS id E9238398EC08 for ; Wed, 9 Jun 2021 16:18:29 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E9238398EC08 Received: by mail-wr1-x432.google.com with SMTP id q5so26171663wrm.1 for ; Wed, 09 Jun 2021 09:18:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:user-agent:in-reply-to:references :mime-version:content-transfer-encoding:subject:to:cc:from :message-id; bh=pG6OtV7zzO46ovhow4zvFXbCe3goU2kV8LmtQeyemoo=; b=oxvTMeauahnM1hR8WAdofrFyQFATWoSq+As7wEcrL6Z0dwreeiHpJt7ByExLWQaTNS 9Y2EHbAcuZZfdFFh7W3nqLhWhLFnxMirq+FKgLDE60QoufaZbC2lESNScJUR7dxiUI66 SCBhoqA0sM+SXV0/Qh0aWu/vXaO43PoMSOHrSKFQRHXxK1vzQTeF2bdO9JNXVL5Zk9EW 9laG3H3CdtCC7zzL/+DLk4hhYEKFwwYBqtO4hLmSWqzdlrbRK/xo17ErahfxN3IY/LSN C6yS+m+guGs7dPIzeBm4FcdCFvdxL4Dh9Phf11ucrTBqFyJ1pZhQ566HHstLrsWW+S3k xAAQ== X-Gm-Message-State: AOAM533FMp31736laQIKyc19FEAryzFD37OYeV1N+KcoI66qHfJrAclv kaVPM3OTS167cxY343qLppM= X-Google-Smtp-Source: ABdhPJwcbAG94VQ1zPIcBr+7OTyf4l4+QMqRNtFyPWlWZ35S1Li6PNT2vaT28yxH8KQh0jWbNTL2Cg== X-Received: by 2002:adf:e110:: with SMTP id t16mr601368wrz.359.1623255508833; Wed, 09 Jun 2021 09:18:28 -0700 (PDT) Received: from [192.168.178.32] (dynamic-095-115-086-022.95.115.pool.telefonica.de. [95.115.86.22]) by smtp.gmail.com with ESMTPSA id 31sm456675wrc.96.2021.06.09.09.18.28 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jun 2021 09:18:28 -0700 (PDT) Date: Wed, 09 Jun 2021 18:18:26 +0200 User-Agent: K-9 Mail for Android In-Reply-To: References: <07775b9d-b8eb-48cb-57ef-9cc278d38967@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: replacing the backwards threader and more To: Aldy Hernandez CC: Jeff Law ,GCC Mailing List From: Richard Biener Message-ID: <7A72F0E5-24DF-4DB1-9309-791870E69FC3@gmail.com> X-Spam-Status: No, score=-3.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Jun 2021 16:18:32 -0000 On June 9, 2021 5:34:03 PM GMT+02:00, Aldy Hernandez w= rote: > > >On 6/9/21 2:09 PM, Richard Biener wrote: >> On Wed, Jun 9, 2021 at 1:50 PM Aldy Hernandez via Gcc > wrote: >>> >>> Hi Jeff=2E Hi folks=2E >>> >>> What started as a foray into severing the old (forward) threader's >>> dependency on evrp, turned into a rewrite of the backwards threader >>> code=2E I'd like to discuss the possibility of replacing the current >>> backwards threader with a new one that gets far more threads and can >>> potentially subsume all threaders in the future=2E >>> >>> I won't include code here, as it will just detract from the high >level >>> discussion=2E But if it helps, I could post what I have, which just >needs >>> some cleanups and porting to the latest trunk changes Andrew has >made=2E >>> >>> Currently the backwards threader works by traversing DEF chains >through >>> PHIs leading to possible paths that start in a constant=2E When such >a >>> path is found, it is checked to see if it is profitable, and if so, >the >>> constant path is threaded=2E The current implementation is rather >limited >>> since backwards paths must end in a constant=2E For example, the >>> backwards threader can't get any of the tests in >>> gcc=2Edg/tree-ssa/ssa-thread-14=2Ec: >>> >>> if (a && b) >>> foo (); >>> if (!b && c) >>> bar (); >>> >>> etc=2E >>> >>> After my refactoring patches to the threading code, it is now >possible >>> to drop in an alternate implementation that shares the profitability >>> code (is this path profitable?), the jump registry, and the actual >jump >>> threading code=2E I have leveraged this to write a ranger-based >threader >>> that gets every single thread the current code gets, plus 90-130% >more=2E >>> >>> Here are the details from the branch, which should be very similar >to >>> trunk=2E I'm presenting the branch numbers because they contain >Andrew's >>> upcoming relational query which significantly juices up the results=2E >>> >>> New threader: >>> ethread:65043 (+3=2E06%) >>> dom:32450 (-13=2E3%) >>> backwards threader:72482 (+89=2E6%) >>> vrp:40532 (-30=2E7%) >>> Total threaded: 210507 (+6=2E70%) >>> >>> This means that the new code gets 89=2E6% more jump threading >>> opportunities than the code I want to replace=2E In doing so, it >reduces >>> the amount of DOM threading opportunities by 13=2E3% and by 30=2E7% fr= om >the >>> VRP jump threader=2E The total improvement across the jump threading >>> opportunities in the compiler is 6=2E70%=2E >>> >>> However, these are pessimistic numbers=2E=2E=2E >>> >>> I have noticed that some of the threading opportunities that DOM and >VRP >>> now get are not because they're smarter, but because they're picking >up >>> opportunities that the new code exposes=2E I experimented with >running an >>> iterative threader, and then seeing what VRP and DOM could actually >get=2E >>> This is too expensive to do in real life, but it at least shows >what >>> the effect of the new code is on DOM/VRP's abilities: >>> >>> Iterative threader: >>> ethread:65043 (+3=2E06%) >>> dom:31170 (-16=2E7%) >>> thread:86717 (+127%) >>> vrp:33851 (-42=2E2%) >>> Total threaded: 216781 (+9=2E90%) >>> >>> This means that the new code not only gets 127% more cases, but it >>> reduces the DOM and VRP opportunities considerably (16=2E7% and 42=2E2= % >>> respectively)=2E The end result is that we have the possibility of >>> getting almost 10% more jump threading opportunities in the entire >>> compilation run=2E >>=20 >> Yeah, DOM once was iterating =2E=2E=2E >>=20 >> You probably have noticed that we have very man (way too many) >> 'thread' passes, often in close succession with each other or >> DOM or VRP=2E So in the above numbers I wonder if you can break >> down the numbers individually for the actual passes (in their order)? > >Sure, I can do that=2E Let me whip up the old branch and gather some >info=2E > >>=20 >>> (Note that the new code gets even more opportunities, but I'm only >>> reporting the profitable ones that made it all the way through to >the >>> threader backend, and actually eliminated a branch=2E) >>> >>> The overall compilation hit from this work is currently 1=2E38% as >>> measured by callgrind=2E We should be able to reduce this a bit, plus >we >>> could get some of that back if we can replace the DOM and VRP >threaders >>> (future work)=2E >>> >>> My proposed implementation should be able to get any threading >>> opportunity, and will get more as range-ops and ranger improve=2E >>> >>> I can go into the details if necessary, but the gist of it is that >we >>> leverage the import facility in the ranger to only look up paths >that >>> have a direct repercussion in the conditional being threaded, thus >>> reducing the search space=2E This enhanced path discovery, plus an >engine >>> to resolve conditionals based on knowledge from a CFG path, is all >that >>> is needed to register new paths=2E There is no limit to how far back >we >>> look, though in practice, we stop looking once a path is too >expensive >>> to continue the search in a given direction=2E >>> >>> The solver API is simple: >>> >>> // This class is a thread path solver=2E Given a set of BBs >indicating >>> // a path through the CFG, range_in_path() will return the range >>> // of an SSA as if the BBs in the path would have been executed in >>> // order=2E >>> // >>> // Note that the blocks are in reverse order, thus the exit block is >>> path[0]=2E >>> >>> class thread_solver : gori_compute >>> { >>> >>> public: >>> thread_solver (gimple_ranger &ranger); >>> virtual ~thread_solver (); >>> void set_path (const vec *, const bitmap_head >*imports); >>> void range_in_path (irange &, tree name); >>> void range_in_path (irange &, gimple *); >>> =2E=2E=2E >>> }; >>> >>> Basically, as we're discovering paths, we ask the solver what the >value >>> of the final conditional in a BB is in a given path=2E If it >resolves, we >>> register the path=2E >>> >>> A follow-up project would be to analyze what DOM/VRP are actually >>> getting that we don't, because in theory with an enhanced ranger, we >>> should be able to get everything they do (minus some float stuff, >and >>> some CSE things DOM does)=2E However, IMO, this is good enough to at >>> least replace the current backwards threading code=2E >>> >>> My suggestion would be to keep both implementations, defaulting to >the >>> ranger based, and running the old code immediately after-- trapping >if >>> it can find any threading opportunities=2E >>=20 >> But due to iteration uncovering new opportunities this will >inevitably >> break, no? > >No, actually because I'm comparing current and new backwards threader=20 >behavior on the same IL, something I can't do with VRP because the IL >is=20 >slightly different (VRP asserts)=2E > >So I can compare apples to apples in the backwards threader code=2E What > >I do is run the new code first, and then run the old code on the same >IL=20 >before the threader has altered the CFG, while asserting that the old=20 >code cannot register any "new" paths not already present in the >registry=2E > >I have tested this assertion throughout 200+ =2Eii files from a >bootstrap,=20 >and there has never been a case where the old code can get something we > >can't=2E Ah, that's nice!=20 >>=20 >>> After a few weeks, we could >>> kill the old code=2E >>=20 >> Note that for analyzing threadings done apart from looking at overall >> numbers the statistics infrastructure can be useful, likewise could >> be the opt-info one where you can diff stats based on file or >function >> (statistics) or even location of a participating jump (opt-info)=2E >>=20 >> If you are re-using tree-ssa-thread{edge,update}=2E{c,h} anyway you >> probably only have to amend one or two places=2E I'm personally >> breaking things down to file/function via statistics to spot gross >> differences more locally=2E > >I am indeed re-using all of tree-ssa-thread{edge,update}=2E That is=20 >unchanged=2E I've been using the overall statistics plus an awk script >to=20 >collate it all=2E But thanks for the opt-info tip=2E I didn't know abou= t >that=2E > >>=20 >> IMHO removing threading from VRP (as a step to make "VRP" >> into another EVRP run) should be part of the initial transition, >> there's always a thread pass nearby=2E Performing threading from >> EVRP itself might be another option to evaluate=2E Trimming down >> the number of (now backwards-)threaders would be another goal=2E > >Agreed=2E I will have to sit down and investigate what VRP is getting=2E= =20 >Last I peeked it was either stuff range-ops could be taught, or=20 >unconditional jumps to unconditional jumps that could be easily >handled=2E=20 >However, I think we can prove that the current backwards threader code=20 >cannot get anything in the presence of the new code, and could be >easily=20 >replaced before concentrating on DOM/VRP=2E > >That being said, figuring out exactly the discrepancies with DOM/VRP is > >on my short-term radar ;-)=2E > >Aldy