From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 84877 invoked by alias); 20 Jul 2016 15:06:39 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 84854 invoked by uid 89); 20 Jul 2016 15:06:38 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.8 required=5.0 tests=AWL,BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.2 spammy=paper, perfectly, late X-HELO: gate.crashing.org Received: from gate.crashing.org (HELO gate.crashing.org) (63.228.1.57) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-SHA encrypted) ESMTPS; Wed, 20 Jul 2016 15:06:36 +0000 Received: from gate.crashing.org (localhost.localdomain [127.0.0.1]) by gate.crashing.org (8.14.1/8.13.8) with ESMTP id u6KF6WGs014279; Wed, 20 Jul 2016 10:06:32 -0500 Received: (from segher@localhost) by gate.crashing.org (8.14.1/8.14.1/Submit) id u6KF6V5N014278; Wed, 20 Jul 2016 10:06:31 -0500 Date: Wed, 20 Jul 2016 15:06:00 -0000 From: Segher Boessenkool To: Bernd Schmidt Cc: gcc-patches@gcc.gnu.org, dje.gcc@gmail.com Subject: Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate concerns Message-ID: <20160720150631.GA13687@gate.crashing.org> References: <019d5b4c3f6b8119e1511e33a16a8ea96078b094.1465347472.git.segher@kernel.crashing.org> <20160718163411.GA5741@gate.crashing.org> <20160719144602.GA26941@gate.crashing.org> <20160719153515.GB26941@gate.crashing.org> <51e07d9c-8b54-43a7-19db-0f13962407c4@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51e07d9c-8b54-43a7-19db-0f13962407c4@redhat.com> User-Agent: Mutt/1.4.2.3i X-IsSubscribed: yes X-SW-Source: 2016-07/txt/msg01272.txt.bz2 On Wed, Jul 20, 2016 at 01:23:44PM +0200, Bernd Schmidt wrote: > >>>But you need the profile to make even reasonably good decisions. > >> > >>I'm not worried about making cost decisions: as far as I'm concerned > >>it's perfectly fine for that. I'm worried about correctness - you can't > >>validly save registers inside a loop. > > > >Of course you can. It needs to be paired with a restore; and we do > >that just fine. > > Pretty much *all* implementations in the literature do this, fwiw. > I, however, fail to see where this happens. See for example one of the better papers on shrink-wrapping, "Post Register Allocation Spill Code Optimization", by Lupo and Wilken. See the problem definition (section 2), figure 1 for a figure clearly showing multiple save/restore (and executed more than once). section 4.2 for why we don't need to look at loops. [ In this paper prologue/epilogue pairs are only placed around SESE regions, which we do not have many in GCC that late in RTL (often the CFG isn't even reducible); there is no reason to restrict to SESE regions though ]. > If you have references to > somewhere where this algorithm is described, that would be helpful, No, of course not, because I just made this up, as should be clear. The problem definition is simple: we have a CFG, and some of the blocks in that CFG need some things done by the prologue done before they execute. We don't want to run that prologue code more often than necessary, because it can be expensive (compared to the parts of the function that are executed at all). Considering all possible combinations of blocks (or edges) where we can place a prologue is not computationally feasible. But there is a shortcut: if a block X gets a prologue, all blocks dominated by it will for zero cost have that prologue established as well (by simply not doing an epilogue until they are reached). So this algo does the obvious thing, simply walking the dom tree (which is O(n)). Then, from the prologue placement, we compute which blocks will execute with that concern "active"; and we insert prologue/epilogue code to make that assignment true (a prologue or epilogue for every edge that crosses from "does not have" to "does have", or the other way around; and then there is the head/tail thing because cross-jumping fails to unify many of those *logues, so we take care of it manually). > because at this stage I think I really don't understand what you're > trying to achieve. The submission lacks examples. It says what it does right at the start of the head comment: """ Instead of putting all of the prologue and epilogue in one spot, we can put parts of it in places that are executed less frequently. The following code does this, for concerns that can have more than one prologue and epilogue, and where those pro- and epilogues can be executed more than once. """ followed by a bunch of detail. > So I could see things could work if you place an epilogue part in the > last block of a loop if the start of the loop contains a corresponding > part of the prologue, but taking just the comment in the code: > Prologue concerns are placed in such a way that they are executed as > infrequently as possible. Epilogue concerns are put everywhere where > there is an edge from a bb dominated by such a prologue concern to a > bb not dominated by one. > > this describes no mechanism by which such a thing would happen. Sure it does. The edge leaving the loop, for example. You can have a prologue/epilogue pair within a loop (which is unusual, but *can* happen, e.g. as part of a conditional that executes almost never -- this is quite frequent btw, assertions, on-the-run initialisation, etc.) The situation you describe has all the blocks in the loop needing the prologue (but, say, nothing outside the loop). Then of course the prologue is placed on the entry (edge) into the loop, and the epilogue on the exit edge(s). > And I > fail to see how moving parts of the prologue into a loop would be > beneficial as an optimization. for (i = 0; i < 10; i++) if (less_than_one_in_ten_times) do_something_that_needs_a_prologue; or for (i = 0; i < 10; i++) if (whatever) do_something_that_needs_a_prologue_and_does_not_return; or whatever other situation. We do not have natural loops, often. The algorithm places prologues so that their dynamic execution frequency is optimal, which results in their dynamic execution being optimal, whatever the CFG looks like. Segher