From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 576 invoked by alias); 7 Jun 2011 10:07:19 -0000 Received: (qmail 545 invoked by uid 22791); 7 Jun 2011 10:07:17 -0000 X-SWARE-Spam-Status: No, hits=-1.0 required=5.0 tests=AWL,BAYES_50,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST X-Spam-Check-By: sourceware.org Received: from mail-wy0-f175.google.com (HELO mail-wy0-f175.google.com) (74.125.82.175) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 07 Jun 2011 10:07:01 +0000 Received: by wye20 with SMTP id 20so4064396wye.20 for ; Tue, 07 Jun 2011 03:07:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.227.167.129 with SMTP id q1mr5983226wby.101.1307441220009; Tue, 07 Jun 2011 03:07:00 -0700 (PDT) Received: by 10.227.37.152 with HTTP; Tue, 7 Jun 2011 03:06:59 -0700 (PDT) In-Reply-To: <1307383631.4798.11.camel@L3G5336.ibm.com> References: <1307383631.4798.11.camel@L3G5336.ibm.com> Date: Tue, 07 Jun 2011 10:07:00 -0000 Message-ID: Subject: Re: [Design notes, RFC] Address-lowering prototype design (PR46556) From: Richard Guenther To: "William J. Schmidt" Cc: gcc-patches@gcc.gnu.org, bergner@vnet.ibm.com, dje.gcc@gmail.com, steven@gcc.gnu.org, law@redhat.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org X-SW-Source: 2011-06/txt/msg00490.txt.bz2 On Mon, Jun 6, 2011 at 8:07 PM, William J. Schmidt wrote: > It's been some time since I last posted about the address lowering issue = from > PR46556 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D46556). =A0I've ha= d a basic > prototype in place for some time, but I ran into some issues that initial= ly > caused performance regressions; I've had to jump through several hoops to > resolve these. =A0At this time, I'm now seeing decent performance improve= ments > from address lowering for PowerPC (about 2% overall on SPEC cpu2000). =A0A > couple of minor degradations still occur that I am looking into. > > I thought this would be a good time to put out another RFC. =A0However, t= he code > has grown over time, and I doubt whether all you busy people are particul= arly > interested in reviewing a 2700-line prototype patch in detail. =A0Instead= , I've > written up some notes on the design, some of the challenges I ran into, a= nd my > current approach to handling those. > > I'd like to first get comments from interested parties on the prototype > design, and get consensus on whether this is the appropriate overall > approach. =A0If so, the next step would be to decide how to go about revi= ewing > the code. =A0Since smaller patches are much easier to review, one possibi= lity > would be for me to gate the new pass on a flag that's off by default, and > submit patches incrementally for review. =A0I.e., start with the bare-bon= es > algorithm, and gradually add in the complexities as folks have time to re= view > the code. =A0I'm quite new to the community, so any advice on how to hand= le > this would be very welcome. > > In any case, here are my notes. =A0Thanks in advance for your views! > > Bill > > > Basic algorithm > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > A new pass is added late in the gimple phases, currently between > pass_phi_only_cprop and pass_cd_dce. =A0I chose this placement because th= e new > pass can generate a fair amount of dead code. > > Address lowering walks the CFG in predominator order. =A0Common subexpres= sion > elimination is performed on operations that could participate in addressi= ng > expressions. =A0These operations may have been made available in the curr= ent > block, or in any dominating block. > > Any statement that may contain a memory access (excluding calls) is consi= dered > for address lowering. =A0The memory access is expanded into an affine > combination of trees by looking back at feeding statements, using logic > originally developed for IVopts. =A0The affine combination is subjected t= o a > number of tests to determine whether it's legal and profitable to lower t= he > addressing expression into a TARGET_MEM_REF (or MEM_REF, when addressing = is > sufficiently simple). =A0If so, the conversion to a TMR is performed, aga= in > using logic originally developed for IVopts. > > The resulting memory reference is again subjected to a number of conditio= ns > testing for legality and profitability. =A0If these conditions are all me= t, the > reference data is copied from the original expression to the TMR, and the= TMR > replaces the original expression in the gimple statement. > > The conversion of a memory access into a TMR generally creates a number o= f new > gimple statements preceding the statement under analysis. =A0CSE is perfo= rmed on > the fly for these new statements as well, using the available expressions= data > discussed earlier. =A0This produces naturally shared addressability for m= embers > of the same structure, for example. > > Inhibitors > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > As mentioned, a number of conditions are checked before replacing a memory > access with a TMR. =A0Conditions that inhibit replacement include the fol= lowing. > > =A0* There is not yet any handling for bitfields. > > =A0* An expression containing a direct field reference from a small nonvo= latile > =A0 structure is left alone. =A0It is assumed that these values will be > =A0 register-assigned, so exposing their addresses is counterproductive. > > =A0* I have seen odd cases in C++ code such as > > =A0 =A0 lhs =3D MEM[(struct foo *) 0B].bar; > > =A0 This results in an affine combination of trees with no elements, so an > =A0 explicit check for such empty combinations is required. > > =A0* If the affine combination of trees introduces an ADDR_EXPR for any d= ecl > =A0 that is not already marked as addressable, we refuse to make the > =A0 replacement. =A0Taking the address of something for the first time de= feats > =A0 downstream optimizations that would otherwise be possible. > > =A0* create_mem_ref can occasionally produce a TMR or MEM_REF with a base > =A0 expression of constant zero. =A0One way this can happen is with quest= ionable > =A0 coding practices, such as treating an integer paramter as a pointer. = =A0It's > =A0 an open question whether create_mem_ref should ever produce such a tr= ee. > =A0 In any case, the base expression of zero again confuses the may-alias= ing > =A0 machinery, so I refuse the replacement in this case. > > =A0* If the original expression will be recognized as a "hidden global st= ore" in > =A0 tree-ssa-sink.c:is_hidden_global_store, but the replacement expressio= n will > =A0 not, it is possible for the dead code eliminator to remove the modifi= ed > =A0 statement. =A0It seems to me this shouldn't normally happen in practi= ce. =A0For > =A0 now I detect this case and refuse to do the replacement, but I suspec= t a > =A0 possible front-end or upstream-optimization problem here. =A0The test= case > =A0 that fails here is libstdc++-v3/testsuite/23_containers/vector/ > =A0 ext_pointer_modifiers/insert.cc. =A0More investigation required. That indeed sounds odd. > > Probably no longer inhibitors > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > I currently am checking for some cases that I think may no longer be > necessary, following this change: > > ------------------------------------------------------------------------ > r174282 | rguenth | 2011-05-26 08:01:48 -0500 (Thu, 26 May 2011) | 10 lin= es > > 2011-05-26 =A0Richard Guenther =A0 > > =A0 =A0 =A0 =A0PR tree-optimization/48702 > =A0 =A0 =A0 =A0* tree-ssa-address.c (create_mem_ref_raw): Create MEM_REFs > =A0 =A0 =A0 =A0only when we know the base address is within bounds. > =A0 =A0 =A0 =A0* tree-ssa-alias.c (indirect_ref_may_alias_decl_p): Do not > =A0 =A0 =A0 =A0assume the base address of TARGET_MEM_REFs is in bounds. > > =A0 =A0 =A0 =A0* gcc.dg/torture/pr48702.c: New testcase. > > ------------------------------------------------------------------------ > > I am in process of checking whether removing these checks has any untoward > consequences: > > =A0* There are several ways in which the affine combination of trees can = contain > =A0 subexpressions with negative coefficients. =A0This is evil and must be > =A0 avoided, since it can result in a memory reference whose base address > =A0 pointer lies outside the object it references. =A0The may-aliasing ma= chinery > =A0 cannot currently handle this. =A0So we refuse to make the replacement= when a > =A0 negative coefficient is detected. > > =A0* Earlier induction variable optimization may also cause us to end up = with a > =A0 base address below the beginning of an object. =A0(Ivars are often in= itialized > =A0 to the base of element -1 outside of a loop, and incremented at the t= op of > =A0 the loop. =A0Expanding into the affine combination of trees collapses= this to > =A0 produce a base expression pointing outside the object.) =A0I avoid th= is by > =A0 checking whether the base expression is an SSA name for an ivar that = is > =A0 directly or transitively defined by a PHI expression. > > Zero-offset memory references > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > Following is a fairly common pattern that results from address lowering w= ith > CSE: > > =A0 =A0 =A0: > =A0 =A0 =A0 =A0D.2154_8 =3D n_2(D) * 4; > =A0 =A0 =A0 =A0D.2107_3 =3D MEM[base: p_1(D), index: D.2154_8, offset: 0B= ]; > =A0 =A0 =A0 =A0D.2156_10 =3D p_1(D) + D.2154_8; > =A0 =A0 =A0 =A0D.2108_4 =3D MEM[(struct x *)D.2156_10 + 128B]; > =A0 =A0 =A0 =A0D.2109_5 =3D MEM[(struct x *)D.2156_10 + 64B]; > =A0 =A0 =A0 =A0foo (D.2107_3, D.2108_4, D.2109_5); > =A0 =A0 =A0 =A0return; > > When the addressing expression consists of two tree addends and a non-zero > offset, the addends are pre-added and used as the base expression, with t= he > offset becoming the immediate operand in the storage op. =A0When the offs= et is > zero, the base and index are added implicitly in an indexed-form storage = op. > > The bias to the indexed form with a zero offset is a good choice, in abse= nce > of knowledge of the surrounding code; it saves an instruction and a depen= dency > with a possible small increase in register pressure. =A0As can be seen he= re, > though, we end up with a missed commoning opportunity, since the second > instruction could become: > > =A0 =A0 =A0 D.2107_3 =3D MEM[(struct x *)D.2156_10 + 0B]; > > Since the add into D.2156_10 is going to occur anyway, the advantage of t= he > bias to indexed form vanishes. =A0Instead, it leads to an increase in reg= ister > pressure. =A0This is more noticeable as expressions are CSEd across block= s. > > As the example shows, when the zero-offset memory reference is first > encountered, the sum of its base and index components may not be a known > available expression. =A0To find all opportunities where local CSE is pos= sible > for these, the zero-offset expressions are recorded during the basic > algorithm. =A0At the end of each block, the zero-offset expressions are > post-processed. =A0If the opportunity exists, the base is replaced with t= he SSA > representative of the sum, and the index is changed to constant zero. =A0= If > necessary, the definition of the SSA representative of the sum is moved a= head > of the zero-offset expression. =A0This is safe since both addends are kno= wn to > be available there. > > Challenge with zero-offset memory references > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Unfortunately, when I first implemented this post-processing, I didn't se= e any > change in code generation. =A0Upon investigation, it turned out that fwpr= op.c > was undoing this optimization in forward_propagate_and_simplify. =A0To > circumvent this, I ended up adding some extra code to identify zero-offset > memory references that fit the pattern of the example (only in RTL). =A0I > constrained fwprop.c from propagating into these as locally poor > replacements. > > I recognize this might be a controversial choice, and I'm open to other > suggestions. =A0But I found CSEing these to be a useful optimization that= makes > a difference in common situations, so I didn't want to lose it. > > Loss of aliasing information > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > The most serious problem I've run into is degraded performance due to poo= rer > instruction scheduling choices. =A0I tracked this down to > alias.c:nonoverlapping_component_refs_p. > > This code proves that two memory accesses don't overlap by attempting to = prove > that they access different fields of the same structure. =A0This is done = using > the MEM_EXPRs of the two rtx's, which record the expression trees that we= re > translated into the rtx's during expand. =A0When address lowering is not > present, a simple COMPONENT_REF will appear in the MEM_EXPR: =A0x.a, for > example. =A0However, address lowering changes the simple COMPONENT_REF in= to a > [TARGET_]MEM_REF that is no longer necessarily identifiable as a field > reference. =A0Thus the aliasing machinery can no longer prove that two su= ch > field references are disjoint. > > This has severe consequences for performance, and has to be dealt with if > address lowering is to be successful. > > I've worked around this with an admittedly fragile solution; I'll discuss= the > drawbacks below. =A0The idea is to construct a mapping from replacement m= em_refs > to the original expressions that they replaced. =A0When a MEM_EXPR is bei= ng set > during expand, we first look up the mem_ref in the mapping. =A0If present= , the > MEM_EXPR is set to the original expression, rather than to the mem_ref. = =A0This > essentially duplicates the behavior in the absence of address lowering. Ick. We had this in the past via TMR_ORIGINAL which caused all sorts of problems. Removing it didn't cause much degradation because we now preserve points-to information. Originally I played with lowering all memory accesses to MEM_REFs (see the old mem-ref branch), and the loss of type-based alias disambiguation was indeed an issue. But - I definitely do not like the idea of preserving something similar to TMR_ORIGINAL. Instead we can try preserving some information we derive from it. We keep the original access type that we can use for TBAA but do not retain knowledge on whether the type of the MEM_REF is valid for TBAA or if it is view-converted. > This appears to work well in practice today. =A0However, it relies on the= re > being no optimizations between the address lowering pass and expand that > modify [TARGET_]MEM_REFs. =A0Dead store elimination may remove some, whic= h is > fine, of course. =A0But if a new pass was to be added that modified opera= nds of > mem_refs after address lowering, code generation would be quietly degrade= d. > We would need some specific scan tests in the test suite to check for thi= s. > > Another (less serious) problem is having to introduce a data structure th= at > needs to be maintained beyond the lifetime of its associated pass. =A0It'= s a bit > of a wart to have this side table that must be kept around until expand is > finished. =A0It's workable, but not very clean. > > As an alternative, I went looking for ways to carry the original expressi= on > along with the mem_ref in the remaining gimple phases; but I wasn't able = to > find a good way to do that. =A0Perhaps others can suggest something. > > As noted, I recognize this is a fragile solution, and I'm very interested= in > other ideas for keeping the necessary information around until it's neede= d. > > Open issues > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > CSE of addressing expressions across blocks (or within very large blocks)= can > raise register pressure and potentially introduce spill code. =A0We may w= ant > some heuristics to constrain the distance across which expressions are > considered available. > > I am investigating small degradations in 177.mesa, 187.facerec, 300.twolf= , and > 164.gzip. > > Performance testing on other platforms is needed. > > >