From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 11817 invoked by alias); 6 Jun 2011 18:07:57 -0000 Received: (qmail 11797 invoked by uid 22791); 6 Jun 2011 18:07:55 -0000 X-SWARE-Spam-Status: No, hits=0.1 required=5.0 tests=AWL,BAYES_50,MAY_BE_FORGED,T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from e8.ny.us.ibm.com (HELO e8.ny.us.ibm.com) (32.97.182.138) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Mon, 06 Jun 2011 18:07:39 +0000 Received: from d01relay07.pok.ibm.com (d01relay07.pok.ibm.com [9.56.227.147]) by e8.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id p56HuW9t014815; Mon, 6 Jun 2011 13:56:32 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d01relay07.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id p56I7Yj21331306; Mon, 6 Jun 2011 14:07:35 -0400 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id p56I7C7f018324; Mon, 6 Jun 2011 12:07:13 -0600 Received: from [9.10.86.209] (tepot-pc.rchland.ibm.com [9.10.86.209] (may be forged)) by d03av01.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id p56I7BYS018239; Mon, 6 Jun 2011 12:07:11 -0600 Subject: [Design notes, RFC] Address-lowering prototype design (PR46556) From: "William J. Schmidt" To: gcc-patches@gcc.gnu.org Cc: bergner@vnet.ibm.com, dje.gcc@gmail.com, richard.guenther@gmail.com, steven@gcc.gnu.org, law@redhat.com Content-Type: text/plain Date: Mon, 06 Jun 2011 18:07:00 -0000 Message-Id: <1307383631.4798.11.camel@L3G5336.ibm.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org X-SW-Source: 2011-06/txt/msg00448.txt.bz2 It's been some time since I last posted about the address lowering issue from PR46556 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46556). I've had a basic prototype in place for some time, but I ran into some issues that initially caused performance regressions; I've had to jump through several hoops to resolve these. At this time, I'm now seeing decent performance improvements from address lowering for PowerPC (about 2% overall on SPEC cpu2000). A couple of minor degradations still occur that I am looking into. I thought this would be a good time to put out another RFC. However, the code has grown over time, and I doubt whether all you busy people are particularly interested in reviewing a 2700-line prototype patch in detail. Instead, I've written up some notes on the design, some of the challenges I ran into, and my current approach to handling those. I'd like to first get comments from interested parties on the prototype design, and get consensus on whether this is the appropriate overall approach. If so, the next step would be to decide how to go about reviewing the code. Since smaller patches are much easier to review, one possibility would be for me to gate the new pass on a flag that's off by default, and submit patches incrementally for review. I.e., start with the bare-bones algorithm, and gradually add in the complexities as folks have time to review the code. I'm quite new to the community, so any advice on how to handle this would be very welcome. In any case, here are my notes. Thanks in advance for your views! Bill Basic algorithm =============== A new pass is added late in the gimple phases, currently between pass_phi_only_cprop and pass_cd_dce. I chose this placement because the new pass can generate a fair amount of dead code. Address lowering walks the CFG in predominator order. Common subexpression elimination is performed on operations that could participate in addressing expressions. These operations may have been made available in the current block, or in any dominating block. Any statement that may contain a memory access (excluding calls) is considered for address lowering. The memory access is expanded into an affine combination of trees by looking back at feeding statements, using logic originally developed for IVopts. The affine combination is subjected to a number of tests to determine whether it's legal and profitable to lower the addressing expression into a TARGET_MEM_REF (or MEM_REF, when addressing is sufficiently simple). If so, the conversion to a TMR is performed, again using logic originally developed for IVopts. The resulting memory reference is again subjected to a number of conditions testing for legality and profitability. If these conditions are all met, the reference data is copied from the original expression to the TMR, and the TMR replaces the original expression in the gimple statement. The conversion of a memory access into a TMR generally creates a number of new gimple statements preceding the statement under analysis. CSE is performed on the fly for these new statements as well, using the available expressions data discussed earlier. This produces naturally shared addressability for members of the same structure, for example. Inhibitors ========== As mentioned, a number of conditions are checked before replacing a memory access with a TMR. Conditions that inhibit replacement include the following. * There is not yet any handling for bitfields. * An expression containing a direct field reference from a small nonvolatile structure is left alone. It is assumed that these values will be register-assigned, so exposing their addresses is counterproductive. * I have seen odd cases in C++ code such as lhs = MEM[(struct foo *) 0B].bar; This results in an affine combination of trees with no elements, so an explicit check for such empty combinations is required. * If the affine combination of trees introduces an ADDR_EXPR for any decl that is not already marked as addressable, we refuse to make the replacement. Taking the address of something for the first time defeats downstream optimizations that would otherwise be possible. * create_mem_ref can occasionally produce a TMR or MEM_REF with a base expression of constant zero. One way this can happen is with questionable coding practices, such as treating an integer paramter as a pointer. It's an open question whether create_mem_ref should ever produce such a tree. In any case, the base expression of zero again confuses the may-aliasing machinery, so I refuse the replacement in this case. * If the original expression will be recognized as a "hidden global store" in tree-ssa-sink.c:is_hidden_global_store, but the replacement expression will not, it is possible for the dead code eliminator to remove the modified statement. It seems to me this shouldn't normally happen in practice. For now I detect this case and refuse to do the replacement, but I suspect a possible front-end or upstream-optimization problem here. The test case that fails here is libstdc++-v3/testsuite/23_containers/vector/ ext_pointer_modifiers/insert.cc. More investigation required. Probably no longer inhibitors ============================= I currently am checking for some cases that I think may no longer be necessary, following this change: ------------------------------------------------------------------------ r174282 | rguenth | 2011-05-26 08:01:48 -0500 (Thu, 26 May 2011) | 10 lines 2011-05-26 Richard Guenther PR tree-optimization/48702 * tree-ssa-address.c (create_mem_ref_raw): Create MEM_REFs only when we know the base address is within bounds. * tree-ssa-alias.c (indirect_ref_may_alias_decl_p): Do not assume the base address of TARGET_MEM_REFs is in bounds. * gcc.dg/torture/pr48702.c: New testcase. ------------------------------------------------------------------------ I am in process of checking whether removing these checks has any untoward consequences: * There are several ways in which the affine combination of trees can contain subexpressions with negative coefficients. This is evil and must be avoided, since it can result in a memory reference whose base address pointer lies outside the object it references. The may-aliasing machinery cannot currently handle this. So we refuse to make the replacement when a negative coefficient is detected. * Earlier induction variable optimization may also cause us to end up with a base address below the beginning of an object. (Ivars are often initialized to the base of element -1 outside of a loop, and incremented at the top of the loop. Expanding into the affine combination of trees collapses this to produce a base expression pointing outside the object.) I avoid this by checking whether the base expression is an SSA name for an ivar that is directly or transitively defined by a PHI expression. Zero-offset memory references ============================= Following is a fairly common pattern that results from address lowering with CSE: : D.2154_8 = n_2(D) * 4; D.2107_3 = MEM[base: p_1(D), index: D.2154_8, offset: 0B]; D.2156_10 = p_1(D) + D.2154_8; D.2108_4 = MEM[(struct x *)D.2156_10 + 128B]; D.2109_5 = MEM[(struct x *)D.2156_10 + 64B]; foo (D.2107_3, D.2108_4, D.2109_5); return; When the addressing expression consists of two tree addends and a non-zero offset, the addends are pre-added and used as the base expression, with the offset becoming the immediate operand in the storage op. When the offset is zero, the base and index are added implicitly in an indexed-form storage op. The bias to the indexed form with a zero offset is a good choice, in absence of knowledge of the surrounding code; it saves an instruction and a dependency with a possible small increase in register pressure. As can be seen here, though, we end up with a missed commoning opportunity, since the second instruction could become: D.2107_3 = MEM[(struct x *)D.2156_10 + 0B]; Since the add into D.2156_10 is going to occur anyway, the advantage of the bias to indexed form vanishes. Instead, it leads to an increase in register pressure. This is more noticeable as expressions are CSEd across blocks. As the example shows, when the zero-offset memory reference is first encountered, the sum of its base and index components may not be a known available expression. To find all opportunities where local CSE is possible for these, the zero-offset expressions are recorded during the basic algorithm. At the end of each block, the zero-offset expressions are post-processed. If the opportunity exists, the base is replaced with the SSA representative of the sum, and the index is changed to constant zero. If necessary, the definition of the SSA representative of the sum is moved ahead of the zero-offset expression. This is safe since both addends are known to be available there. Challenge with zero-offset memory references ============================================ Unfortunately, when I first implemented this post-processing, I didn't see any change in code generation. Upon investigation, it turned out that fwprop.c was undoing this optimization in forward_propagate_and_simplify. To circumvent this, I ended up adding some extra code to identify zero-offset memory references that fit the pattern of the example (only in RTL). I constrained fwprop.c from propagating into these as locally poor replacements. I recognize this might be a controversial choice, and I'm open to other suggestions. But I found CSEing these to be a useful optimization that makes a difference in common situations, so I didn't want to lose it. Loss of aliasing information ============================ The most serious problem I've run into is degraded performance due to poorer instruction scheduling choices. I tracked this down to alias.c:nonoverlapping_component_refs_p. This code proves that two memory accesses don't overlap by attempting to prove that they access different fields of the same structure. This is done using the MEM_EXPRs of the two rtx's, which record the expression trees that were translated into the rtx's during expand. When address lowering is not present, a simple COMPONENT_REF will appear in the MEM_EXPR: x.a, for example. However, address lowering changes the simple COMPONENT_REF into a [TARGET_]MEM_REF that is no longer necessarily identifiable as a field reference. Thus the aliasing machinery can no longer prove that two such field references are disjoint. This has severe consequences for performance, and has to be dealt with if address lowering is to be successful. I've worked around this with an admittedly fragile solution; I'll discuss the drawbacks below. The idea is to construct a mapping from replacement mem_refs to the original expressions that they replaced. When a MEM_EXPR is being set during expand, we first look up the mem_ref in the mapping. If present, the MEM_EXPR is set to the original expression, rather than to the mem_ref. This essentially duplicates the behavior in the absence of address lowering. This appears to work well in practice today. However, it relies on there being no optimizations between the address lowering pass and expand that modify [TARGET_]MEM_REFs. Dead store elimination may remove some, which is fine, of course. But if a new pass was to be added that modified operands of mem_refs after address lowering, code generation would be quietly degraded. We would need some specific scan tests in the test suite to check for this. Another (less serious) problem is having to introduce a data structure that needs to be maintained beyond the lifetime of its associated pass. It's a bit of a wart to have this side table that must be kept around until expand is finished. It's workable, but not very clean. As an alternative, I went looking for ways to carry the original expression along with the mem_ref in the remaining gimple phases; but I wasn't able to find a good way to do that. Perhaps others can suggest something. As noted, I recognize this is a fragile solution, and I'm very interested in other ideas for keeping the necessary information around until it's needed. Open issues =========== CSE of addressing expressions across blocks (or within very large blocks) can raise register pressure and potentially introduce spill code. We may want some heuristics to constrain the distance across which expressions are considered available. I am investigating small degradations in 177.mesa, 187.facerec, 300.twolf, and 164.gzip. Performance testing on other platforms is needed.