From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-293763-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 11817 invoked by alias); 6 Jun 2011 18:07:57 -0000
Received: (qmail 11797 invoked by uid 22791); 6 Jun 2011 18:07:55 -0000
X-SWARE-Spam-Status: No, hits=0.1 required=5.0	tests=AWL,BAYES_50,MAY_BE_FORGED,T_RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Received: from e8.ny.us.ibm.com (HELO e8.ny.us.ibm.com) (32.97.182.138)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Mon, 06 Jun 2011 18:07:39 +0000
Received: from d01relay07.pok.ibm.com (d01relay07.pok.ibm.com [9.56.227.147])	by e8.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id p56HuW9t014815;	Mon, 6 Jun 2011 13:56:32 -0400
Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167])	by d01relay07.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id p56I7Yj21331306;	Mon, 6 Jun 2011 14:07:35 -0400
Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1])	by d03av01.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id p56I7C7f018324;	Mon, 6 Jun 2011 12:07:13 -0600
Received: from [9.10.86.209] (tepot-pc.rchland.ibm.com [9.10.86.209] (may be forged))	by d03av01.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id p56I7BYS018239;	Mon, 6 Jun 2011 12:07:11 -0600
Subject: [Design notes, RFC] Address-lowering prototype design (PR46556)
From: "William J. Schmidt" <wschmidt@linux.vnet.ibm.com>
To: gcc-patches@gcc.gnu.org
Cc: bergner@vnet.ibm.com, dje.gcc@gmail.com, richard.guenther@gmail.com,        steven@gcc.gnu.org, law@redhat.com
Content-Type: text/plain
Date: Mon, 06 Jun 2011 18:07:00 -0000
Message-Id: <1307383631.4798.11.camel@L3G5336.ibm.com>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2011-06/txt/msg00448.txt.bz2

It's been some time since I last posted about the address lowering issue from
PR46556 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46556).  I've had a basic
prototype in place for some time, but I ran into some issues that initially
caused performance regressions; I've had to jump through several hoops to
resolve these.  At this time, I'm now seeing decent performance improvements
from address lowering for PowerPC (about 2% overall on SPEC cpu2000).  A
couple of minor degradations still occur that I am looking into.

I thought this would be a good time to put out another RFC.  However, the code
has grown over time, and I doubt whether all you busy people are particularly
interested in reviewing a 2700-line prototype patch in detail.  Instead, I've
written up some notes on the design, some of the challenges I ran into, and my
current approach to handling those.

I'd like to first get comments from interested parties on the prototype
design, and get consensus on whether this is the appropriate overall
approach.  If so, the next step would be to decide how to go about reviewing
the code.  Since smaller patches are much easier to review, one possibility
would be for me to gate the new pass on a flag that's off by default, and
submit patches incrementally for review.  I.e., start with the bare-bones
algorithm, and gradually add in the complexities as folks have time to review
the code.  I'm quite new to the community, so any advice on how to handle
this would be very welcome.

In any case, here are my notes.  Thanks in advance for your views!

Bill


Basic algorithm
===============
A new pass is added late in the gimple phases, currently between
pass_phi_only_cprop and pass_cd_dce.  I chose this placement because the new
pass can generate a fair amount of dead code.

Address lowering walks the CFG in predominator order.  Common subexpression
elimination is performed on operations that could participate in addressing
expressions.  These operations may have been made available in the current
block, or in any dominating block.

Any statement that may contain a memory access (excluding calls) is considered
for address lowering.  The memory access is expanded into an affine
combination of trees by looking back at feeding statements, using logic
originally developed for IVopts.  The affine combination is subjected to a
number of tests to determine whether it's legal and profitable to lower the
addressing expression into a TARGET_MEM_REF (or MEM_REF, when addressing is
sufficiently simple).  If so, the conversion to a TMR is performed, again
using logic originally developed for IVopts.

The resulting memory reference is again subjected to a number of conditions
testing for legality and profitability.  If these conditions are all met, the
reference data is copied from the original expression to the TMR, and the TMR
replaces the original expression in the gimple statement.

The conversion of a memory access into a TMR generally creates a number of new
gimple statements preceding the statement under analysis.  CSE is performed on
the fly for these new statements as well, using the available expressions data
discussed earlier.  This produces naturally shared addressability for members
of the same structure, for example.

Inhibitors
==========
As mentioned, a number of conditions are checked before replacing a memory
access with a TMR.  Conditions that inhibit replacement include the following.

 * There is not yet any handling for bitfields.

 * An expression containing a direct field reference from a small nonvolatile
   structure is left alone.  It is assumed that these values will be
   register-assigned, so exposing their addresses is counterproductive.

 * I have seen odd cases in C++ code such as 

     lhs = MEM[(struct foo *) 0B].bar;

   This results in an affine combination of trees with no elements, so an
   explicit check for such empty combinations is required.

 * If the affine combination of trees introduces an ADDR_EXPR for any decl
   that is not already marked as addressable, we refuse to make the
   replacement.  Taking the address of something for the first time defeats
   downstream optimizations that would otherwise be possible.

 * create_mem_ref can occasionally produce a TMR or MEM_REF with a base
   expression of constant zero.  One way this can happen is with questionable
   coding practices, such as treating an integer paramter as a pointer.  It's
   an open question whether create_mem_ref should ever produce such a tree.
   In any case, the base expression of zero again confuses the may-aliasing
   machinery, so I refuse the replacement in this case.

 * If the original expression will be recognized as a "hidden global store" in
   tree-ssa-sink.c:is_hidden_global_store, but the replacement expression will
   not, it is possible for the dead code eliminator to remove the modified
   statement.  It seems to me this shouldn't normally happen in practice.  For
   now I detect this case and refuse to do the replacement, but I suspect a
   possible front-end or upstream-optimization problem here.  The test case
   that fails here is libstdc++-v3/testsuite/23_containers/vector/
   ext_pointer_modifiers/insert.cc.  More investigation required.

Probably no longer inhibitors
=============================
I currently am checking for some cases that I think may no longer be
necessary, following this change:

------------------------------------------------------------------------
r174282 | rguenth | 2011-05-26 08:01:48 -0500 (Thu, 26 May 2011) | 10 lines

2011-05-26  Richard Guenther  <rguenther@suse.de>

        PR tree-optimization/48702
        * tree-ssa-address.c (create_mem_ref_raw): Create MEM_REFs
        only when we know the base address is within bounds.
        * tree-ssa-alias.c (indirect_ref_may_alias_decl_p): Do not
        assume the base address of TARGET_MEM_REFs is in bounds.

        * gcc.dg/torture/pr48702.c: New testcase.

------------------------------------------------------------------------

I am in process of checking whether removing these checks has any untoward
consequences:

 * There are several ways in which the affine combination of trees can contain
   subexpressions with negative coefficients.  This is evil and must be
   avoided, since it can result in a memory reference whose base address
   pointer lies outside the object it references.  The may-aliasing machinery
   cannot currently handle this.  So we refuse to make the replacement when a
   negative coefficient is detected.

 * Earlier induction variable optimization may also cause us to end up with a
   base address below the beginning of an object.  (Ivars are often initialized
   to the base of element -1 outside of a loop, and incremented at the top of
   the loop.  Expanding into the affine combination of trees collapses this to
   produce a base expression pointing outside the object.)  I avoid this by
   checking whether the base expression is an SSA name for an ivar that is
   directly or transitively defined by a PHI expression.

Zero-offset memory references
=============================
Following is a fairly common pattern that results from address lowering with
CSE:

      <bb 2>:
        D.2154_8 = n_2(D) * 4;
        D.2107_3 = MEM[base: p_1(D), index: D.2154_8, offset: 0B];
	D.2156_10 = p_1(D) + D.2154_8;
	D.2108_4 = MEM[(struct x *)D.2156_10 + 128B];
	D.2109_5 = MEM[(struct x *)D.2156_10 + 64B];
	foo (D.2107_3, D.2108_4, D.2109_5);
	return;

When the addressing expression consists of two tree addends and a non-zero
offset, the addends are pre-added and used as the base expression, with the
offset becoming the immediate operand in the storage op.  When the offset is
zero, the base and index are added implicitly in an indexed-form storage op.

The bias to the indexed form with a zero offset is a good choice, in absence
of knowledge of the surrounding code; it saves an instruction and a dependency
with a possible small increase in register pressure.  As can be seen here,
though, we end up with a missed commoning opportunity, since the second
instruction could become: 

       D.2107_3 = MEM[(struct x *)D.2156_10 + 0B];

Since the add into D.2156_10 is going to occur anyway, the advantage of the
bias to indexed form vanishes.  Instead, it leads to an increase in register
pressure.  This is more noticeable as expressions are CSEd across blocks. 

As the example shows, when the zero-offset memory reference is first
encountered, the sum of its base and index components may not be a known
available expression.  To find all opportunities where local CSE is possible
for these, the zero-offset expressions are recorded during the basic
algorithm.  At the end of each block, the zero-offset expressions are
post-processed.  If the opportunity exists, the base is replaced with the SSA
representative of the sum, and the index is changed to constant zero.  If
necessary, the definition of the SSA representative of the sum is moved ahead
of the zero-offset expression.  This is safe since both addends are known to
be available there.

Challenge with zero-offset memory references
============================================
Unfortunately, when I first implemented this post-processing, I didn't see any
change in code generation.  Upon investigation, it turned out that fwprop.c
was undoing this optimization in forward_propagate_and_simplify.  To
circumvent this, I ended up adding some extra code to identify zero-offset
memory references that fit the pattern of the example (only in RTL).  I
constrained fwprop.c from propagating into these as locally poor
replacements.

I recognize this might be a controversial choice, and I'm open to other
suggestions.  But I found CSEing these to be a useful optimization that makes
a difference in common situations, so I didn't want to lose it.

Loss of aliasing information
============================
The most serious problem I've run into is degraded performance due to poorer
instruction scheduling choices.  I tracked this down to 
alias.c:nonoverlapping_component_refs_p.

This code proves that two memory accesses don't overlap by attempting to prove
that they access different fields of the same structure.  This is done using
the MEM_EXPRs of the two rtx's, which record the expression trees that were
translated into the rtx's during expand.  When address lowering is not
present, a simple COMPONENT_REF will appear in the MEM_EXPR:  x.a, for
example.  However, address lowering changes the simple COMPONENT_REF into a
[TARGET_]MEM_REF that is no longer necessarily identifiable as a field
reference.  Thus the aliasing machinery can no longer prove that two such
field references are disjoint.

This has severe consequences for performance, and has to be dealt with if
address lowering is to be successful.

I've worked around this with an admittedly fragile solution; I'll discuss the
drawbacks below.  The idea is to construct a mapping from replacement mem_refs
to the original expressions that they replaced.  When a MEM_EXPR is being set
during expand, we first look up the mem_ref in the mapping.  If present, the
MEM_EXPR is set to the original expression, rather than to the mem_ref.  This
essentially duplicates the behavior in the absence of address lowering.

This appears to work well in practice today.  However, it relies on there
being no optimizations between the address lowering pass and expand that
modify [TARGET_]MEM_REFs.  Dead store elimination may remove some, which is
fine, of course.  But if a new pass was to be added that modified operands of
mem_refs after address lowering, code generation would be quietly degraded.
We would need some specific scan tests in the test suite to check for this.

Another (less serious) problem is having to introduce a data structure that
needs to be maintained beyond the lifetime of its associated pass.  It's a bit
of a wart to have this side table that must be kept around until expand is
finished.  It's workable, but not very clean.

As an alternative, I went looking for ways to carry the original expression
along with the mem_ref in the remaining gimple phases; but I wasn't able to
find a good way to do that.  Perhaps others can suggest something.

As noted, I recognize this is a fragile solution, and I'm very interested in
other ideas for keeping the necessary information around until it's needed.

Open issues
===========
CSE of addressing expressions across blocks (or within very large blocks) can
raise register pressure and potentially introduce spill code.  We may want
some heuristics to constrain the distance across which expressions are
considered available.

I am investigating small degradations in 177.mesa, 187.facerec, 300.twolf, and
164.gzip.

Performance testing on other platforms is needed.