From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-296079-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 3887 invoked by alias); 6 Jul 2011 14:28:57 -0000
Received: (qmail 3878 invoked by uid 22791); 6 Jul 2011 14:28:55 -0000
X-SWARE-Spam-Status: No, hits=-1.5 required=5.0	tests=AWL,BAYES_00,T_RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Received: from e2.ny.us.ibm.com (HELO e2.ny.us.ibm.com) (32.97.182.142)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Wed, 06 Jul 2011 14:28:36 +0000
Received: from d01relay03.pok.ibm.com (d01relay03.pok.ibm.com [9.56.227.235])	by e2.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id p66E7fFC030878	for <gcc-patches@gcc.gnu.org>; Wed, 6 Jul 2011 10:07:41 -0400
Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217])	by d01relay03.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id p66ESTW5106686	for <gcc-patches@gcc.gnu.org>; Wed, 6 Jul 2011 10:28:29 -0400
Received: from d01av03.pok.ibm.com (loopback [127.0.0.1])	by d01av03.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id p66AS4pp016191	for <gcc-patches@gcc.gnu.org>; Wed, 6 Jul 2011 07:28:04 -0300
Received: from [9.10.86.161] (liebl-pc.rchland.ibm.com [9.10.86.161] (may be forged))	by d01av03.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id p66AS4vC016113;	Wed, 6 Jul 2011 07:28:04 -0300
Subject: Re: [PATCH] Address lowering [1/3] Main patch
From: "William J. Schmidt" <wschmidt@linux.vnet.ibm.com>
To: Richard Guenther <richard.guenther@gmail.com>
Cc: gcc-patches@gcc.gnu.org, bergner@vnet.ibm.com
In-Reply-To:  <CAFiYyc2oNiJJq8qymMPwgMk1613=Y_t2-GNXtr=LDk_8YK9VsQ@mail.gmail.com>
References: <1309444782.26980.52.camel@oc2474580526.ibm.com>	 <CAFiYyc0JjZLhqNEmnfgsrVN3X6X-7L4oGe=mN4kLmEe3N1aP4g@mail.gmail.com>	 <1309874343.5402.34.camel@oc2474580526.ibm.com>	 <CAFiYyc2oNiJJq8qymMPwgMk1613=Y_t2-GNXtr=LDk_8YK9VsQ@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Date: Wed, 06 Jul 2011 14:37:00 -0000
Message-ID: <1309962495.3758.29.camel@oc2474580526.ibm.com>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2011-07/txt/msg00348.txt.bz2

On Wed, 2011-07-06 at 15:16 +0200, Richard Guenther wrote:
> On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
> > (Sorry for the late response; yesterday was a holiday here.)
> >
> > On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
> >> On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
> >> <wschmidt@linux.vnet.ibm.com> wrote:
> >> > This is the first of three patches related to lowering addressing
> >> > expressions to MEM_REFs and TARGET_MEM_REFs in late gimple.  This patch
> >> > contains the new pass together with supporting changes in existing
> >> > modules.  The second patch contains an independent change to the RTL
> >> > forward propagator to keep it from undoing an optimization made in the
> >> > first patch.  The third patch contains new test cases and changes to
> >> > existing test cases.
> >> >
> >> > Although I've broken it up into three patches to make the review easier,
> >> > it would be best to commit at least the first and third together to
> >> > avoid regressions.  The second can stand alone.
> >> >
> >> > I've done regression tests on powerpc64 and x86_64, and have asked
> >> > Andreas Krebbel to test against the IBM z (390) platform.  I've done
> >> > performance regression testing on powerpc64.  The only performance
> >> > regression of note is the 2% degradation to 188.ammp due to loss of
> >> > field disambiguation information.  As discussed in another thread,
> >> > fixing this introduces more complexity than it's worth.
> >>
> >> Are there also performance improvements?  What about code size?
> >
> > Yes, there are performance improvements.  I've been running cpu2000 on
> > 32- and 64-bit powerpc64.  Thirteen tests show measurable improvements
> > between 1% and 9%, with 187.facerec showing the largest improvements for
> > both 32 and 64.
> >
> > I don't have formal code size results, but anecdotally from code
> > crawling, I have seen code size either neutral or slightly improved.
> > The largest code size improvements I've seen were on 32-bit code where
> > the commoning allowed removal of a number of sign-extend and zero-extend
> > instructions that were otherwise not seen to be redundant.
> >>
> >> I tried to get an understanding to what kind of optimizations this patch
> >> produces based on the test of testcases you added, but I have a hard
> >> time here.  Can you outline some please?
> >>
> >
> > The primary goal is to clean up code such as is shown in the original
> > post of PR46556.  In late 2007 there were some changes made to address
> > canonicalization that caused the code gen to be suboptimal on powerpc64.
> > At that time you and others suggested a pattern recognizer prior to
> > expand as probably the best solution, similar to what IVopts is doing.
> 
> The PR46556 case looks quite simple.

It certainly is.  I was personally curious whether there were other
suboptimal sequences that might be hiding out there, that a more general
approach might expose.  There was a comment at the end of the bugzilla
about a pass to expose target addressing modes in gimple for this
purpose.  When I first started looking at this, I looked for some
feedback from the community about whether that should be done, and got a
few favorable comments along with one negative one.  So that's how we
got on this road...

> 
> > By using the same mem_ref generation machinery used by IVopts, together
> > with local CSE, the goal was to ensure base registers are properly
> > shared so that optimal code is generated, particularly for cases of
> > shared addressability to structures and arrays.  I also observed cases
> > where it was useful to extend the sharing across the dominator tree.
> 
> As you are doing IV selection per individual statement only, using
> the affine combination machinery looks quite a big hammer to me.
> Especially as it is hard to imagine what the side-effects are, apart
> from re-associating dependencies that do not fit the MEM-REF and
> making the MEM-REF as complicated as permitted by the target.
> 
> What I thought originally when suggesting to do something similar
> to IVOPTs was to build a list of candidates and uses and optimize
> that set using a cost function similar to how IVOPTs does.
> 
OK, reading back I can see that now...

> Doing addressing-mode selection locally per statement seems like
> more a task for a few pattern matchers, for example in tree-ssa-forwprop.c
> (for its last invocation).  One pattern would be that of PR46556,
> MEM[(p + ((n + 16)*4))] which we can transform to
> TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was
> a single-use.  The TARGET_MEM_REF generation can easily
> re-use the address-description and target-availability checks from
> tree-ssa-address.c.  I would be at least interested in whether
> handling the pattern from PR46556 alone (or maybe with a few
> similar other cases) is responsible for the performance improvements.
> 
Hm, but I don't think forwprop sees the code in this form.  At the time
the last pass of forwprop runs, the gimple for the original problem is:

  D.1997_3 = p_1(D)->a[n_2(D)];
  D.1998_4 = p_1(D)->c[n_2(D)];
  D.1999_5 = p_1(D)->b[n_2(D)];
  foo (D.1997_3, D.1998_4, D.1999_5);

But perhaps this is just a matter of changing the patterns to recognize?

Without running experiments yet, my guess is that this would pick up
some of the benefit, but not all.  As you say, the effects are difficult
to predict; I've seen some surprisingly good results from CSEing some
complex addressing, but I also had to implement several tweaks to avoid
detrimental effects (such as stepping on what IVopts already got right).
It may be that some of the surprising positives (such as removal of
extra sign-extend instructions) indicate some cleanup work that should
be done in the back end instead.

> Ideally we'd of course have a cost driven machinery that considers
> (part of) the whole function.
> 
> > Secondarily, I noticed that once this was cleaned up we still had the
> > suboptimal code generation for the zero-offset mem refs scenario
> > outlined in the code commentary.  The extra logic to clean this up helps
> > keep register pressure down.  I've observed some spill code reduction
> > when this is in place.  Again, using expression availability from
> > dominating blocks helps here in some cases as well.
> 
> Yeah, the case is quite odd and doesn't really fit existing optimizers
> given that the CSE opportunity is hidden within the TARGET_MEM_REF ...
> 
> >> I still do not like the implementation of yet another CSE machinery
> >> given that we already have two.  I think most of the need for CSE
> >> comes from the use of the affine combination framework and
> >> force_gimple_operand.  In fact I'd be interested to see cases that
> >> are optimized that could not be handled by a combine-like
> >> pattern matcher?
> >>
> >
> > I'm somewhat puzzled by this comment.  Back on 2/4, I posted a skeletal
> > outline for this pass and asked for comments.  You indicated a concern
> > that the affine combination expansion would un-CSE a lot of expressions,
> > and that I should keep track of local candidates during this pass to
> > ensure that base registers, etc., would be properly shared.  It seemed
> > to me that a quick and dirty CSE of addressing expressions would be the
> > best way to handle that.  I posted another RFC on 2/24 with an early
> > implementation of CSE constrained to a single block, and indicated
> > shortly thereafter my intent to extend this to a dominator walk.  I
> > didn't receive any negative comments about using CSE at that time; but
> > then I didn't get much response at all.
> 
> Sorry about that - I agree that doing a local CSE is one possible
> solution, but when considering that we are only considering a statement
> at a time it looks like a quite bloated approach.  A local CSE
> capability would be indeed useful also for other passes in GCC, like
> for example loop unrolling, which expose lots of garbage to later
> passes.  I thought of re-using the machinery that DOM provides, avoding
> yet another CSE implementation.
> 
> >
> > I probably should have pushed harder to get comments at that point, but
> > I was very new to the community and was feeling my way through the
> > protocols; I didn't feel comfortable pushing.
> 
> And pushing doesn't help always either.

That's been my general experience; I try to be careful about nagging
busy people.  I think my timing was poor at the beginning of this
project, as you guys were quite busy with closing off the 4.6 release.

> 
> > In any case, yes, the primary need for CSE is as a result of the affine
> > combination framework, which creates a large amount of redundant code.
> > I can certainly take a look at Michael's suggestion to move the pass
> > earlier and allow pass-dominator to handle the cleanup, and add the
> > zero-offset mem-ref optimization to that or a subsequent pass.  Do you
> > agree that this would be a preferable approach?
> 
> It would definitely preferable to a new CSE implementation.  I'm not
> sure the zero-offset optimization easily fits in DOM though.

I took a look at this yesterday and I think it fits easily enough; about
the same as it fit into my prototype pass.  The available expressions
logic is pretty similar to what I was doing, so transplanting the code
wouldn't be difficult.  The only issue I saw is that there isn't a pure
lookup function that doesn't place an expression on the avail-exprs
stack, but that's not hard to add.

> 
> What I'd really like to better understand is what transformations are
> the profitable ones and if we can handle them statement-locally
> within our combine machinery (thus, tree-ssa-forwprop.c).  I can
> cook up a forwprop patch to fix PR46556 in case you are interested
> in more details.
> 
Thanks...let me study forwprop a bit first -- I'm trying to get my head
wrapped around as much of the middle end as I can, and this is a good
excuse to delve into another piece.

I think perhaps the best way forward may be to use my prototype as a
baseline to compare against additions to the combine logic, so we can
tease out what some of the unexpected gains are, and think about how
best to achieve them more simply.

I'll have a look at forwprop and pester you (lightly :) if I have
questions.

Thanks,
Bill

> Thanks,
> Richard.
> 
> >> Thanks,
> >> Richard.
> >
> >
> >