From: Richard Guenther <richard.guenther@gmail.com>
To: "William J. Schmidt" <wschmidt@linux.vnet.ibm.com>
Cc: gcc-patches@gcc.gnu.org, bergner@vnet.ibm.com
Subject: Re: [PATCH] Address lowering [1/3] Main patch
Date: Wed, 06 Jul 2011 15:02:00 -0000 [thread overview]
Message-ID: <CAFiYyc11S1Zcx1LH9UCDSomGYTuiS+C5phvgtq-CyHZAtAHS-Q@mail.gmail.com> (raw)
In-Reply-To: <1309962495.3758.29.camel@oc2474580526.ibm.com>
On Wed, Jul 6, 2011 at 4:28 PM, William J. Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> On Wed, 2011-07-06 at 15:16 +0200, Richard Guenther wrote:
>> On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt
>> <wschmidt@linux.vnet.ibm.com> wrote:
>> > (Sorry for the late response; yesterday was a holiday here.)
>> >
>> > On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote:
>> >> On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt
>> >> <wschmidt@linux.vnet.ibm.com> wrote:
>> >> > This is the first of three patches related to lowering addressing
>> >> > expressions to MEM_REFs and TARGET_MEM_REFs in late gimple. This patch
>> >> > contains the new pass together with supporting changes in existing
>> >> > modules. The second patch contains an independent change to the RTL
>> >> > forward propagator to keep it from undoing an optimization made in the
>> >> > first patch. The third patch contains new test cases and changes to
>> >> > existing test cases.
>> >> >
>> >> > Although I've broken it up into three patches to make the review easier,
>> >> > it would be best to commit at least the first and third together to
>> >> > avoid regressions. The second can stand alone.
>> >> >
>> >> > I've done regression tests on powerpc64 and x86_64, and have asked
>> >> > Andreas Krebbel to test against the IBM z (390) platform. I've done
>> >> > performance regression testing on powerpc64. The only performance
>> >> > regression of note is the 2% degradation to 188.ammp due to loss of
>> >> > field disambiguation information. As discussed in another thread,
>> >> > fixing this introduces more complexity than it's worth.
>> >>
>> >> Are there also performance improvements? What about code size?
>> >
>> > Yes, there are performance improvements. I've been running cpu2000 on
>> > 32- and 64-bit powerpc64. Thirteen tests show measurable improvements
>> > between 1% and 9%, with 187.facerec showing the largest improvements for
>> > both 32 and 64.
>> >
>> > I don't have formal code size results, but anecdotally from code
>> > crawling, I have seen code size either neutral or slightly improved.
>> > The largest code size improvements I've seen were on 32-bit code where
>> > the commoning allowed removal of a number of sign-extend and zero-extend
>> > instructions that were otherwise not seen to be redundant.
>> >>
>> >> I tried to get an understanding to what kind of optimizations this patch
>> >> produces based on the test of testcases you added, but I have a hard
>> >> time here. Can you outline some please?
>> >>
>> >
>> > The primary goal is to clean up code such as is shown in the original
>> > post of PR46556. In late 2007 there were some changes made to address
>> > canonicalization that caused the code gen to be suboptimal on powerpc64.
>> > At that time you and others suggested a pattern recognizer prior to
>> > expand as probably the best solution, similar to what IVopts is doing.
>>
>> The PR46556 case looks quite simple.
>
> It certainly is. I was personally curious whether there were other
> suboptimal sequences that might be hiding out there, that a more general
> approach might expose. There was a comment at the end of the bugzilla
> about a pass to expose target addressing modes in gimple for this
> purpose. When I first started looking at this, I looked for some
> feedback from the community about whether that should be done, and got a
> few favorable comments along with one negative one. So that's how we
> got on this road...
>
>>
>> > By using the same mem_ref generation machinery used by IVopts, together
>> > with local CSE, the goal was to ensure base registers are properly
>> > shared so that optimal code is generated, particularly for cases of
>> > shared addressability to structures and arrays. I also observed cases
>> > where it was useful to extend the sharing across the dominator tree.
>>
>> As you are doing IV selection per individual statement only, using
>> the affine combination machinery looks quite a big hammer to me.
>> Especially as it is hard to imagine what the side-effects are, apart
>> from re-associating dependencies that do not fit the MEM-REF and
>> making the MEM-REF as complicated as permitted by the target.
>>
>> What I thought originally when suggesting to do something similar
>> to IVOPTs was to build a list of candidates and uses and optimize
>> that set using a cost function similar to how IVOPTs does.
>>
> OK, reading back I can see that now...
>
>> Doing addressing-mode selection locally per statement seems like
>> more a task for a few pattern matchers, for example in tree-ssa-forwprop.c
>> (for its last invocation). One pattern would be that of PR46556,
>> MEM[(p + ((n + 16)*4))] which we can transform to
>> TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was
>> a single-use. The TARGET_MEM_REF generation can easily
>> re-use the address-description and target-availability checks from
>> tree-ssa-address.c. I would be at least interested in whether
>> handling the pattern from PR46556 alone (or maybe with a few
>> similar other cases) is responsible for the performance improvements.
>>
> Hm, but I don't think forwprop sees the code in this form. At the time
> the last pass of forwprop runs, the gimple for the original problem is:
>
> D.1997_3 = p_1(D)->a[n_2(D)];
> D.1998_4 = p_1(D)->c[n_2(D)];
> D.1999_5 = p_1(D)->b[n_2(D)];
> foo (D.1997_3, D.1998_4, D.1999_5);
>
> But perhaps this is just a matter of changing the patterns to recognize?
Ah, so we still have the ARRAY_REFs here. Yeah, well - then the
issue boils down to get_inner_reference canonicalizing the offset
according to what fold-const.c implements, and we simply emit
the offset in that form (if you look at the normal_inner_ref: case
in expand_expr_real*). Thus with the above form at expansion
time the matching would be applied there (or during our RTL
address-generation in fwprop.c ...).
Another idea is of course to lower all handled_component_p
memory accesses - something I did on the old mem-ref branch
and I believe I also suggested at some point. Without such lowering
all the address-generation isn't exposed to the GIMPLE optimizers.
The simplest way of doing the lowering is to do sth like
base = get_inner_reference (rhs, ..., &bitpos, &offset, ...);
base = fold_build2 (POINTER_PLUS_EXPR, ...,
base, offset);
base = force_gimple_operand (base, ... is_gimple_mem_ref_addr);
tem = fold_build2 (MEM_REF, TREE_TYPE (rhs),
base,
build_int_cst (get_alias_ptr_type (rhs), bitpos));
at some point. I didn't end up doing that because of the loss of
alias information.
> Without running experiments yet, my guess is that this would pick up
> some of the benefit, but not all. As you say, the effects are difficult
> to predict; I've seen some surprisingly good results from CSEing some
> complex addressing, but I also had to implement several tweaks to avoid
> detrimental effects (such as stepping on what IVopts already got right).
> It may be that some of the surprising positives (such as removal of
> extra sign-extend instructions) indicate some cleanup work that should
> be done in the back end instead.
Indeed - or at some other point in the tree middle-end. For example
we do not re-associate POINTER_PLUS_EXPR at all, which loses
quite some CSE opportunities for addresses (tree-ssa-reassoc.c
would be the place to enhance).
>> Ideally we'd of course have a cost driven machinery that considers
>> (part of) the whole function.
>>
>> > Secondarily, I noticed that once this was cleaned up we still had the
>> > suboptimal code generation for the zero-offset mem refs scenario
>> > outlined in the code commentary. The extra logic to clean this up helps
>> > keep register pressure down. I've observed some spill code reduction
>> > when this is in place. Again, using expression availability from
>> > dominating blocks helps here in some cases as well.
>>
>> Yeah, the case is quite odd and doesn't really fit existing optimizers
>> given that the CSE opportunity is hidden within the TARGET_MEM_REF ...
>>
>> >> I still do not like the implementation of yet another CSE machinery
>> >> given that we already have two. I think most of the need for CSE
>> >> comes from the use of the affine combination framework and
>> >> force_gimple_operand. In fact I'd be interested to see cases that
>> >> are optimized that could not be handled by a combine-like
>> >> pattern matcher?
>> >>
>> >
>> > I'm somewhat puzzled by this comment. Back on 2/4, I posted a skeletal
>> > outline for this pass and asked for comments. You indicated a concern
>> > that the affine combination expansion would un-CSE a lot of expressions,
>> > and that I should keep track of local candidates during this pass to
>> > ensure that base registers, etc., would be properly shared. It seemed
>> > to me that a quick and dirty CSE of addressing expressions would be the
>> > best way to handle that. I posted another RFC on 2/24 with an early
>> > implementation of CSE constrained to a single block, and indicated
>> > shortly thereafter my intent to extend this to a dominator walk. I
>> > didn't receive any negative comments about using CSE at that time; but
>> > then I didn't get much response at all.
>>
>> Sorry about that - I agree that doing a local CSE is one possible
>> solution, but when considering that we are only considering a statement
>> at a time it looks like a quite bloated approach. A local CSE
>> capability would be indeed useful also for other passes in GCC, like
>> for example loop unrolling, which expose lots of garbage to later
>> passes. I thought of re-using the machinery that DOM provides, avoding
>> yet another CSE implementation.
>>
>> >
>> > I probably should have pushed harder to get comments at that point, but
>> > I was very new to the community and was feeling my way through the
>> > protocols; I didn't feel comfortable pushing.
>>
>> And pushing doesn't help always either.
>
> That's been my general experience; I try to be careful about nagging
> busy people. I think my timing was poor at the beginning of this
> project, as you guys were quite busy with closing off the 4.6 release.
>
>>
>> > In any case, yes, the primary need for CSE is as a result of the affine
>> > combination framework, which creates a large amount of redundant code.
>> > I can certainly take a look at Michael's suggestion to move the pass
>> > earlier and allow pass-dominator to handle the cleanup, and add the
>> > zero-offset mem-ref optimization to that or a subsequent pass. Do you
>> > agree that this would be a preferable approach?
>>
>> It would definitely preferable to a new CSE implementation. I'm not
>> sure the zero-offset optimization easily fits in DOM though.
>
> I took a look at this yesterday and I think it fits easily enough; about
> the same as it fit into my prototype pass. The available expressions
> logic is pretty similar to what I was doing, so transplanting the code
> wouldn't be difficult. The only issue I saw is that there isn't a pure
> lookup function that doesn't place an expression on the avail-exprs
> stack, but that's not hard to add.
>
>>
>> What I'd really like to better understand is what transformations are
>> the profitable ones and if we can handle them statement-locally
>> within our combine machinery (thus, tree-ssa-forwprop.c). I can
>> cook up a forwprop patch to fix PR46556 in case you are interested
>> in more details.
>>
> Thanks...let me study forwprop a bit first -- I'm trying to get my head
> wrapped around as much of the middle end as I can, and this is a good
> excuse to delve into another piece.
>
> I think perhaps the best way forward may be to use my prototype as a
> baseline to compare against additions to the combine logic, so we can
> tease out what some of the unexpected gains are, and think about how
> best to achieve them more simply.
Yeah. I will re-surrect the simple pass I had to lower all memory
accesses, that will give us another toy to play with.
Thanks,
Richard.
> I'll have a look at forwprop and pester you (lightly :) if I have
> questions.
>
> Thanks,
> Bill
>
>> Thanks,
>> Richard.
>>
>> >> Thanks,
>> >> Richard.
>> >
>> >
>> >
>
>
>
next prev parent reply other threads:[~2011-07-06 14:59 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-06-30 14:51 William J. Schmidt
2011-07-04 14:22 ` Richard Guenther
2011-07-04 15:30 ` Michael Matz
2011-07-05 14:08 ` William J. Schmidt
2011-07-05 14:26 ` Michael Matz
2011-07-05 17:08 ` William J. Schmidt
2011-07-08 14:11 ` William J. Schmidt
2011-07-05 14:06 ` William J. Schmidt
2011-07-06 13:22 ` Richard Guenther
2011-07-06 14:37 ` William J. Schmidt
2011-07-06 15:02 ` Richard Guenther [this message]
2011-07-20 0:46 ` William J. Schmidt
2011-07-20 9:53 ` Richard Guenther
2011-07-20 13:38 ` William J. Schmidt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFiYyc11S1Zcx1LH9UCDSomGYTuiS+C5phvgtq-CyHZAtAHS-Q@mail.gmail.com \
--to=richard.guenther@gmail.com \
--cc=bergner@vnet.ibm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=wschmidt@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).