Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
       [not found] <A610E03AD50BFC4D95529A36D37FA55E7B006D6780@GEORGE.Emea.Arm.com>
@ 2015-08-12 15:59 ` Wilco Dijkstra
  2015-08-12 16:09   ` Richard Henderson
  2015-08-13  3:30   ` Segher Boessenkool
  0 siblings, 2 replies; 15+ messages in thread
From: Wilco Dijkstra @ 2015-08-12 15:59 UTC (permalink / raw)
  To: rth; +Cc: 'GCC Patches', Richard Earnshaw, Marcus Shawcroft, dje.gcc

Richard Henderson wrote:
> However, the way that aarch64 and alpha have done it hasn't
> been ideal, in that there's a fairly costly search that must
> be done every time.  I've thought before about changing this
> so that we would be able to cache results, akin to how we do
> it in expmed.c for multiplication.
> 
> I've implemented such a caching scheme for three targets, as
> a test of how much code could be shared.  The answer appears
> to be about 100 lines of boiler-plate.  Minimal, true, but it
> may still be worth it as a way of encouraging backends to do
> similar things in a similar way.

However it also creates new dependencies that may not be desirable
(such as hash table size, algorithm used etc).

> Some notes about ppc64 in particular:
> 
>   * Constants aren't split until quite late, preventing all hope of
>     CSE'ing portions of the generated code.  My gut feeling is that
>     this is in general a mistake, but...

Late split is best in general as you want to CSE the original constants,
not parts of the expansion (which would be very rarely possible).

>   * This is the only platform for which I bothered collecting any sort
>     of performance data:
> 
>     As best I can tell, there is a 9% improvement in bootstrap speed
>     for ppc64.  That is, 10 minutes off the original 109 minute build.
> 
>     For aarch64 and alpha, I simply assumed there would be no loss,
>     since the basic search algorithm is unchanged for each.
> 
> Comments?  Especially on the shared header?

I'm not convinced the amount of code that could be shared is enough to be
worthwhile. Also the way it is written makes the immediate generation more 
complex and likely consuming a lot of memory (each cached immediate requires
at least 64 bytes). It is not obvious to me why it is a good idea to hide the
simple/fast cases behind the hashing scheme - it seems better that the backend
explicitly decides which cases should be cached.

I looked at the statistics of AArch64 immediate generation a while ago. 
The interesting thing is ~95% of calls are queries, and the same query is on 
average repeated 10 times in a row. So (a) it is not important to cache the 
expansions, and (b) the high repetition rate means a single-entry cache
has a 90% hitrate. We already have a patch for this and could collect stats
comparing the approaches. If a single-entry cache can provide a similar 
benefit as caching all immediates then my preference would be to keep things
simple and just cache the last query.

Note the many repeated queries indicate a performance issue at a much higher 
level (repeated cost queries on the same unchanged RTL), and solving that 
problem will likely improve buildtime for all targets.

Wilco

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-12 15:59 ` [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation Wilco Dijkstra
@ 2015-08-12 16:09   ` Richard Henderson
  2015-08-25 14:21     ` Wilco Dijkstra
  2015-08-13  3:30   ` Segher Boessenkool
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2015-08-12 16:09 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: 'GCC Patches', Richard Earnshaw, Marcus Shawcroft, dje.gcc

On 08/12/2015 08:59 AM, Wilco Dijkstra wrote:
> I looked at the statistics of AArch64 immediate generation a while ago. 
> The interesting thing is ~95% of calls are queries, and the same query is on 
> average repeated 10 times in a row. So (a) it is not important to cache the 
> expansions, and (b) the high repetition rate means a single-entry cache
> has a 90% hitrate. We already have a patch for this and could collect stats
> comparing the approaches. If a single-entry cache can provide a similar 
> benefit as caching all immediates then my preference would be to keep things
> simple and just cache the last query.

Interesting.  That's already more detailed investigation than I'd done.  I had
no idea the queries were so clustered.  I assumed that the queries would be
scattered across various passes, and so the various constants across the
function would get checked in sequence.

I would be very interested in seeing those stats when you've done.


r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-12 15:59 ` [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation Wilco Dijkstra
  2015-08-12 16:09   ` Richard Henderson
@ 2015-08-13  3:30   ` Segher Boessenkool
  1 sibling, 0 replies; 15+ messages in thread
From: Segher Boessenkool @ 2015-08-13  3:30 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: rth, 'GCC Patches', Richard Earnshaw, Marcus Shawcroft, dje.gcc

On Wed, Aug 12, 2015 at 04:59:27PM +0100, Wilco Dijkstra wrote:
> However it also creates new dependencies that may not be desirable
> (such as hash table size, algorithm used etc).
> 
> > Some notes about ppc64 in particular:
> > 
> >   * Constants aren't split until quite late, preventing all hope of
> >     CSE'ing portions of the generated code.  My gut feeling is that
> >     this is in general a mistake, but...
> 
> Late split is best in general as you want to CSE the original constants,
> not parts of the expansion (which would be very rarely possible).

CSE handles the REG_EQ* notes just fine?  For rs6000 many constants
are split at expansion already, and I never saw problems from that?
Maybe I haven't looked at the right (wrong :-) ) testcases though.

> > Comments?  Especially on the shared header?
> 
> I'm not convinced the amount of code that could be shared is enough to be
> worthwhile.

I agree.

> Also the way it is written makes the immediate generation more 
> complex and likely consuming a lot of memory (each cached immediate requires
> at least 64 bytes).

It would take a bit less memory if it used pointers (to other cache elts)
for sub-expressions, but that makes cache invalidation more interesting,
and shared (within one constant) subexpressions non-trivial.

> It is not obvious to me why it is a good idea to hide the
> simple/fast cases behind the hashing scheme - it seems better that the backend
> explicitly decides which cases should be cached.

Without going through the abstraction you mean?  I'd rather there was
no such abstraction at all :-)

> I looked at the statistics of AArch64 immediate generation a while ago. 
> The interesting thing is ~95% of calls are queries, and the same query is on 
> average repeated 10 times in a row.

Huh.

> So (a) it is not important to cache the expansions,

It also stores the expansion from analysis until emission time.  This
simplifies the code a lot.

Maybe that can be done without caching everything (and still not be
too expensive), dunno.

> and (b) the high repetition rate means a single-entry cache
> has a 90% hitrate.

And a 10% miss rate...

> We already have a patch for this and could collect stats
> comparing the approaches. If a single-entry cache can provide a similar 
> benefit as caching all immediates then my preference would be to keep things
> simple and just cache the last query.
> 
> Note the many repeated queries indicate a performance issue at a much higher 
> level (repeated cost queries on the same unchanged RTL), and solving that 
> problem will likely improve buildtime for all targets.

You cannot see if RTL has changed from the pointer to it, so we'd have
to store that info (the "unchanged" info, or the "cost" info) somewhere.
Maybe it would be useful to store it in the RTL itself?


Segher

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-12 16:09   ` Richard Henderson
@ 2015-08-25 14:21     ` Wilco Dijkstra
  0 siblings, 0 replies; 15+ messages in thread
From: Wilco Dijkstra @ 2015-08-25 14:21 UTC (permalink / raw)
  To: 'Richard Henderson'
  Cc: 'GCC Patches', Richard Earnshaw, Marcus Shawcroft, dje.gcc

> Richard Henderson wrote:
> On 08/12/2015 08:59 AM, Wilco Dijkstra wrote:
> > I looked at the statistics of AArch64 immediate generation a while ago.
> > The interesting thing is ~95% of calls are queries, and the same query is on
> > average repeated 10 times in a row. So (a) it is not important to cache the
> > expansions, and (b) the high repetition rate means a single-entry cache
> > has a 90% hitrate. We already have a patch for this and could collect stats
> > comparing the approaches. If a single-entry cache can provide a similar
> > benefit as caching all immediates then my preference would be to keep things
> > simple and just cache the last query.
> 
> Interesting.  That's already more detailed investigation than I'd done.  I had
> no idea the queries were so clustered.  I assumed that the queries would be
> scattered across various passes, and so the various constants across the
> function would get checked in sequence.
> 
> I would be very interested in seeing those stats when you've done.

Caching improves average buildtime by 0.1-0.2% - your patch seems to be slightly
faster than caching just 1 query, so that suggests caching a few entries would be
beneficial.

However looking at the immediates that are generated by the loops, it's feasible
to avoid linear/quadratic search loops altogether. So I think a generic immediate 
caching scheme won't be useful for AArch64.

Wilco


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-13  3:10   ` Segher Boessenkool
@ 2015-08-13 11:32     ` David Edelsohn
  0 siblings, 0 replies; 15+ messages in thread
From: David Edelsohn @ 2015-08-13 11:32 UTC (permalink / raw)
  To: Segher Boessenkool, Richard Henderson
  Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw

On Wed, Aug 12, 2015 at 11:10 PM, Segher Boessenkool
<segher@kernel.crashing.org> wrote:
> On Wed, Aug 12, 2015 at 03:31:48AM -0500, Segher Boessenkool wrote:
>> >   * This is the only platform for which I bothered collecting any sort
>> >     of performance data:
>> >
>> >     As best I can tell, there is a 9% improvement in bootstrap speed
>> >     for ppc64.  That is, 10 minutes off the original 109 minute build.
>>
>> That is, wow.  Wow :-)
>
> Bootstrap + 4 regtests of a virgin trunk du jour took me 127m;
> with your patches, 130m.  So I'm not seeing that improvement (but
> no regression either).  This is gcc110 fwiw.

The patch series is fine with me.

As Segher mentioned, the performance impact is in the noise for my tests.

Thanks, David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-13  3:07     ` Segher Boessenkool
@ 2015-08-13  5:36       ` Segher Boessenkool
  0 siblings, 0 replies; 15+ messages in thread
From: Segher Boessenkool @ 2015-08-13  5:36 UTC (permalink / raw)
  To: Richard Henderson
  Cc: gcc-patches, David Edelsohn, Marcus Shawcroft, Richard Earnshaw

On Wed, Aug 12, 2015 at 08:32:46AM -0700, Richard Henderson wrote:
> On 08/12/2015 01:31 AM, Segher Boessenkool wrote:
> > Is there something that makes the cache not get too big?  Do we
> > care, anyway?
> 
> No, nothing ever reduces the size of the cache.  I doubt we care, but I haven't
> instrumented to see how big it grows.
> 
> My intuition says the most important thing about managing this cache is not to
> put the most common trivial constants in, and I already do that.

Right.  And it seems to cache negative results too (the five-insn
sequence).

> >>     I did attempt to fix it, and got nothing for my troubles except
> >>     poorer code generation for AND/IOR/XOR with non-trivial constants.
> > 
> > Could you give an example of code that isn't split early enough?
> 
> I believe the examples I was seeing was in the libdecnumber code.  I'd have to
> go back and reproduce it now...

If you could, please do.

> >>     Perhaps there's some other predication or costing error that's
> >>     getting in the way, and it simply wasn't obvious to me.   In any
> >>     case, nothing in this patch set addresses this at all.
> > 
> > The instruction (set (reg) (const_int 0x12345678)) is costed as 4
> > (i.e. one insn).  That cannot be good.  This is alternative #5 in
> > *movsi_internal1_single (there are many more variants of that
> > pattern).
> 
> Yes, when I tried to fix it, I did adjust that costing, but still...

I misread it (it's alt #6, with cost 8).  Maybe Alan's patches would
fix this one?

> >>   * Constants are split *really* late.  In particular, after reload.
> > 
> > Yeah that is bad.  But I'm still not seeing it.  Hrm, maybe only
> > DImode ones?
> 
> Dunno.  I found it after I did add a recipe to use ADDI, which then triggered
> an ICE due to the r0 base register.

Ah!  Reload generates *new* addition insns out of thin air, and that
is exactly the case where ADDI won't work.  LRA works a bit better
there it seems (but it is still not the default for rs6000); if the
constraint ("b" in this case) would not match, it tries other things.

> > Have you looked at generated code quality?
> 
> I've looked at diffs of all of the object files in the target directory.  There
> were definite spots of improvement.  I wasn't able to spot any regressions.

[ I've looked now. ]

Some of the new combos help a bit, yes.  Total code size increased
a tiny bit; it looks to be all because of unfortunate scheduling and
less tail merging.  The usual.

32-bit code is identical, as it should be ;-)


Segher

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-12  8:32 ` Segher Boessenkool
  2015-08-12 15:32   ` Richard Henderson
@ 2015-08-13  3:10   ` Segher Boessenkool
  2015-08-13 11:32     ` David Edelsohn
  1 sibling, 1 reply; 15+ messages in thread
From: Segher Boessenkool @ 2015-08-13  3:10 UTC (permalink / raw)
  To: Richard Henderson
  Cc: gcc-patches, David Edelsohn, Marcus Shawcroft, Richard Earnshaw

On Wed, Aug 12, 2015 at 03:31:48AM -0500, Segher Boessenkool wrote:
> >   * This is the only platform for which I bothered collecting any sort
> >     of performance data:
> > 
> >     As best I can tell, there is a 9% improvement in bootstrap speed
> >     for ppc64.  That is, 10 minutes off the original 109 minute build.
> 
> That is, wow.  Wow :-)

Bootstrap + 4 regtests of a virgin trunk du jour took me 127m;
with your patches, 130m.  So I'm not seeing that improvement (but
no regression either).  This is gcc110 fwiw.


Segher

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-12 15:32   ` Richard Henderson
@ 2015-08-13  3:07     ` Segher Boessenkool
  2015-08-13  5:36       ` Segher Boessenkool
  0 siblings, 1 reply; 15+ messages in thread
From: Segher Boessenkool @ 2015-08-13  3:07 UTC (permalink / raw)
  To: Richard Henderson
  Cc: gcc-patches, David Edelsohn, Marcus Shawcroft, Richard Earnshaw

On Wed, Aug 12, 2015 at 08:32:46AM -0700, Richard Henderson wrote:
> On 08/12/2015 01:31 AM, Segher Boessenkool wrote:
> > Is there something that makes the cache not get too big?  Do we
> > care, anyway?
> 
> No, nothing ever reduces the size of the cache.  I doubt we care, but I haven't
> instrumented to see how big it grows.
> 
> My intuition says the most important thing about managing this cache is not to
> put the most common trivial constants in, and I already do that.

Right.  And it seems to cache negative results too (the five-insn
sequence).

> >>     I did attempt to fix it, and got nothing for my troubles except
> >>     poorer code generation for AND/IOR/XOR with non-trivial constants.
> > 
> > Could you give an example of code that isn't split early enough?
> 
> I believe the examples I was seeing was in the libdecnumber code.  I'd have to
> go back and reproduce it now...

If you could, please do.

> >>     Perhaps there's some other predication or costing error that's
> >>     getting in the way, and it simply wasn't obvious to me.   In any
> >>     case, nothing in this patch set addresses this at all.
> > 
> > The instruction (set (reg) (const_int 0x12345678)) is costed as 4
> > (i.e. one insn).  That cannot be good.  This is alternative #5 in
> > *movsi_internal1_single (there are many more variants of that
> > pattern).
> 
> Yes, when I tried to fix it, I did adjust that costing, but still...

I misread it (it's alt #6, with cost 8).  Maybe Alan's patches would
fix this one?

> >>   * Constants are split *really* late.  In particular, after reload.
> > 
> > Yeah that is bad.  But I'm still not seeing it.  Hrm, maybe only
> > DImode ones?
> 
> Dunno.  I found it after I did add a recipe to use ADDI, which then triggered
> an ICE due to the r0 base register.

Ah!  Reload generates *new* addition insns out of thin air, and that
is exactly the case where ADDI won't work.  LRA works a bit better
there it seems (but it is still not the default for rs6000); if the
constraint ("b" in this case) would not match, it tries other things.

> > Have you looked at generated code quality?
> 
> I've looked at diffs of all of the object files in the target directory.  There
> were definite spots of improvement.  I wasn't able to spot any regressions.

[ I've looked now. ]

Some of the new combos help a bit, yes.  Total code size increased
a tiny bit; it looks to be all because of unfortunate scheduling and
less tail merging.  The usual.

32-bit code is identical, as it should be ;-)


Segher

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-12  8:32 ` Richard Earnshaw
  2015-08-12  8:43   ` Richard Earnshaw
@ 2015-08-12 15:45   ` Richard Henderson
  1 sibling, 0 replies; 15+ messages in thread
From: Richard Henderson @ 2015-08-12 15:45 UTC (permalink / raw)
  To: Richard Earnshaw, gcc-patches
  Cc: David Edelsohn, Marcus Shawcroft, Richard Earnshaw

On 08/12/2015 01:32 AM, Richard Earnshaw wrote:
> How do we clear the cache, and when?  For example, on ARM, switching
> between ARM and Thumb state means we need to generate potentially
> radically different sequences?  We can do such splitting at function
> boundaries now.

At present I never clear the cache.  Maybe we'll find that's a mistake.

For arm vs thumb I would start with just using two different caches.  The way
the code is structured currently, that would mean two different classes.  Which
could just be trivial wrappers around a common base class containing the
generator code.

> Can we generate different sequences for hot/cold code within a single
> function?

Not without using different caches.

> Can we cache sequences with the context (eg use with AND, OR, ADD, etc)?

No.  At least not without...


r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-12  8:32 ` Segher Boessenkool
@ 2015-08-12 15:32   ` Richard Henderson
  2015-08-13  3:07     ` Segher Boessenkool
  2015-08-13  3:10   ` Segher Boessenkool
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2015-08-12 15:32 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: gcc-patches, David Edelsohn, Marcus Shawcroft, Richard Earnshaw

On 08/12/2015 01:31 AM, Segher Boessenkool wrote:
> Is there something that makes the cache not get too big?  Do we
> care, anyway?

No, nothing ever reduces the size of the cache.  I doubt we care, but I haven't
instrumented to see how big it grows.

My intuition says the most important thing about managing this cache is not to
put the most common trivial constants in, and I already do that.

>>     I did attempt to fix it, and got nothing for my troubles except
>>     poorer code generation for AND/IOR/XOR with non-trivial constants.
> 
> Could you give an example of code that isn't split early enough?

I believe the examples I was seeing was in the libdecnumber code.  I'd have to
go back and reproduce it now...

>>     Perhaps there's some other predication or costing error that's
>>     getting in the way, and it simply wasn't obvious to me.   In any
>>     case, nothing in this patch set addresses this at all.
> 
> The instruction (set (reg) (const_int 0x12345678)) is costed as 4
> (i.e. one insn).  That cannot be good.  This is alternative #5 in
> *movsi_internal1_single (there are many more variants of that
> pattern).

Yes, when I tried to fix it, I did adjust that costing, but still...

>>   * Constants are split *really* late.  In particular, after reload.
> 
> Yeah that is bad.  But I'm still not seeing it.  Hrm, maybe only
> DImode ones?

Dunno.  I found it after I did add a recipe to use ADDI, which then triggered
an ICE due to the r0 base register.

> Have you looked at generated code quality?

I've looked at diffs of all of the object files in the target directory.  There
were definite spots of improvement.  I wasn't able to spot any regressions.

r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-12  8:43   ` Richard Earnshaw
@ 2015-08-12  9:02     ` Richard Earnshaw
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Earnshaw @ 2015-08-12  9:02 UTC (permalink / raw)
  To: Richard Henderson, gcc-patches; +Cc: David Edelsohn, Marcus Shawcroft

On 12/08/15 09:43, Richard Earnshaw wrote:
> On 12/08/15 09:32, Richard Earnshaw wrote:
>> On 12/08/15 02:11, Richard Henderson wrote:
>>>     I'm somewhat surprised that the operands to the logicals aren't
>>>     visible at rtl generation time, given all the work done in gimple.
>>>     And failing that, combine has enough REG_EQUAL notes that it ought
>>>     to be able to put things back together and see the simpler pattern.
>>>
>>
>> We've tried it in the past.  Exposing the individual steps prevents the
>> higher-level rtl-based optimizations since they can no-longer deal with
>> the complete sub-expression.
> 
> Eg. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63724
> 
> R.
> 
And https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65768

R.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-12  8:32 ` Richard Earnshaw
@ 2015-08-12  8:43   ` Richard Earnshaw
  2015-08-12  9:02     ` Richard Earnshaw
  2015-08-12 15:45   ` Richard Henderson
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Earnshaw @ 2015-08-12  8:43 UTC (permalink / raw)
  To: Richard Earnshaw, Richard Henderson, gcc-patches
  Cc: David Edelsohn, Marcus Shawcroft

On 12/08/15 09:32, Richard Earnshaw wrote:
> On 12/08/15 02:11, Richard Henderson wrote:
>> Something last week had me looking at ppc64 code generation,
>> and some of what I saw was fairly bad.  Fixing it wasn't going
>> to be easy, due to the fact that the logic for generating
>> constants wasn't contained within a single function.
>>
>> Better is the way that aarch64 and alpha have done it in the
>> past, sharing a single function with all of the logical that
>> can be used for both cost calculation and the actual emission
>> of the constants.
>>
>> However, the way that aarch64 and alpha have done it hasn't
>> been ideal, in that there's a fairly costly search that must
>> be done every time.  I've thought before about changing this
>> so that we would be able to cache results, akin to how we do
>> it in expmed.c for multiplication.
>>
>> I've implemented such a caching scheme for three targets, as
>> a test of how much code could be shared.  The answer appears
>> to be about 100 lines of boiler-plate.  Minimal, true, but it
>> may still be worth it as a way of encouraging backends to do
>> similar things in a similar way.
>>
> 
> I've got a short week this week, so won't have time to look at this in
> detail for a while.  So a bunch of questions... but not necessarily
> objections :-)
> 
> How do we clear the cache, and when?  For example, on ARM, switching
> between ARM and Thumb state means we need to generate potentially
> radically different sequences?  We can do such splitting at function
> boundaries now.
> 
> Can we generate different sequences for hot/cold code within a single
> function?
> 
> Can we cache sequences with the context (eg use with AND, OR, ADD, etc)?
> 
> 
>> Some notes about ppc64 in particular:
>>
>>   * Constants aren't split until quite late, preventing all hope of
>>     CSE'ing portions of the generated code.  My gut feeling is that
>>     this is in general a mistake, but...
>>
>>     I did attempt to fix it, and got nothing for my troubles except
>>     poorer code generation for AND/IOR/XOR with non-trivial constants.
>>
> On AArch64 in particular, building complex constants is generally
> destructive on the source register (if you want to preserve intermediate
> values you have to make intermediate copies); that's clearly never going
> to be a win if you don't need at least 3 instructions to form the
> constant.
> 
> There might be some cases where you could form a second constant as a
> difference from an earlier one, but that then creates data-flow
> dependencies and in OoO machines that might not be worth-while.  Even
> for in-order machines it can restrict scheduling and result in worse code.
> 
> 
>>     I'm somewhat surprised that the operands to the logicals aren't
>>     visible at rtl generation time, given all the work done in gimple.
>>     And failing that, combine has enough REG_EQUAL notes that it ought
>>     to be able to put things back together and see the simpler pattern.
>>
> 
> We've tried it in the past.  Exposing the individual steps prevents the
> higher-level rtl-based optimizations since they can no-longer deal with
> the complete sub-expression.

Eg. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63724

R.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-12  1:11 Richard Henderson
@ 2015-08-12  8:32 ` Richard Earnshaw
  2015-08-12  8:43   ` Richard Earnshaw
  2015-08-12 15:45   ` Richard Henderson
  2015-08-12  8:32 ` Segher Boessenkool
  1 sibling, 2 replies; 15+ messages in thread
From: Richard Earnshaw @ 2015-08-12  8:32 UTC (permalink / raw)
  To: Richard Henderson, gcc-patches
  Cc: David Edelsohn, Marcus Shawcroft, Richard Earnshaw

On 12/08/15 02:11, Richard Henderson wrote:
> Something last week had me looking at ppc64 code generation,
> and some of what I saw was fairly bad.  Fixing it wasn't going
> to be easy, due to the fact that the logic for generating
> constants wasn't contained within a single function.
> 
> Better is the way that aarch64 and alpha have done it in the
> past, sharing a single function with all of the logical that
> can be used for both cost calculation and the actual emission
> of the constants.
> 
> However, the way that aarch64 and alpha have done it hasn't
> been ideal, in that there's a fairly costly search that must
> be done every time.  I've thought before about changing this
> so that we would be able to cache results, akin to how we do
> it in expmed.c for multiplication.
> 
> I've implemented such a caching scheme for three targets, as
> a test of how much code could be shared.  The answer appears
> to be about 100 lines of boiler-plate.  Minimal, true, but it
> may still be worth it as a way of encouraging backends to do
> similar things in a similar way.
> 

I've got a short week this week, so won't have time to look at this in
detail for a while.  So a bunch of questions... but not necessarily
objections :-)

How do we clear the cache, and when?  For example, on ARM, switching
between ARM and Thumb state means we need to generate potentially
radically different sequences?  We can do such splitting at function
boundaries now.

Can we generate different sequences for hot/cold code within a single
function?

Can we cache sequences with the context (eg use with AND, OR, ADD, etc)?


> Some notes about ppc64 in particular:
> 
>   * Constants aren't split until quite late, preventing all hope of
>     CSE'ing portions of the generated code.  My gut feeling is that
>     this is in general a mistake, but...
> 
>     I did attempt to fix it, and got nothing for my troubles except
>     poorer code generation for AND/IOR/XOR with non-trivial constants.
> 
On AArch64 in particular, building complex constants is generally
destructive on the source register (if you want to preserve intermediate
values you have to make intermediate copies); that's clearly never going
to be a win if you don't need at least 3 instructions to form the
constant.

There might be some cases where you could form a second constant as a
difference from an earlier one, but that then creates data-flow
dependencies and in OoO machines that might not be worth-while.  Even
for in-order machines it can restrict scheduling and result in worse code.


>     I'm somewhat surprised that the operands to the logicals aren't
>     visible at rtl generation time, given all the work done in gimple.
>     And failing that, combine has enough REG_EQUAL notes that it ought
>     to be able to put things back together and see the simpler pattern.
> 

We've tried it in the past.  Exposing the individual steps prevents the
higher-level rtl-based optimizations since they can no-longer deal with
the complete sub-expression.

>     Perhaps there's some other predication or costing error that's
>     getting in the way, and it simply wasn't obvious to me.   In any
>     case, nothing in this patch set addresses this at all.
> 
>   * I go on to add 4 new methods of generating a constant, each of
>     which typically saves 2 insns over the current algorithm.  There
>     are a couple more that might be useful but...
> 
>   * Constants are split *really* late.  In particular, after reload.
>     It would be awesome if we could at least have them all split before
>     register allocation so that we arrange to use ADDI and ADDIS when
>     that could save a few instructions.  But that does of course mean
>     avoiding r0 for the input.  Again, nothing here attempts to change
>     when constants are split.
> 

certainly in the ARM port we try to split immediately before register
allocation, that way we can be sure that we have scratch registers
available if that helps with generating more efficient sequences.

R.

>   * This is the only platform for which I bothered collecting any sort
>     of performance data:
> 
>     As best I can tell, there is a 9% improvement in bootstrap speed
>     for ppc64.  That is, 10 minutes off the original 109 minute build.
> 
>     For aarch64 and alpha, I simply assumed there would be no loss,
>     since the basic search algorithm is unchanged for each.
> 
> Comments?  Especially on the shared header?
> 
> 
> r~
> 
> Cc: David Edelsohn <dje.gcc@gmail.com>
> Cc: Marcus Shawcroft <marcus.shawcroft@arm.com>
> Cc: Richard Earnshaw <richard.earnshaw@arm.com>
> 
> Richard Henderson (15):
>   rs6000: Split out rs6000_is_valid_and_mask_wide
>   rs6000: Make num_insns_constant_wide static
>   rs6000: Tidy num_insns_constant vs CONST_DOUBLE
>   rs6000: Implement set_const_data infrastructure
>   rs6000: Move constant via mask into build_set_const_data
>   rs6000: Use rldiwi in constant construction
>   rs6000: Generalize left shift in constant generation
>   rs6000: Generalize masking in constant generation
>   rs6000: Use xoris in constant construction
>   rs6000: Use rotldi in constant generation
>   aarch64: Use hashing infrastructure for generating constants
>   aarch64: Test for duplicated 32-bit halves
>   alpha: Use hashing infrastructure for generating constants
>   alpha: Split out alpha_cost_set_const
>   alpha: Remove alpha_emit_set_long_const
> 
>  gcc/config/aarch64/aarch64.c      | 463 ++++++++++++++++------------
>  gcc/config/alpha/alpha.c          | 583 +++++++++++++++++------------------
>  gcc/config/rs6000/rs6000-protos.h |   1 -
>  gcc/config/rs6000/rs6000.c        | 617 ++++++++++++++++++++++++--------------
>  gcc/config/rs6000/rs6000.md       |  15 -
>  gcc/genimm-hash.h                 | 122 ++++++++
>  6 files changed, 1057 insertions(+), 744 deletions(-)
>  create mode 100644 gcc/genimm-hash.h
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
  2015-08-12  1:11 Richard Henderson
  2015-08-12  8:32 ` Richard Earnshaw
@ 2015-08-12  8:32 ` Segher Boessenkool
  2015-08-12 15:32   ` Richard Henderson
  2015-08-13  3:10   ` Segher Boessenkool
  1 sibling, 2 replies; 15+ messages in thread
From: Segher Boessenkool @ 2015-08-12  8:32 UTC (permalink / raw)
  To: Richard Henderson
  Cc: gcc-patches, David Edelsohn, Marcus Shawcroft, Richard Earnshaw

Hi!

This looks really nice.  I'll try it out soon :-)

Some comments now...


On Tue, Aug 11, 2015 at 06:11:29PM -0700, Richard Henderson wrote:
> However, the way that aarch64 and alpha have done it hasn't
> been ideal, in that there's a fairly costly search that must
> be done every time.  I've thought before about changing this
> so that we would be able to cache results, akin to how we do
> it in expmed.c for multiplication.

Is there something that makes the cache not get too big?  Do we
care, anyway?

> Some notes about ppc64 in particular:
> 
>   * Constants aren't split until quite late, preventing all hope of
>     CSE'ing portions of the generated code.  My gut feeling is that
>     this is in general a mistake, but...

Constant arguments to IOR/XOR/AND that can be done with two machine
insns are split at expand.  Then combine comes along and just loves
to recombine them, but then they are split again at split1 (before
RA).

For AND this was optimal in my experiments; for IOR/XOR it has been
this way since the dawn of time.

Simple SETs aren't split at expand, maybe they should be.  But they
are split at split1.

>     I did attempt to fix it, and got nothing for my troubles except
>     poorer code generation for AND/IOR/XOR with non-trivial constants.

Could you give an example of code that isn't split early enough?

>     I'm somewhat surprised that the operands to the logicals aren't
>     visible at rtl generation time, given all the work done in gimple.

So am I, because that is not what I'm seeing?  E.g.

int f(int x) { return x | 0x12345678; }

is expanded as two IORs already.  There must be something in your
testcases that prevents this?

>     And failing that, combine has enough REG_EQUAL notes that it ought
>     to be able to put things back together and see the simpler pattern.
> 
>     Perhaps there's some other predication or costing error that's
>     getting in the way, and it simply wasn't obvious to me.   In any
>     case, nothing in this patch set addresses this at all.

The instruction (set (reg) (const_int 0x12345678)) is costed as 4
(i.e. one insn).  That cannot be good.  This is alternative #5 in
*movsi_internal1_single (there are many more variants of that
pattern).

>   * I go on to add 4 new methods of generating a constant, each of
>     which typically saves 2 insns over the current algorithm.  There
>     are a couple more that might be useful but...

New methods look to be really simple to add with your framework,
very nice :-)

>   * Constants are split *really* late.  In particular, after reload.

Yeah that is bad.  But I'm still not seeing it.  Hrm, maybe only
DImode ones?

>     It would be awesome if we could at least have them all split before
>     register allocation

And before sched1, yeah.

>     so that we arrange to use ADDI and ADDIS when
>     that could save a few instructions.  But that does of course mean
>     avoiding r0 for the input.

That is no problem at all before RA.

>     Again, nothing here attempts to change
>     when constants are split.
> 
>   * This is the only platform for which I bothered collecting any sort
>     of performance data:
> 
>     As best I can tell, there is a 9% improvement in bootstrap speed
>     for ppc64.  That is, 10 minutes off the original 109 minute build.

That is, wow.  Wow :-)

Have you looked at generated code quality?


Segher

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation
@ 2015-08-12  1:11 Richard Henderson
  2015-08-12  8:32 ` Richard Earnshaw
  2015-08-12  8:32 ` Segher Boessenkool
  0 siblings, 2 replies; 15+ messages in thread
From: Richard Henderson @ 2015-08-12  1:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: David Edelsohn, Marcus Shawcroft, Richard Earnshaw

Something last week had me looking at ppc64 code generation,
and some of what I saw was fairly bad.  Fixing it wasn't going
to be easy, due to the fact that the logic for generating
constants wasn't contained within a single function.

Better is the way that aarch64 and alpha have done it in the
past, sharing a single function with all of the logical that
can be used for both cost calculation and the actual emission
of the constants.

However, the way that aarch64 and alpha have done it hasn't
been ideal, in that there's a fairly costly search that must
be done every time.  I've thought before about changing this
so that we would be able to cache results, akin to how we do
it in expmed.c for multiplication.

I've implemented such a caching scheme for three targets, as
a test of how much code could be shared.  The answer appears
to be about 100 lines of boiler-plate.  Minimal, true, but it
may still be worth it as a way of encouraging backends to do
similar things in a similar way.

Some notes about ppc64 in particular:

  * Constants aren't split until quite late, preventing all hope of
    CSE'ing portions of the generated code.  My gut feeling is that
    this is in general a mistake, but...

    I did attempt to fix it, and got nothing for my troubles except
    poorer code generation for AND/IOR/XOR with non-trivial constants.

    I'm somewhat surprised that the operands to the logicals aren't
    visible at rtl generation time, given all the work done in gimple.
    And failing that, combine has enough REG_EQUAL notes that it ought
    to be able to put things back together and see the simpler pattern.

    Perhaps there's some other predication or costing error that's
    getting in the way, and it simply wasn't obvious to me.   In any
    case, nothing in this patch set addresses this at all.

  * I go on to add 4 new methods of generating a constant, each of
    which typically saves 2 insns over the current algorithm.  There
    are a couple more that might be useful but...

  * Constants are split *really* late.  In particular, after reload.
    It would be awesome if we could at least have them all split before
    register allocation so that we arrange to use ADDI and ADDIS when
    that could save a few instructions.  But that does of course mean
    avoiding r0 for the input.  Again, nothing here attempts to change
    when constants are split.

  * This is the only platform for which I bothered collecting any sort
    of performance data:

    As best I can tell, there is a 9% improvement in bootstrap speed
    for ppc64.  That is, 10 minutes off the original 109 minute build.

    For aarch64 and alpha, I simply assumed there would be no loss,
    since the basic search algorithm is unchanged for each.

Comments?  Especially on the shared header?


r~

Cc: David Edelsohn <dje.gcc@gmail.com>
Cc: Marcus Shawcroft <marcus.shawcroft@arm.com>
Cc: Richard Earnshaw <richard.earnshaw@arm.com>

Richard Henderson (15):
  rs6000: Split out rs6000_is_valid_and_mask_wide
  rs6000: Make num_insns_constant_wide static
  rs6000: Tidy num_insns_constant vs CONST_DOUBLE
  rs6000: Implement set_const_data infrastructure
  rs6000: Move constant via mask into build_set_const_data
  rs6000: Use rldiwi in constant construction
  rs6000: Generalize left shift in constant generation
  rs6000: Generalize masking in constant generation
  rs6000: Use xoris in constant construction
  rs6000: Use rotldi in constant generation
  aarch64: Use hashing infrastructure for generating constants
  aarch64: Test for duplicated 32-bit halves
  alpha: Use hashing infrastructure for generating constants
  alpha: Split out alpha_cost_set_const
  alpha: Remove alpha_emit_set_long_const

 gcc/config/aarch64/aarch64.c      | 463 ++++++++++++++++------------
 gcc/config/alpha/alpha.c          | 583 +++++++++++++++++------------------
 gcc/config/rs6000/rs6000-protos.h |   1 -
 gcc/config/rs6000/rs6000.c        | 617 ++++++++++++++++++++++++--------------
 gcc/config/rs6000/rs6000.md       |  15 -
 gcc/genimm-hash.h                 | 122 ++++++++
 6 files changed, 1057 insertions(+), 744 deletions(-)
 create mode 100644 gcc/genimm-hash.h

-- 
2.4.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-08-25 14:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <A610E03AD50BFC4D95529A36D37FA55E7B006D6780@GEORGE.Emea.Arm.com>
2015-08-12 15:59 ` [PATCH ppc64,aarch64,alpha 00/15] Improve backend constant generation Wilco Dijkstra
2015-08-12 16:09   ` Richard Henderson
2015-08-25 14:21     ` Wilco Dijkstra
2015-08-13  3:30   ` Segher Boessenkool
2015-08-12  1:11 Richard Henderson
2015-08-12  8:32 ` Richard Earnshaw
2015-08-12  8:43   ` Richard Earnshaw
2015-08-12  9:02     ` Richard Earnshaw
2015-08-12 15:45   ` Richard Henderson
2015-08-12  8:32 ` Segher Boessenkool
2015-08-12 15:32   ` Richard Henderson
2015-08-13  3:07     ` Segher Boessenkool
2015-08-13  5:36       ` Segher Boessenkool
2015-08-13  3:10   ` Segher Boessenkool
2015-08-13 11:32     ` David Edelsohn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).