[PATCH][AArch64] Use LDP/STP in shrinkwrapping

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][AArch64] Use LDP/STP in shrinkwrapping
@ 2018-01-05 12:22 Wilco Dijkstra
  2018-01-07 23:02 ` Segher Boessenkool
  2018-02-06  9:49 ` James Greenhalgh
  0 siblings, 2 replies; 14+ messages in thread
From: Wilco Dijkstra @ 2018-01-05 12:22 UTC (permalink / raw)
  To: GCC Patches; +Cc: nd

The shrinkwrap optimization added late in GCC 7 allows each callee-save to
be delayed and done only across blocks which need a particular callee-save.
Although this reduces unnecessary memory traffic on code paths that need
few callee-saves, it typically uses LDR/STR rather than LDP/STP.  The
number of LDP/STP instructions is reduced by ~7%. This means more memory
accesses and increased codesize, ~1.0% on average.

To improve this, if a particular callee-save must be saved/restored, also
add the adjacent callee-save to allow use of LDP/STP.  This significantly
reduces codesize (for example gcc_r, povray_r, parest_r, xalancbmk_r are
1% smaller).  This is a simple fix which can be backported.  A more advanced
approach would scan blocks for pairs of callee-saves, but that requires a
rewrite of all the callee-save code which is too late at this stage.

An example epilog in a shrinkwrapped function before:

ldp    x21, x22, [sp,#16]
ldr    x23, [sp,#32]
ldr    x24, [sp,#40]
ldp    x25, x26, [sp,#48]
ldr    x27, [sp,#64]
ldr    x28, [sp,#72]
ldr    x30, [sp,#80]
ldr    d8, [sp,#88]
ldp    x19, x20, [sp],#96
ret

And after this patch:

ldr    d8, [sp,#88]
ldp    x21, x22, [sp,#16]
ldp    x23, x24, [sp,#32]
ldp    x25, x26, [sp,#48]
ldp    x27, x28, [sp,#64]
ldr    x30, [sp,#80]
ldp    x19, x20, [sp],#96
ret

Passes bootstrap, OK for commit (and backport to GCC7)?

ChangeLog:
2018-01-05  Wilco Dijkstra  <wdijkstr@arm.com>

	* config/aarch64/aarch64.c (aarch64_components_for_bb):
	Increase LDP/STP opportunities by adding adjacent callee-saves.
--

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 9735fc18402dd8fe2fa4022eef4c0522814a0552..da21032b19413d0361b8d30b51a31124eaaa31a1 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -3503,7 +3503,22 @@ aarch64_components_for_bb (basic_block bb)
        && (bitmap_bit_p (in, regno)
 	   || bitmap_bit_p (gen, regno)
 	   || bitmap_bit_p (kill, regno)))
-	  bitmap_set_bit (components, regno);
+      {
+	unsigned regno2, offset, offset2;
+	bitmap_set_bit (components, regno);
+
+	/* If there is a callee-save at an adjacent offset, add it too
+	   to increase the use of LDP/STP.  */
+	offset = cfun->machine->frame.reg_offset[regno];
+	regno2 = ((offset & 8) == 0) ? regno + 1 : regno - 1;
+
+	if (regno2 <= LAST_SAVED_REGNUM)
+	  {
+	    offset2 = cfun->machine->frame.reg_offset[regno2];
+	    if ((offset & ~8) == (offset2 & ~8))
+	      bitmap_set_bit (components, regno2);
+	  }
+      }

   return components;
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-05 12:22 [PATCH][AArch64] Use LDP/STP in shrinkwrapping Wilco Dijkstra
@ 2018-01-07 23:02 ` Segher Boessenkool
  2018-01-08 13:31   ` Wilco Dijkstra
  2018-02-06  9:49 ` James Greenhalgh
  1 sibling, 1 reply; 14+ messages in thread
From: Segher Boessenkool @ 2018-01-07 23:02 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, nd

Hi Wilco,

On Fri, Jan 05, 2018 at 12:22:44PM +0000, Wilco Dijkstra wrote:
> An example epilog in a shrinkwrapped function before:
> 
> ldp    x21, x22, [sp,#16]
> ldr    x23, [sp,#32]
> ldr    x24, [sp,#40]
> ldp    x25, x26, [sp,#48]
> ldr    x27, [sp,#64]
> ldr    x28, [sp,#72]
> ldr    x30, [sp,#80]
> ldr    d8, [sp,#88]
> ldp    x19, x20, [sp],#96
> ret

In this example, the compiler already can make a ldp for both x23/x24 and
x27/x28 just fine (if not in emit_epilogue_components, then simply in a
peephole); why did that not work?  Or is this not the actual generated
machine code (and there are labels between the insns, for example)?


Segher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-07 23:02 ` Segher Boessenkool
@ 2018-01-08 13:31   ` Wilco Dijkstra
  2018-01-08 18:32     ` Segher Boessenkool
  0 siblings, 1 reply; 14+ messages in thread
From: Wilco Dijkstra @ 2018-01-08 13:31 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, nd

Segher Boessenkool wrote:
> On Fri, Jan 05, 2018 at 12:22:44PM +0000, Wilco Dijkstra wrote:
>> An example epilog in a shrinkwrapped function before:
>> 
>> ldp    x21, x22, [sp,#16]
>> ldr    x23, [sp,#32]
>> ldr    x24, [sp,#40]
>> ldp    x25, x26, [sp,#48]
>> ldr    x27, [sp,#64]
>> ldr    x28, [sp,#72]
>> ldr    x30, [sp,#80]
>> ldr    d8, [sp,#88]
>> ldp    x19, x20, [sp],#96
>> ret
>
> In this example, the compiler already can make a ldp for both x23/x24 and
> x27/x28 just fine (if not in emit_epilogue_components, then simply in a
> peephole); why did that not work?  Or is this not the actual generated
> machine code (and there are labels between the insns, for example)?

This block originally had a label in it, 2 blocks emitted identical restores and
then branched to the final epilog. The final epilogue was then duplicated so
we end up with 2 almost identical epilogs of 10 instructions (almost since
there were 1-2 unrelated instructions in both blocks).

Peepholing is very conservative about instructions using SP and won't touch
anything frame related. If this was working better then the backend could just
emit single loads/stores and let peepholing generate LDP/STP.

However this is not the real issue. In the worst case the current code may
only emit LDR and STR. If there are multiple callee-saves in a block, we
want to use LDP/STP, and if there is an odd number of registers, we want
to add a callee-save from an inner block.

Another issue is that after pro_and_epilogue pass I see multiple restores
of the same registers and then a branch to the same block. We should try
to avoid the unnecessary duplication.

Wilco

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-08 13:31   ` Wilco Dijkstra
@ 2018-01-08 18:32     ` Segher Boessenkool
  2018-01-08 20:43       ` Wilco Dijkstra
  0 siblings, 1 reply; 14+ messages in thread
From: Segher Boessenkool @ 2018-01-08 18:32 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, nd

On Mon, Jan 08, 2018 at 01:27:24PM +0000, Wilco Dijkstra wrote:
> Segher Boessenkool wrote:
> > On Fri, Jan 05, 2018 at 12:22:44PM +0000, Wilco Dijkstra wrote:
> >> An example epilog in a shrinkwrapped function before:
> >> 
> >> ldpÂ Â Â  x21, x22, [sp,#16]
> >> ldrÂ Â Â  x23, [sp,#32]
> >> ldrÂ Â Â  x24, [sp,#40]
> >> ldpÂ Â Â  x25, x26, [sp,#48]
> >> ldrÂ Â Â  x27, [sp,#64]
> >> ldrÂ Â Â  x28, [sp,#72]
> >> ldrÂ Â Â  x30, [sp,#80]
> >> ldrÂ Â Â  d8, [sp,#88]
> >> ldpÂ Â Â  x19, x20, [sp],#96
> >> ret
> >
> > In this example, the compiler already can make a ldp for both x23/x24 and
> > x27/x28 just fine (if not in emit_epilogue_components, then simply in a
> > peephole); why did that not work?Â  Or is this not the actual generated
> > machine code (and there are labels between the insns, for example)?
> 
> This block originally had a label in it, 2 blocks emitted identical restores and
> then branched to the final epilog. The final epilogue was then duplicated so
> we end up with 2 almost identical epilogs of 10 instructions (almost since
> there were 1-2 unrelated instructions in both blocks).
> 
> Peepholing is very conservative about instructions using SP and won't touch
> anything frame related. If this was working better then the backend could just
> emit single loads/stores and let peepholing generate LDP/STP.

How unfortunate; that should definitely be improved then.

Always pairing two registers together *also* degrades code quality.

> Another issue is that after pro_and_epilogue pass I see multiple restores
> of the same registers and then a branch to the same block. We should try
> to avoid the unnecessary duplication.

It already does that if *all* predecessors of that block do that.  If you
want to do it in other cases, you end up with more jumps.  That may be
beneficial in some cases, of course, but it is not an obvious win (and in
the general case it is, hrm let's use nice words, "terrible").


Segher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-08 18:32     ` Segher Boessenkool
@ 2018-01-08 20:43       ` Wilco Dijkstra
  2018-01-08 21:31         ` Segher Boessenkool
  0 siblings, 1 reply; 14+ messages in thread
From: Wilco Dijkstra @ 2018-01-08 20:43 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, nd

Segher Boessenkool wrote:
> On Mon, Jan 08, 2018 at 01:27:24PM +0000, Wilco Dijkstra wrote:
>
>> Peepholing is very conservative about instructions using SP and won't touch
>> anything frame related. If this was working better then the backend could just
>> emit single loads/stores and let peepholing generate LDP/STP.
>
> How unfortunate; that should definitely be improved then.

Improving that helps indeed but won't fix the issue. The epilog may not
always be duplicated and merged like in my example. Any subset of saves
and restores may not be pairable, so the worst case still has twice as many
memory accesses.

> Always pairing two registers together *also* degrades code quality.

No, while it's not optimal, it means smaller code and fewer memory accesses.

>> Another issue is that after pro_and_epilogue pass I see multiple restores
>> of the same registers and then a branch to the same block. We should try
>> to avoid the unnecessary duplication.
>
> It already does that if *all* predecessors of that block do that.  If you
> want to do it in other cases, you end up with more jumps.  That may be
> beneficial in some cases, of course, but it is not an obvious win (and in
> the general case it is, hrm let's use nice words, "terrible").

That may well be the problem. So if there are N predecessors, of which N-1
need to restore the same set of callee saves, but one was shrinkwrapped,
N-1 copies of the same restores might be emitted. N could be the number
of blocks in a function - I really hope it doesn't work out like that...

Wilco

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-08 20:43       ` Wilco Dijkstra
@ 2018-01-08 21:31         ` Segher Boessenkool
  2018-01-09 12:44           ` Wilco Dijkstra
  0 siblings, 1 reply; 14+ messages in thread
From: Segher Boessenkool @ 2018-01-08 21:31 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, nd

On Mon, Jan 08, 2018 at 08:25:47PM +0000, Wilco Dijkstra wrote:
> > Always pairing two registers together *also* degrades code quality.
> 
> No, while it's not optimal, it means smaller code and fewer memory accesses.

It means you execute *more* memory accesses.  Always.  This may be
sometimes hidden, sure.  I'm not saying you do not want more ldp's;
I'm saying this particular strategy is very far from ideal.

> >> Another issue is that after pro_and_epilogue pass I see multiple restores
> >> of the same registers and then a branch to the same block. We should try
> >> to avoid the unnecessary duplication.
> >
> > It already does that if *all* predecessors of that block do that.Â  If you
> > want to do it in other cases, you end up with more jumps.Â  That may be
> > beneficial in some cases, of course, but it is not an obvious win (and in
> > the general case it is, hrm let's use nice words, "terrible").
> 
> That may well be the problem. So if there are N predecessors, of which N-1
> need to restore the same set of callee saves, but one was shrinkwrapped,
> N-1 copies of the same restores might be emitted. N could be the number
> of blocks in a function - I really hope it doesn't work out like that...

In the worst case it would.  OTOH, joining every combo into blocks costs
O(2**C) (where C is the # components) bb's worst case.

It isn't a simple problem.  The current tuning works pretty well for us,
but no doubt it can be improved!


Segher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-08 21:31         ` Segher Boessenkool
@ 2018-01-09 12:44           ` Wilco Dijkstra
  2018-01-09 14:56             ` Segher Boessenkool
  0 siblings, 1 reply; 14+ messages in thread
From: Wilco Dijkstra @ 2018-01-09 12:44 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, nd

Segher Boessenkool wrote:
> On Mon, Jan 08, 2018 at 0:25:47PM +0000, Wilco Dijkstra wrote:
>> > Always pairing two registers together *also* degrades code quality.
>> 
>> No, while it's not optimal, it means smaller code and fewer memory accesses.
>
> It means you execute *more* memory accesses.  Always.  This may be
> sometimes hidden, sure.  I'm not saying you do not want more ldp's;
> I'm saying this particular strategy is very far from ideal.

No it means less since the number of memory accesses reduces (memory
bandwidth may increase but that's not an issue). You can argue that different
pairings might work better as you may be able to push some LDP/STPs into
less frequently executed blocks, but finding optimal pairings is a hard problem
since the offsets need to be consecutive.

>> That may well be the problem. So if there are N predecessors, of which N-1
>> need to restore the same set of callee saves, but one was shrinkwrapped,
>> N-1 copies of the same restores might be emitted. N could be the number
>> of blocks in a function - I really hope it doesn't work out like that...
>
> In the worst case it would.  OTOH, joining every combo into blocks costs
> O(2**C) (where C is the # components) bb's worst case.
>
> It isn't a simple problem.  The current tuning works pretty well for us,
> but no doubt it can be improved!

Well if there are C components, we could limit the total number of saves/restores
inserted to say 4C. Similarly common cases could easily share the restores
without increasing the number of branches.

Wilco
    

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-09 12:44           ` Wilco Dijkstra
@ 2018-01-09 14:56             ` Segher Boessenkool
  2018-01-10  5:15               ` Andrew Pinski
  0 siblings, 1 reply; 14+ messages in thread
From: Segher Boessenkool @ 2018-01-09 14:56 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, nd

On Tue, Jan 09, 2018 at 12:23:42PM +0000, Wilco Dijkstra wrote:
> Segher Boessenkool wrote:
> > On Mon, Jan 08, 2018 at 0:25:47PM +0000, Wilco Dijkstra wrote:
> >> > Always pairing two registers together *also* degrades code quality.
> >> 
> >> No, while it's not optimal, it means smaller code and fewer memory accesses.
> >
> > It means you execute *more* memory accesses.Â  Always.Â  This may be
> > sometimes hidden, sure.Â  I'm not saying you do not want more ldp's;
> > I'm saying this particular strategy is very far from ideal.
> 
> No it means less since the number of memory accesses reduces (memory
> bandwidth may increase but that's not an issue).

The problem is *more* memory accesses are executed at runtime.  Which is
why separate shrink-wrapping does what it does: to have *fewer* executed.
(It's not just the direct execution cost why that helps: more important
are latencies to dependent ops, microarchitectural traps, etc.).

If you make A always stored whenever B is, and the other way around, the
optimal place to do it will always store at least as often as either A
or B, _but can also store more often than either_.

> >> That may well be the problem. So if there are N predecessors, of which N-1
> >> need to restore the same set of callee saves, but one was shrinkwrapped,
> >> N-1 copies of the same restores might be emitted. N could be the number
> >> of blocks in a function - I really hope it doesn't work out like that...
> >
> > In the worst case it would.Â  OTOH, joining every combo into blocks costs
> > O(2**C) (where C is the # components) bb's worst case.
> >
> > It isn't a simple problem.Â  The current tuning works pretty well for us,
> > but no doubt it can be improved!
> 
> Well if there are C components, we could limit the total number of saves/restores
> inserted to say 4C. Similarly common cases could easily share the restores
> without increasing the number of branches.

It is common to see many saves/restores generated for the exceptional cases.


Segher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-09 14:56             ` Segher Boessenkool
@ 2018-01-10  5:15               ` Andrew Pinski
  2018-01-10 18:42                 ` Segher Boessenkool
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Pinski @ 2018-01-10  5:15 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Wilco Dijkstra, GCC Patches, nd

On Tue, Jan 9, 2018 at 6:54 AM, Segher Boessenkool
<segher@kernel.crashing.org> wrote:
> On Tue, Jan 09, 2018 at 12:23:42PM +0000, Wilco Dijkstra wrote:
>> Segher Boessenkool wrote:
>> > On Mon, Jan 08, 2018 at 0:25:47PM +0000, Wilco Dijkstra wrote:
>> >> > Always pairing two registers together *also* degrades code quality.
>> >>
>> >> No, while it's not optimal, it means smaller code and fewer memory accesses.
>> >
>> > It means you execute *more* memory accesses.  Always.  This may be
>> > sometimes hidden, sure.  I'm not saying you do not want more ldp's;
>> > I'm saying this particular strategy is very far from ideal.
>>
>> No it means less since the number of memory accesses reduces (memory
>> bandwidth may increase but that's not an issue).
>
> The problem is *more* memory accesses are executed at runtime.  Which is
> why separate shrink-wrapping does what it does: to have *fewer* executed.
> (It's not just the direct execution cost why that helps: more important
> are latencies to dependent ops, microarchitectural traps, etc.).

On most micro-arch of AARCH64, having one LDP/STP will take just as
long as one LDR/STR as long as it is on the same cache line.
So having one LDP/STP compared to two LDR?STR is much better.  LDP/STP
is considered one memory access really and that is where the confusion
is coming from.  We are reducing the overall number of memory accesses
or keeping it the same on that path.
Hope this explanation allows you to understand why pairing does not
degrade the code quality but improves it overall.

Thanks,
Andrew

>
> If you make A always stored whenever B is, and the other way around, the
> optimal place to do it will always store at least as often as either A
> or B, _but can also store more often than either_.
>
>> >> That may well be the problem. So if there are N predecessors, of which N-1
>> >> need to restore the same set of callee saves, but one was shrinkwrapped,
>> >> N-1 copies of the same restores might be emitted. N could be the number
>> >> of blocks in a function - I really hope it doesn't work out like that...
>> >
>> > In the worst case it would.  OTOH, joining every combo into blocks costs
>> > O(2**C) (where C is the # components) bb's worst case.
>> >
>> > It isn't a simple problem.  The current tuning works pretty well for us,
>> > but no doubt it can be improved!
>>
>> Well if there are C components, we could limit the total number of saves/restores
>> inserted to say 4C. Similarly common cases could easily share the restores
>> without increasing the number of branches.
>
> It is common to see many saves/restores generated for the exceptional cases.
>
>
> Segher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-10  5:15               ` Andrew Pinski
@ 2018-01-10 18:42                 ` Segher Boessenkool
  2018-01-11 15:56                   ` Wilco Dijkstra
  0 siblings, 1 reply; 14+ messages in thread
From: Segher Boessenkool @ 2018-01-10 18:42 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Wilco Dijkstra, GCC Patches, nd

On Tue, Jan 09, 2018 at 09:13:23PM -0800, Andrew Pinski wrote:
> On Tue, Jan 9, 2018 at 6:54 AM, Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> > On Tue, Jan 09, 2018 at 12:23:42PM +0000, Wilco Dijkstra wrote:
> >> Segher Boessenkool wrote:
> >> > On Mon, Jan 08, 2018 at 0:25:47PM +0000, Wilco Dijkstra wrote:
> >> >> > Always pairing two registers together *also* degrades code quality.
> >> >>
> >> >> No, while it's not optimal, it means smaller code and fewer memory accesses.
> >> >
> >> > It means you execute *more* memory accesses.  Always.  This may be
> >> > sometimes hidden, sure.  I'm not saying you do not want more ldp's;
> >> > I'm saying this particular strategy is very far from ideal.
> >>
> >> No it means less since the number of memory accesses reduces (memory
> >> bandwidth may increase but that's not an issue).
> >
> > The problem is *more* memory accesses are executed at runtime.  Which is
> > why separate shrink-wrapping does what it does: to have *fewer* executed.
> > (It's not just the direct execution cost why that helps: more important
> > are latencies to dependent ops, microarchitectural traps, etc.).
> 
> On most micro-arch of AARCH64, having one LDP/STP will take just as
> long as one LDR/STR as long as it is on the same cache line.
> So having one LDP/STP compared to two LDR?STR is much better.  LDP/STP
> is considered one memory access really and that is where the confusion
> is coming from.  We are reducing the overall number of memory accesses
> or keeping it the same on that path.
> Hope this explanation allows you to understand why pairing does not
> degrade the code quality but improves it overall.

Of course I see that ldp is useful.  I don't think that this particular
way of forcing more pairs is a good idea.  Needs testing / benchmarking /
instrumentation, and we haven't seen any of that.

Forcing pairs before separate shrink-wrapping reduces the effectiveness
of the latter by a lot.


Segher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-10 18:42                 ` Segher Boessenkool
@ 2018-01-11 15:56                   ` Wilco Dijkstra
  2018-01-11 21:37                     ` Segher Boessenkool
  0 siblings, 1 reply; 14+ messages in thread
From: Wilco Dijkstra @ 2018-01-11 15:56 UTC (permalink / raw)
  To: Segher Boessenkool, Andrew Pinski; +Cc: GCC Patches, nd

Segher Boessenkool wrote:

> Of course I see that ldp is useful.  I don't think that this particular
> way of forcing more pairs is a good idea.  Needs testing / benchmarking /
> instrumentation, and we haven't seen any of that.

I wouldn't propose a patch if it caused slowdowns. In fact I am seeing speedups,
particularly benchmarks which benefit from shrinkwrapping (eg. povray). Int is
flat, and there is an overall gain on fp.

> Forcing pairs before separate shrink-wrapping reduces the effectiveness
> of the latter by a lot.

That doesn't appear to be the case - or at least any reduction in effectiveness is
more than mitigated by the lower number of memory accesses and I-cache misses.
To do better still you'd need to compute an optimal set of pairs, and that is quite
difficult in the current infrastructure. I could dynamically create pairs just in the backend
but that won't be optimal either.

Wilco

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-11 15:56                   ` Wilco Dijkstra
@ 2018-01-11 21:37                     ` Segher Boessenkool
  2018-02-03 13:06                       ` Richard Sandiford
  0 siblings, 1 reply; 14+ messages in thread
From: Segher Boessenkool @ 2018-01-11 21:37 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Andrew Pinski, GCC Patches, nd

On Thu, Jan 11, 2018 at 03:35:37PM +0000, Wilco Dijkstra wrote:
> Segher Boessenkool wrote:
> Â  
> > Of course I see that ldp is useful.Â  I don't think that this particular
> > way of forcing more pairs is a good idea.Â  Needs testing / benchmarking /
> > instrumentation, and we haven't seen any of that.
> 
> I wouldn't propose a patch if it caused slowdowns. In fact I am seeing speedups,
> particularly benchmarks which benefit from shrinkwrapping (eg. povray). Int is
> flat, and there is an overall gain on fp.
> 
> > Forcing pairs before separate shrink-wrapping reduces the effectiveness
> > of the latter by a lot.
> 
> That doesn't appear to be the case - or at least any reduction in effectiveness is
> more than mitigated by the lower number of memory accesses and I-cache misses.
> To do better still you'd need to compute an optimal set of pairs, and that is quite
> difficult in the current infrastructure. I could dynamically create pairs just in the backend
> but that won't be optimal either.

Right, I certainly believe forming more pairs before sws (as you do)
improves the code -- but I think forming the pairs only after sws will
work even better.

But yes that is more work to implement, and the benefit (if any) is
unknown.  I hoped I could convince you to try ;-)


Segher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-11 21:37                     ` Segher Boessenkool
@ 2018-02-03 13:06                       ` Richard Sandiford
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2018-02-03 13:06 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Wilco Dijkstra, Andrew Pinski, GCC Patches, nd, james.greenhalgh

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Thu, Jan 11, 2018 at 03:35:37PM +0000, Wilco Dijkstra wrote:
>> Segher Boessenkool wrote:
>>   
>> > Of course I see that ldp is useful.  I don't think that this particular
>> > way of forcing more pairs is a good idea.  Needs testing / benchmarking /
>> > instrumentation, and we haven't seen any of that.
>> 
>> I wouldn't propose a patch if it caused slowdowns. In fact I am seeing
>> speedups,
>> particularly benchmarks which benefit from shrinkwrapping (eg. povray). Int is
>> flat, and there is an overall gain on fp.
>> 
>> > Forcing pairs before separate shrink-wrapping reduces the effectiveness
>> > of the latter by a lot.
>> 
>> That doesn't appear to be the case - or at least any reduction in
>> effectiveness is
>> more than mitigated by the lower number of memory accesses and I-cache misses.
>> To do better still you'd need to compute an optimal set of pairs, and
>> that is quite
>> difficult in the current infrastructure. I could dynamically create
>> pairs just in the backend
>> but that won't be optimal either.
>
> Right, I certainly believe forming more pairs before sws (as you do)
> improves the code -- but I think forming the pairs only after sws will
> work even better.
>
> But yes that is more work to implement, and the benefit (if any) is
> unknown.  I hoped I could convince you to try ;-)

Where we do stand with this patch?  I think it's legitimately a
regression fix; like Wilco, I'm seeing benchmarks in which epilogue
restores are no longer paired for important functions.

Even if things could be improved elsewhere, the patch seems to make
sense on its own and is also nicely localised.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping
  2018-01-05 12:22 [PATCH][AArch64] Use LDP/STP in shrinkwrapping Wilco Dijkstra
  2018-01-07 23:02 ` Segher Boessenkool
@ 2018-02-06  9:49 ` James Greenhalgh
  1 sibling, 0 replies; 14+ messages in thread
From: James Greenhalgh @ 2018-02-06  9:49 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, nd

On Fri, Jan 05, 2018 at 12:22:44PM +0000, Wilco Dijkstra wrote:
> The shrinkwrap optimization added late in GCC 7 allows each callee-save to
> be delayed and done only across blocks which need a particular callee-save.
> Although this reduces unnecessary memory traffic on code paths that need
> few callee-saves, it typically uses LDR/STR rather than LDP/STP.  The
> number of LDP/STP instructions is reduced by ~7%. This means more memory
> accesses and increased codesize, ~1.0% on average.
> 
> To improve this, if a particular callee-save must be saved/restored, also
> add the adjacent callee-save to allow use of LDP/STP.  This significantly
> reduces codesize (for example gcc_r, povray_r, parest_r, xalancbmk_r are
> 1% smaller).  This is a simple fix which can be backported.  A more advanced
> approach would scan blocks for pairs of callee-saves, but that requires a
> rewrite of all the callee-save code which is too late at this stage.
> 
> An example epilog in a shrinkwrapped function before:
> 
> ldp    x21, x22, [sp,#16]
> ldr    x23, [sp,#32]
> ldr    x24, [sp,#40]
> ldp    x25, x26, [sp,#48]
> ldr    x27, [sp,#64]
> ldr    x28, [sp,#72]
> ldr    x30, [sp,#80]
> ldr    d8, [sp,#88]
> ldp    x19, x20, [sp],#96
> ret
> 
> And after this patch:
> 
> ldr    d8, [sp,#88]
> ldp    x21, x22, [sp,#16]
> ldp    x23, x24, [sp,#32]
> ldp    x25, x26, [sp,#48]
> ldp    x27, x28, [sp,#64]
> ldr    x30, [sp,#80]
> ldp    x19, x20, [sp],#96
> ret
> 
> Passes bootstrap, OK for commit (and backport to GCC7)?

OK.

Thanks,
James

> 
> ChangeLog:
> 2018-01-05  Wilco Dijkstra  <wdijkstr@arm.com>
> 
> 	* config/aarch64/aarch64.c (aarch64_components_for_bb):
> 	Increase LDP/STP opportunities by adding adjacent callee-saves.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-02-06  9:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-05 12:22 [PATCH][AArch64] Use LDP/STP in shrinkwrapping Wilco Dijkstra
2018-01-07 23:02 ` Segher Boessenkool
2018-01-08 13:31   ` Wilco Dijkstra
2018-01-08 18:32     ` Segher Boessenkool
2018-01-08 20:43       ` Wilco Dijkstra
2018-01-08 21:31         ` Segher Boessenkool
2018-01-09 12:44           ` Wilco Dijkstra
2018-01-09 14:56             ` Segher Boessenkool
2018-01-10  5:15               ` Andrew Pinski
2018-01-10 18:42                 ` Segher Boessenkool
2018-01-11 15:56                   ` Wilco Dijkstra
2018-01-11 21:37                     ` Segher Boessenkool
2018-02-03 13:06                       ` Richard Sandiford
2018-02-06  9:49 ` James Greenhalgh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).