public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* RFC: Rewriting auto-inc-dec.c
@ 2011-07-21 15:53 Richard Sandiford
  2011-07-21 16:01 ` Bernd Schmidt
  2011-08-05 12:27 ` Bernd Schmidt
  0 siblings, 2 replies; 5+ messages in thread
From: Richard Sandiford @ 2011-07-21 15:53 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3558 bytes --]

Many of the NEON load/store instructions only allow address of the form:

    (reg rN)
    (post_inc (reg rN))
    (post_modify (reg rN) (reg rM))

with no reg+const alternative.  If vectorised code has several
consecutive loads, it's often better to use a series of post_incs such as:

    *r1++
    *r1++
    *r1++
    *r1++

rather than:

    *r1
    r2 = r1 + size
    *r2
    r3 = r1 + size * 2
    *r3
    r4 = r1 + size * 3
    *r4

At the moment, auto-inc-dec.c only considers pairs of instructions,
so it can't optimise this kind of sequence.  The attached patch is a
WIP (but almost complete) attempt to handle longer sequences too.
It's a rewrite of the pass, so I've attached the C file rather than
a diff.

The patch improves the performance of several libav loops on Cortex A8
by up to 8%.  On a popular but unnamable embedded benchmark suite,
it improves the score of three individual tests by 20%.

Like the current auto-inc-dec.c, the pass only considers single basic
blocks.  It might be interesting to relax that restriction in future,
but it wouldn't really help with the kind of cases that matter for NEON.

I haven't tried to implement anything as ambitious as the old
optimise-related-values pass, but I think the new pass structure
would make that kind of optimisation easier to add than it would
be at present.

Apart from the new cases above, the other main change is to use
insn_rtx_cost.  That only works well with an additional patch:

Index: gcc/gcc/rtlanal.c
===================================================================
--- gcc.orig/gcc/rtlanal.c
+++ gcc/gcc/rtlanal.c
@@ -4779,7 +4779,8 @@ insn_rtx_cost (rtx pat, bool speed)
   else
     return 0;
 
-  cost = rtx_cost (SET_SRC (set), SET, speed);
+  cost = (rtx_cost (SET_DEST (set), SET, speed)
+	  + rtx_cost (SET_SRC (set), SET, speed));
   return cost > 0 ? cost : COSTS_N_INSNS (1);
 }
 
which, when I tested it on a few targets a couple of weeks ago, showed
some small CSiBE improvements.

The new pass should still handle the cases that the current one does,
and due to value tracking, can handle a few pairs that the current
one can't.

I also have some ARM patches to change the MEM rtx costs (address
writeback is expensive for core Cortex A8 instructions) and to model
address writeback in the Cortex A8 and A9 schedulers.

Using rtx costs does make things worse on some targets, which I think
is due to dubious MEM costs.  m68k is a particularly bad case, because
for -Os, it uses byte counts rather than COSTS_N_INSNS.  The
insn_rtx_cost code above:

   return cost > 0 ? cost : COSTS_N_INSNS (1);

then makes register moves very expensive.

The new pass is still linear if you consider splay tree lookups to have
amortised linear complexity.  I've borrowed the splay-tree.c approach
to these lookups; the outcome of the question I asked on gcc@ recently
was that the current code was arrived at after some experimentation
(and after bad experiences with previous versions).

I didn't see any significant increase in compile time for an extreme
case like:

    #define A *x++ = 1;
    #define B A A A A A A A A
    #define C B B B B B B B B
    #define D C C C C C C C C
    #define E D D D D D D D D
    #define F E E E E E E E E
    void foo (volatile int *x) { E E E }

(it's actually slightly quicker after the patch, presumably because
there are fewer instructions for later passes to handle).

The new pass will take more memory than the old pass though.  What's the
best way of getting a figure?

Tested on arm-linux-gnueabi.  Thoughts?

Richard


[-- Attachment #2: auto-inc-dec.c.bz2 --]
[-- Type: application/x-bzip2, Size: 18409 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: Rewriting auto-inc-dec.c
  2011-07-21 15:53 RFC: Rewriting auto-inc-dec.c Richard Sandiford
@ 2011-07-21 16:01 ` Bernd Schmidt
  2011-07-21 16:13   ` Richard Sandiford
  2011-08-05 12:27 ` Bernd Schmidt
  1 sibling, 1 reply; 5+ messages in thread
From: Bernd Schmidt @ 2011-07-21 16:01 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 07/21/11 17:42, Richard Sandiford wrote:
> Tested on arm-linux-gnueabi.  Thoughts?

I'll try to find some time to look at it a bit. One thing I've always
wanted to do is move auto-inc-dec after reload, so that we can remove
inc_for_reload - do you think your new code could handle this?


Bernd

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: Rewriting auto-inc-dec.c
  2011-07-21 16:01 ` Bernd Schmidt
@ 2011-07-21 16:13   ` Richard Sandiford
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Sandiford @ 2011-07-21 16:13 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches

Bernd Schmidt <bernds@codesourcery.com> writes:
> On 07/21/11 17:42, Richard Sandiford wrote:
>> Tested on arm-linux-gnueabi.  Thoughts?
>
> I'll try to find some time to look at it a bit.

Thanks.

> One thing I've always
> wanted to do is move auto-inc-dec after reload, so that we can remove
> inc_for_reload - do you think your new code could handle this?

Not as things stand, I'm afraid.  In some cases, the new code tries to
create new pseudo registers if there's no obvious base register to reuse.

I suppose we could add some code to try to find a free hard register,
but for the kind of loops that motivated this change, it's probably
better to keep it before reload and get the more realistic sched1
output.

(I think the new code, like the old code, could handle other cases quite
happily after reload.  I did wonder whether we should try to run it twice,
once before reload and once after.)

Richard

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: Rewriting auto-inc-dec.c
  2011-07-21 15:53 RFC: Rewriting auto-inc-dec.c Richard Sandiford
  2011-07-21 16:01 ` Bernd Schmidt
@ 2011-08-05 12:27 ` Bernd Schmidt
  2011-08-10 10:40   ` Richard Sandiford
  1 sibling, 1 reply; 5+ messages in thread
From: Bernd Schmidt @ 2011-08-05 12:27 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On 07/21/11 17:42, Richard Sandiford wrote:

> At the moment, auto-inc-dec.c only considers pairs of instructions,
> so it can't optimise this kind of sequence.  The attached patch is a
> WIP (but almost complete) attempt to handle longer sequences too.

So, I promised to look at it, I guess I better do so before disappearing
on vacation. It's a lot to take in; so far I think I get the overall
structure and some of the details.

On the whole I think it looks good. I've tried to come up with such
algorithms in the past, but I think I always tried too hard to find
"optimal" solutions. The algorithm here looks reasonable.

>       /* When optimizing for speed, don't introduce dependencies between
>          memory references in the chain and memory references outside of it,
>          since doing so would limit scheduling.  If the chain is truly
>          worthwhile, we can still try again using a new pseudo register.  */

Now, it's good that there is at least some consideration for this, but I
wonder whether it's possible to do better here. The basic principle is
that if something is achievable without autoinc, in the same number of
insns (or a factor 1.x thereof with x small), then it's preferrable not
to use autoinc at all.

For example,

>            ... = *r1                ... = *rN++
>            ... = *(r1 + 4)          ... = *rN
>      *(r1 + 4) = ...              *rN++ = ...
>             r2 = r1 + 8              r2 = rN
>            ... = *r2          --->  ... = rN++
>             r3 = r2 + 4              r3 = rN
>            ... = *r3                ... = rN++
>             r4 = r1 + 16             r4 = rN
>            ... = *r4                ... = rN

On many targets, this would best be handled as a series of (reg +
offset) addressing modes. If an update is required, on targets that
support both (reg + offset) and {pre,post}_modify, it would be good to
have an automodify in just one of the addresses; if a single add insn is
required this may still be better if the number of memory references is
large enough. Do you think this is something that can be done in the
context of this pass?

Minor details:

I'm not sure all the code is completely safe if a hardreg is used with a
mix of single- and multi-word accesses. Maybe just mark a hardreg that
occurs in a multi-word mode as unavailable for the optimization?

It would be nice not to have semi-identical pairs of functions like
other_uses_before and other_uses_after or valid_mem_add_pair and
valid_add_mem_pair. Can these be merged sensibly?

Just to make sure, my understanding is that this will also generate
(automod (reg1) (plus (reg1) (reg2))). Right?

> /* Represents an offset that is applied to a valno base.  */
> union valno_offset {
>   HOST_WIDE_INT constant;
>   struct valno_def *valno;
> };

Should mention that this is used as splay tree key? For some reason I
find it slightly scary to use a union as a key; I see that things like
valno_splay have a constant_p parameter - maybe you want two separate
trees instead?

> /* Create a new entry with key KEY in ROOT's derived tree.  */
> 
> static struct valno_def *
> new_splay_valno (bool constant_p, struct valno_def *base,
>                  union valno_offset *offset)

Comment doesn't match definition.

> Using rtx costs does make things worse on some targets, which I think
> is due to dubious MEM costs.  m68k is a particularly bad case, because
> for -Os, it uses byte counts rather than COSTS_N_INSNS.

Let the m68k maintainers worry about that particular problem. The basic
principle must be to have the optimizers use costs reasonably and expect
targets to define them reasonably.


Bernd

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: Rewriting auto-inc-dec.c
  2011-08-05 12:27 ` Bernd Schmidt
@ 2011-08-10 10:40   ` Richard Sandiford
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Sandiford @ 2011-08-10 10:40 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches

Thanks for looking at this.

Bernd Schmidt <bernds@codesourcery.com> writes:
> On 07/21/11 17:42, Richard Sandiford wrote:
>>       /* When optimizing for speed, don't introduce dependencies between
>>          memory references in the chain and memory references outside of it,
>>          since doing so would limit scheduling.  If the chain is truly
>>          worthwhile, we can still try again using a new pseudo register.  */
>
> Now, it's good that there is at least some consideration for this, but I
> wonder whether it's possible to do better here. The basic principle is
> that if something is achievable without autoinc, in the same number of
> insns (or a factor 1.x thereof with x small), then it's preferrable not
> to use autoinc at all.

Yeah.  The pass tries to reject changes that have a net zero cost unless
there are specific reasons to believe that later passes would benefit.
(This is the max_cost_delta argument to test_replacements.)

> For example,
>
>>            ... = *r1                ... = *rN++
>>            ... = *(r1 + 4)          ... = *rN
>>      *(r1 + 4) = ...              *rN++ = ...
>>             r2 = r1 + 8              r2 = rN
>>            ... = *r2          --->  ... = rN++
>>             r3 = r2 + 4              r3 = rN
>>            ... = *r3                ... = rN++
>>             r4 = r1 + 16             r4 = rN
>>            ... = *r4                ... = rN
>
> On many targets, this would best be handled as a series of (reg +
> offset) addressing modes.

The idea is that we run after fwprop, so all addresses that are better
expressed as (reg + offset) should already have that form.  I.e. the
pass should be run at a stage where it doesn't have to process unoptimised
addresses (and thus have to "second guess" what later address optimisations
might do).  So the fact that the addresses above are not:

	... = *(r1 + 8)
	... = *(r1 + 12)
	... = *(r1 + 16)

or at least:

	... = *(r1 + 8)
	... = *(r2 + 4)
	... = *(r1 + 16)

should mean that those addresses aren't acceptable on the target,
or at least are more expensive than plain *rN.  This is the kind
of thing we'd see for NEON.

I should explain that in the comment -- thanks for bringing it up.

> If an update is required, on targets that
> support both (reg + offset) and {pre,post}_modify, it would be good to
> have an automodify in just one of the addresses; if a single add insn is
> required this may still be better if the number of memory references is
> large enough. Do you think this is something that can be done in the
> context of this pass?

So something like:

	 ... = *r1                ... = *rN, rN += 16
	 ... = *(r1 + 100)        ... = *(rN - 84)
	 ... = *(r1 + 80)         ... = *(rN - 64)
	 ... = *(r1 - 20)         ... = *(rN - 36)
	 r1 = r1 + 16

?  Yeah, that could be added.

> I'm not sure all the code is completely safe if a hardreg is used with a
> mix of single- and multi-word accesses.

The code was supposed to handle that.  Is there anything specifically
that worries you?  FWIW, the principles were:

  - in update_liveness, when a multi-register value (rN...rM) is used or
    defined, behave as though each individual register in the range [rN, rM]
    had been used or defined.  (The DF machinery handles this for us;
    there's a separate df_ref for each register in the range.)

  - If we're tracking (rN...rM) as a single register rtx, then any
    non-ref_point use of any register in the range [rN, rM] is recorded
    as an untracked use.

  - If we're tracking (rN...rM) as a single register rtx, then any
    definition of any register in the range [rN, rM] clobbers the
    tracked value, and brings its live range to an end.

> Maybe just mark a hardreg that occurs in a multi-word mode as
> unavailable for the optimization?

Well, I'd rather not if we can help it.  AVR in particular benefits from
handling wide hard regs.

> It would be nice not to have semi-identical pairs of functions like
> other_uses_before and other_uses_after or valid_mem_add_pair and
> valid_add_mem_pair. Can these be merged sensibly?

Although the functions are deliberately similar in interface, they're
not really that similar in implementation.  They test different live
ranges and apply different liveness checks.  So I'm not sure there's
much that can be merged.

> Just to make sure, my understanding is that this will also generate
> (automod (reg1) (plus (reg1) (reg2))). Right?

Yeah, that's right.

>> /* Represents an offset that is applied to a valno base.  */
>> union valno_offset {
>>   HOST_WIDE_INT constant;
>>   struct valno_def *valno;
>> };
>
> Should mention that this is used as splay tree key?

Well, it is, but it's also an important part of the valno record.
I mentioned that the offset was the key here:

  /* A splay tree of valnos that use this one as their base.  The splay
     key is the offset.  */

but I agree it's probably worth saying it above the "offset" field itself
as well.

> For some reason I find it slightly scary to use a union as a key; I
> see that things like valno_splay have a constant_p parameter - maybe
> you want two separate trees instead?

Yeah, I can certainly use two trees here.  I was probably overly
influenced by the fact that the current valno_def was a nice round
8 words on 64-bit hosts, so didn't want to add an extra field.

Even so, I think the splay function itself would remain largely
the same.  The offset would still be stored as a union, and we'd
still be using that as the key.  Even if we replace the constant_p
argument with a function callback. that callback would still need to
assert that the valno's constant_p field has the right value, then
extract the relevant field from the union.

>> /* Create a new entry with key KEY in ROOT's derived tree.  */
>> 
>> static struct valno_def *
>> new_splay_valno (bool constant_p, struct valno_def *base,
>>                  union valno_offset *offset)
>
> Comment doesn't match definition.

Doh!  Fixed, thanks.

Richard

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-08-10  9:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-21 15:53 RFC: Rewriting auto-inc-dec.c Richard Sandiford
2011-07-21 16:01 ` Bernd Schmidt
2011-07-21 16:13   ` Richard Sandiford
2011-08-05 12:27 ` Bernd Schmidt
2011-08-10 10:40   ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).