[RFC][PATCH] Canonicalize address multiplies

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC][PATCH] Canonicalize address multiplies
@ 2016-10-04 12:53 Wilco Dijkstra
  2016-10-04 14:29 ` Oleg Endo
  2016-10-05  8:07 ` Richard Biener
  0 siblings, 2 replies; 3+ messages in thread
From: Wilco Dijkstra @ 2016-10-04 12:53 UTC (permalink / raw)
  To: GCC Patches; +Cc: nd

GCC currently doesn't canonicalize address expressions. As a result
inefficient code is generated even for trivial index address expressions,
blocking CSE and other optimizations:

int f(int *p, int i) { return p[i+2] + p[i+1]; }

	sxtw	x1, w1
	add	x1, x1, 2
	add	x2, x0, x1, lsl 2
	ldr	w0, [x0, x1, lsl 2]
	ldr	w1, [x2, -4]
	add	w0, w1, w0
	ret

After this patch:

	add	x1, x0, x1, sxtw 2
	ldp	w0, w2, [x1, 4]
	add	w0, w2, w0
	ret

The reason for this is that array index expressions are preferably kept in 
the *(p + (i + C0) * C1) form eventhough it is best on most targets to make
use of an offset in memory accesses - ie. *(p + i * C1 + (C0*C1)).

This patch disables the folding in fold_plusminus_mult_expr that changes
the latter form into the former.  Unfortunately it isn't possible to know
it is an address expression, and neither is there a way to decide when
C0*C1 is too complex. 

So is there a better way/place to do this, or do we need an address
canonicalization phase in the tree that ensures we expand addresses in an
efficient manner, taking into account target offsets?


ChangeLog:
2016-10-04  Wilco Dijkstra  <wdijkstr@arm.com>

    gcc/
	* fold-const.c (fold_plusminus_mult_expr): Block folding of immediates
	into multiply.
--

diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index e71ce5e0f23adbb1d4a73506769f7243900cfd2d..bc9fb1e8ff3e33c94e66a2d1282235b71fac2730 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -6912,7 +6912,9 @@ fold_plusminus_mult_expr (location_t loc, enum tree_code code, tree type,
      (A * C) +- A -> A * (C+-1).
      We are most concerned about the case where C is a constant,
      but other combinations show up during loop reduction.  Since
-     it is not difficult, try all four possibilities.  */
+     it is not difficult, try all four possibilities.
+     However avoid moving integer constants into the multiply:
+     (A * C0) +- C1 is better than (A +- (C1/C0)) * C0.  */
 
   if (TREE_CODE (arg0) == MULT_EXPR)
     {
@@ -6920,10 +6922,7 @@ fold_plusminus_mult_expr (location_t loc, enum tree_code code, tree type,
       arg01 = TREE_OPERAND (arg0, 1);
     }
   else if (TREE_CODE (arg0) == INTEGER_CST)
-    {
-      arg00 = build_one_cst (type);
-      arg01 = arg0;
-    }
+    return NULL_TREE;
   else
     {
       /* We cannot generate constant 1 for fract.  */
@@ -6938,20 +6937,7 @@ fold_plusminus_mult_expr (location_t loc, enum tree_code code, tree type,
       arg11 = TREE_OPERAND (arg1, 1);
     }
   else if (TREE_CODE (arg1) == INTEGER_CST)
-    {
-      arg10 = build_one_cst (type);
-      /* As we canonicalize A - 2 to A + -2 get rid of that sign for
-	 the purpose of this canonicalization.  */
-      if (wi::neg_p (arg1, TYPE_SIGN (TREE_TYPE (arg1)))
-	  && negate_expr_p (arg1)
-	  && code == PLUS_EXPR)
-	{
-	  arg11 = negate_expr (arg1);
-	  code = MINUS_EXPR;
-	}
-      else
-	arg11 = arg1;
-    }
+    return NULL_TREE;
   else
     {
       /* We cannot generate constant 1 for fract.  */

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC][PATCH] Canonicalize address multiplies
  2016-10-04 12:53 [RFC][PATCH] Canonicalize address multiplies Wilco Dijkstra
@ 2016-10-04 14:29 ` Oleg Endo
  2016-10-05  8:07 ` Richard Biener
  1 sibling, 0 replies; 3+ messages in thread
From: Oleg Endo @ 2016-10-04 14:29 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: nd, Erik Varga

On Tue, 2016-10-04 at 12:53 +0000, Wilco Dijkstra wrote:
> GCC currently doesn't canonicalize address expressions. As a result
> inefficient code is generated even for trivial index address
> expressions,
> blocking CSE and other optimizations:
> 
> int f(int *p, int i) { return p[i+2] + p[i+1]; }
> 
> 	sxtw	x1, w1
> 	add	x1, x1, 2
> 	add	x2, x0, x1, lsl 2
> 	ldr	w0, [x0, x1, lsl 2]
> 	ldr	w1, [x2, -4]
> 	add	w0, w1, w0
> 	ret
> 
> After this patch:
> 
> 	add	x1, x0, x1, sxtw 2
> 	ldp	w0, w2, [x1, 4]
> 	add	w0, w2, w0
> 	ret
> 
> The reason for this is that array index expressions are preferably
> kept in the *(p + (i + C0) * C1) form eventhough it is best on most
> targets to make use of an offset in memory accesses - ie. *(p + i *
> C1 + (C0*C1)).
> 
> This patch disables the folding in fold_plusminus_mult_expr that
> changes the latter form into the former.Â Â Unfortunately it isn't
> possible to know it is an address expression, and neither is there a
> way to decide when C0*C1 is too complex.Â 
> 
> So is there a better way/place to do this, or do we need an address
> canonicalization phase in the tree that ensures we expand addresses
> in an efficient manner, taking into account target offsets?

There's been an effort to implement address mode selection (AMS)
optimization in GCC as part of the GSoC program. Â However, it hasn't
been mainlined yet and it's for SH only, but I'd like to move that
forward and make it available to other backends, too.

It's an RTL pass and works by analyzing memory accesses inside basic
blocks, figuring out the effective address expressions, querying the
backend for address mode alternatives for each memory access and the
associated costs. Â With that information it tries to find a minimal
solution (minimizing address register calculations and minimizing
address mode alternative costs), which is currently implemented with
backtracking.

For SH, the AMS pass can convert your example above from this

_f:
	mov	r5,r0
	add	#2,r0
	shll2	r0
	mov	r4,r1
	add	r0,r1
	mov.l	@(r0,r4),r0
	add	#-4,r1
	mov.l	@r1,r2
	rts	
	add	r2,r0

into this:
_f:
	shll2	r5
	add	r5,r4
	mov.l	@(4,r4),r0
	mov.l	@(8,r4),r1
	rts	
	add	r1,r0

.. which is minimal on SH.

It also fixes several missed auto-inc opportunities and was meant to
allow further address mode related optimizations like displacement
range fitting or access reordering.

Although not yet ready for mainline, the current code can be found on
github:

https://github.com/erikvarga/gcc/commits/master

https://github.com/erikvarga/gcc/blob/master/gcc/ams.h
https://github.com/erikvarga/gcc/blob/master/gcc/ams.cc

The way AMS gets address mode information from the backend is different
to GCC's current approach:

https://github.com/erikvarga/gcc/blob/master/gcc/config/sh/sh.c#L11946

Since the SH ISA is a bit irregular, there are a bunch of exceptions
and special cases in the cost calculations which take surrounding insns
and mem accesses into account. Â I guess a more regular or less
restrictive ISA wouldn't need too many special cases.

Cheers,
Oleg

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC][PATCH] Canonicalize address multiplies
  2016-10-04 12:53 [RFC][PATCH] Canonicalize address multiplies Wilco Dijkstra
  2016-10-04 14:29 ` Oleg Endo
@ 2016-10-05  8:07 ` Richard Biener
  1 sibling, 0 replies; 3+ messages in thread
From: Richard Biener @ 2016-10-05  8:07 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, nd

On Tue, Oct 4, 2016 at 2:53 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> GCC currently doesn't canonicalize address expressions. As a result
> inefficient code is generated even for trivial index address expressions,
> blocking CSE and other optimizations:
>
> int f(int *p, int i) { return p[i+2] + p[i+1]; }
>
>         sxtw    x1, w1
>         add     x1, x1, 2
>         add     x2, x0, x1, lsl 2
>         ldr     w0, [x0, x1, lsl 2]
>         ldr     w1, [x2, -4]
>         add     w0, w1, w0
>         ret
>
> After this patch:
>
>         add     x1, x0, x1, sxtw 2
>         ldp     w0, w2, [x1, 4]
>         add     w0, w2, w0
>         ret
>
> The reason for this is that array index expressions are preferably kept in
> the *(p + (i + C0) * C1) form eventhough it is best on most targets to make
> use of an offset in memory accesses - ie. *(p + i * C1 + (C0*C1)).
>
> This patch disables the folding in fold_plusminus_mult_expr that changes
> the latter form into the former.  Unfortunately it isn't possible to know
> it is an address expression, and neither is there a way to decide when
> C0*C1 is too complex.
>
> So is there a better way/place to do this, or do we need an address
> canonicalization phase in the tree that ensures we expand addresses in an
> efficient manner, taking into account target offsets?

Note there is also the case where the address expression is not exposed in
GIMPLE until after RTL expansion which is if the array reference is not
based on a pointer but on an actual array.  Then we keep the ARRAY_REF
form in GIMPLE.

The more general issue that want's to be addressed here is expression
re-association to maximize CSE opportunities (or minimize addressing
cost) across all expressions that are "close enough".

We have the re-association pass which does decisions based on
a single expression only (not considering CSE opportunities with others).

Then we have the IVOPTs pass which should do what you want but it
only runs on memory references in loops.  Then we have the SLSR
pass which was supposed to cover the IVOPTs in scalar code case
but it doesn't consider addressing mode costs.

The folding patch disables a (maybe premature) canonicalization
(it's really supposed to be a canonicalization, not an optimization)
which will end up helping a testcase like yours but regress another
case where the CSE opportunity arises with the other canonicalization.

If you disable this canonicalization, saying A * C0 +- C1 is better than
(A +- (C1/C0)) * C0 then you should at least implement the reverse
canonicalization.  These days the appropriate place to do that is
match.pd (ISTR having a patch moving fold_plusminus_mult_expr to
match.pd, I don't remember what happened to it).

Richard.

>
> ChangeLog:
> 2016-10-04  Wilco Dijkstra  <wdijkstr@arm.com>
>
>     gcc/
>         * fold-const.c (fold_plusminus_mult_expr): Block folding of immediates
>         into multiply.
> --
>
> diff --git a/gcc/fold-const.c b/gcc/fold-const.c
> index e71ce5e0f23adbb1d4a73506769f7243900cfd2d..bc9fb1e8ff3e33c94e66a2d1282235b71fac2730 100644
> --- a/gcc/fold-const.c
> +++ b/gcc/fold-const.c
> @@ -6912,7 +6912,9 @@ fold_plusminus_mult_expr (location_t loc, enum tree_code code, tree type,
>       (A * C) +- A -> A * (C+-1).
>       We are most concerned about the case where C is a constant,
>       but other combinations show up during loop reduction.  Since
> -     it is not difficult, try all four possibilities.  */
> +     it is not difficult, try all four possibilities.
> +     However avoid moving integer constants into the multiply:
> +     (A * C0) +- C1 is better than (A +- (C1/C0)) * C0.  */
>
>    if (TREE_CODE (arg0) == MULT_EXPR)
>      {
> @@ -6920,10 +6922,7 @@ fold_plusminus_mult_expr (location_t loc, enum tree_code code, tree type,
>        arg01 = TREE_OPERAND (arg0, 1);
>      }
>    else if (TREE_CODE (arg0) == INTEGER_CST)
> -    {
> -      arg00 = build_one_cst (type);
> -      arg01 = arg0;
> -    }
> +    return NULL_TREE;
>    else
>      {
>        /* We cannot generate constant 1 for fract.  */
> @@ -6938,20 +6937,7 @@ fold_plusminus_mult_expr (location_t loc, enum tree_code code, tree type,
>        arg11 = TREE_OPERAND (arg1, 1);
>      }
>    else if (TREE_CODE (arg1) == INTEGER_CST)
> -    {
> -      arg10 = build_one_cst (type);
> -      /* As we canonicalize A - 2 to A + -2 get rid of that sign for
> -        the purpose of this canonicalization.  */
> -      if (wi::neg_p (arg1, TYPE_SIGN (TREE_TYPE (arg1)))
> -         && negate_expr_p (arg1)
> -         && code == PLUS_EXPR)
> -       {
> -         arg11 = negate_expr (arg1);
> -         code = MINUS_EXPR;
> -       }
> -      else
> -       arg11 = arg1;
> -    }
> +    return NULL_TREE;
>    else
>      {
>        /* We cannot generate constant 1 for fract.  */
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-10-05  8:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-04 12:53 [RFC][PATCH] Canonicalize address multiplies Wilco Dijkstra
2016-10-04 14:29 ` Oleg Endo
2016-10-05  8:07 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).