public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
@ 2017-04-13 11:16 Wilco Dijkstra
  2017-04-13 11:21 ` Jakub Jelinek
  0 siblings, 1 reply; 24+ messages in thread
From: Wilco Dijkstra @ 2017-04-13 11:16 UTC (permalink / raw)
  To: Jakub Jelinek, Sudi Das
  Cc: GCC Patches, nd, Richard Earnshaw, James Greenhalgh

>On Wed, Apr 12, 2017 at 09:29:55AM +0000, Sudi Das wrote:
> > Hi all
> > 
> > This is a fix for PR 80131 
> > Currently the code A << (B - C) is not simplified.
>> However at least a more specific case of 1U << (C -x) where C = precision(type) - 1 can be simplified to (1 << C) >> x.
>
> Is that always a win though?

Yes assuming left and right shift have the same performance.

> Some constants have higher costs than others on various targets, some
> significantly higher.  This change might be beneficial only
> if if C is as expensive as 1, then you get rid of a one (typically cheap)
> operation.

Most targets can create the constant cheaply. Targets that can't would need 2
instructions (move+shift) or a literal load. That's not worse compared to the
original sequence (3 operations). The constant can be shared or lifted out of
a loop, so we're saving 1 subtract per use of the sequence.

> Which makes me wonder whether this should be done at GIMPLE time and not
> at RTL time (expansion or combine etc.) when one can compare the RTX costs.
> Or do this at match.pd as canonicalization and then have RTL transformation
> to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or
> shorter).

I can't see why that would be useful for just this pattern. There are lots more useful
cases where GCC should simplify immediates to fit the target but it doesn't currently.
For example it changes x < 0x100000 to x <= 0xfffff which is worse on many targets.

Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-04-13 11:16 [PATCH][GCC] Simplification of 1U << (31 - x) Wilco Dijkstra
@ 2017-04-13 11:21 ` Jakub Jelinek
  2017-04-13 11:33   ` Wilco Dijkstra
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Jelinek @ 2017-04-13 11:21 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: Sudi Das, GCC Patches, nd, Richard Earnshaw, James Greenhalgh

On Thu, Apr 13, 2017 at 11:16:08AM +0000, Wilco Dijkstra wrote:
> >On Wed, Apr 12, 2017 at 09:29:55AM +0000, Sudi Das wrote:
> > > Hi all
> > > 
> > > This is a fix for PR 80131 
> > > Currently the code A << (B - C) is not simplified.
> >> However at least a more specific case of 1U << (C -x) where C = precision(type) - 1 can be simplified to (1 << C) >> x.
> >
> > Is that always a win though?
> 
> Yes assuming left and right shift have the same performance.
> 
> > Some constants have higher costs than others on various targets, some
> > significantly higher.  This change might be beneficial only
> > if if C is as expensive as 1, then you get rid of a one (typically cheap)
> > operation.
> 
> Most targets can create the constant cheaply. Targets that can't would need 2

No.  Some constants sometimes even 7 instructions (e.g. sparc64; not talking
in particular about 1ULL << 63 constant), or have one instruction
that is more expensive than normal small constant load.  Compare say x86_64
movl/movq vs. movabsq, I think the latter has 3 times longer latency on many
CPUs.  So no, I think it isn't an unconditional win.

	Jakub

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-04-13 11:21 ` Jakub Jelinek
@ 2017-04-13 11:33   ` Wilco Dijkstra
  2017-04-13 11:41     ` Jakub Jelinek
  0 siblings, 1 reply; 24+ messages in thread
From: Wilco Dijkstra @ 2017-04-13 11:33 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Sudi Das, GCC Patches, nd, Richard Earnshaw, James Greenhalgh

Jakub Jelinek wrote:
    
> No.  Some constants sometimes even 7 instructions (e.g. sparc64; not talking
> in particular about 1ULL << 63 constant), or have one instruction
> that is more expensive than normal small constant load.  Compare say x86_64
> movl/movq vs. movabsq, I think the latter has 3 times longer latency on many
> CPUs.  So no, I think it isn't an unconditional win.

We're specifically only talking about the constants (1L << 63), (1 << 31) and (1 << 15).
On all targets these need at most 2 simple instructions. That makes it an unconditional win.

Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-04-13 11:33   ` Wilco Dijkstra
@ 2017-04-13 11:41     ` Jakub Jelinek
  2017-04-13 11:49       ` Richard Biener
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Jelinek @ 2017-04-13 11:41 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: Sudi Das, GCC Patches, nd, Richard Earnshaw, James Greenhalgh

On Thu, Apr 13, 2017 at 11:33:12AM +0000, Wilco Dijkstra wrote:
> Jakub Jelinek wrote:
>     
> > No.  Some constants sometimes even 7 instructions (e.g. sparc64; not talking
> > in particular about 1ULL << 63 constant), or have one instruction
> > that is more expensive than normal small constant load.  Compare say x86_64
> > movl/movq vs. movabsq, I think the latter has 3 times longer latency on many
> > CPUs.  So no, I think it isn't an unconditional win.
> 
> We're specifically only talking about the constants (1L << 63), (1 << 31) and (1 << 15).
> On all targets these need at most 2 simple instructions. That makes it an unconditional win.

It is not a win on at least Haswell-E:
__attribute__((noinline, noclone)) unsigned long long int
foo (int x)
{
  asm volatile ("" : : : "memory");
  return 1ULL << (63 - x);
}

__attribute__((noinline, noclone)) unsigned long long int
bar (int x)
{
  asm volatile ("" : : : "memory");
  return (1ULL << 63) >> x;
}

int
main (int argc, const char **argv)
{
  int i;
  if (argc == 1)
    for (i = 0; i < 1000000000; i++)
      asm volatile ("" : : "r" (foo (13)));
  else
    for (i = 0; i < 1000000000; i++)
      asm volatile ("" : : "r" (bar (13)));
  return 0;
}

$ time /tmp/test

real	0m1.290s
user	0m1.288s
sys	0m0.002s
$ time /tmp/test 1

real	0m1.542s
user	0m1.540s
sys	0m0.002s

As I said, movabsq is expensive.

	Jakub

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-04-13 11:41     ` Jakub Jelinek
@ 2017-04-13 11:49       ` Richard Biener
  2017-04-13 11:55         ` Jakub Jelinek
  2017-04-13 12:01         ` Wilco Dijkstra
  0 siblings, 2 replies; 24+ messages in thread
From: Richard Biener @ 2017-04-13 11:49 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Wilco Dijkstra, Sudi Das, GCC Patches, nd, Richard Earnshaw,
	James Greenhalgh

On Thu, Apr 13, 2017 at 1:41 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Apr 13, 2017 at 11:33:12AM +0000, Wilco Dijkstra wrote:
>> Jakub Jelinek wrote:
>>
>> > No.  Some constants sometimes even 7 instructions (e.g. sparc64; not talking
>> > in particular about 1ULL << 63 constant), or have one instruction
>> > that is more expensive than normal small constant load.  Compare say x86_64
>> > movl/movq vs. movabsq, I think the latter has 3 times longer latency on many
>> > CPUs.  So no, I think it isn't an unconditional win.
>>
>> We're specifically only talking about the constants (1L << 63), (1 << 31) and (1 << 15).
>> On all targets these need at most 2 simple instructions. That makes it an unconditional win.
>
> It is not a win on at least Haswell-E:
> __attribute__((noinline, noclone)) unsigned long long int
> foo (int x)
> {
>   asm volatile ("" : : : "memory");
>   return 1ULL << (63 - x);
> }
>
> __attribute__((noinline, noclone)) unsigned long long int
> bar (int x)
> {
>   asm volatile ("" : : : "memory");
>   return (1ULL << 63) >> x;
> }
>
> int
> main (int argc, const char **argv)
> {
>   int i;
>   if (argc == 1)
>     for (i = 0; i < 1000000000; i++)
>       asm volatile ("" : : "r" (foo (13)));
>   else
>     for (i = 0; i < 1000000000; i++)
>       asm volatile ("" : : "r" (bar (13)));
>   return 0;
> }
>
> $ time /tmp/test
>
> real    0m1.290s
> user    0m1.288s
> sys     0m0.002s
> $ time /tmp/test 1
>
> real    0m1.542s
> user    0m1.540s
> sys     0m0.002s
>
> As I said, movabsq is expensive.

It is IMHO a valid GIMPLE optimization / canonicalization.

        movabsq $-9223372036854775808, %rax

so this should then have been generated as 1<<63?

At some point variable shifts were quite expensive as well..

Richard.

>         Jakub

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-04-13 11:49       ` Richard Biener
@ 2017-04-13 11:55         ` Jakub Jelinek
  2017-04-13 12:01         ` Wilco Dijkstra
  1 sibling, 0 replies; 24+ messages in thread
From: Jakub Jelinek @ 2017-04-13 11:55 UTC (permalink / raw)
  To: Richard Biener, Uros Bizjak
  Cc: Wilco Dijkstra, Sudi Das, GCC Patches, nd, Richard Earnshaw,
	James Greenhalgh

On Thu, Apr 13, 2017 at 01:49:01PM +0200, Richard Biener wrote:
> It is IMHO a valid GIMPLE optimization / canonicalization.

As I said, we can do it as GIMPLE canonicalization, but
we should have code to undo it if beneficial at RTL level.
And the patch has not included that.
> 
>         movabsq $-9223372036854775808, %rax
> 
> so this should then have been generated as 1<<63?

Maybe.  But it seems to be still more expensive than the original code,
though better than movabsq:

__attribute__((noinline, noclone)) unsigned long long int
foo (int x)
{
  asm volatile ("" : : : "memory");
  return 1ULL << (63 - x);
}

__attribute__((noinline, noclone)) unsigned long long int
bar (int x)
{
  asm volatile ("" : : : "memory");
  return (1ULL << 63) >> x;
}

__attribute__((noinline, noclone)) unsigned long long int
baz (int x)
{
  unsigned long long int y = 1;
  asm volatile ("" : "+r" (y) : : "memory");
  return (y << 63) >> x;
}

int
main (int argc, const char **argv)
{
  int i;
  if (argc == 1)
    for (i = 0; i < 1000000000; i++)
      asm volatile ("" : : "r" (foo (13)));
  else if (argc == 2)
    for (i = 0; i < 1000000000; i++)
      asm volatile ("" : : "r" (bar (13)));
  else if (argc == 3)
    for (i = 0; i < 1000000000; i++)
      asm volatile ("" : : "r" (baz (13)));
  return 0;
}

	Jakub

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-04-13 11:49       ` Richard Biener
  2017-04-13 11:55         ` Jakub Jelinek
@ 2017-04-13 12:01         ` Wilco Dijkstra
  2017-08-01  9:15           ` Sudi Das
  1 sibling, 1 reply; 24+ messages in thread
From: Wilco Dijkstra @ 2017-04-13 12:01 UTC (permalink / raw)
  To: Richard Biener, Jakub Jelinek
  Cc: Sudi Das, GCC Patches, nd, Richard Earnshaw, James Greenhalgh

Richard Biener wrote:
> It is IMHO a valid GIMPLE optimization / canonicalization.
>
>        movabsq $-9223372036854775808, %rax
>
> so this should then have been generated as 1<<63?
>
> At some point variable shifts were quite expensive as well..

Yes I don't see a major difference between movabsq and

        movl    $1, %eax
        salq    $63, %rax

on my Sandy Bridge, but if the above is faster then that is what the x64
backend should emit - it's 1 byte smaller as well, so probably better in all
cases.

Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-04-13 12:01         ` Wilco Dijkstra
@ 2017-08-01  9:15           ` Sudi Das
  2017-08-04 10:16             ` Richard Biener
  0 siblings, 1 reply; 24+ messages in thread
From: Sudi Das @ 2017-08-01  9:15 UTC (permalink / raw)
  To: Wilco Dijkstra, Richard Biener, Jakub Jelinek
  Cc: GCC Patches, nd, Richard Earnshaw, James Greenhalgh



 

Sorry about the delayed response but looking at the above discussion, should I conclude that this is a valid tree simplification?
I am pasting the diff of the assembly that AArch64 generates with the test case that I added. I see fewer instructions generated with the patch.

--- pr80131-1.s    2017-08-01 10:02:43.243374174 +0100
+++ pr80131-1.s-patched    2017-08-01 10:00:54.776455630 +0100
@@ -24,10 +24,8 @@
     str    x0, [sp, 8]
     ldr    x0, [sp, 8]
     mov    w1, w0
-    mov    w0, 63
-    sub    w0, w0, w1
-    mov    x1, 1
-    lsl    x0, x1, x0
+    mov    x0, -9223372036854775808
+    lsr    x0, x0, x1
     add    sp, sp, 16
     ret
     .size    f2, .-f2
@@ -39,10 +37,8 @@
     str    x0, [sp, 8]
     ldr    x0, [sp, 8]
     mov    w1, w0
-    mov    w0, 63
-    sub    w0, w0, w1
-    mov    x1, 1
-    lsl    x0, x1, x0
+    mov    x0, -9223372036854775808
+    lsr    x0, x0, x1
     add    sp, sp, 16
     ret
     .size    f3, .-f3
@@ -52,11 +48,9 @@
 f4:
     sub    sp, sp, #16
     str    w0, [sp, 12]
-    mov    w1, 31
     ldr    w0, [sp, 12]
-    sub    w0, w1, w0
-    mov    w1, 1
-    lsl    w0, w1, w0
+    mov    w1, -2147483648
+    lsr    w0, w1, w0
     add    sp, sp, 16
     ret
     .size    f4, .-f4


Thanks

Sudi 




From: Wilco Dijkstra
Sent: Thursday, April 13, 2017 1:01 PM
To: Richard Biener; Jakub Jelinek
Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
    
Richard Biener wrote:
> It is IMHO a valid GIMPLE optimization / canonicalization.
>
>        movabsq $-9223372036854775808, %rax
>
> so this should then have been generated as 1<<63?
>
> At some point variable shifts were quite expensive as well..

Yes I don't see a major difference between movabsq and

        movl    $1, %eax
        salq    $63, %rax

on my Sandy Bridge, but if the above is faster then that is what the x64
backend should emit - it's 1 byte smaller as well, so probably better in all
cases.

Wilco     

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-08-01  9:15           ` Sudi Das
@ 2017-08-04 10:16             ` Richard Biener
  2017-09-26 12:44               ` Sudi Das
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Biener @ 2017-08-04 10:16 UTC (permalink / raw)
  To: Sudi Das
  Cc: Wilco Dijkstra, Jakub Jelinek, GCC Patches, nd, Richard Earnshaw,
	James Greenhalgh

On Tue, Aug 1, 2017 at 11:14 AM, Sudi Das <Sudi.Das@arm.com> wrote:
>
>
>
>
> Sorry about the delayed response but looking at the above discussion, should I conclude that this is a valid tree simplification?

Yes, I think so.  Jakub requested code to undo this at RTL expansion
based on target costs, not sure if we really should
require that from you given the user could have written the target
sequence himself.

Few comments about the patch:

+/* Fold (1 << (C - x)) where C = precision(type) - 1
+   into ((1 << C) >> x). */
+(simplify
+ (lshift integer_onep@0 (minus INTEGER_CST@1 @2))

I think this warrants a single_use check on the minus (note :s isn't enough
as with the unsigned case we'd happily ignore it by design).

+  (if (INTEGRAL_TYPE_P (type)
+       && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT
+       && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1))

You can relax this with using

          && wi::eq_p (@1, TYPE_PRECISION (type) - 1)

+   (if (TYPE_UNSIGNED(type))
+     (rshift (lshift @0 @1) @2)
+   (with
+    { tree utype = unsigned_type_for (type); }
+    (convert:type (rshift (lshift (convert:utype @0) @1) @2))))))
+

You can write (convert (rshift ...)), without the :type.

I'm leaving it to Jakub whether you need to write that RTL expansion tweak.

Thanks,
Richard.

> I am pasting the diff of the assembly that AArch64 generates with the test case that I added. I see fewer instructions generated with the patch.
>
> --- pr80131-1.s    2017-08-01 10:02:43.243374174 +0100
> +++ pr80131-1.s-patched    2017-08-01 10:00:54.776455630 +0100
> @@ -24,10 +24,8 @@
>      str    x0, [sp, 8]
>      ldr    x0, [sp, 8]
>      mov    w1, w0
> -    mov    w0, 63
> -    sub    w0, w0, w1
> -    mov    x1, 1
> -    lsl    x0, x1, x0
> +    mov    x0, -9223372036854775808
> +    lsr    x0, x0, x1
>      add    sp, sp, 16
>      ret
>      .size    f2, .-f2
> @@ -39,10 +37,8 @@
>      str    x0, [sp, 8]
>      ldr    x0, [sp, 8]
>      mov    w1, w0
> -    mov    w0, 63
> -    sub    w0, w0, w1
> -    mov    x1, 1
> -    lsl    x0, x1, x0
> +    mov    x0, -9223372036854775808
> +    lsr    x0, x0, x1
>      add    sp, sp, 16
>      ret
>      .size    f3, .-f3
> @@ -52,11 +48,9 @@
>  f4:
>      sub    sp, sp, #16
>      str    w0, [sp, 12]
> -    mov    w1, 31
>      ldr    w0, [sp, 12]
> -    sub    w0, w1, w0
> -    mov    w1, 1
> -    lsl    w0, w1, w0
> +    mov    w1, -2147483648
> +    lsr    w0, w1, w0
>      add    sp, sp, 16
>      ret
>      .size    f4, .-f4
>
>
> Thanks
>
> Sudi
>
>
>
>
> From: Wilco Dijkstra
> Sent: Thursday, April 13, 2017 1:01 PM
> To: Richard Biener; Jakub Jelinek
> Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
> Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
>
> Richard Biener wrote:
>> It is IMHO a valid GIMPLE optimization / canonicalization.
>>
>>        movabsq $-9223372036854775808, %rax
>>
>> so this should then have been generated as 1<<63?
>>
>> At some point variable shifts were quite expensive as well..
>
> Yes I don't see a major difference between movabsq and
>
>         movl    $1, %eax
>         salq    $63, %rax
>
> on my Sandy Bridge, but if the above is faster then that is what the x64
> backend should emit - it's 1 byte smaller as well, so probably better in all
> cases.
>
> Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-08-04 10:16             ` Richard Biener
@ 2017-09-26 12:44               ` Sudi Das
  2017-09-26 13:06                 ` Jakub Jelinek
  2017-10-09 12:05                 ` Richard Biener
  0 siblings, 2 replies; 24+ messages in thread
From: Sudi Das @ 2017-09-26 12:44 UTC (permalink / raw)
  To: Richard Biener
  Cc: Wilco Dijkstra, Jakub Jelinek, GCC Patches, nd, Richard Earnshaw,
	James Greenhalgh

[-- Attachment #1: Type: text/plain, Size: 3899 bytes --]


Still waiting on Jakub's comment on whether there are more things needed at the backend. But I have updated the patch according to Richard's comments.

Thanks
Sudi



From: Richard Biener <richard.guenther@gmail.com>
Sent: Friday, August 4, 2017 11:16 AM
To: Sudi Das
Cc: Wilco Dijkstra; Jakub Jelinek; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
    
On Tue, Aug 1, 2017 at 11:14 AM, Sudi Das <Sudi.Das@arm.com> wrote:
>
>
>
>
> Sorry about the delayed response but looking at the above discussion, should I conclude that this is a valid tree simplification?

Yes, I think so.  Jakub requested code to undo this at RTL expansion
based on target costs, not sure if we really should
require that from you given the user could have written the target
sequence himself.

Few comments about the patch:

+/* Fold (1 << (C - x)) where C = precision(type) - 1
+   into ((1 << C) >> x). */
+(simplify
+ (lshift integer_onep@0 (minus INTEGER_CST@1 @2))

I think this warrants a single_use check on the minus (note :s isn't enough
as with the unsigned case we'd happily ignore it by design).

+  (if (INTEGRAL_TYPE_P (type)
+       && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT
+       && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1))

You can relax this with using

          && wi::eq_p (@1, TYPE_PRECISION (type) - 1)

+   (if (TYPE_UNSIGNED(type))
+     (rshift (lshift @0 @1) @2)
+   (with
+    { tree utype = unsigned_type_for (type); }
+    (convert:type (rshift (lshift (convert:utype @0) @1) @2))))))
+

You can write (convert (rshift ...)), without the :type.

I'm leaving it to Jakub whether you need to write that RTL expansion tweak.

Thanks,
Richard.

> I am pasting the diff of the assembly that AArch64 generates with the test case that I added. I see fewer instructions generated with the patch.
>
> --- pr80131-1.s    2017-08-01 10:02:43.243374174 +0100
> +++ pr80131-1.s-patched    2017-08-01 10:00:54.776455630 +0100
> @@ -24,10 +24,8 @@
>      str    x0, [sp, 8]
>      ldr    x0, [sp, 8]
>      mov    w1, w0
> -    mov    w0, 63
> -    sub    w0, w0, w1
> -    mov    x1, 1
> -    lsl    x0, x1, x0
> +    mov    x0, -9223372036854775808
> +    lsr    x0, x0, x1
>      add    sp, sp, 16
>      ret
>      .size    f2, .-f2
> @@ -39,10 +37,8 @@
>      str    x0, [sp, 8]
>      ldr    x0, [sp, 8]
>      mov    w1, w0
> -    mov    w0, 63
> -    sub    w0, w0, w1
> -    mov    x1, 1
> -    lsl    x0, x1, x0
> +    mov    x0, -9223372036854775808
> +    lsr    x0, x0, x1
>      add    sp, sp, 16
>      ret
>      .size    f3, .-f3
> @@ -52,11 +48,9 @@
>  f4:
>      sub    sp, sp, #16
>      str    w0, [sp, 12]
> -    mov    w1, 31
>      ldr    w0, [sp, 12]
> -    sub    w0, w1, w0
> -    mov    w1, 1
> -    lsl    w0, w1, w0
> +    mov    w1, -2147483648
> +    lsr    w0, w1, w0
>      add    sp, sp, 16
>      ret
>      .size    f4, .-f4
>
>
> Thanks
>
> Sudi
>
>
>
>
> From: Wilco Dijkstra
> Sent: Thursday, April 13, 2017 1:01 PM
> To: Richard Biener; Jakub Jelinek
> Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
> Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
>
> Richard Biener wrote:
>> It is IMHO a valid GIMPLE optimization / canonicalization.
>>
>>        movabsq $-9223372036854775808, %rax
>>
>> so this should then have been generated as 1<<63?
>>
>> At some point variable shifts were quite expensive as well..
>
> Yes I don't see a major difference between movabsq and
>
>         movl    $1, %eax
>         salq    $63, %rax
>
> on my Sandy Bridge, but if the above is faster then that is what the x64
> backend should emit - it's 1 byte smaller as well, so probably better in all
> cases.
>
> Wilco
    

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch-7259-5.diff --]
[-- Type: text/x-patch; name="patch-7259-5.diff", Size: 1762 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index e9017e4..160c12d 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -600,6 +600,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
        && tree_nop_conversion_p (type, TREE_TYPE (@1)))
    (lshift @0 @2)))
 
+/* Fold (1 << (C - x)) where C = precision(type) - 1
+   into ((1 << C) >> x). */
+(simplify
+ (lshift integer_onep@0 (minus@1 INTEGER_CST@2 @3))
+  (if (INTEGRAL_TYPE_P (type)
+       && wi::eq_p (@2, TYPE_PRECISION (type) - 1)
+       && single_use (@1))
+   (if (TYPE_UNSIGNED(type))
+     (rshift (lshift @0 @2) @3)
+   (with
+    { tree utype = unsigned_type_for (type); }
+    (convert (rshift (lshift (convert:utype @0) @2) @3))))))
+
 /* Fold (C1/X)*C2 into (C1*C2)/X.  */
 (simplify
  (mult (rdiv@3 REAL_CST@0 @1) REAL_CST@2)
diff --git a/gcc/testsuite/gcc.dg/pr80131-1.c b/gcc/testsuite/gcc.dg/pr80131-1.c
new file mode 100644
index 0000000..317ea3e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr80131-1.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-fdump-tree-gimple" } */
+
+/* Checks the simplification of:
+   1 << (C - x) to (1 << C) >> x, where C = precision (type) - 1
+   f1 is not simplified but f2, f3 and f4 are. */
+
+__INT64_TYPE__ f1 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (31 - i);
+}
+
+__INT64_TYPE__ f2 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (63 - i);
+}
+
+__UINT64_TYPE__ f3 (__INT64_TYPE__ i)
+{
+  return (__UINT64_TYPE__)1 << (63 - i);
+}
+
+__INT32_TYPE__ f4 (__INT32_TYPE__ i)
+{
+  return (__INT32_TYPE__)1 << (31 - i);
+}
+
+/* { dg-final { scan-tree-dump-times "= 31 -"  1 "gimple" } } */
+/* { dg-final { scan-tree-dump-times "9223372036854775808 >>" 2 "gimple" } } */
+/* { dg-final { scan-tree-dump "2147483648 >>" "gimple" } } */

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-09-26 12:44               ` Sudi Das
@ 2017-09-26 13:06                 ` Jakub Jelinek
  2017-09-26 13:20                   ` Wilco Dijkstra
  2017-10-09 12:05                 ` Richard Biener
  1 sibling, 1 reply; 24+ messages in thread
From: Jakub Jelinek @ 2017-09-26 13:06 UTC (permalink / raw)
  To: Sudi Das
  Cc: Richard Biener, Wilco Dijkstra, GCC Patches, nd,
	Richard Earnshaw, James Greenhalgh

On Tue, Sep 26, 2017 at 12:44:10PM +0000, Sudi Das wrote:
> 
> Still waiting on Jakub's comment on whether there are more things needed
> at the backend.  But I have updated the patch according to Richard's
> comments.

Well, we don't want to regress performance wise on one of the most important
primary targets.  I don't care that much if the RTL/backend work is done
together with the patch, or as a follow-up during stage1/3, but it should be
done, the testcases I've posted can be used as a basis of a P1 runtime
performance regression.

	Jakub

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-09-26 13:06                 ` Jakub Jelinek
@ 2017-09-26 13:20                   ` Wilco Dijkstra
  2017-10-06 17:00                     ` Sudi Das
  0 siblings, 1 reply; 24+ messages in thread
From: Wilco Dijkstra @ 2017-09-26 13:20 UTC (permalink / raw)
  To: Sudi Das, Jakub Jelinek
  Cc: Richard Biener, GCC Patches, nd, Richard Earnshaw, James Greenhalgh

Jakub Jelinek wrote:

> Well, we don't want to regress performance wise on one of the most important
> primary targets.  I don't care that much if the RTL/backend work is done
> together with the patch, or as a follow-up during stage1/3, but it should be
> done, the testcases I've posted can be used as a basis of a P1 runtime
> performance regression.

It should be sufficient to file a bug about inefficient 64-bit constant expansions on
x64. I didn't see a significant difference in my benchmarking of it on x64, so I'd say
it's only a performance regression if large benchmarks regress measurably (quite
unlikely).

Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-09-26 13:20                   ` Wilco Dijkstra
@ 2017-10-06 17:00                     ` Sudi Das
  0 siblings, 0 replies; 24+ messages in thread
From: Sudi Das @ 2017-10-06 17:00 UTC (permalink / raw)
  To: Wilco Dijkstra, Jakub Jelinek
  Cc: Richard Biener, GCC Patches, nd, Richard Earnshaw, James Greenhalgh


Hi Jakub

As per the discussions, I have a created a bug report for the possible regression this may cause.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82454

Sudi


From: Wilco Dijkstra
Sent: Tuesday, September 26, 2017 2:20 PM
To: Sudi Das; Jakub Jelinek
Cc: Richard Biener; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
    
Jakub Jelinek wrote:

> Well, we don't want to regress performance wise on one of the most important
> primary targets.  I don't care that much if the RTL/backend work is done
> together with the patch, or as a follow-up during stage1/3, but it should be
> done, the testcases I've posted can be used as a basis of a P1 runtime
> performance regression.

It should be sufficient to file a bug about inefficient 64-bit constant expansions on
x64. I didn't see a significant difference in my benchmarking of it on x64, so I'd say
it's only a performance regression if large benchmarks regress measurably (quite
unlikely).

Wilco    

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-09-26 12:44               ` Sudi Das
  2017-09-26 13:06                 ` Jakub Jelinek
@ 2017-10-09 12:05                 ` Richard Biener
  2017-10-09 13:04                   ` Wilco Dijkstra
  1 sibling, 1 reply; 24+ messages in thread
From: Richard Biener @ 2017-10-09 12:05 UTC (permalink / raw)
  To: Sudi Das
  Cc: Wilco Dijkstra, Jakub Jelinek, GCC Patches, nd, Richard Earnshaw,
	James Greenhalgh

On Tue, Sep 26, 2017 at 2:44 PM, Sudi Das <Sudi.Das@arm.com> wrote:
>
> Still waiting on Jakub's comment on whether there are more things needed at the backend. But I have updated the patch according to Richard's comments.

+   (if (TYPE_UNSIGNED(type))

space before '(type)'.

+     (rshift (lshift @0 @2) @3)
+   (with
+    { tree utype = unsigned_type_for (type); }
+    (convert (rshift (lshift (convert:utype @0) @2) @3))))))

I think your testcase needs a

{ dg-require-effective-target int32plus }

as otherwise it'll fail on AVR.

I think the patch is ok with these changes but obviously we should try
to address
the code-generation issue on x86 at RTL expansion time.  They are sort-of
existing missing optimizations.

Richard.

> Thanks
> Sudi
>
>
>
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Friday, August 4, 2017 11:16 AM
> To: Sudi Das
> Cc: Wilco Dijkstra; Jakub Jelinek; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
> Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
>
> On Tue, Aug 1, 2017 at 11:14 AM, Sudi Das <Sudi.Das@arm.com> wrote:
>>
>>
>>
>>
>> Sorry about the delayed response but looking at the above discussion, should I conclude that this is a valid tree simplification?
>
> Yes, I think so.  Jakub requested code to undo this at RTL expansion
> based on target costs, not sure if we really should
> require that from you given the user could have written the target
> sequence himself.
>
> Few comments about the patch:
>
> +/* Fold (1 << (C - x)) where C = precision(type) - 1
> +   into ((1 << C) >> x). */
> +(simplify
> + (lshift integer_onep@0 (minus INTEGER_CST@1 @2))
>
> I think this warrants a single_use check on the minus (note :s isn't enough
> as with the unsigned case we'd happily ignore it by design).
>
> +  (if (INTEGRAL_TYPE_P (type)
> +       && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT
> +       && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1))
>
> You can relax this with using
>
>           && wi::eq_p (@1, TYPE_PRECISION (type) - 1)
>
> +   (if (TYPE_UNSIGNED(type))
> +     (rshift (lshift @0 @1) @2)
> +   (with
> +    { tree utype = unsigned_type_for (type); }
> +    (convert:type (rshift (lshift (convert:utype @0) @1) @2))))))
> +
>
> You can write (convert (rshift ...)), without the :type.
>
> I'm leaving it to Jakub whether you need to write that RTL expansion tweak.
>
> Thanks,
> Richard.
>
>> I am pasting the diff of the assembly that AArch64 generates with the test case that I added. I see fewer instructions generated with the patch.
>>
>> --- pr80131-1.s    2017-08-01 10:02:43.243374174 +0100
>> +++ pr80131-1.s-patched    2017-08-01 10:00:54.776455630 +0100
>> @@ -24,10 +24,8 @@
>>      str    x0, [sp, 8]
>>      ldr    x0, [sp, 8]
>>      mov    w1, w0
>> -    mov    w0, 63
>> -    sub    w0, w0, w1
>> -    mov    x1, 1
>> -    lsl    x0, x1, x0
>> +    mov    x0, -9223372036854775808
>> +    lsr    x0, x0, x1
>>      add    sp, sp, 16
>>      ret
>>      .size    f2, .-f2
>> @@ -39,10 +37,8 @@
>>      str    x0, [sp, 8]
>>      ldr    x0, [sp, 8]
>>      mov    w1, w0
>> -    mov    w0, 63
>> -    sub    w0, w0, w1
>> -    mov    x1, 1
>> -    lsl    x0, x1, x0
>> +    mov    x0, -9223372036854775808
>> +    lsr    x0, x0, x1
>>      add    sp, sp, 16
>>      ret
>>      .size    f3, .-f3
>> @@ -52,11 +48,9 @@
>>  f4:
>>      sub    sp, sp, #16
>>      str    w0, [sp, 12]
>> -    mov    w1, 31
>>      ldr    w0, [sp, 12]
>> -    sub    w0, w1, w0
>> -    mov    w1, 1
>> -    lsl    w0, w1, w0
>> +    mov    w1, -2147483648
>> +    lsr    w0, w1, w0
>>      add    sp, sp, 16
>>      ret
>>      .size    f4, .-f4
>>
>>
>> Thanks
>>
>> Sudi
>>
>>
>>
>>
>> From: Wilco Dijkstra
>> Sent: Thursday, April 13, 2017 1:01 PM
>> To: Richard Biener; Jakub Jelinek
>> Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
>> Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
>>
>> Richard Biener wrote:
>>> It is IMHO a valid GIMPLE optimization / canonicalization.
>>>
>>>        movabsq $-9223372036854775808, %rax
>>>
>>> so this should then have been generated as 1<<63?
>>>
>>> At some point variable shifts were quite expensive as well..
>>
>> Yes I don't see a major difference between movabsq and
>>
>>         movl    $1, %eax
>>         salq    $63, %rax
>>
>> on my Sandy Bridge, but if the above is faster then that is what the x64
>> backend should emit - it's 1 byte smaller as well, so probably better in all
>> cases.
>>
>> Wilco
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-10-09 12:05                 ` Richard Biener
@ 2017-10-09 13:04                   ` Wilco Dijkstra
  2017-10-10 11:12                     ` Sudi Das
  0 siblings, 1 reply; 24+ messages in thread
From: Wilco Dijkstra @ 2017-10-09 13:04 UTC (permalink / raw)
  To: Richard Biener, Sudi Das
  Cc: Jakub Jelinek, GCC Patches, nd, Richard Earnshaw, James Greenhalgh

Richard Biener wrote:

> I think the patch is ok with these changes but obviously we should try
> to address
> the code-generation issue on x86 at RTL expansion time.  They are sort-of
> existing missing optimizations.

Note the only x64 specific issue is the backend expansion of 64-bit immediates
which could be improved like I suggested. However what we're really missing
is a generic optimization pass that tries to simplify immediates using accurate
target costs. In eg. x >= C or x > C we can use either C or C-1 - on many targets
one option may be a single instruction, while the other might take 2 or even 3.

Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-10-09 13:04                   ` Wilco Dijkstra
@ 2017-10-10 11:12                     ` Sudi Das
  2017-11-07 12:37                       ` Wilco Dijkstra
  0 siblings, 1 reply; 24+ messages in thread
From: Sudi Das @ 2017-10-10 11:12 UTC (permalink / raw)
  To: Wilco Dijkstra, Richard Biener
  Cc: Jakub Jelinek, GCC Patches, nd, Richard Earnshaw, James Greenhalgh

[-- Attachment #1: Type: text/plain, Size: 1467 bytes --]



Thanks, I have made the changes to the patch.
Also can someone please apply it for me. I do not have commit access.

2017-10-10  Sudakshina Das  <sudi.das@arm.com>

	PR middle-end/80131
	* match.pd: Simplify 1 << (C - x) where C = precision (x) - 1.

2017-10-10  Sudakshina Das  <sudi.das@arm.com>

	PR middle-end/80131
	* testsuite/gcc.dg/pr80131-1.c: New Test.


With regards to the existing missed optimizations needed to the x86 RTL expansion, I think the discussions can take place on the bug report that I created and maybe someone will pick it up.

Thanks
Sudi




From: Wilco Dijkstra
Sent: Monday, October 9, 2017 2:02 PM
To: Richard Biener; Sudi Das
Cc: Jakub Jelinek; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
    
Richard Biener wrote:

> I think the patch is ok with these changes but obviously we should try
> to address
> the code-generation issue on x86 at RTL expansion time.  They are sort-of
> existing missing optimizations.

Note the only x64 specific issue is the backend expansion of 64-bit immediates
which could be improved like I suggested. However what we're really missing
is a generic optimization pass that tries to simplify immediates using accurate
target costs. In eg. x >= C or x > C we can use either C or C-1 - on many targets
one option may be a single instruction, while the other might take 2 or even 3.

Wilco    

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch-7259-6.diff --]
[-- Type: text/x-patch; name="patch-7259-6.diff", Size: 1812 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index e58a65a..7a25a1b 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -600,6 +600,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
        && tree_nop_conversion_p (type, TREE_TYPE (@1)))
    (lshift @0 @2)))
 
+/* Fold (1 << (C - x)) where C = precision(type) - 1
+   into ((1 << C) >> x). */
+(simplify
+ (lshift integer_onep@0 (minus@1 INTEGER_CST@2 @3))
+  (if (INTEGRAL_TYPE_P (type)
+       && wi::eq_p (@2, TYPE_PRECISION (type) - 1)
+       && single_use (@1))
+   (if (TYPE_UNSIGNED (type))
+     (rshift (lshift @0 @2) @3)
+   (with
+    { tree utype = unsigned_type_for (type); }
+    (convert (rshift (lshift (convert:utype @0) @2) @3))))))
+
 /* Fold (C1/X)*C2 into (C1*C2)/X.  */
 (simplify
  (mult (rdiv@3 REAL_CST@0 @1) REAL_CST@2)
diff --git a/gcc/testsuite/gcc.dg/pr80131-1.c b/gcc/testsuite/gcc.dg/pr80131-1.c
new file mode 100644
index 0000000..0bfe1f4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr80131-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target int32plus } */
+/* { dg-options "-fdump-tree-gimple" } */
+
+/* Checks the simplification of:
+   1 << (C - x) to (1 << C) >> x, where C = precision (type) - 1
+   f1 is not simplified but f2, f3 and f4 are. */
+
+__INT64_TYPE__ f1 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (31 - i);
+}
+
+__INT64_TYPE__ f2 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (63 - i);
+}
+
+__UINT64_TYPE__ f3 (__INT64_TYPE__ i)
+{
+  return (__UINT64_TYPE__)1 << (63 - i);
+}
+
+__INT32_TYPE__ f4 (__INT32_TYPE__ i)
+{
+  return (__INT32_TYPE__)1 << (31 - i);
+}
+
+/* { dg-final { scan-tree-dump-times "= 31 -"  1 "gimple" } } */
+/* { dg-final { scan-tree-dump-times "9223372036854775808 >>" 2 "gimple" } } */
+/* { dg-final { scan-tree-dump "2147483648 >>" "gimple" } } */

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-10-10 11:12                     ` Sudi Das
@ 2017-11-07 12:37                       ` Wilco Dijkstra
  2017-11-07 14:48                         ` Christophe Lyon
  0 siblings, 1 reply; 24+ messages in thread
From: Wilco Dijkstra @ 2017-11-07 12:37 UTC (permalink / raw)
  To: Sudi Das, Richard Biener
  Cc: Jakub Jelinek, GCC Patches, nd, Richard Earnshaw, James Greenhalgh

Sudi Das wrote:

> Thanks, I have made the changes to the patch.
> Also can someone please apply it for me. I do not have commit access.
>
> 2017-10-10  Sudakshina Das  <sudi.das@arm.com>
>
>        PR middle-end/80131
>        * match.pd: Simplify 1 << (C - x) where C = precision (x) - 1.
>
> 2017-10-10  Sudakshina Das  <sudi.das@arm.com>
>
>        PR middle-end/80131
>        * testsuite/gcc.dg/pr80131-1.c: New Test.
>
>
> With regards to the existing missed optimizations needed to the x86 RTL expansion, 
> I think the discussions can take place on the bug report that I created and maybe someone will pick it up.

I've committed this as r254496.

Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-11-07 12:37                       ` Wilco Dijkstra
@ 2017-11-07 14:48                         ` Christophe Lyon
  2017-11-07 14:49                           ` Wilco Dijkstra
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Lyon @ 2017-11-07 14:48 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: Sudi Das, Richard Biener, Jakub Jelinek, GCC Patches, nd,
	Richard Earnshaw, James Greenhalgh

Hi Wilco

On 7 November 2017 at 13:28, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Sudi Das wrote:
>
>> Thanks, I have made the changes to the patch.
>> Also can someone please apply it for me. I do not have commit access.
>>
>> 2017-10-10  Sudakshina Das  <sudi.das@arm.com>
>>
>>        PR middle-end/80131
>>        * match.pd: Simplify 1 << (C - x) where C = precision (x) - 1.
>>
>> 2017-10-10  Sudakshina Das  <sudi.das@arm.com>
>>
>>        PR middle-end/80131
>>        * testsuite/gcc.dg/pr80131-1.c: New Test.
>>
>>
>> With regards to the existing missed optimizations needed to the x86 RTL expansion,
>> I think the discussions can take place on the bug report that I created and maybe someone will pick it up.
>
> I've committed this as r254496.

This causes my builds (all arm and aarch64 targets) to fail:
g++ -fno-PIE -c   -g -O2 -DIN_GCC  -DCROSS_DIRECTORY_STRUCTURE
-fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall
-Wwrite-strings -Wcast-qual -Wmissing-format-attribute
-Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros
-Wno-overlength-strings -fno-common -Wno-unused -DHAVE_CONFIG_H -I.
-I. -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/.
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../include
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libcpp/include
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/builds/gcc-fsf-gccsrc/obj-aarch64-none-linux-gnu/gcc1/./gmp
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gmp
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/builds/gcc-fsf-gccsrc/obj-aarch64-none-linux-gnu/gcc1/./mpfr/src
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/mpfr/src
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/mpc/src
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libdecnumber
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libdecnumber/dpd
-I../libdecnumber
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libbacktrace
  -o gimple-match.o -MT gimple-match.o -MMD -MP -MF
./.deps/gimple-match.TPo gimple-match.c
In file included from
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/coretypes.h:397,
                 from
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/gimple-match-head.c:22,
                 from gimple-match.c:4:
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:
In function ‘bool wi::eq_p(const T1&, const T2&) [with T1 =
tree_node*, T2 = int]’:
gimple-match.c:49533:   instantiated from here
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits<tree_node*>’ used in nested
name specifier
gimple-match.c:49533:   instantiated from here
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits<tree_node*>’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits<tree_node*>’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits<tree_node*>’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits<tree_node*>’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits<tree_node*>’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits<tree_node*>’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits<tree_node*>’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits<tree_node*>’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits<tree_node*>’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:
In function ‘unsigned int wi::get_binary_precision(const T1&, const
T2&) [with T1 = tree_node*, T2 = int]’:
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1763:
  instantiated from ‘bool wi::eq_p(const T1&, const T2&) [with T1 =
tree_node*, T2 = int]’
gimple-match.c:49533:   instantiated from here
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1696:
error: incomplete type ‘wi::int_traits<tree_node*>’ used in nested
name specifier
make[2]: *** [gimple-match.o] Error 1

Can you have a look?

>
> Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-11-07 14:48                         ` Christophe Lyon
@ 2017-11-07 14:49                           ` Wilco Dijkstra
  0 siblings, 0 replies; 24+ messages in thread
From: Wilco Dijkstra @ 2017-11-07 14:49 UTC (permalink / raw)
  To: Christophe Lyon
  Cc: Sudi Das, Richard Biener, Jakub Jelinek, GCC Patches, nd,
	Richard Earnshaw, James Greenhalgh

Christophe Lyon wrote:

> This causes my builds (all arm and aarch64 targets) to fail:

Richard Biener already committed a fix in r254498 (thanks).
It seems constants in match.pd now need wi::to_wide.

Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-04-12 18:59     ` Jakub Jelinek
@ 2017-04-12 19:34       ` Segher Boessenkool
  0 siblings, 0 replies; 24+ messages in thread
From: Segher Boessenkool @ 2017-04-12 19:34 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Sudi Das, gcc-patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh, nd

On Wed, Apr 12, 2017 at 08:59:34PM +0200, Jakub Jelinek wrote:
> On Wed, Apr 12, 2017 at 01:15:56PM -0500, Segher Boessenkool wrote:
> > On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote:
> > > On Wed, Apr 12, 2017 at 09:29:55AM +0000, Sudi Das wrote:
> > > > This is a fix for PR 80131 
> > > > Currently the code A << (B - C) is not simplified.
> > > > However at least a more specific case of 1U << (C -x) where C = precision(type) - 1 can be simplified to (1 << C) >> x.
> > > 
> > > Is that always a win though?
> > > Some constants have higher costs than others on various targets, some
> > > significantly higher.  This change might be beneficial only
> > > if if C is as expensive as 1, then you get rid of a one (typically cheap)
> > > operation.
> > > Which makes me wonder whether this should be done at GIMPLE time and not
> > > at RTL time (expansion or combine etc.) when one can compare the RTX costs.
> > 
> > Yeah, either combine or simplify-rtx I'd say.
> > 
> > The transform    A << (B - C)  --->   (A << B) >> C
> > only works if A << B does not overflow but A << (B + 1) does (and then
> 
> Is that really true?  The A << B does not overflow is obvious precondition.
> 
> But consider say A 5U, B 29 and C (not compile time known) -2.

Ah yes, A unsigned.  Bah.

> 5U << 31 is valid 0x80000000U, but (5U << 29) >> (-2) is UB.
> Isn't the other condition instead that either C must be non-negative, or
> B must be number of bits in A's type - 1, i.e. that for negative C
> A << (B - C) would already be always UB?
> But then unless we know C is non-negative, A must be really just 1,
> otherwise A << B overflows.

Yeah.


Segher

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-04-12 18:16   ` Segher Boessenkool
@ 2017-04-12 18:59     ` Jakub Jelinek
  2017-04-12 19:34       ` Segher Boessenkool
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Jelinek @ 2017-04-12 18:59 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Sudi Das, gcc-patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh, nd

On Wed, Apr 12, 2017 at 01:15:56PM -0500, Segher Boessenkool wrote:
> Hi,
> 
> On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote:
> > On Wed, Apr 12, 2017 at 09:29:55AM +0000, Sudi Das wrote:
> > > This is a fix for PR 80131 
> > > Currently the code A << (B - C) is not simplified.
> > > However at least a more specific case of 1U << (C -x) where C = precision(type) - 1 can be simplified to (1 << C) >> x.
> > 
> > Is that always a win though?
> > Some constants have higher costs than others on various targets, some
> > significantly higher.  This change might be beneficial only
> > if if C is as expensive as 1, then you get rid of a one (typically cheap)
> > operation.
> > Which makes me wonder whether this should be done at GIMPLE time and not
> > at RTL time (expansion or combine etc.) when one can compare the RTX costs.
> 
> Yeah, either combine or simplify-rtx I'd say.
> 
> The transform    A << (B - C)  --->   (A << B) >> C
> only works if A << B does not overflow but A << (B + 1) does (and then

Is that really true?  The A << B does not overflow is obvious precondition.

But consider say A 5U, B 29 and C (not compile time known) -2.
5U << 31 is valid 0x80000000U, but (5U << 29) >> (-2) is UB.
Isn't the other condition instead that either C must be non-negative, or
B must be number of bits in A's type - 1, i.e. that for negative C
A << (B - C) would already be always UB?
But then unless we know C is non-negative, A must be really just 1,
otherwise A << B overflows.

> always does work afaics).  Or if we know C is non-negative and A << B
> does not overflow.  So realistically A and B need to be constant.
> 
> > Or do this at match.pd as canonicalization and then have RTL transformation
> > to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or
> > shorter).
> 
> The inverse transform only works for A=1, not for the more general case.

	Jakub

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-04-12 17:06 ` Jakub Jelinek
@ 2017-04-12 18:16   ` Segher Boessenkool
  2017-04-12 18:59     ` Jakub Jelinek
  0 siblings, 1 reply; 24+ messages in thread
From: Segher Boessenkool @ 2017-04-12 18:16 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Sudi Das, gcc-patches, Marcus Shawcroft, Richard Earnshaw,
	James Greenhalgh, nd

Hi,

On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote:
> On Wed, Apr 12, 2017 at 09:29:55AM +0000, Sudi Das wrote:
> > This is a fix for PR 80131 
> > Currently the code A << (B - C) is not simplified.
> > However at least a more specific case of 1U << (C -x) where C = precision(type) - 1 can be simplified to (1 << C) >> x.
> 
> Is that always a win though?
> Some constants have higher costs than others on various targets, some
> significantly higher.  This change might be beneficial only
> if if C is as expensive as 1, then you get rid of a one (typically cheap)
> operation.
> Which makes me wonder whether this should be done at GIMPLE time and not
> at RTL time (expansion or combine etc.) when one can compare the RTX costs.

Yeah, either combine or simplify-rtx I'd say.

The transform    A << (B - C)  --->   (A << B) >> C
only works if A << B does not overflow but A << (B + 1) does (and then
always does work afaics).  Or if we know C is non-negative and A << B
does not overflow.  So realistically A and B need to be constant.

> Or do this at match.pd as canonicalization and then have RTL transformation
> to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or
> shorter).

The inverse transform only works for A=1, not for the more general case.


Segher

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][GCC] Simplification of 1U << (31 - x)
  2017-04-12  9:30 Sudi Das
@ 2017-04-12 17:06 ` Jakub Jelinek
  2017-04-12 18:16   ` Segher Boessenkool
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Jelinek @ 2017-04-12 17:06 UTC (permalink / raw)
  To: Sudi Das
  Cc: gcc-patches, Marcus Shawcroft, Richard Earnshaw, James Greenhalgh, nd

On Wed, Apr 12, 2017 at 09:29:55AM +0000, Sudi Das wrote:
> Hi all
> 
> This is a fix for PR 80131 
> Currently the code A << (B - C) is not simplified.
> However at least a more specific case of 1U << (C -x) where C = precision(type) - 1 can be simplified to (1 << C) >> x.

Is that always a win though?
Some constants have higher costs than others on various targets, some
significantly higher.  This change might be beneficial only
if if C is as expensive as 1, then you get rid of a one (typically cheap)
operation.
Which makes me wonder whether this should be done at GIMPLE time and not
at RTL time (expansion or combine etc.) when one can compare the RTX costs.
Or do this at match.pd as canonicalization and then have RTL transformation
to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or
shorter).

	Jakub

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH][GCC] Simplification of 1U << (31 - x)
@ 2017-04-12  9:30 Sudi Das
  2017-04-12 17:06 ` Jakub Jelinek
  0 siblings, 1 reply; 24+ messages in thread
From: Sudi Das @ 2017-04-12  9:30 UTC (permalink / raw)
  To: gcc-patches; +Cc: Marcus Shawcroft, Richard Earnshaw, James Greenhalgh, nd

[-- Attachment #1: Type: text/plain, Size: 1006 bytes --]

Hi all

This is a fix for PR 80131 
Currently the code A << (B - C) is not simplified.
However at least a more specific case of 1U << (C -x) where C = precision(type) - 1 can be simplified to (1 << C) >> x.

This is done by adding a new simplification rule in match.pd

So for a test case :

unsigned int f1(unsigned int i)
{
  return 1U << (31 - i);
}

We see a gimple dump of 

f1 (unsigned int i)
{
  unsigned int D.3121;

  D.3121 = 2147483648 >> i;
  return D.3121;
}

instead of 

f1 (unsigned int i)
{
  unsigned int D.3121;

  _1 = 31 - i;
  D.3121 = 1 << _1;
  return D.3121;
}


Add a new test case and checked for regressions on bootstrapped aarch64-none-linux-gnu.
Ok for stage 1?

Thanks
Sudi

2017-03-23  Sudakshina Das  <sudi.das@arm.com>

	PR middle-end/80131
	* match.pd: Simplify 1 << (C - x) where C = precision (type) - 1.

2017-03-23  Sudakshina Das  <sudi.das@arm.com>

	PR middle-end/80131
	* testsuite/gcc.dg/pr80131-1.c: New Test.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch-7259-4(1).diff --]
[-- Type: text/x-patch; name="patch-7259-4(1).diff", Size: 1815 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index 7b96800..be20fb7 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -508,6 +508,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
        && tree_nop_conversion_p (type, TREE_TYPE (@1)))
    (lshift @0 @2)))
 
+/* Fold (1 << (C - x)) where C = precision(type) - 1
+   into ((1 << C) >> x). */
+(simplify
+ (lshift integer_onep@0 (minus INTEGER_CST@1 @2))
+  (if (INTEGRAL_TYPE_P (type)
+       && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT
+       && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1))
+   (if (TYPE_UNSIGNED(type))
+     (rshift (lshift @0 @1) @2)
+   (with
+    { tree utype = unsigned_type_for (type); }
+    (convert:type (rshift (lshift (convert:utype @0) @1) @2))))))
+
 /* Fold (C1/X)*C2 into (C1*C2)/X.  */
 (simplify
  (mult (rdiv@3 REAL_CST@0 @1) REAL_CST@2)
diff --git a/gcc/testsuite/gcc.dg/pr80131-1.c b/gcc/testsuite/gcc.dg/pr80131-1.c
new file mode 100644
index 0000000..2bb6ff3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr80131-1.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-fdump-tree-gimple" } */
+
+/* Checks the simplification of:
+   1 << (C - x) to (1 << C) >> x, where C = precision (type) - 1
+   f1 is not simplified but f2, f3 and f4 are. */
+
+__INT64_TYPE__ f1 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (31 - i);
+}
+
+__INT64_TYPE__ f2 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (63 - i);
+}
+
+__UINT64_TYPE__ f3 (__INT64_TYPE__ i)
+{
+  return (__UINT64_TYPE__)1 << (63 - i);
+}
+
+__INT32_TYPE__ f4 (__INT32_TYPE__ i)
+{
+  return (__INT32_TYPE__)1 << (31 - i);
+}
+
+/* { dg-final { scan-tree-dump-times "= 31 -"  1 "gimple" } } */
+/* { dg-final { scan-tree-dump-times "9223372036854775808 >>" 2 "gimple" } } */
+/* { dg-final { scan-tree-dump "2147483648 >>" "gimple" } } */

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-11-07 13:42 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-13 11:16 [PATCH][GCC] Simplification of 1U << (31 - x) Wilco Dijkstra
2017-04-13 11:21 ` Jakub Jelinek
2017-04-13 11:33   ` Wilco Dijkstra
2017-04-13 11:41     ` Jakub Jelinek
2017-04-13 11:49       ` Richard Biener
2017-04-13 11:55         ` Jakub Jelinek
2017-04-13 12:01         ` Wilco Dijkstra
2017-08-01  9:15           ` Sudi Das
2017-08-04 10:16             ` Richard Biener
2017-09-26 12:44               ` Sudi Das
2017-09-26 13:06                 ` Jakub Jelinek
2017-09-26 13:20                   ` Wilco Dijkstra
2017-10-06 17:00                     ` Sudi Das
2017-10-09 12:05                 ` Richard Biener
2017-10-09 13:04                   ` Wilco Dijkstra
2017-10-10 11:12                     ` Sudi Das
2017-11-07 12:37                       ` Wilco Dijkstra
2017-11-07 14:48                         ` Christophe Lyon
2017-11-07 14:49                           ` Wilco Dijkstra
  -- strict thread matches above, loose matches on Subject: below --
2017-04-12  9:30 Sudi Das
2017-04-12 17:06 ` Jakub Jelinek
2017-04-12 18:16   ` Segher Boessenkool
2017-04-12 18:59     ` Jakub Jelinek
2017-04-12 19:34       ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).