public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
@ 2020-10-08 10:33 Wilco Dijkstra
  0 siblings, 0 replies; 12+ messages in thread
From: Wilco Dijkstra @ 2020-10-08 10:33 UTC (permalink / raw)
  To: aldyh; +Cc: GCC Patches

Hi,

> I am quoting my analysis from the PR.  Could an aarch64 expert 
> pontificate here?
>
> This test is checking the final assembly for a specific sequence.  I 
> don't speak aarch64 assembly, but the IL is different coming out of evrp.

The code currently generated is incorrect - you really need to preserve the AND.

The issue seems that evrp doesn't seem to take CLZ_DEFINED_VALUE_AT_ZERO
macros into account. When it is 2, the value of CLZ/CTZ is defined even in the
mid-end and you need to include it in the range (so if the CLZ/CTZ value is the
number of bits you should use ranges 0..32 and 0..64).

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
  2020-10-08 11:40         ` Jakub Jelinek
@ 2020-10-08 12:57           ` Wilco Dijkstra
  0 siblings, 0 replies; 12+ messages in thread
From: Wilco Dijkstra @ 2020-10-08 12:57 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

Hi Jakub,

> On Thu, Oct 08, 2020 at 11:37:24AM +0000, Wilco Dijkstra via Gcc-patches wrote:
>> Which optimizations does it enable that aren't possible if the value is defined?
>
> See bugzilla.  Note other compilers heavily optimize on those builtins
> undefined at value zero.

You mean the PR94801, PR94793, PR95863 you mentioned before? The first doesn't
seem to be a useful optimization (would anyone ever write that?), the other 2 would
benefit from clz(0) being well defined. In particular, x86 without BMI would greatly
benefit from setting CTZ_DEFINED_VALUE_AT_ZERO to 2.

So I fail to see any "heavy" optimizations here that show a benefit of keeping the value
undefined at zero.

>> > We just should make sure that we optimize code like x ? __builtin_c[lt]z (x) : 32;
>> > etc. properly (and I believe we do).
>> 
>> I think we do, but both the external and internal documentation are not clear
>> enough that most targets actually do define a value and will optimize for it.
>> Otherwise we wouldn't have this bug now...
>
> The documentation is very clear that the builtins are undefined at zero,
> that is all that matters for users.

If we don't change the undefinedness, at least we should try to explain the above
idiom as a way to get a well-defined range that still results in a single instruction
on most targets.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
  2020-10-08 11:37       ` Wilco Dijkstra
@ 2020-10-08 11:40         ` Jakub Jelinek
  2020-10-08 12:57           ` Wilco Dijkstra
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Jelinek @ 2020-10-08 11:40 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches

On Thu, Oct 08, 2020 at 11:37:24AM +0000, Wilco Dijkstra via Gcc-patches wrote:
> Which optimizations does it enable that aren't possible if the value is defined?

See bugzilla.  Note other compilers heavily optimize on those builtins
undefined at value zero.

> > We just should make sure that we optimize code like x ? __builtin_c[lt]z (x) : 32;
> > etc. properly (and I believe we do).
> 
> I think we do, but both the external and internal documentation are not clear
> enough that most targets actually do define a value and will optimize for it.
> Otherwise we wouldn't have this bug now...

The documentation is very clear that the builtins are undefined at zero,
that is all that matters for users.

	Jakub


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
  2020-10-08 11:25     ` Jakub Jelinek
@ 2020-10-08 11:37       ` Wilco Dijkstra
  2020-10-08 11:40         ` Jakub Jelinek
  0 siblings, 1 reply; 12+ messages in thread
From: Wilco Dijkstra @ 2020-10-08 11:37 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, aldyh

Hi Jakub, 

> Having it undefined allows optimizations, and has been that way for years.

Which optimizations does it enable that aren't possible if the value is defined?

> We just should make sure that we optimize code like x ? __builtin_c[lt]z (x) : 32;
> etc. properly (and I believe we do).

I think we do, but both the external and internal documentation are not clear
enough that most targets actually do define a value and will optimize for it.
Otherwise we wouldn't have this bug now...

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
  2020-10-08 11:09 ` Jakub Jelinek
  2020-10-08 11:22   ` Wilco Dijkstra
@ 2020-10-08 11:30   ` Wilco Dijkstra
  1 sibling, 0 replies; 12+ messages in thread
From: Wilco Dijkstra @ 2020-10-08 11:30 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, aldyh


Btw for PowerPC is 0..32:

https://www.ibm.com/support/knowledgecenter/ssw_aix_72/assembler/idalangref_cntlzw_instrs.html

Wilco

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
  2020-10-08 11:22   ` Wilco Dijkstra
@ 2020-10-08 11:25     ` Jakub Jelinek
  2020-10-08 11:37       ` Wilco Dijkstra
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Jelinek @ 2020-10-08 11:25 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, aldyh

On Thu, Oct 08, 2020 at 11:22:34AM +0000, Wilco Dijkstra wrote:
> >> I think a better way forward would be to make the builtin_clz/ctz more defined.
> >> Having undefined values is a source of unnecessary bugs given practically all
> >> modern targets return the number of bits for the zero input - it is relatively
> >> easy to ensure this on the few targets that don't.
> >
> > Well, e.g. i?86/x86_64 in most commonly used CPU flags is really undefined
> > (the register is unchanged).  And -1 is also quite commonly used value,
> > e.g. powerpc, gcn, xtensa.
> 
> So wouldn't it be easy to initialize the register before you do the bsr to get
> the same result as with BMI? I don't think an extra mov can affect performance
> in actual code (and GCC could still optimize the zero case if the input range
> doesn't include zero).
> 
> -1 is more complex, if these targets don't want to add extra instructions to fix
> it up, we could define the zero result either -1 or #bits depending on the target
> (still better than completely undefined).

Having it undefined allows optimizations, and has been that way for years.
We just should make sure that we optimize code like x ? __builtin_c[lt]z (x) : 32;
etc. properly (and I believe we do).

	Jakub


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
  2020-10-08 11:09 ` Jakub Jelinek
@ 2020-10-08 11:22   ` Wilco Dijkstra
  2020-10-08 11:25     ` Jakub Jelinek
  2020-10-08 11:30   ` Wilco Dijkstra
  1 sibling, 1 reply; 12+ messages in thread
From: Wilco Dijkstra @ 2020-10-08 11:22 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, aldyh

Hi Jakub,

>> I think a better way forward would be to make the builtin_clz/ctz more defined.
>> Having undefined values is a source of unnecessary bugs given practically all
>> modern targets return the number of bits for the zero input - it is relatively
>> easy to ensure this on the few targets that don't.
>
> Well, e.g. i?86/x86_64 in most commonly used CPU flags is really undefined
> (the register is unchanged).  And -1 is also quite commonly used value,
> e.g. powerpc, gcn, xtensa.

So wouldn't it be easy to initialize the register before you do the bsr to get
the same result as with BMI? I don't think an extra mov can affect performance
in actual code (and GCC could still optimize the zero case if the input range
doesn't include zero).

-1 is more complex, if these targets don't want to add extra instructions to fix
it up, we could define the zero result either -1 or #bits depending on the target
(still better than completely undefined).

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
  2020-10-08 11:04 Wilco Dijkstra
@ 2020-10-08 11:09 ` Jakub Jelinek
  2020-10-08 11:22   ` Wilco Dijkstra
  2020-10-08 11:30   ` Wilco Dijkstra
  0 siblings, 2 replies; 12+ messages in thread
From: Jakub Jelinek @ 2020-10-08 11:09 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, aldyh

On Thu, Oct 08, 2020 at 11:04:01AM +0000, Wilco Dijkstra wrote:
> > Perhaps another way out of this would be document and enforce that
> > __builtin_c[lt]z{,l,ll} etc calls are undefined at zero, but C[TL]Z ifn
> > calls are defined there based on *_DEFINED_VALUE_AT_ZERO (*) == 2, and then
> > we would need to make sure that e.g. in simplify_count_trailing_zeroes
> > we emit a .CTZ or __builtin_ctz call depending on whether it is undefined
> > there or not; or give .C[TL]Z an additional argument (0/1) which would tell
> > if it is defined at zero or not.
> 
> I think a better way forward would be to make the builtin_clz/ctz more defined.
> Having undefined values is a source of unnecessary bugs given practically all
> modern targets return the number of bits for the zero input - it is relatively
> easy to ensure this on the few targets that don't.

Well, e.g. i?86/x86_64 in most commonly used CPU flags is really undefined
(the register is unchanged).  And -1 is also quite commonly used value,
e.g. powerpc, gcn, xtensa.

	Jakub


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
@ 2020-10-08 11:04 Wilco Dijkstra
  2020-10-08 11:09 ` Jakub Jelinek
  0 siblings, 1 reply; 12+ messages in thread
From: Wilco Dijkstra @ 2020-10-08 11:04 UTC (permalink / raw)
  To: jakub; +Cc: GCC Patches, aldyh

Hi Jakub,

> Perhaps another way out of this would be document and enforce that
> __builtin_c[lt]z{,l,ll} etc calls are undefined at zero, but C[TL]Z ifn
> calls are defined there based on *_DEFINED_VALUE_AT_ZERO (*) == 2, and then
> we would need to make sure that e.g. in simplify_count_trailing_zeroes
> we emit a .CTZ or __builtin_ctz call depending on whether it is undefined
> there or not; or give .C[TL]Z an additional argument (0/1) which would tell
> if it is defined at zero or not.

I think a better way forward would be to make the builtin_clz/ctz more defined.
Having undefined values is a source of unnecessary bugs given practically all
modern targets return the number of bits for the zero input - it is relatively
easy to ensure this on the few targets that don't.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
  2020-10-08 10:22 ` Jakub Jelinek
@ 2020-10-08 10:27   ` Jakub Jelinek
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Jelinek @ 2020-10-08 10:27 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener, gcc-patches

On Thu, Oct 08, 2020 at 12:22:11PM +0200, Jakub Jelinek via Gcc-patches wrote:
> We have several enhancement reports in bugzilla from Gabriel Ravier on this
> topic I think.

See e.g. PR94801, PR94793, PR95863 on the topic.

	Jakub


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
  2020-10-08  9:58 Aldy Hernandez
@ 2020-10-08 10:22 ` Jakub Jelinek
  2020-10-08 10:27   ` Jakub Jelinek
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Jelinek @ 2020-10-08 10:22 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener; +Cc: gcc-patches

On Thu, Oct 08, 2020 at 11:58:21AM +0200, Aldy Hernandez via Gcc-patches wrote:
> I am quoting my analysis from the PR.  Could an aarch64 expert pontificate
> here?
> 
> This test is checking the final assembly for a specific sequence.  I don't
> speak aarch64 assembly, but the IL is different coming out of evrp.
> 
> The first culprit is this difference in the mergephi1 dump:
> 
>    _9 = .CTZ (x_6(D));
> -  _10 = _9 & 31;
> +  _10 = _9;
> 
> These are unsigned ints, so assuming they are 32 bits on aarch64,
> __builtin_ctz is always less than 32.  This is because a CTZ of 0 is
> undefined according to the GCC manual:
> 
> [[
> Built-in Function: int __builtin_ctz (unsigned int x)
> 
>     Returns the number of trailing 0-bits in x, starting at the least
> significant bit position. If x is 0, the result is undefined.

In user __builtin_ctz yes, but it isn't that easy.

Aarch64 defines CTZ_DEFINED_VALUE_AT_ZERO to 2 and sets value to bitsize
of the mode (i.e. 32 in this case), 

And documentation says:
 -- Macro: CLZ_DEFINED_VALUE_AT_ZERO (MODE, VALUE)
 -- Macro: CTZ_DEFINED_VALUE_AT_ZERO (MODE, VALUE)
     A C expression that indicates whether the architecture defines a
     value for 'clz' or 'ctz' with a zero operand.  A result of '0'
     indicates the value is undefined.  If the value is defined for only
     the RTL expression, the macro should evaluate to '1'; if the value
     applies also to the corresponding optab entry (which is normally
     the case if it expands directly into the corresponding RTL), then
     the macro should evaluate to '2'.  In the cases where the value is
     defined, VALUE should be set to this value.

At least for *_DEFINED_VALUE_AT_ZERO (*) == 2, I'm afraid VRP can't
assume it is undefined at zero, but needs to use a range that covers
both the values for non-zero operand and the VALUE from
CTZ_DEFINED_VALUE_AT_ZERO, so [0, 32] range in this case, because
earlier GIMPLE passes could have already optimized away e.g. a != 0 check.
See simplify_count_trailing_zeroes in forwprop.

Perhaps another way out of this would be document and enforce that
__builtin_c[lt]z{,l,ll} etc calls are undefined at zero, but C[TL]Z ifn
calls are defined there based on *_DEFINED_VALUE_AT_ZERO (*) == 2, and then
we would need to make sure that e.g. in simplify_count_trailing_zeroes
we emit a .CTZ or __builtin_ctz call depending on whether it is undefined
there or not; or give .C[TL]Z an additional argument (0/1) which would tell
if it is defined at zero or not.

We have several enhancement reports in bugzilla from Gabriel Ravier on this
topic I think.

	Jakub


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c
@ 2020-10-08  9:58 Aldy Hernandez
  2020-10-08 10:22 ` Jakub Jelinek
  0 siblings, 1 reply; 12+ messages in thread
From: Aldy Hernandez @ 2020-10-08  9:58 UTC (permalink / raw)
  To: gcc-patches

I am quoting my analysis from the PR.  Could an aarch64 expert 
pontificate here?

This test is checking the final assembly for a specific sequence.  I 
don't speak aarch64 assembly, but the IL is different coming out of evrp.

The first culprit is this difference in the mergephi1 dump:

    _9 = .CTZ (x_6(D));
-  _10 = _9 & 31;
+  _10 = _9;

These are unsigned ints, so assuming they are 32 bits on aarch64, 
__builtin_ctz is always less than 32.  This is because a CTZ of 0 is 
undefined according to the GCC manual:

[[
Built-in Function: int __builtin_ctz (unsigned int x)

     Returns the number of trailing 0-bits in x, starting at the least 
significant bit position. If x is 0, the result is undefined.
]]

So a bitwise AND of anything less than 32 with 0x1f (31) is a no-op.

Here are the full IL differences:

--- legacy-evrp/pr90838.c.038t.mergephi1        2020-10-07 
08:44:12.152358885 -0400
+++ ranger/pr90838.c.038t.mergephi1     2020-10-07 08:39:12.339296502 -0400
@@ -1,41 +1,41 @@

  ;; Function ctz1 (ctz1, funcdef_no=0, decl_uid=3587, cgraph_uid=1, 
symbol_order=0)

  ctz1 (unsigned int x)
  {
    static const char table[32] = 
"\x00\x01\x1c\x02\x1d\x0e\x18\x03\x1e\x16\x14\x0f\x19\x11\x04\b\x1f\x1b\r\x17\x15\x13\x10\x07\x1a\f\x12\x06\v\x05\n\t";
    unsigned int _1;
    unsigned int _2;
    unsigned int _3;
    unsigned int _4;
    char _5;
    int _9;
    int _10;

    <bb 2> :
    _1 = -x_6(D);
    _2 = _1 & x_6(D);
    _3 = _2 * 125613361;
    _4 = _3 >> 27;
    _9 = .CTZ (x_6(D));
-  _10 = _9 & 31;
+  _10 = _9;
    _5 = (char) _10;
    return _10;

  }



  ;; Function ctz2 (ctz2, funcdef_no=1, decl_uid=3591, cgraph_uid=2, 
symbol_order=1)

  ctz2 (unsigned int x)
  {
    static short int table[64] = {32, 0, 1, 12, 2, 6, 0, 13, 3, 0, 7, 0, 
0, 0, 0, 14, 10, 4, 0, 0, 8, 0, 0, 25, 0, 0, 0, 0, 0, 21, 27, 15, 31, 
11, 5, 0, 0, 0, 0, 0, 9, 0, 0,
24, 0, 0, 20, 26, 30, 0, 0, 0, 0, 23, 0, 19, 29, 0, 22, 18, 28, 17, 16, 0};
    unsigned int _1;
    unsigned int _2;
    unsigned int _3;
    short int _4;
    int _8;

    <bb 2> :
    _1 = -x_5(D);
@@ -87,27 +87,27 @@


  ;; Function ctz4 (ctz4, funcdef_no=3, decl_uid=3601, cgraph_uid=4, 
symbol_order=5)

  ctz4 (long unsigned int x)
  {
    long unsigned int lsb;
    long unsigned int _1;
    long long unsigned int _2;
    long long unsigned int _3;
    char _4;
    int _9;
    int _10;

    <bb 2> :
    _1 = -x_5(D);
    lsb_6 = _1 & x_5(D);
    _2 = lsb_6 * 283881067100198605;
    _3 = _2 >> 58;
    _9 = .CTZ (x_5(D));
-  _10 = _9 & 63;
+  _10 = _9;
    _4 = (char) _10;
    return _10;

  }

The difference in assembly matches.  We have 2 less AND's in the final 
output:

$ diff -u legacy.s ranger.s
--- legacy.s    2020-10-07 09:06:13.420446783 -0400
+++ ranger.s    2020-10-07 09:06:42.646646949 -0400
@@ -8,7 +8,6 @@
  ctz1:
         rbit    w0, w0
         clz     w0, w0
-       and     w0, w0, 31
         ret
         .size   ctz1, .-ctz1
         .align  2
@@ -36,7 +35,6 @@
  ctz4:
         rbit    x0, x0
         clz     x0, x0
-       and     w0, w0, 63
         ret
         .size   ctz4, .-ctz4

If my analysis is correct, we could just remove the line checking for 
"and", or perhaps check that we don't have any and's.

OK for trunk?
Aldy

     gcc/testsuite/ChangeLog:

             PR target/97312
             * gcc.target/aarch64/pr90838.c: Remove scan for AND.

diff --git a/gcc/testsuite/gcc.target/aarch64/pr90838.c 
b/gcc/testsuite/gcc.target/aarch64/pr90838.c
index e1e19ac6a61..76cd5e18d2e 100644
--- a/gcc/testsuite/gcc.target/aarch64/pr90838.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr90838.c
@@ -60,5 +60,4 @@ int ctz4 (unsigned long x)
  }

  /* { dg-final { scan-assembler-times "clz\t" 4 } } */
-/* { dg-final { scan-assembler-times "and\t" 2 } } */
  /* { dg-final { scan-assembler-not "cmp\t.*0" } } */


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-10-08 12:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-08 10:33 [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c Wilco Dijkstra
  -- strict thread matches above, loose matches on Subject: below --
2020-10-08 11:04 Wilco Dijkstra
2020-10-08 11:09 ` Jakub Jelinek
2020-10-08 11:22   ` Wilco Dijkstra
2020-10-08 11:25     ` Jakub Jelinek
2020-10-08 11:37       ` Wilco Dijkstra
2020-10-08 11:40         ` Jakub Jelinek
2020-10-08 12:57           ` Wilco Dijkstra
2020-10-08 11:30   ` Wilco Dijkstra
2020-10-08  9:58 Aldy Hernandez
2020-10-08 10:22 ` Jakub Jelinek
2020-10-08 10:27   ` Jakub Jelinek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).