public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64
@ 2011-08-23 16:40 gpiez at web dot de
  2011-08-23 17:00 ` [Bug c/50168] " jakub at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: gpiez at web dot de @ 2011-08-23 16:40 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

             Bug #: 50168
           Summary: __builtin_ctz() and intrinsics __bsr(), __bsf()
                    generate suboptimal code on x86_64
    Classification: Unclassified
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: gpiez@web.de


Testcase:

--------------------
#include <x86intrin.h>

static inline long my_bsfq(long x) __attribute__((__always_inline__));
static inline long my_bsfq(long x) {
    long result;
    asm(" bsfq %1, %0 \n"
        : "=r"(result)
        : "r"(x)
    );
    return result;
}

long c[64];

long f(long i) {
    return c[ __bsfq(i) ];
}

long g(long i) {
    return c[ __builtin_ctzll(i) ];
}

long h(long i) {
    return c[ my_bsfq(i) ];
}
----------------------



When I compile this with 'gcc -O3 -g testcase.c -c -o testcase.o
&& objdump -d testcase', I get



----------------------
0000000000000000 <f>:
   0:   48 0f bc ff             bsf    %rdi,%rdi
   4:   48 63 ff                movslq %edi,%rdi
   7:   48 8b 04 fd 00 00 00    mov    0x0(,%rdi,8),%rax
   e:   00 
   f:   c3                      retq   

0000000000000010 <g>:
  10:   48 0f bc ff             bsf    %rdi,%rdi
  14:   48 63 ff                movslq %edi,%rdi
  17:   48 8b 04 fd 00 00 00    mov    0x0(,%rdi,8),%rax
  1e:   00 
  1f:   c3                      retq   

0000000000000020 <h>:
  20:   48 0f bc ff             bsf    %rdi,%rdi
  24:   48 8b 04 fd 00 00 00    mov    0x0(,%rdi,8),%rax
  2b:   00 
  2c:   c3                      retq   
-----------------------



Please note the unneeded 32 to 64 bit conversion 'movslq ...' inserted by the
compiler in functions f() and g(). It should look like h() instead.

I suspect the source is the prototype of the builtin, whose return type 'int'
does not match the "natural" return type on x86_64, which is 64 bit, the same
register size as the input register.

If I replace the builtin/intrinsic with the selfmade asm one, I get a nice
speedup of 2% in my chessengine.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug c/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64
  2011-08-23 16:40 [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64 gpiez at web dot de
@ 2011-08-23 17:00 ` jakub at gcc dot gnu.org
  2011-08-23 18:07 ` jakub at gcc dot gnu.org
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-08-23 17:00 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-08-23 16:41:34 UTC ---
I'll look at this.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug c/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64
  2011-08-23 16:40 [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64 gpiez at web dot de
  2011-08-23 17:00 ` [Bug c/50168] " jakub at gcc dot gnu.org
@ 2011-08-23 18:07 ` jakub at gcc dot gnu.org
  2011-08-23 22:01 ` gpiez at web dot de
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-08-23 18:07 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |uros at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-08-23 17:58:52 UTC ---
Those aren't equivalent unfortunately, because bsf and bsr insns on x86 have
undefined value if the source is zero.  While __builtin_c[lt]z* documentation
says that the result is undefined in that case, I wonder if it would be fine
even if long l = (int) __builtin_c[lt]z* (x); gave a value that wasn't actually
sign-extended to 64 bits.
The combiner already simplifies zero or sign extension of popcount/parity/ffs
and, if ctz or clz value is defined at zero, also those, but if it is undefined
it assumes anything in any of the bits and thus can't optimize the sign/zero
extension away.  With -mbmi it will be optimized just fine, because for tzcnt
(and lzcnt for -mlzcnt) insns are well defined even for source operand zero.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug c/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64
  2011-08-23 16:40 [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64 gpiez at web dot de
  2011-08-23 17:00 ` [Bug c/50168] " jakub at gcc dot gnu.org
  2011-08-23 18:07 ` jakub at gcc dot gnu.org
@ 2011-08-23 22:01 ` gpiez at web dot de
  2011-08-23 23:53 ` gpiez at web dot de
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: gpiez at web dot de @ 2011-08-23 22:01 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

--- Comment #3 from Gunther Piez <gpiez at web dot de> 2011-08-23 21:54:40 UTC ---
On 23.08.2011 19:58, jakub at gcc dot gnu.org wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168
>
> Jakub Jelinek <jakub at gcc dot gnu.org> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |uros at gcc dot gnu.org
>
> --- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-08-23 17:58:52 UTC ---
> Those aren't equivalent unfortunately, because bsf and bsr insns on x86 have
> undefined value if the source is zero.  While __builtin_c[lt]z* documentation
> says that the result is undefined in that case, I wonder if it would be fine
> even if long l = (int) __builtin_c[lt]z* (x); gave a value that wasn't actually
> sign-extended to 64 bits.
> The combiner already simplifies zero or sign extension of popcount/parity/ffs
> and, if ctz or clz value is defined at zero, also those, but if it is undefined
> it assumes anything in any of the bits and thus can't optimize the sign/zero
> extension away.  With -mbmi it will be optimized just fine, because for tzcnt
> (and lzcnt for -mlzcnt) insns are well defined even for source operand zero.
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug c/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64
  2011-08-23 16:40 [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64 gpiez at web dot de
                   ` (2 preceding siblings ...)
  2011-08-23 22:01 ` gpiez at web dot de
@ 2011-08-23 23:53 ` gpiez at web dot de
  2011-08-24  8:18 ` [Bug middle-end/50168] " rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: gpiez at web dot de @ 2011-08-23 23:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

--- Comment #4 from Gunther Piez <gpiez at web dot de> 2011-08-23 22:00:31 UTC ---
On 23.08.2011 19:58, jakub at gcc dot gnu.org wrote:
> While __builtin_c[lt]z* documentation
> says that the result is undefined in that case, I wonder if it would be fine
> even if long l = (int) __builtin_c[lt]z* (x); gave a value that wasn't actually
> sign-extended to 64 bits.

So that software operating on the assumption that the value return by
__builtin_c[lt]z* is always int, even in the undefined case, would break
as soon at it sees a value outside the int range. Which could very well
be the case, AFAIK in the zero case the value of the target register is
just unchanged.

IMHO this is ok, I doubt that such code exists and even if, it is very
broken by design :-)
 Just my 2 cent.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64
  2011-08-23 16:40 [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64 gpiez at web dot de
                   ` (3 preceding siblings ...)
  2011-08-23 23:53 ` gpiez at web dot de
@ 2011-08-24  8:18 ` rguenth at gcc dot gnu.org
  2011-08-24  9:41 ` jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-08-24  8:18 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|c                           |middle-end

--- Comment #5 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-08-24 08:16:06 UTC ---
I'd say "undefined" is a good enough reason to optimize the case (as opposed
to "target implementation defined" whose behavior we'd need to preserve).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64
  2011-08-23 16:40 [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64 gpiez at web dot de
                   ` (4 preceding siblings ...)
  2011-08-24  8:18 ` [Bug middle-end/50168] " rguenth at gcc dot gnu.org
@ 2011-08-24  9:41 ` jakub at gcc dot gnu.org
  2011-08-24  9:47 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-08-24  9:41 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-08-24 09:21:11 UTC ---
But it is just undefined value, not undefined behavior.  If you do:
--- rtlanal.c 2011-08-23 19:46:13.000000000 +0200
+++ rtlanal.c 2011-08-24 11:18:01.720582231 +0200
@@ -4256,21 +4256,17 @@ nonzero_bits1 (const_rtx x, enum machine
     case CLZ:
       /* If CLZ has a known value at zero, then the nonzero bits are
          that value, plus the number of bits in the mode minus one.  */
-      if (CLZ_DEFINED_VALUE_AT_ZERO (mode, nonzero))
-        nonzero
-          |= ((unsigned HOST_WIDE_INT) 1 << (floor_log2 (mode_width))) - 1;
-      else
-        nonzero = -1;
+      if (!CLZ_DEFINED_VALUE_AT_ZERO (mode, nonzero))
+        nonzero = 0;
+      nonzero |= ((unsigned HOST_WIDE_INT) 1 << (floor_log2 (mode_width))) -
1;
       break;

     case CTZ:
       /* If CTZ has a known value at zero, then the nonzero bits are
          that value, plus the number of bits in the mode minus one.  */
-      if (CTZ_DEFINED_VALUE_AT_ZERO (mode, nonzero))
-        nonzero
-          |= ((unsigned HOST_WIDE_INT) 1 << (floor_log2 (mode_width))) - 1;
-      else
-        nonzero = -1;
+      if (!CTZ_DEFINED_VALUE_AT_ZERO (mode, nonzero))
+        nonzero = 0;
+      nonzero |= ((unsigned HOST_WIDE_INT) 1 << (floor_log2 (mode_width))) -
1;
       break;

     case CLRSB:

then __builtin_clzl (x) & 0xff will happily give 0xdeadbeef as result if x
happens to be 0.  Similarly long y = (short) __builtin_clzl (x) will result in
y being any value from LONG_MIN to LONG_MAX, instead of any value from
SHORT_MIN to SHORT_MAX.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64
  2011-08-23 16:40 [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64 gpiez at web dot de
                   ` (5 preceding siblings ...)
  2011-08-24  9:41 ` jakub at gcc dot gnu.org
@ 2011-08-24  9:47 ` rguenth at gcc dot gnu.org
  2011-08-24  9:49 ` jakub at gcc dot gnu.org
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-08-24  9:47 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011-08-24
     Ever Confirmed|0                           |1

--- Comment #7 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-08-24 09:40:44 UTC ---
Hm, so the best thing we can do is a peephole recognizing that

  10:   48 0f bc ff             bsf    %rdi,%rdi
  14:   48 63 ff                movslq %edi,%rdi

when rdi is zero it will stay so and the movslq is redundant?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64
  2011-08-23 16:40 [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64 gpiez at web dot de
                   ` (6 preceding siblings ...)
  2011-08-24  9:47 ` rguenth at gcc dot gnu.org
@ 2011-08-24  9:49 ` jakub at gcc dot gnu.org
  2011-08-24  9:53 ` jakub at gcc dot gnu.org
  2021-08-08 22:48 ` [Bug middle-end/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate extra sign extend " pinskia at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-08-24  9:49 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-08-24 09:45:56 UTC ---
What we IMHO should optimize and don't currently is the redundant sign
extension when using __builtin_ffsl - as it internally uses bsf + cmove,
nonzero_bits isn't able to figure out that the result of the sequence is
guaranteed to have nonzero-bits.  Perhaps we should in that case add a
REG_EQUAL note to the last insn in the sequence and perhaps nonzero_bits could
also look at REG_EQUAL notes.  Doing that could perhaps help even testcases
like:
int foo (long x)
{
  return __builtin_popcountl (x) & 0xff;
}
where the andl $0x255, %eax could be optimized away, etc.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64
  2011-08-23 16:40 [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64 gpiez at web dot de
                   ` (7 preceding siblings ...)
  2011-08-24  9:49 ` jakub at gcc dot gnu.org
@ 2011-08-24  9:53 ` jakub at gcc dot gnu.org
  2021-08-08 22:48 ` [Bug middle-end/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate extra sign extend " pinskia at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-08-24  9:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-08-24 09:48:20 UTC ---
(In reply to comment #7)
> Hm, so the best thing we can do is a peephole recognizing that
> 
>   10:   48 0f bc ff             bsf    %rdi,%rdi
>   14:   48 63 ff                movslq %edi,%rdi
> 
> when rdi is zero it will stay so and the movslq is redundant?

I don't think so.  While the SandyBridge CPU seems to behave that way, there is
no such guarantee in the Intel manuals which say that the destination value is
undefined if zero-flag is cleared (i.e. the source was zero) and thus some
older or future CPUs might behave differently.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug middle-end/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate extra sign extend on x86_64
  2011-08-23 16:40 [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64 gpiez at web dot de
                   ` (8 preceding siblings ...)
  2011-08-24  9:53 ` jakub at gcc dot gnu.org
@ 2021-08-08 22:48 ` pinskia at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-08 22:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |DUPLICATE

--- Comment #12 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Dup of bug 29776.

*** This bug has been marked as a duplicate of bug 29776 ***

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-08-08 22:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-23 16:40 [Bug c/50168] New: __builtin_ctz() and intrinsics __bsr(), __bsf() generate suboptimal code on x86_64 gpiez at web dot de
2011-08-23 17:00 ` [Bug c/50168] " jakub at gcc dot gnu.org
2011-08-23 18:07 ` jakub at gcc dot gnu.org
2011-08-23 22:01 ` gpiez at web dot de
2011-08-23 23:53 ` gpiez at web dot de
2011-08-24  8:18 ` [Bug middle-end/50168] " rguenth at gcc dot gnu.org
2011-08-24  9:41 ` jakub at gcc dot gnu.org
2011-08-24  9:47 ` rguenth at gcc dot gnu.org
2011-08-24  9:49 ` jakub at gcc dot gnu.org
2011-08-24  9:53 ` jakub at gcc dot gnu.org
2021-08-08 22:48 ` [Bug middle-end/50168] __builtin_ctz() and intrinsics __bsr(), __bsf() generate extra sign extend " pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).