[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
       [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
@ 2011-10-07 18:23 ` pluto at agmk dot net
  2011-10-07 18:29 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: pluto at agmk dot net @ 2011-10-07 18:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

--- Comment #9 from Pawel Sikora <pluto at agmk dot net> 2011-10-07 18:22:25 UTC ---
Created attachment 25442
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25442
testcase


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
       [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
  2011-10-07 18:23 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized pluto at agmk dot net
@ 2011-10-07 18:29 ` pinskia at gcc dot gnu.org
  2011-10-07 18:46 ` pluto at agmk dot net
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2011-10-07 18:29 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

--- Comment #10 from Andrew Pinski <pinskia at gcc dot gnu.org> 2011-10-07 18:29:07 UTC ---
(In reply to comment #9)
> Created attachment 25442 [details]
> testcase

I think those are hard to optimize really since those are inline-asm really. 
And the unsigned short one gets optimized correctly.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
       [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
  2011-10-07 18:23 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized pluto at agmk dot net
  2011-10-07 18:29 ` pinskia at gcc dot gnu.org
@ 2011-10-07 18:46 ` pluto at agmk dot net
  2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org
  2021-07-10  8:34 ` roger at nextmovesoftware dot com
  4 siblings, 0 replies; 9+ messages in thread
From: pluto at agmk dot net @ 2011-10-07 18:46 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

--- Comment #11 from Pawel Sikora <pluto at agmk dot net> 2011-10-07 18:45:57 UTC ---
(In reply to comment #10)
> (In reply to comment #9)
> > Created attachment 25442 [details]
> > testcase
> 
> I think those are hard to optimize really since those are inline-asm really. 
> And the unsigned short one gets optimized correctly.

ahhh,
glibc uses a generic i386 implementation (ror+xchg) in <byteorder.h>
while it has an optimized <byteswap.h> for i486+ :(


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
       [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2011-10-07 18:46 ` pluto at agmk dot net
@ 2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org
  2021-07-10  8:34 ` roger at nextmovesoftware dot com
  4 siblings, 0 replies; 9+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-07-08 10:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

--- Comment #12 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:4c619132b3f14dc5e672a7f2f0e09cb784193559

commit r12-2137-g4c619132b3f14dc5e672a7f2f0e09cb784193559
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Thu Jul 8 11:46:14 2021 +0100

    PR tree-optimization/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in
match.pd

    All of the optimizations/transformations mentioned in bugzilla for
    PR tree-optimization/40210 are already implemented in mainline GCC,
    with one exception.  In comment #5, there's a suggestion that
    (bswap64(x)>>56)&0xff can be implemented without the bswap as
    (unsigned char)x, or equivalently x&0xff.

    This patch implements the above optimization, and closely related
    variants.  For any single bit, (bswap(X)>>C1)&1 can be simplified
    to (X>>C2)&1, where bit position C2 is the appropriate permutation
    of C1.  Similarly, the bswap can eliminated if the desired set of
    bits all lie within the same byte, hence (bswap(x)>>8)&255 can
    always be optimized, as can (bswap(x)>>8)&123.

    Previously,
    int foo(long long x) {
      return (__builtin_bswap64(x) >> 56) & 0xff;
    }

    compiled with -O2 to
    foo:    movq    %rdi, %rax
            bswap   %rax
            shrq    $56, %rax
            ret

    with this patch, it now compiles to
    foo:    movzbl  %dil, %eax
            ret

    2021-07-08  Roger Sayle  <roger@nextmovesoftware.com>
                Richard Biener  <rguenther@suse.de>

    gcc/ChangeLog
            PR tree-optimization/40210
            * match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
            (x>>C3)&C2 when possible.  Simplify bswap(x)>>C1 as ((T)x)>>C2
            when possible.  Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.

    gcc/testsuite/ChangeLog
            PR tree-optimization/40210
            * gcc.dg/builtin-bswap-13.c: New test.
            * gcc.dg/builtin-bswap-14.c: New test.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
       [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org
@ 2021-07-10  8:34 ` roger at nextmovesoftware dot com
  4 siblings, 0 replies; 9+ messages in thread
From: roger at nextmovesoftware dot com @ 2021-07-10  8:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at nextmovesoftware dot com
         Resolution|---                         |FIXED
             Status|UNCONFIRMED                 |RESOLVED
   Target Milestone|---                         |12.0

--- Comment #13 from Roger Sayle <roger at nextmovesoftware dot com> ---
All of the (bswap-related) optimizations mentioned in PR
tree-optimization/40210 are now implemented in mainline GCC.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/40210]  New: gcc needs byte swap builtins
@ 2009-05-20 18:48 eric-bugs at omnifarious dot org
  2009-05-20 19:39 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized eric-bugs at omnifarious dot org
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: eric-bugs at omnifarious dot org @ 2009-05-20 18:48 UTC (permalink / raw)
  To: gcc-bugs

gcc needs some built in functions for byte swapping.

I've been experimenting with the various versions of byte swapping functions
out there, and they either result in code that's opaque to the optimizer (i.e.
swapping something twice is not considered a null operation) or the optimizer
doesn't recognize that a byte swap is what's happening and renders it as a
complex series of shift, and and or instructions.

I know very little about the internals of gcc, but my ignorant preference would
be to make tree-ssa recognize that code like this:

inline uint64_t byteswap_64(const uint64_t x)
{
   return
      ((((x) & 0xff00000000000000ull) >> 56)
      | (((x) & 0x00ff000000000000ull) >> 40)
      | (((x) & 0x0000ff0000000000ull) >> 24)
      | (((x) & 0x000000ff00000000ull) >> 8)
      | (((x) & 0x00000000ff000000ull) << 8)
      | (((x) & 0x0000000000ff0000ull) << 24)
      | (((x) & 0x000000000000ff00ull) << 40)
      | (((x) & 0x00000000000000ffull) << 56));
}

is a byte swap and optimize appropriately.  If this were being done to an
entire array, it might even be possible to vectorize it efficiently.

This would also mean that code to pull specific bits out of a pre or post swap
value could be moved around and fiddled to get the value out of a different
place if it made for more efficient register usage.

-- 
           Summary: gcc needs byte swap builtins
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: eric-bugs at omnifarious dot org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
  2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org
@ 2009-05-20 19:39 ` eric-bugs at omnifarious dot org
  2009-05-20 20:05 ` jakub at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: eric-bugs at omnifarious dot org @ 2009-05-20 19:39 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from eric-bugs at omnifarious dot org  2009-05-20 19:39 -------
This code:

#include <stdint.h>
#include <stddef.h>

inline uint64_t byteswap_64(const uint64_t x)
{
   return __builtin_bswap64(x);
}

inline uint32_t byteswap_32(const uint32_t x)
{
   return __builtin_bswap32(x);
}

extern void random_function(uint32_t a, uint64_t b, uint32_t c, uint64_t d);

void swapping(const uint32_t x32, const uint64_t x64)
{
   random_function(byteswap_32(x32), byteswap_64(x64),
byteswap_32(byteswap_32(x32)), byteswap_64(byteswap_64(x64)));
}

void swaparray(uint64_t outvals[], char outtop[], const uint64_t invals[],
const size_t size)
{
   size_t i = 0;
   for (i = 0; i < size; ++i) {
      outvals[i] = byteswap_64(invals[i]);
      outtop[i] = (byteswap_64(invals[i]) >> 56) & 0xffull;
   }
}

results in this assembly:

.globl swaparray
        .type   swaparray, @function
swaparray:
.LFB5:
        testq   %rcx, %rcx
        je      .L8
        xorl    %r8d, %r8d
        .p2align 4,,7
        .p2align 3
.L7:
        movq    (%rdx,%r8,8), %rax
        bswap   %rax
        movq    %rax, (%rdi,%r8,8)
        movq    (%rdx,%r8,8), %rax
        bswap   %rax
        shrq    $56, %rax
        movb    %al, (%rsi,%r8)
        incq    %r8
        cmpq    %r8, %rcx
        ja      .L7
.L8:
        rep
        ret
.LFE5:
        .size   swaparray, .-swaparray
        .p2align 4,,15
.globl swapping
        .type   swapping, @function
swapping:
.LFB4:
        bswap   %rsi
        bswap   %edi
        movq    %rsi, %rcx
        movl    %edi, %edx
        bswap   %rcx
        bswap   %edx
        jmp     random_function
.LFE4:
        .size   swapping, .-swapping

when compiled with gcc -O3 -mtune=native -march=native on an Opteron system.

Notice that in swapping bswap is used twice rather than having two move
instructions and two bswap instructions.  The optimizer is apparently unaware
that bswap is its own inverse.

In swaparray the bswap operation is not subject to an obvious CSE optimization,
nor is it realized that the latter line might be more efficiently implemented
by movb   %al, (%rsi,%r8) before the bswap operation.


-- 

eric-bugs at omnifarious dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|gcc needs byte swap builtins|gcc byte swap builtins
                   |                            |inadequately optimized


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
  2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org
  2009-05-20 19:39 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized eric-bugs at omnifarious dot org
@ 2009-05-20 20:05 ` jakub at gcc dot gnu dot org
  2009-05-20 20:22 ` eric-bugs at omnifarious dot org
  2009-06-10 17:37 ` hp at gcc dot gnu dot org
  3 siblings, 0 replies; 9+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-20 20:05 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #6 from jakub at gcc dot gnu dot org  2009-05-20 20:05 -------
There are plenty other possible builtin bswap optimizations.  E.g.
extern void bar (void);

void foo (int x)
{
  if (__builtin_bswap32 (x) == __builtin_bswap32 (0x1234567))
    bar ();
}

should be optimized into if (x == 0x1234567) (only for EQ_EXPR/NE_EXPR),
similarly __builtin_bswap32 (x) == 0x1234567 should be optimized into
x == __builtin_bswap32 (0x1234567) because the latter can be swapped at compile
time, similarly __builtin_bswap32 (__builtin_bswap32 (x) | 0x1234) could
be optimized into x | __builtin_bswap32 (0x1234) (similarly for ^ or & or ~),
etc.  The question is if enough projects start using these builtins to make it
worth spending compile time on it and what exact optimizations are useful on
real-world code.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
  2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org
  2009-05-20 19:39 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized eric-bugs at omnifarious dot org
  2009-05-20 20:05 ` jakub at gcc dot gnu dot org
@ 2009-05-20 20:22 ` eric-bugs at omnifarious dot org
  2009-06-10 17:37 ` hp at gcc dot gnu dot org
  3 siblings, 0 replies; 9+ messages in thread
From: eric-bugs at omnifarious dot org @ 2009-05-20 20:22 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #7 from eric-bugs at omnifarious dot org  2009-05-20 20:22 -------
I've been playing around a bit more, and I've noticed that gcc in general does
not do a spectacular job of optimizing bitwise operations of any kind.

Some kind of general framework for tracking the movements of individual bits
and details like "16 bit values only have 16 bits, so using & to ensure this in
various ways is a null operation." might actually do a lot to speed up a lot of
code.

I distinctly remember a time long past when I and a co-worker fiddled some
complex bit operations this way and that to get the assembly out we knew was
close to optimal for a tight inner loop.  The resulting expression was
significantly less clear than the most obvious way of stating the same thing
and I also knew that if DEC changed their compiler in certain ways we'd have to
do it all over again.

As an example, there is no reason that:

(x << 8) | (x >> 8) should result in better code than ((x & 0xffu) << 8) | ((x
& 0xff00u) >> 8) when x is of type uint16_t, but it does.  And recognizing that
either can be done in one instruction on an x86 would be even better.

So, while I think you are likely correct that the byteswap builtins do not need
a lot of extensive optimization, I do think that bit operations in general
could be handled a lot better, and that would help out a whole lot of code. 
Once that framework was in place optimizing the byteswap builtin would be
trivial.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
  2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org
                   ` (2 preceding siblings ...)
  2009-05-20 20:22 ` eric-bugs at omnifarious dot org
@ 2009-06-10 17:37 ` hp at gcc dot gnu dot org
  3 siblings, 0 replies; 9+ messages in thread
From: hp at gcc dot gnu dot org @ 2009-06-10 17:37 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from hp at gcc dot gnu dot org  2009-06-10 17:36 -------
(In reply to comment #7)
> I do think that bit operations in general
> could be handled a lot better, and that would help out a whole lot of code. 

If you add (compilable) test-cases with enhancement requests, carefully noting
target and gcc version where you observed the lack of optimization, this would
help a lot.

> Once that framework was in place optimizing the byteswap builtin would be
> trivial.

We already have a "framework".


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-07-10  8:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
2011-10-07 18:23 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized pluto at agmk dot net
2011-10-07 18:29 ` pinskia at gcc dot gnu.org
2011-10-07 18:46 ` pluto at agmk dot net
2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org
2021-07-10  8:34 ` roger at nextmovesoftware dot com
2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org
2009-05-20 19:39 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized eric-bugs at omnifarious dot org
2009-05-20 20:05 ` jakub at gcc dot gnu dot org
2009-05-20 20:22 ` eric-bugs at omnifarious dot org
2009-06-10 17:37 ` hp at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).