public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
[not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
@ 2011-10-07 18:23 ` pluto at agmk dot net
2011-10-07 18:29 ` pinskia at gcc dot gnu.org
` (3 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: pluto at agmk dot net @ 2011-10-07 18:23 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
--- Comment #9 from Pawel Sikora <pluto at agmk dot net> 2011-10-07 18:22:25 UTC ---
Created attachment 25442
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25442
testcase
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
[not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
2011-10-07 18:23 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized pluto at agmk dot net
@ 2011-10-07 18:29 ` pinskia at gcc dot gnu.org
2011-10-07 18:46 ` pluto at agmk dot net
` (2 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2011-10-07 18:29 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
--- Comment #10 from Andrew Pinski <pinskia at gcc dot gnu.org> 2011-10-07 18:29:07 UTC ---
(In reply to comment #9)
> Created attachment 25442 [details]
> testcase
I think those are hard to optimize really since those are inline-asm really.
And the unsigned short one gets optimized correctly.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
[not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
2011-10-07 18:23 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized pluto at agmk dot net
2011-10-07 18:29 ` pinskia at gcc dot gnu.org
@ 2011-10-07 18:46 ` pluto at agmk dot net
2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org
2021-07-10 8:34 ` roger at nextmovesoftware dot com
4 siblings, 0 replies; 9+ messages in thread
From: pluto at agmk dot net @ 2011-10-07 18:46 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
--- Comment #11 from Pawel Sikora <pluto at agmk dot net> 2011-10-07 18:45:57 UTC ---
(In reply to comment #10)
> (In reply to comment #9)
> > Created attachment 25442 [details]
> > testcase
>
> I think those are hard to optimize really since those are inline-asm really.
> And the unsigned short one gets optimized correctly.
ahhh,
glibc uses a generic i386 implementation (ror+xchg) in <byteorder.h>
while it has an optimized <byteswap.h> for i486+ :(
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
[not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
` (2 preceding siblings ...)
2011-10-07 18:46 ` pluto at agmk dot net
@ 2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org
2021-07-10 8:34 ` roger at nextmovesoftware dot com
4 siblings, 0 replies; 9+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-07-08 10:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
--- Comment #12 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:
https://gcc.gnu.org/g:4c619132b3f14dc5e672a7f2f0e09cb784193559
commit r12-2137-g4c619132b3f14dc5e672a7f2f0e09cb784193559
Author: Roger Sayle <roger@nextmovesoftware.com>
Date: Thu Jul 8 11:46:14 2021 +0100
PR tree-optimization/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in
match.pd
All of the optimizations/transformations mentioned in bugzilla for
PR tree-optimization/40210 are already implemented in mainline GCC,
with one exception. In comment #5, there's a suggestion that
(bswap64(x)>>56)&0xff can be implemented without the bswap as
(unsigned char)x, or equivalently x&0xff.
This patch implements the above optimization, and closely related
variants. For any single bit, (bswap(X)>>C1)&1 can be simplified
to (X>>C2)&1, where bit position C2 is the appropriate permutation
of C1. Similarly, the bswap can eliminated if the desired set of
bits all lie within the same byte, hence (bswap(x)>>8)&255 can
always be optimized, as can (bswap(x)>>8)&123.
Previously,
int foo(long long x) {
return (__builtin_bswap64(x) >> 56) & 0xff;
}
compiled with -O2 to
foo: movq %rdi, %rax
bswap %rax
shrq $56, %rax
ret
with this patch, it now compiles to
foo: movzbl %dil, %eax
ret
2021-07-08 Roger Sayle <roger@nextmovesoftware.com>
Richard Biener <rguenther@suse.de>
gcc/ChangeLog
PR tree-optimization/40210
* match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
(x>>C3)&C2 when possible. Simplify bswap(x)>>C1 as ((T)x)>>C2
when possible. Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.
gcc/testsuite/ChangeLog
PR tree-optimization/40210
* gcc.dg/builtin-bswap-13.c: New test.
* gcc.dg/builtin-bswap-14.c: New test.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
[not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
` (3 preceding siblings ...)
2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org
@ 2021-07-10 8:34 ` roger at nextmovesoftware dot com
4 siblings, 0 replies; 9+ messages in thread
From: roger at nextmovesoftware dot com @ 2021-07-10 8:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
Roger Sayle <roger at nextmovesoftware dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |roger at nextmovesoftware dot com
Resolution|--- |FIXED
Status|UNCONFIRMED |RESOLVED
Target Milestone|--- |12.0
--- Comment #13 from Roger Sayle <roger at nextmovesoftware dot com> ---
All of the (bswap-related) optimizations mentioned in PR
tree-optimization/40210 are now implemented in mainline GCC.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org
` (2 preceding siblings ...)
2009-05-20 20:22 ` eric-bugs at omnifarious dot org
@ 2009-06-10 17:37 ` hp at gcc dot gnu dot org
3 siblings, 0 replies; 9+ messages in thread
From: hp at gcc dot gnu dot org @ 2009-06-10 17:37 UTC (permalink / raw)
To: gcc-bugs
------- Comment #8 from hp at gcc dot gnu dot org 2009-06-10 17:36 -------
(In reply to comment #7)
> I do think that bit operations in general
> could be handled a lot better, and that would help out a whole lot of code.
If you add (compilable) test-cases with enhancement requests, carefully noting
target and gcc version where you observed the lack of optimization, this would
help a lot.
> Once that framework was in place optimizing the byteswap builtin would be
> trivial.
We already have a "framework".
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org
2009-05-20 19:39 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized eric-bugs at omnifarious dot org
2009-05-20 20:05 ` jakub at gcc dot gnu dot org
@ 2009-05-20 20:22 ` eric-bugs at omnifarious dot org
2009-06-10 17:37 ` hp at gcc dot gnu dot org
3 siblings, 0 replies; 9+ messages in thread
From: eric-bugs at omnifarious dot org @ 2009-05-20 20:22 UTC (permalink / raw)
To: gcc-bugs
------- Comment #7 from eric-bugs at omnifarious dot org 2009-05-20 20:22 -------
I've been playing around a bit more, and I've noticed that gcc in general does
not do a spectacular job of optimizing bitwise operations of any kind.
Some kind of general framework for tracking the movements of individual bits
and details like "16 bit values only have 16 bits, so using & to ensure this in
various ways is a null operation." might actually do a lot to speed up a lot of
code.
I distinctly remember a time long past when I and a co-worker fiddled some
complex bit operations this way and that to get the assembly out we knew was
close to optimal for a tight inner loop. The resulting expression was
significantly less clear than the most obvious way of stating the same thing
and I also knew that if DEC changed their compiler in certain ways we'd have to
do it all over again.
As an example, there is no reason that:
(x << 8) | (x >> 8) should result in better code than ((x & 0xffu) << 8) | ((x
& 0xff00u) >> 8) when x is of type uint16_t, but it does. And recognizing that
either can be done in one instruction on an x86 would be even better.
So, while I think you are likely correct that the byteswap builtins do not need
a lot of extensive optimization, I do think that bit operations in general
could be handled a lot better, and that would help out a whole lot of code.
Once that framework was in place optimizing the byteswap builtin would be
trivial.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org
2009-05-20 19:39 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized eric-bugs at omnifarious dot org
@ 2009-05-20 20:05 ` jakub at gcc dot gnu dot org
2009-05-20 20:22 ` eric-bugs at omnifarious dot org
2009-06-10 17:37 ` hp at gcc dot gnu dot org
3 siblings, 0 replies; 9+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-20 20:05 UTC (permalink / raw)
To: gcc-bugs
------- Comment #6 from jakub at gcc dot gnu dot org 2009-05-20 20:05 -------
There are plenty other possible builtin bswap optimizations. E.g.
extern void bar (void);
void foo (int x)
{
if (__builtin_bswap32 (x) == __builtin_bswap32 (0x1234567))
bar ();
}
should be optimized into if (x == 0x1234567) (only for EQ_EXPR/NE_EXPR),
similarly __builtin_bswap32 (x) == 0x1234567 should be optimized into
x == __builtin_bswap32 (0x1234567) because the latter can be swapped at compile
time, similarly __builtin_bswap32 (__builtin_bswap32 (x) | 0x1234) could
be optimized into x | __builtin_bswap32 (0x1234) (similarly for ^ or & or ~),
etc. The question is if enough projects start using these builtins to make it
worth spending compile time on it and what exact optimizations are useful on
real-world code.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org
@ 2009-05-20 19:39 ` eric-bugs at omnifarious dot org
2009-05-20 20:05 ` jakub at gcc dot gnu dot org
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: eric-bugs at omnifarious dot org @ 2009-05-20 19:39 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from eric-bugs at omnifarious dot org 2009-05-20 19:39 -------
This code:
#include <stdint.h>
#include <stddef.h>
inline uint64_t byteswap_64(const uint64_t x)
{
return __builtin_bswap64(x);
}
inline uint32_t byteswap_32(const uint32_t x)
{
return __builtin_bswap32(x);
}
extern void random_function(uint32_t a, uint64_t b, uint32_t c, uint64_t d);
void swapping(const uint32_t x32, const uint64_t x64)
{
random_function(byteswap_32(x32), byteswap_64(x64),
byteswap_32(byteswap_32(x32)), byteswap_64(byteswap_64(x64)));
}
void swaparray(uint64_t outvals[], char outtop[], const uint64_t invals[],
const size_t size)
{
size_t i = 0;
for (i = 0; i < size; ++i) {
outvals[i] = byteswap_64(invals[i]);
outtop[i] = (byteswap_64(invals[i]) >> 56) & 0xffull;
}
}
results in this assembly:
.globl swaparray
.type swaparray, @function
swaparray:
.LFB5:
testq %rcx, %rcx
je .L8
xorl %r8d, %r8d
.p2align 4,,7
.p2align 3
.L7:
movq (%rdx,%r8,8), %rax
bswap %rax
movq %rax, (%rdi,%r8,8)
movq (%rdx,%r8,8), %rax
bswap %rax
shrq $56, %rax
movb %al, (%rsi,%r8)
incq %r8
cmpq %r8, %rcx
ja .L7
.L8:
rep
ret
.LFE5:
.size swaparray, .-swaparray
.p2align 4,,15
.globl swapping
.type swapping, @function
swapping:
.LFB4:
bswap %rsi
bswap %edi
movq %rsi, %rcx
movl %edi, %edx
bswap %rcx
bswap %edx
jmp random_function
.LFE4:
.size swapping, .-swapping
when compiled with gcc -O3 -mtune=native -march=native on an Opteron system.
Notice that in swapping bswap is used twice rather than having two move
instructions and two bswap instructions. The optimizer is apparently unaware
that bswap is its own inverse.
In swaparray the bswap operation is not subject to an obvious CSE optimization,
nor is it realized that the latter line might be more efficiently implemented
by movb %al, (%rsi,%r8) before the bswap operation.
--
eric-bugs at omnifarious dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|gcc needs byte swap builtins|gcc byte swap builtins
| |inadequately optimized
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-07-10 8:34 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/>
2011-10-07 18:23 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized pluto at agmk dot net
2011-10-07 18:29 ` pinskia at gcc dot gnu.org
2011-10-07 18:46 ` pluto at agmk dot net
2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org
2021-07-10 8:34 ` roger at nextmovesoftware dot com
2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org
2009-05-20 19:39 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized eric-bugs at omnifarious dot org
2009-05-20 20:05 ` jakub at gcc dot gnu dot org
2009-05-20 20:22 ` eric-bugs at omnifarious dot org
2009-06-10 17:37 ` hp at gcc dot gnu dot org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).