public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/> @ 2011-10-07 18:23 ` pluto at agmk dot net 2011-10-07 18:29 ` pinskia at gcc dot gnu.org ` (3 subsequent siblings) 4 siblings, 0 replies; 9+ messages in thread From: pluto at agmk dot net @ 2011-10-07 18:23 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 --- Comment #9 from Pawel Sikora <pluto at agmk dot net> 2011-10-07 18:22:25 UTC --- Created attachment 25442 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25442 testcase ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/> 2011-10-07 18:23 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized pluto at agmk dot net @ 2011-10-07 18:29 ` pinskia at gcc dot gnu.org 2011-10-07 18:46 ` pluto at agmk dot net ` (2 subsequent siblings) 4 siblings, 0 replies; 9+ messages in thread From: pinskia at gcc dot gnu.org @ 2011-10-07 18:29 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 --- Comment #10 from Andrew Pinski <pinskia at gcc dot gnu.org> 2011-10-07 18:29:07 UTC --- (In reply to comment #9) > Created attachment 25442 [details] > testcase I think those are hard to optimize really since those are inline-asm really. And the unsigned short one gets optimized correctly. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/> 2011-10-07 18:23 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized pluto at agmk dot net 2011-10-07 18:29 ` pinskia at gcc dot gnu.org @ 2011-10-07 18:46 ` pluto at agmk dot net 2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org 2021-07-10 8:34 ` roger at nextmovesoftware dot com 4 siblings, 0 replies; 9+ messages in thread From: pluto at agmk dot net @ 2011-10-07 18:46 UTC (permalink / raw) To: gcc-bugs http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 --- Comment #11 from Pawel Sikora <pluto at agmk dot net> 2011-10-07 18:45:57 UTC --- (In reply to comment #10) > (In reply to comment #9) > > Created attachment 25442 [details] > > testcase > > I think those are hard to optimize really since those are inline-asm really. > And the unsigned short one gets optimized correctly. ahhh, glibc uses a generic i386 implementation (ror+xchg) in <byteorder.h> while it has an optimized <byteswap.h> for i486+ :( ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/> ` (2 preceding siblings ...) 2011-10-07 18:46 ` pluto at agmk dot net @ 2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org 2021-07-10 8:34 ` roger at nextmovesoftware dot com 4 siblings, 0 replies; 9+ messages in thread From: cvs-commit at gcc dot gnu.org @ 2021-07-08 10:48 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 --- Comment #12 from CVS Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>: https://gcc.gnu.org/g:4c619132b3f14dc5e672a7f2f0e09cb784193559 commit r12-2137-g4c619132b3f14dc5e672a7f2f0e09cb784193559 Author: Roger Sayle <roger@nextmovesoftware.com> Date: Thu Jul 8 11:46:14 2021 +0100 PR tree-optimization/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd All of the optimizations/transformations mentioned in bugzilla for PR tree-optimization/40210 are already implemented in mainline GCC, with one exception. In comment #5, there's a suggestion that (bswap64(x)>>56)&0xff can be implemented without the bswap as (unsigned char)x, or equivalently x&0xff. This patch implements the above optimization, and closely related variants. For any single bit, (bswap(X)>>C1)&1 can be simplified to (X>>C2)&1, where bit position C2 is the appropriate permutation of C1. Similarly, the bswap can eliminated if the desired set of bits all lie within the same byte, hence (bswap(x)>>8)&255 can always be optimized, as can (bswap(x)>>8)&123. Previously, int foo(long long x) { return (__builtin_bswap64(x) >> 56) & 0xff; } compiled with -O2 to foo: movq %rdi, %rax bswap %rax shrq $56, %rax ret with this patch, it now compiles to foo: movzbl %dil, %eax ret 2021-07-08 Roger Sayle <roger@nextmovesoftware.com> Richard Biener <rguenther@suse.de> gcc/ChangeLog PR tree-optimization/40210 * match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as (x>>C3)&C2 when possible. Simplify bswap(x)>>C1 as ((T)x)>>C2 when possible. Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255. gcc/testsuite/ChangeLog PR tree-optimization/40210 * gcc.dg/builtin-bswap-13.c: New test. * gcc.dg/builtin-bswap-14.c: New test. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/> ` (3 preceding siblings ...) 2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org @ 2021-07-10 8:34 ` roger at nextmovesoftware dot com 4 siblings, 0 replies; 9+ messages in thread From: roger at nextmovesoftware dot com @ 2021-07-10 8:34 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 Roger Sayle <roger at nextmovesoftware dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |roger at nextmovesoftware dot com Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED Target Milestone|--- |12.0 --- Comment #13 from Roger Sayle <roger at nextmovesoftware dot com> --- All of the (bswap-related) optimizations mentioned in PR tree-optimization/40210 are now implemented in mainline GCC. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] New: gcc needs byte swap builtins @ 2009-05-20 18:48 eric-bugs at omnifarious dot org 2009-05-20 19:39 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized eric-bugs at omnifarious dot org ` (3 more replies) 0 siblings, 4 replies; 9+ messages in thread From: eric-bugs at omnifarious dot org @ 2009-05-20 18:48 UTC (permalink / raw) To: gcc-bugs gcc needs some built in functions for byte swapping. I've been experimenting with the various versions of byte swapping functions out there, and they either result in code that's opaque to the optimizer (i.e. swapping something twice is not considered a null operation) or the optimizer doesn't recognize that a byte swap is what's happening and renders it as a complex series of shift, and and or instructions. I know very little about the internals of gcc, but my ignorant preference would be to make tree-ssa recognize that code like this: inline uint64_t byteswap_64(const uint64_t x) { return ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)); } is a byte swap and optimize appropriately. If this were being done to an entire array, it might even be possible to vectorize it efficiently. This would also mean that code to pull specific bits out of a pre or post swap value could be moved around and fiddled to get the value out of a different place if it made for more efficient register usage. -- Summary: gcc needs byte swap builtins Product: gcc Version: 4.3.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: eric-bugs at omnifarious dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized 2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org @ 2009-05-20 19:39 ` eric-bugs at omnifarious dot org 2009-05-20 20:05 ` jakub at gcc dot gnu dot org ` (2 subsequent siblings) 3 siblings, 0 replies; 9+ messages in thread From: eric-bugs at omnifarious dot org @ 2009-05-20 19:39 UTC (permalink / raw) To: gcc-bugs ------- Comment #5 from eric-bugs at omnifarious dot org 2009-05-20 19:39 ------- This code: #include <stdint.h> #include <stddef.h> inline uint64_t byteswap_64(const uint64_t x) { return __builtin_bswap64(x); } inline uint32_t byteswap_32(const uint32_t x) { return __builtin_bswap32(x); } extern void random_function(uint32_t a, uint64_t b, uint32_t c, uint64_t d); void swapping(const uint32_t x32, const uint64_t x64) { random_function(byteswap_32(x32), byteswap_64(x64), byteswap_32(byteswap_32(x32)), byteswap_64(byteswap_64(x64))); } void swaparray(uint64_t outvals[], char outtop[], const uint64_t invals[], const size_t size) { size_t i = 0; for (i = 0; i < size; ++i) { outvals[i] = byteswap_64(invals[i]); outtop[i] = (byteswap_64(invals[i]) >> 56) & 0xffull; } } results in this assembly: .globl swaparray .type swaparray, @function swaparray: .LFB5: testq %rcx, %rcx je .L8 xorl %r8d, %r8d .p2align 4,,7 .p2align 3 .L7: movq (%rdx,%r8,8), %rax bswap %rax movq %rax, (%rdi,%r8,8) movq (%rdx,%r8,8), %rax bswap %rax shrq $56, %rax movb %al, (%rsi,%r8) incq %r8 cmpq %r8, %rcx ja .L7 .L8: rep ret .LFE5: .size swaparray, .-swaparray .p2align 4,,15 .globl swapping .type swapping, @function swapping: .LFB4: bswap %rsi bswap %edi movq %rsi, %rcx movl %edi, %edx bswap %rcx bswap %edx jmp random_function .LFE4: .size swapping, .-swapping when compiled with gcc -O3 -mtune=native -march=native on an Opteron system. Notice that in swapping bswap is used twice rather than having two move instructions and two bswap instructions. The optimizer is apparently unaware that bswap is its own inverse. In swaparray the bswap operation is not subject to an obvious CSE optimization, nor is it realized that the latter line might be more efficiently implemented by movb %al, (%rsi,%r8) before the bswap operation. -- eric-bugs at omnifarious dot org changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|gcc needs byte swap builtins|gcc byte swap builtins | |inadequately optimized http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized 2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org 2009-05-20 19:39 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized eric-bugs at omnifarious dot org @ 2009-05-20 20:05 ` jakub at gcc dot gnu dot org 2009-05-20 20:22 ` eric-bugs at omnifarious dot org 2009-06-10 17:37 ` hp at gcc dot gnu dot org 3 siblings, 0 replies; 9+ messages in thread From: jakub at gcc dot gnu dot org @ 2009-05-20 20:05 UTC (permalink / raw) To: gcc-bugs ------- Comment #6 from jakub at gcc dot gnu dot org 2009-05-20 20:05 ------- There are plenty other possible builtin bswap optimizations. E.g. extern void bar (void); void foo (int x) { if (__builtin_bswap32 (x) == __builtin_bswap32 (0x1234567)) bar (); } should be optimized into if (x == 0x1234567) (only for EQ_EXPR/NE_EXPR), similarly __builtin_bswap32 (x) == 0x1234567 should be optimized into x == __builtin_bswap32 (0x1234567) because the latter can be swapped at compile time, similarly __builtin_bswap32 (__builtin_bswap32 (x) | 0x1234) could be optimized into x | __builtin_bswap32 (0x1234) (similarly for ^ or & or ~), etc. The question is if enough projects start using these builtins to make it worth spending compile time on it and what exact optimizations are useful on real-world code. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized 2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org 2009-05-20 19:39 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized eric-bugs at omnifarious dot org 2009-05-20 20:05 ` jakub at gcc dot gnu dot org @ 2009-05-20 20:22 ` eric-bugs at omnifarious dot org 2009-06-10 17:37 ` hp at gcc dot gnu dot org 3 siblings, 0 replies; 9+ messages in thread From: eric-bugs at omnifarious dot org @ 2009-05-20 20:22 UTC (permalink / raw) To: gcc-bugs ------- Comment #7 from eric-bugs at omnifarious dot org 2009-05-20 20:22 ------- I've been playing around a bit more, and I've noticed that gcc in general does not do a spectacular job of optimizing bitwise operations of any kind. Some kind of general framework for tracking the movements of individual bits and details like "16 bit values only have 16 bits, so using & to ensure this in various ways is a null operation." might actually do a lot to speed up a lot of code. I distinctly remember a time long past when I and a co-worker fiddled some complex bit operations this way and that to get the assembly out we knew was close to optimal for a tight inner loop. The resulting expression was significantly less clear than the most obvious way of stating the same thing and I also knew that if DEC changed their compiler in certain ways we'd have to do it all over again. As an example, there is no reason that: (x << 8) | (x >> 8) should result in better code than ((x & 0xffu) << 8) | ((x & 0xff00u) >> 8) when x is of type uint16_t, but it does. And recognizing that either can be done in one instruction on an x86 would be even better. So, while I think you are likely correct that the byteswap builtins do not need a lot of extensive optimization, I do think that bit operations in general could be handled a lot better, and that would help out a whole lot of code. Once that framework was in place optimizing the byteswap builtin would be trivial. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized 2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org ` (2 preceding siblings ...) 2009-05-20 20:22 ` eric-bugs at omnifarious dot org @ 2009-06-10 17:37 ` hp at gcc dot gnu dot org 3 siblings, 0 replies; 9+ messages in thread From: hp at gcc dot gnu dot org @ 2009-06-10 17:37 UTC (permalink / raw) To: gcc-bugs ------- Comment #8 from hp at gcc dot gnu dot org 2009-06-10 17:36 ------- (In reply to comment #7) > I do think that bit operations in general > could be handled a lot better, and that would help out a whole lot of code. If you add (compilable) test-cases with enhancement requests, carefully noting target and gcc version where you observed the lack of optimization, this would help a lot. > Once that framework was in place optimizing the byteswap builtin would be > trivial. We already have a "framework". -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-07-10 8:34 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <bug-40210-4@http.gcc.gnu.org/bugzilla/> 2011-10-07 18:23 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized pluto at agmk dot net 2011-10-07 18:29 ` pinskia at gcc dot gnu.org 2011-10-07 18:46 ` pluto at agmk dot net 2021-07-08 10:48 ` cvs-commit at gcc dot gnu.org 2021-07-10 8:34 ` roger at nextmovesoftware dot com 2009-05-20 18:48 [Bug tree-optimization/40210] New: gcc needs byte swap builtins eric-bugs at omnifarious dot org 2009-05-20 19:39 ` [Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized eric-bugs at omnifarious dot org 2009-05-20 20:05 ` jakub at gcc dot gnu dot org 2009-05-20 20:22 ` eric-bugs at omnifarious dot org 2009-06-10 17:37 ` hp at gcc dot gnu dot org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).