public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/114449] New: bswap64 not optimized
@ 2024-03-24 15:09 pali at kernel dot org
2024-03-24 15:30 ` [Bug middle-end/114449] " xry111 at gcc dot gnu.org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: pali at kernel dot org @ 2024-03-24 15:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449
Bug ID: 114449
Summary: bswap64 not optimized
Product: gcc
Version: 13.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: pali at kernel dot org
Target Milestone: ---
https://godbolt.org/z/dc3br9dYT
gcc 13.2 with -O3 does not detect straightforward code for bswap64
functionality. It generates unoptimized code.
uint64_t bswap64_1(uint64_t num) {
uint64_t ret = 0;
for (size_t i = 0; i < sizeof(num); i++) {
ret |= ((num >> (8*(sizeof(num)-1-i))) & 0xff) << (8*i);
}
return ret;
}
Rewriting the code to manually unpack the loop cause that gcc produces
optimized code with single "bswap" instruction on x86-64.
uint64_t bswap64_2(uint64_t num) {
uint64_t ret = 0;
ret |= (((num >> 56) & 0xff) << 0);
ret |= (((num >> 48) & 0xff) << 8);
ret |= (((num >> 40) & 0xff) << 16);
ret |= (((num >> 32) & 0xff) << 24);
ret |= (((num >> 24) & 0xff) << 32);
ret |= (((num >> 16) & 0xff) << 40);
ret |= (((num >> 8) & 0xff) << 48);
ret |= (((num >> 0) & 0xff) << 56);
return ret;
}
Additional -funroll-all-loops argument for the first example does not help and
still produces unoptimized code.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug middle-end/114449] bswap64 not optimized
2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
@ 2024-03-24 15:30 ` xry111 at gcc dot gnu.org
2024-03-24 15:40 ` pali at kernel dot org
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: xry111 at gcc dot gnu.org @ 2024-03-24 15:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449
Xi Ruoyao <xry111 at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |xry111 at gcc dot gnu.org
--- Comment #1 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
Adding #pragma GCC unroll 8 for the loop makes it optimized.
IIRC by default GCC only unroll loops with a factor of 4 so it's not "fully"
unrolled w/o the pragma.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug middle-end/114449] bswap64 not optimized
2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
2024-03-24 15:30 ` [Bug middle-end/114449] " xry111 at gcc dot gnu.org
@ 2024-03-24 15:40 ` pali at kernel dot org
2024-03-24 15:43 ` pali at kernel dot org
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: pali at kernel dot org @ 2024-03-24 15:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449
--- Comment #2 from Pali Rohár <pali at kernel dot org> ---
Interesting... I was expecting that some -O3 or better -Ofast option tells gcc
to optimize the code as much as possible.
I added that pragma before for-loop in the first example and then gcc really
optimized the code to just bswap instruction.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug middle-end/114449] bswap64 not optimized
2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
2024-03-24 15:30 ` [Bug middle-end/114449] " xry111 at gcc dot gnu.org
2024-03-24 15:40 ` pali at kernel dot org
@ 2024-03-24 15:43 ` pali at kernel dot org
2024-03-24 15:45 ` xry111 at gcc dot gnu.org
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: pali at kernel dot org @ 2024-03-24 15:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449
--- Comment #3 from Pali Rohár <pali at kernel dot org> ---
Note that clang optimizes it just with -O2 and does not require any special
pragma.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug middle-end/114449] bswap64 not optimized
2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
` (2 preceding siblings ...)
2024-03-24 15:43 ` pali at kernel dot org
@ 2024-03-24 15:45 ` xry111 at gcc dot gnu.org
2024-03-24 18:17 ` pinskia at gcc dot gnu.org
2024-03-25 9:17 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: xry111 at gcc dot gnu.org @ 2024-03-24 15:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449
Xi Ruoyao <xry111 at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
Last reconfirmed| |2024-03-24
--- Comment #4 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
(In reply to Pali Rohár from comment #2)
> Interesting... I was expecting that some -O3 or better -Ofast option tells
> gcc to optimize the code as much as possible.
Yes this is a bug. I'm only providing some clue about why this happens, not
meaning we should rely on some #pragma to get the expected result.
But generally unrolling loops too much pessimizes the code, so increase the
default unrolling factor is a no-go. We need to invent some cleverer thing for
this.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug middle-end/114449] bswap64 not optimized
2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
` (3 preceding siblings ...)
2024-03-24 15:45 ` xry111 at gcc dot gnu.org
@ 2024-03-24 18:17 ` pinskia at gcc dot gnu.org
2024-03-25 9:17 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-24 18:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |pinskia at gcc dot gnu.org
Severity|normal |enhancement
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug middle-end/114449] bswap64 not optimized
2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
` (4 preceding siblings ...)
2024-03-24 18:17 ` pinskia at gcc dot gnu.org
@ 2024-03-25 9:17 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-25 9:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note we do unroll the loop with -O3 but only late after which we do not re-do
bswap recognition (which happens before loop optimization). At -O2 we
don't unroll because that increases code-size too much.
Recognition of "final value computation" is done in the sccp pass which
could be amended for this (final_value_replacement_loop,
tree-scalar-evolution.cc).
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-03-25 9:17 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
2024-03-24 15:30 ` [Bug middle-end/114449] " xry111 at gcc dot gnu.org
2024-03-24 15:40 ` pali at kernel dot org
2024-03-24 15:43 ` pali at kernel dot org
2024-03-24 15:45 ` xry111 at gcc dot gnu.org
2024-03-24 18:17 ` pinskia at gcc dot gnu.org
2024-03-25 9:17 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).