[Bug middle-end/114449] New: bswap64 not optimized

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/114449] New: bswap64 not optimized
@ 2024-03-24 15:09 pali at kernel dot org
  2024-03-24 15:30 ` [Bug middle-end/114449] " xry111 at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: pali at kernel dot org @ 2024-03-24 15:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449

            Bug ID: 114449
           Summary: bswap64 not optimized
           Product: gcc
           Version: 13.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pali at kernel dot org
  Target Milestone: ---

https://godbolt.org/z/dc3br9dYT

gcc 13.2 with -O3 does not detect straightforward code for bswap64
functionality. It generates unoptimized code.

    uint64_t bswap64_1(uint64_t num) {
        uint64_t ret = 0;
        for (size_t i = 0; i < sizeof(num); i++) {
            ret |= ((num >> (8*(sizeof(num)-1-i))) & 0xff) << (8*i);
        }
        return ret;
    }


Rewriting the code to manually unpack the loop cause that gcc produces
optimized code with single "bswap" instruction on x86-64.

    uint64_t bswap64_2(uint64_t num) {
        uint64_t ret = 0;
        ret |= (((num >> 56) & 0xff) <<  0);
        ret |= (((num >> 48) & 0xff) <<  8);
        ret |= (((num >> 40) & 0xff) << 16);
        ret |= (((num >> 32) & 0xff) << 24);
        ret |= (((num >> 24) & 0xff) << 32);
        ret |= (((num >> 16) & 0xff) << 40);
        ret |= (((num >>  8) & 0xff) << 48);
        ret |= (((num >>  0) & 0xff) << 56);
        return ret;
    }


Additional -funroll-all-loops argument for the first example does not help and
still produces unoptimized code.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/114449] bswap64 not optimized
  2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
@ 2024-03-24 15:30 ` xry111 at gcc dot gnu.org
  2024-03-24 15:40 ` pali at kernel dot org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: xry111 at gcc dot gnu.org @ 2024-03-24 15:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449

Xi Ruoyao <xry111 at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |xry111 at gcc dot gnu.org

--- Comment #1 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
Adding #pragma GCC unroll 8 for the loop makes it optimized.

IIRC by default GCC only unroll loops with a factor of 4 so it's not "fully"
unrolled w/o the pragma.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/114449] bswap64 not optimized
  2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
  2024-03-24 15:30 ` [Bug middle-end/114449] " xry111 at gcc dot gnu.org
@ 2024-03-24 15:40 ` pali at kernel dot org
  2024-03-24 15:43 ` pali at kernel dot org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pali at kernel dot org @ 2024-03-24 15:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449

--- Comment #2 from Pali Rohár <pali at kernel dot org> ---
Interesting... I was expecting that some -O3 or better -Ofast option tells gcc
to optimize the code as much as possible.

I added that pragma before for-loop in the first example and then gcc really
optimized the code to just bswap instruction.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/114449] bswap64 not optimized
  2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
  2024-03-24 15:30 ` [Bug middle-end/114449] " xry111 at gcc dot gnu.org
  2024-03-24 15:40 ` pali at kernel dot org
@ 2024-03-24 15:43 ` pali at kernel dot org
  2024-03-24 15:45 ` xry111 at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pali at kernel dot org @ 2024-03-24 15:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449

--- Comment #3 from Pali Rohár <pali at kernel dot org> ---
Note that clang optimizes it just with -O2 and does not require any special
pragma.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/114449] bswap64 not optimized
  2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
                   ` (2 preceding siblings ...)
  2024-03-24 15:43 ` pali at kernel dot org
@ 2024-03-24 15:45 ` xry111 at gcc dot gnu.org
  2024-03-24 18:17 ` pinskia at gcc dot gnu.org
  2024-03-25  9:17 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: xry111 at gcc dot gnu.org @ 2024-03-24 15:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449

Xi Ruoyao <xry111 at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2024-03-24

--- Comment #4 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
(In reply to Pali Rohár from comment #2)
> Interesting... I was expecting that some -O3 or better -Ofast option tells
> gcc to optimize the code as much as possible.

Yes this is a bug.  I'm only providing some clue about why this happens, not
meaning we should rely on some #pragma to get the expected result.

But generally unrolling loops too much pessimizes the code, so increase the
default unrolling factor is a no-go.  We need to invent some cleverer thing for
this.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/114449] bswap64 not optimized
  2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
                   ` (3 preceding siblings ...)
  2024-03-24 15:45 ` xry111 at gcc dot gnu.org
@ 2024-03-24 18:17 ` pinskia at gcc dot gnu.org
  2024-03-25  9:17 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-24 18:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pinskia at gcc dot gnu.org
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/114449] bswap64 not optimized
  2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
                   ` (4 preceding siblings ...)
  2024-03-24 18:17 ` pinskia at gcc dot gnu.org
@ 2024-03-25  9:17 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-25  9:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note we do unroll the loop with -O3 but only late after which we do not re-do
bswap recognition (which happens before loop optimization).  At -O2 we
don't unroll because that increases code-size too much.

Recognition of "final value computation" is done in the sccp pass which
could be amended for this (final_value_replacement_loop,
tree-scalar-evolution.cc).

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-03-25  9:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-24 15:09 [Bug middle-end/114449] New: bswap64 not optimized pali at kernel dot org
2024-03-24 15:30 ` [Bug middle-end/114449] " xry111 at gcc dot gnu.org
2024-03-24 15:40 ` pali at kernel dot org
2024-03-24 15:43 ` pali at kernel dot org
2024-03-24 15:45 ` xry111 at gcc dot gnu.org
2024-03-24 18:17 ` pinskia at gcc dot gnu.org
2024-03-25  9:17 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).