public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/114319] New: htobe64-like function is not optimized on 32-bit x86
@ 2024-03-12 17:45 pali at kernel dot org
  2024-03-12 17:49 ` [Bug middle-end/114319] " pinskia at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: pali at kernel dot org @ 2024-03-12 17:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319

            Bug ID: 114319
           Summary: htobe64-like function is not optimized on 32-bit x86
           Product: gcc
           Version: 12.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pali at kernel dot org
  Target Milestone: ---
            Target: x86

Here is very simple and straightforward implementation of htobe64 function
which takes 64-bit number stored in unsigned long long variable and encodes it
into byte buffer unsigned char[].

void test1(unsigned long long val, unsigned char *buf) {
  buf[0] = val >> 56;
  buf[1] = val >> 48;
  buf[2] = val >> 40;
  buf[3] = val >> 32;
  buf[4] = val >> 24;
  buf[5] = val >> 16;
  buf[6] = val >> 8;
  buf[7] = val;
}

Compiling it for 64-bit x86 via "gcc -m64 -O2" produces optimized code:

0000000000000000 <test1>:
   0:   48 0f cf                bswap  %rdi
   3:   48 89 3e                mov    %rdi,(%rsi)
   6:   c3                      retq

But compiling it for 32-bit x86 via "gcc -m32 -O2" produces not so optimized
code:

00000000 <test1>:
   0:   8b 54 24 08             mov    0x8(%esp),%edx
   4:   8b 44 24 0c             mov    0xc(%esp),%eax
   8:   89 d1                   mov    %edx,%ecx
   a:   88 70 02                mov    %dh,0x2(%eax)
   d:   c1 e9 18                shr    $0x18,%ecx
  10:   88 50 03                mov    %dl,0x3(%eax)
  13:   88 08                   mov    %cl,(%eax)
  15:   89 d1                   mov    %edx,%ecx
  17:   8b 54 24 04             mov    0x4(%esp),%edx
  1b:   c1 e9 10                shr    $0x10,%ecx
  1e:   0f ca                   bswap  %edx
  20:   88 48 01                mov    %cl,0x1(%eax)
  23:   89 50 04                mov    %edx,0x4(%eax)
  26:   c3                      ret


I tried to compile it for 32-bit powerpc via "powerpc-linux-gnu-gcc -m32 -O2"
and it produces optimized code:

00000000 <test1>:
   0:   90 65 00 00     stw     r3,0(r5)
   4:   90 85 00 04     stw     r4,4(r5)
   8:   4e 80 00 20     blr

Same for 64-bit powerpc via "powerpc-linux-gnu-gcc -m64 -O2":

0000000000000000 <.test1>:
   0:   f8 64 00 00     std     r3,0(r4)
   4:   4e 80 00 20     blr


As a next experiment I tried to rewrite the simple implementation to use gcc
builtins.

void test2(unsigned long long val, unsigned char *buf) {
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  val = __builtin_bswap64(val);
#endif
  __builtin_memcpy(buf, &val, sizeof(val));
}

If I compile it for 32-bit x86 then I get optimized code:

00000030 <test2>:
  30:   8b 4c 24 0c             mov    0xc(%esp),%ecx
  34:   8b 44 24 04             mov    0x4(%esp),%eax
  38:   8b 54 24 08             mov    0x8(%esp),%edx
  3c:   0f c8                   bswap  %eax
  3e:   89 41 04                mov    %eax,0x4(%ecx)
  41:   0f ca                   bswap  %edx
  43:   89 11                   mov    %edx,(%ecx)
  45:   c3                      ret

If I compile it for 64-bit x86 then I get exactly same code as for test1:

0000000000000010 <test2>:
  10:   48 0f cf                bswap  %rdi
  13:   48 89 3e                mov    %rdi,(%rsi)
  16:   c3                      retq

I tried to compile it for powerpc too and the result of test1 and test2 was
same.



So it looks like that the issue here is specific for 32-bit x86 and gcc does
not detect that test1 function on x86 is doing bswap64.

All tests I have done on (amd64) Debian gcc and for powerpc target I used
Debian's powerpc-linux-gnu-gcc cross compiler.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/114319] htobe64-like function is not optimized on 32-bit x86
  2024-03-12 17:45 [Bug target/114319] New: htobe64-like function is not optimized on 32-bit x86 pali at kernel dot org
@ 2024-03-12 17:49 ` pinskia at gcc dot gnu.org
  2024-03-12 17:56 ` pinskia at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-12 17:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|target                      |middle-end
           Severity|normal                      |enhancement
           Keywords|                            |missed-optimization

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/114319] htobe64-like function is not optimized on 32-bit x86
  2024-03-12 17:45 [Bug target/114319] New: htobe64-like function is not optimized on 32-bit x86 pali at kernel dot org
  2024-03-12 17:49 ` [Bug middle-end/114319] " pinskia at gcc dot gnu.org
@ 2024-03-12 17:56 ` pinskia at gcc dot gnu.org
  2024-03-12 18:04 ` pinskia at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-12 17:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to fail|                            |11.4.0

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>But compiling it for 32-bit x86 via "gcc -m32 -O2" produces not so optimized code:


I get that code generation for GCC 11.4.0 and before.

For GCC 12.1.0 and above I get:
```
        movl    8(%esp), %ecx
        bswap   %ecx
        movl    %ecx, %eax
        movl    4(%esp), %ecx
        bswap   %ecx
        movl    %ecx, %edx
        movl    12(%esp), %ecx
        movl    %eax, (%ecx)
        movl    %edx, 4(%ecx)
        ret
```

Which just has a few extra moves.

But adding  -mno-sse, GCC 12 produces worse code.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/114319] htobe64-like function is not optimized on 32-bit x86
  2024-03-12 17:45 [Bug target/114319] New: htobe64-like function is not optimized on 32-bit x86 pali at kernel dot org
  2024-03-12 17:49 ` [Bug middle-end/114319] " pinskia at gcc dot gnu.org
  2024-03-12 17:56 ` pinskia at gcc dot gnu.org
@ 2024-03-12 18:04 ` pinskia at gcc dot gnu.org
  2024-03-12 18:07 ` pali at kernel dot org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-12 18:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2024-03-12
             Target|x86                         |ILP32
             Blocks|                            |94094
     Ever confirmed|0                           |1

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed. I see the trunk even without -mno-sse does not produce the 2 bswaps.

Looks like the store-merging pass is not recognizing bswap<<32 for some reason.

Also I thought there was a dup somewhere ...


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94094
[Bug 94094] [meta-bug] store-merging and/or bswap load/store-merging missed
optimizations

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/114319] htobe64-like function is not optimized on 32-bit x86
  2024-03-12 17:45 [Bug target/114319] New: htobe64-like function is not optimized on 32-bit x86 pali at kernel dot org
                   ` (2 preceding siblings ...)
  2024-03-12 18:04 ` pinskia at gcc dot gnu.org
@ 2024-03-12 18:07 ` pali at kernel dot org
  2024-03-12 18:10 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pali at kernel dot org @ 2024-03-12 18:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319

--- Comment #3 from Pali Rohár <pali at kernel dot org> ---
For details, here is the compiler which produces the mentioned code:

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/12/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 12.2.0-14'
--with-bugurl=file:///usr/share/doc/gcc-12/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-12
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-12-bTRWOB/gcc-12-12.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-12-bTRWOB/gcc-12-12.2.0/debian/tmp-gcn/usr
--enable-offload-defaulted --without-cuda-driver --enable-checking=release
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.2.0 (Debian 12.2.0-14)

I guess that with these configure options you should be able to compile gcc
which produces the mentioned code.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/114319] htobe64-like function is not optimized on 32-bit x86
  2024-03-12 17:45 [Bug target/114319] New: htobe64-like function is not optimized on 32-bit x86 pali at kernel dot org
                   ` (3 preceding siblings ...)
  2024-03-12 18:07 ` pali at kernel dot org
@ 2024-03-12 18:10 ` pinskia at gcc dot gnu.org
  2024-03-13  9:47 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-12 18:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Pali Rohár from comment #3)
> --with-arch-32=i686

This basically causes SSE to be disabled for 32bit by default ...
With the default options to configure GCC, -m32 for x86_64 still enables sse
...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/114319] htobe64-like function is not optimized on 32-bit x86
  2024-03-12 17:45 [Bug target/114319] New: htobe64-like function is not optimized on 32-bit x86 pali at kernel dot org
                   ` (4 preceding siblings ...)
  2024-03-12 18:10 ` pinskia at gcc dot gnu.org
@ 2024-03-13  9:47 ` rguenth at gcc dot gnu.org
  2024-03-13 14:35 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-13  9:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Coalescing successful!
Merged into 1 stores
32 bit bswap implementation found at: _37

looks like we are only merging one store.  Note we cannot recognize
bswap to memory this is a known issue.  So for the bswap64 we need to
merge to a 64bit store which we never do on a 32bit platform.  We
could with SSE, but appearantly we don't try with the bswap trick
at least.  The bswap trick also doesn't seem to consider the split
64bit bswap.  Oddly enough we also fail to merge the other store
(maybe missing a val >> 32 pre-shift "trick").

Possibly could be shown to be a similar issue with a 126bit bswap
on x86_64 which we could emulate with two 64bit bswaps.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/114319] htobe64-like function is not optimized on 32-bit x86
  2024-03-12 17:45 [Bug target/114319] New: htobe64-like function is not optimized on 32-bit x86 pali at kernel dot org
                   ` (5 preceding siblings ...)
  2024-03-13  9:47 ` rguenth at gcc dot gnu.org
@ 2024-03-13 14:35 ` cvs-commit at gcc dot gnu.org
  2024-03-13 14:38 ` jakub at gcc dot gnu.org
  2024-03-13 21:31 ` pali at kernel dot org
  8 siblings, 0 replies; 10+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-03-13 14:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319

--- Comment #6 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:74bca21db31e3f4ab6543b56c3f26b4dfe586fef

commit r14-9453-g74bca21db31e3f4ab6543b56c3f26b4dfe586fef
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Wed Mar 13 15:34:59 2024 +0100

    store-merging: Match bswap64 on 32-bit targets with bswapsi2 [PR114319]

    gimple-ssa-store-merging.cc tests bswap_optab in 3 different places,
    in 2 of them it has special exception for double-word bswap using pair
    of word-mode bswap optabs, but in the last one it doesn't.

    The following patch changes even the last spot.
    We don't handle 128-bit bswaps in the passes at all, because currently we
    just use uint64_t to represent the byte reshuffling (we'd need to use
    offset_int or something like that instead) and we don't have
    __builtin_bswap128 nor type-generic __builtin_bswap, so there is nothing
    for 64-bit targets there.

    2024-03-13  Jakub Jelinek  <jakub@redhat.com>

            PR middle-end/114319
            * gimple-ssa-store-merging.cc
            (imm_store_chain_info::try_coalesce_bswap): For 32-bit targets
            allow matching __builtin_bswap64 if there is bswapsi2 optab.

            * gcc.target/i386/pr114319.c: New test.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/114319] htobe64-like function is not optimized on 32-bit x86
  2024-03-12 17:45 [Bug target/114319] New: htobe64-like function is not optimized on 32-bit x86 pali at kernel dot org
                   ` (6 preceding siblings ...)
  2024-03-13 14:35 ` cvs-commit at gcc dot gnu.org
@ 2024-03-13 14:38 ` jakub at gcc dot gnu.org
  2024-03-13 21:31 ` pali at kernel dot org
  8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-03-13 14:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
                 CC|                            |jakub at gcc dot gnu.org
             Status|NEW                         |RESOLVED

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Fixed for GCC 14.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug middle-end/114319] htobe64-like function is not optimized on 32-bit x86
  2024-03-12 17:45 [Bug target/114319] New: htobe64-like function is not optimized on 32-bit x86 pali at kernel dot org
                   ` (7 preceding siblings ...)
  2024-03-13 14:38 ` jakub at gcc dot gnu.org
@ 2024-03-13 21:31 ` pali at kernel dot org
  8 siblings, 0 replies; 10+ messages in thread
From: pali at kernel dot org @ 2024-03-13 21:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319

--- Comment #8 from Pali Rohár <pali at kernel dot org> ---
Thanks for quick response and fixup of this issue.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-03-13 21:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-12 17:45 [Bug target/114319] New: htobe64-like function is not optimized on 32-bit x86 pali at kernel dot org
2024-03-12 17:49 ` [Bug middle-end/114319] " pinskia at gcc dot gnu.org
2024-03-12 17:56 ` pinskia at gcc dot gnu.org
2024-03-12 18:04 ` pinskia at gcc dot gnu.org
2024-03-12 18:07 ` pali at kernel dot org
2024-03-12 18:10 ` pinskia at gcc dot gnu.org
2024-03-13  9:47 ` rguenth at gcc dot gnu.org
2024-03-13 14:35 ` cvs-commit at gcc dot gnu.org
2024-03-13 14:38 ` jakub at gcc dot gnu.org
2024-03-13 21:31 ` pali at kernel dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).