public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/104344] New: Suboptimal -Os code for manually unrolled loop
@ 2022-02-02 13:59 charles.nicholson at gmail dot com
  2022-02-02 14:14 ` [Bug tree-optimization/104344] " rguenth at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: charles.nicholson at gmail dot com @ 2022-02-02 13:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104344

            Bug ID: 104344
           Summary: Suboptimal -Os code for manually unrolled loop
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: charles.nicholson at gmail dot com
  Target Milestone: ---

Manually copying the byte representation of a float to a uint32_t emits optimal
code with -Os when the copy is performed in a loop. When the copy is
hand-unrolled , -Os generates much larger code than -O3.

It's unclear to me why -Os doesn't generate the same code; to my eye they are
both equivalent and well-formed, and -O3 seems to agree. Apologies in advance
if I'm simply misunderstanding the C11 Standard and what transformations are
legal here.

This happens at least on x64 on gcc 12.0.1, and on ARMv7-M on armgcc 10.2.
Various architecture-specific flags like "-mcpu" and "-march" do not appear to
make a difference.

Code:
=====
#include <stdint.h>

_Static_assert(sizeof(uint32_t) == sizeof(float), "");
_Static_assert(sizeof(uint32_t) == 4, "");

uint32_t cast_through_char_unrolled(float f) {
  uint32_t u;
  char const *src = (char const *)&f;
  char *dst = (char *)&u;
  *dst++ = *src++;
  *dst++ = *src++;
  *dst++ = *src++;
  *dst++ = *src++;
  return u;
}

uint32_t cast_through_char_loop(float f) {
  uint32_t u;
  char const *src = (char const *)&f;
  char *dst = (char *)&u;
  for (int i = 0; i < 4; ++i) {
    *dst++ = *src++;
  }
  return u;
}

-Os output (flags: "--std=c11 -Wall -Wextra -Os")
=======================================
cast_through_char_unrolled:
  movd eax, xmm0
  xor edx, edx
  mov dl, al
  mov dh, ah
  xor ax, ax
  movzx edx, dx
  or eax, edx
  ret

cast_through_char_loop:
  movd eax, xmm0
  ret


-O3 output (flags: "--std=c11 -Wall -Wextra -O3")
=======================================
cast_through_char_unrolled:
  movd eax, xmm0
  ret

cast_through_char_loop:
  movd eax, xmm0
  ret

Godbolt example:
================
https://gcc.godbolt.org/z/bn9xq1b56

Gcc details follow, captured by adding "-v" to my command-line:
===============================================================
Using built-in specs.
COLLECT_GCC=/opt/compiler-explorer/gcc-snapshot/bin/gcc
Target: x86_64-linux-gnu
Configured with: ../gcc-trunk-20220202/configure
--prefix=/opt/compiler-explorer/gcc-build/staging --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-bootstrap
--enable-multiarch --with-abi=m64 --with-multilib-list=m32,m64,mx32
--enable-multilib --enable-clocale=gnu --enable-languages=c,c++,fortran,ada,d
--enable-ld=yes --enable-gold=yes --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --enable-linker-build-id --enable-lto
--enable-plugins --enable-threads=posix
--with-pkgversion=Compiler-Explorer-Build-gcc-756eabacfcd767e39eea63257a026f61a4c4e661-binutils-2.36.1
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 12.0.1 20220202 (experimental)
(Compiler-Explorer-Build-gcc-756eabacfcd767e39eea63257a026f61a4c4e661-binutils-2.36.1) 
COLLECT_GCC_OPTIONS='-fdiagnostics-color=always' '-g' '-o' '/app/output.s'
'-masm=intel' '-S' '-std=c11' '-O3' '-Wall' '-Wextra' '-v' '-mtune=generic'
'-march=x86-64' '-dumpdir' '/app/'

/opt/compiler-explorer/gcc-trunk-20220202/bin/../libexec/gcc/x86_64-linux-gnu/12.0.1/cc1
-quiet -v -imultiarch x86_64-linux-gnu -iprefix
/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/x86_64-linux-gnu/12.0.1/
<source> -quiet -dumpdir /app/ -dumpbase output.c -dumpbase-ext .c -masm=intel
-mtune=generic -march=x86-64 -g -O3 -Wall -Wextra -std=c11 -version
-fdiagnostics-color=always -o /app/output.s
GNU C11
(Compiler-Explorer-Build-gcc-756eabacfcd767e39eea63257a026f61a4c4e661-binutils-2.36.1)
version 12.0.1 20220202 (experimental) (x86_64-linux-gnu)
        compiled by GNU C version 7.5.0, GMP version 6.2.1, MPFR version 4.1.0,
MPC version 1.2.1, isl version isl-0.24-GMP

GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
ignoring nonexistent directory
"/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/x86_64-linux-gnu/12.0.1/../../../../x86_64-linux-gnu/include"
ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/12.0.1/include"
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/12.0.1/include-fixed"
ignoring nonexistent directory
"/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/12.0.1/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:

/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/x86_64-linux-gnu/12.0.1/include

/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/x86_64-linux-gnu/12.0.1/include-fixed
 /usr/local/include
 /opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/../../include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
GNU C11
(Compiler-Explorer-Build-gcc-756eabacfcd767e39eea63257a026f61a4c4e661-binutils-2.36.1)
version 12.0.1 20220202 (experimental) (x86_64-linux-gnu)
        compiled by GNU C version 7.5.0, GMP version 6.2.1, MPFR version 4.1.0,
MPC version 1.2.1, isl version isl-0.24-GMP

GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
Compiler executable checksum: 85eab4743b9643508f1adb2d853127bf
COMPILER_PATH=/opt/compiler-explorer/gcc-trunk-20220202/bin/../libexec/gcc/x86_64-linux-gnu/12.0.1/:/opt/compiler-explorer/gcc-trunk-20220202/bin/../libexec/gcc/x86_64-linux-gnu/:/opt/compiler-explorer/gcc-trunk-20220202/bin/../libexec/gcc/:/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/x86_64-linux-gnu/12.0.1/../../../../x86_64-linux-gnu/bin/
LIBRARY_PATH=/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/x86_64-linux-gnu/12.0.1/:/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/x86_64-linux-gnu/:/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/:/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/x86_64-linux-gnu/12.0.1/../../../../lib64/:/lib/x86_64-linux-gnu/:/lib/../lib64/:/usr/lib/x86_64-linux-gnu/:/usr/lib/../lib64/:/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/x86_64-linux-gnu/12.0.1/../../../../x86_64-linux-gnu/lib/:/opt/compiler-explorer/gcc-trunk-20220202/bin/../lib/gcc/x86_64-linux-gnu/12.0.1/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-fdiagnostics-color=always' '-g' '-o' '/app/output.s'
'-masm=intel' '-S' '-std=c11' '-O3' '-Wall' '-Wextra' '-v' '-mtune=generic'
'-march=x86-64' '-dumpdir' '/app/output.'
Compiler returned: 0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/104344] Suboptimal -Os code for manually unrolled loop
  2022-02-02 13:59 [Bug tree-optimization/104344] New: Suboptimal -Os code for manually unrolled loop charles.nicholson at gmail dot com
@ 2022-02-02 14:14 ` rguenth at gcc dot gnu.org
  2022-02-02 14:18 ` charles.nicholson at gmail dot com
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-02 14:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104344

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org,
                   |                            |rsandifo at gcc dot gnu.org
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2022-02-02
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
If you add -fopt-info to the compiler command you can see what happens:

> ./cc1 -quiet t.c -Os -fopt-info
t.c:21:21: optimized: Loop 1 distributed: split to 0 loops and 1 library calls.

we recognized the loop and made a memmove out of it which is then optimally
expanded

> ./cc1 -quiet t.c -O3 -fopt-info
t.c:10:10: optimized: basic block part vectorized using 16 byte vectors
t.c:21:21: optimized: Loop 1 distributed: split to 0 loops and 1 library calls.

here the same happens but in addition to that the unrolled stmts are
vectorized, producing the same effective result

Starting with GCC 12 vectorization will also happen at -O2 but not at -Os.
You can add -ftree-vectorize -fvect-cost-model=very-cheap to mimic what we
do at -O2 with -Os.

Note that vectorization can also increase code size, on x86 for example
because the instructions, even if there end up being less of them, have
a larger encoding.

Maybe also sth for -Oz vs. -Os and for general consideration of -O2 vs. -Os
and vectorization enablement.

The pattern itself could also be recognized by store-merging/bswap detection
(but the interleaving accesses can make the separate analysis of merging stores
and replacing the load difficult).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/104344] Suboptimal -Os code for manually unrolled loop
  2022-02-02 13:59 [Bug tree-optimization/104344] New: Suboptimal -Os code for manually unrolled loop charles.nicholson at gmail dot com
  2022-02-02 14:14 ` [Bug tree-optimization/104344] " rguenth at gcc dot gnu.org
@ 2022-02-02 14:18 ` charles.nicholson at gmail dot com
  2022-02-02 14:30 ` rguenth at gcc dot gnu.org
  2022-02-02 22:41 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: charles.nicholson at gmail dot com @ 2022-02-02 14:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104344

--- Comment #2 from Charles Nicholson <charles.nicholson at gmail dot com> ---
For whatever it's worth, clang does recognize both forms and emit optimal
assembly at -Os:

https://gcc.godbolt.org/z/sehxYb97E

cast_through_char_unrolled: # @cast_through_char_unrolled
  movd eax, xmm0
  ret
cast_through_char_loop: # @cast_through_char_loop
  movd eax, xmm0
  ret

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/104344] Suboptimal -Os code for manually unrolled loop
  2022-02-02 13:59 [Bug tree-optimization/104344] New: Suboptimal -Os code for manually unrolled loop charles.nicholson at gmail dot com
  2022-02-02 14:14 ` [Bug tree-optimization/104344] " rguenth at gcc dot gnu.org
  2022-02-02 14:18 ` charles.nicholson at gmail dot com
@ 2022-02-02 14:30 ` rguenth at gcc dot gnu.org
  2022-02-02 22:41 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-02 14:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104344

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Charles Nicholson from comment #2)
> For whatever it's worth, clang does recognize both forms and emit optimal
> assembly at -Os:
> 
> https://gcc.godbolt.org/z/sehxYb97E
> 
> cast_through_char_unrolled: # @cast_through_char_unrolled
>   movd eax, xmm0
>   ret
> cast_through_char_loop: # @cast_through_char_loop
>   movd eax, xmm0
>   ret

Yes, even with -fno-vectorize but not sure whether that disables all of it,
that is, not sure in which optimization phase it pattern matches this.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/104344] Suboptimal -Os code for manually unrolled loop
  2022-02-02 13:59 [Bug tree-optimization/104344] New: Suboptimal -Os code for manually unrolled loop charles.nicholson at gmail dot com
                   ` (2 preceding siblings ...)
  2022-02-02 14:30 ` rguenth at gcc dot gnu.org
@ 2022-02-02 22:41 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-02-02 22:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104344

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-02-02 22:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-02 13:59 [Bug tree-optimization/104344] New: Suboptimal -Os code for manually unrolled loop charles.nicholson at gmail dot com
2022-02-02 14:14 ` [Bug tree-optimization/104344] " rguenth at gcc dot gnu.org
2022-02-02 14:18 ` charles.nicholson at gmail dot com
2022-02-02 14:30 ` rguenth at gcc dot gnu.org
2022-02-02 22:41 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).