public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/102294] New: structure assignment slower than memberwise initialization
@ 2021-09-12 21:59 bart.vanassche at gmail dot com
  2021-09-12 22:17 ` [Bug middle-end/102294] " pinskia at gcc dot gnu.org
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-12 21:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

            Bug ID: 102294
           Summary: structure assignment slower than memberwise
                    initialization
           Product: gcc
           Version: 11.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bart.vanassche at gmail dot com
  Target Milestone: ---

Created attachment 51444
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51444&action=edit
Test program that illustrates the issue

The output of the attached test program is as follows for an Intel Core i7-4790
CPU (3.6 GHz) when compiled with -O2:
$ ~/test/bio_init 
Elapsed time: 0.874763 s
Elapsed time: 0.480335 s
Elapsed time: 0.733273 s

The above output shows that bio_init2() runs faster than bio_init3() and that
bio_init3() runs faster than bio_init1(). bio_init3() uses structure assignment
to initialize struct bio while bio_init2() uses memberwise initialization.
bio_init1() uses memset(). To me it was a big surprise to see that bio_init3()
is slower than bio_init2(). Apparently clang generates better code:

$ clang -O2 -o bio_init-clang bio_init.c
$ ./bio_init-clang 

Elapsed time: 0.446804 s
Elapsed time: 0.455009 s
Elapsed time: 0.407392 s

Can gcc be modified such that bio_init3() runs at least as fast as bio_init2()?

The bio_init[123]() source code comes from the Linux kernel. Optimization level
-O2 has been chosen because that is what the Linux kernel uses.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug middle-end/102294] structure assignment slower than memberwise initialization
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
@ 2021-09-12 22:17 ` pinskia at gcc dot gnu.org
  2021-09-13  1:41 ` bart.vanassche at gmail dot com
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-12 22:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note the Linux kernel compiles with -mno-sse where the results are hugely
difference than what can be done in the usespace where SSE can provide a speed
boost.

The easiest way to test this is to use -m32 -mno-sse

GCC:

bio_init1 Elapsed time: 2.427399 s
bio_init2 Elapsed time: 1.757616 s
bio_init3 Elapsed time: 1.959703 s

clang:
bio_init1 Elapsed time: 1.409902 s
bio_init2 Elapsed time: 1.796531 s
bio_init3 Elapsed time: 2.016957 s


So I think this might be wash for the kernel.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug middle-end/102294] structure assignment slower than memberwise initialization
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
  2021-09-12 22:17 ` [Bug middle-end/102294] " pinskia at gcc dot gnu.org
@ 2021-09-13  1:41 ` bart.vanassche at gmail dot com
  2021-09-13  1:43 ` bart.vanassche at gmail dot com
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-13  1:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

Bart Van Assche <bart.vanassche at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #51444|0                           |1
        is obsolete|                            |

--- Comment #2 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Created attachment 51445
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51445&action=edit
Test program that illustrates the issue

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug middle-end/102294] structure assignment slower than memberwise initialization
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
  2021-09-12 22:17 ` [Bug middle-end/102294] " pinskia at gcc dot gnu.org
  2021-09-13  1:41 ` bart.vanassche at gmail dot com
@ 2021-09-13  1:43 ` bart.vanassche at gmail dot com
  2021-09-13  1:59 ` pinskia at gcc dot gnu.org
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-13  1:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

--- Comment #3 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Thanks for the quick feedback. I have modified the test program and added
target("no-sse") to the bio_init[123]() functions. With that change applied the
results are as follows:

$ gcc -O2 -o bio_init bio_init.c && ./bio_init
Elapsed time: 0.965606 s
Elapsed time: 0.529943 s
Elapsed time: 0.734645 s
$ clang -O2 -o bio_init-clang bio_init.c && ./bio_init-clang
Elapsed time: 0.633179 s
Elapsed time: 0.605532 s
Elapsed time: 0.504315 s

It seems like clang still generates significantly better code for bio_init3()
than gcc?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug middle-end/102294] structure assignment slower than memberwise initialization
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
                   ` (2 preceding siblings ...)
  2021-09-13  1:43 ` bart.vanassche at gmail dot com
@ 2021-09-13  1:59 ` pinskia at gcc dot gnu.org
  2021-09-13  2:16 ` bart.vanassche at gmail dot com
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-13  1:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
First atomic_set has an volatile store in it which messes things up a lot.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug middle-end/102294] structure assignment slower than memberwise initialization
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
                   ` (3 preceding siblings ...)
  2021-09-13  1:59 ` pinskia at gcc dot gnu.org
@ 2021-09-13  2:16 ` bart.vanassche at gmail dot com
  2021-09-13  2:25 ` pinskia at gcc dot gnu.org
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-13  2:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

--- Comment #5 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Please note that bio_init3() does not use atomic_set() but ATOMIC_INIT(). The
definition of ATOMIC_INIT() is as follows:

#define ATOMIC_INIT(v) (atomic_t){.counter = (v)}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug middle-end/102294] structure assignment slower than memberwise initialization
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
                   ` (4 preceding siblings ...)
  2021-09-13  2:16 ` bart.vanassche at gmail dot com
@ 2021-09-13  2:25 ` pinskia at gcc dot gnu.org
  2021-09-13  2:57 ` [Bug middle-end/102294] memset expansion is sometimes slow for small sizes bart.vanassche at gmail dot com
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-13  2:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is literally just measuring memset times of a small structure.

-mtune=intel changes the timings too.
Doing -mstringop-strategy=libcall also changes the timing to the point where
they are about the same as clang.

So this is a target issue and not a middle-end.

You need to do timings on many more processors to have the -mtune=generic
changed.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug middle-end/102294] memset expansion is sometimes slow for small sizes
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
                   ` (5 preceding siblings ...)
  2021-09-13  2:25 ` pinskia at gcc dot gnu.org
@ 2021-09-13  2:57 ` bart.vanassche at gmail dot com
  2021-09-13  3:06 ` crazylht at gmail dot com
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-13  2:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

--- Comment #7 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Initializing small data structures via structure assignment is a common
approach in the Linux kernel.

This is the code gcc generates with the no-sse option applied:

(gdb) disas bio_init3
Dump of assembler code for function bio_init3:
   0x00000000004011b0 <+0>:     mov    %rdi,%r8
   0x00000000004011b3 <+3>:     mov    $0xf,%ecx
   0x00000000004011b8 <+8>:     xor    %eax,%eax
   0x00000000004011ba <+10>:    rep stos %rax,%es:(%rdi)
   0x00000000004011bd <+13>:    movl   $0x1,0x20(%r8)
   0x00000000004011c5 <+21>:    mov    %dx,0x62(%r8)
   0x00000000004011ca <+26>:    movl   $0x1,0x64(%r8)
   0x00000000004011d2 <+34>:    mov    %rsi,0x68(%r8)
   0x00000000004011d6 <+38>:    ret    

This is the code clang generates with the no-sse option applied:

(gdb) disas bio_init3
Dump of assembler code for function bio_init3:
   0x00000000004012c0 <+0>:     movq   $0x0,0x18(%rdi)
   0x00000000004012c8 <+8>:     movq   $0x0,0x10(%rdi)
   0x00000000004012d0 <+16>:    movq   $0x0,0x8(%rdi)
   0x00000000004012d8 <+24>:    movq   $0x0,(%rdi)
   0x00000000004012df <+31>:    movl   $0x1,0x20(%rdi)
   0x00000000004012e6 <+38>:    movq   $0x0,0x24(%rdi)
   0x00000000004012ee <+46>:    movq   $0x0,0x2c(%rdi)
   0x00000000004012f6 <+54>:    movq   $0x0,0x34(%rdi)
   0x00000000004012fe <+62>:    movq   $0x0,0x3c(%rdi)
   0x0000000000401306 <+70>:    movq   $0x0,0x44(%rdi)
   0x000000000040130e <+78>:    movq   $0x0,0x4c(%rdi)
   0x0000000000401316 <+86>:    movq   $0x0,0x54(%rdi)
   0x000000000040131e <+94>:    movq   $0x0,0x5a(%rdi)
   0x0000000000401326 <+102>:   mov    %dx,0x62(%rdi)
   0x000000000040132a <+106>:   movl   $0x1,0x64(%rdi)
   0x0000000000401331 <+113>:   mov    %rsi,0x68(%rdi)
   0x0000000000401335 <+117>:   movq   $0x0,0x70(%rdi)
   0x000000000040133d <+125>:   ret    

Is there any x86_64 CPU on which the latter code runs slower than the former?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug middle-end/102294] memset expansion is sometimes slow for small sizes
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
                   ` (6 preceding siblings ...)
  2021-09-13  2:57 ` [Bug middle-end/102294] memset expansion is sometimes slow for small sizes bart.vanassche at gmail dot com
@ 2021-09-13  3:06 ` crazylht at gmail dot com
  2021-09-13  3:21 ` bart.vanassche at gmail dot com
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: crazylht at gmail dot com @ 2021-09-13  3:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

Hongtao.liu <crazylht at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Andrew Pinski from comment #6)
> This is literally just measuring memset times of a small structure.
> 
> -mtune=intel changes the timings too.
> Doing -mstringop-strategy=libcall also changes the timing to the point where
> they are about the same as clang.
> 
> So this is a target issue and not a middle-end.
> 
> You need to do timings on many more processors to have the -mtune=generic
> changed.

Yes, it's related to strongop strategy, w/ -mtune=skylake

gcc -O2 -march=x86-64 test.c  -mtune=skylake

Elapsed time: 0.353267 s
Elapsed time: 0.515796 s
Elapsed time: 0.352953 s

gcc -O2 -march=x86-64 test.c

Elapsed time: 0.892582 s
Elapsed time: 0.515735 s
Elapsed time: 0.843342 s

w/ -mtune=skylake, xmm mov is used.

bio_init3:
.LFB30:
        .cfi_startproc
        pxor    %xmm15, %xmm15
        movups  %xmm15, 96(%rdi)
        movups  %xmm15, 32(%rdi)
        movw    %dx, 98(%rdi)
        movl    $1, 32(%rdi)
        movl    $1, 100(%rdi)
        movq    %rsi, 104(%rdi)
        movups  %xmm15, (%rdi)
        movups  %xmm15, 16(%rdi)
        movups  %xmm15, 48(%rdi)
        movups  %xmm15, 64(%rdi)
        movups  %xmm15, 80(%rdi)
        movq    %xmm15, 112(%rdi)
        ret
        .cfi_endproc

w/ -mtune=generic, res stosq is used.

bio_init3:
.LFB30:
        .cfi_startproc
        movq    %rdi, %r8
        movl    $15, %ecx
        xorl    %eax, %eax
        rep stosq
        movl    $1, 32(%r8)
        movw    %dx, 98(%r8)
        movl    $1, 100(%r8)
        movq    %rsi, 104(%r8)
        ret

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug middle-end/102294] memset expansion is sometimes slow for small sizes
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
                   ` (7 preceding siblings ...)
  2021-09-13  3:06 ` crazylht at gmail dot com
@ 2021-09-13  3:21 ` bart.vanassche at gmail dot com
  2021-09-13  3:28 ` crazylht at gmail dot com
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-13  3:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

--- Comment #9 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Hmm ... isn't movups a floating-point instruction? I want to avoid floating
point instructions since my understanding is that it is not allowed to use
these in kernel code. See e.g.
https://stackoverflow.com/questions/13886338/use-of-floating-point-in-the-linux-kernel.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug middle-end/102294] memset expansion is sometimes slow for small sizes
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
                   ` (8 preceding siblings ...)
  2021-09-13  3:21 ` bart.vanassche at gmail dot com
@ 2021-09-13  3:28 ` crazylht at gmail dot com
  2021-09-13  3:40 ` [Bug target/102294] " pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: crazylht at gmail dot com @ 2021-09-13  3:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Bart Van Assche from comment #9)
> Hmm ... isn't movups a floating-point instruction? I want to avoid floating
> point instructions since my understanding is that it is not allowed to use
> these in kernel code. See e.g.
> https://stackoverflow.com/questions/13886338/use-of-floating-point-in-the-
> linux-kernel.

then as pinski mentioned in #c1, -mno-sse is needed, i guess for clang case,
xmm mov is also used which also shouldn't be available in kernel.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/102294] memset expansion is sometimes slow for small sizes
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
                   ` (9 preceding siblings ...)
  2021-09-13  3:28 ` crazylht at gmail dot com
@ 2021-09-13  3:40 ` pinskia at gcc dot gnu.org
  2021-09-13 13:37 ` hjl.tools at gmail dot com
  2021-09-14  2:04 ` bart.vanassche at gmail dot com
  12 siblings, 0 replies; 14+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-13  3:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|middle-end                  |target
           Keywords|                            |missed-optimization
             Target|                            |x86_64

--- Comment #11 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
With the target("no-sse") attribute, clang turns off SSE but uses a bunch of
64bit stores for the memset while GCC uses rep;stos.

I don't know which one is better on which processors. So someone will need to
do timings on that.  My bet is clang is tuned towards intel processors more
than say a generic AMD processor.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/102294] memset expansion is sometimes slow for small sizes
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
                   ` (10 preceding siblings ...)
  2021-09-13  3:40 ` [Bug target/102294] " pinskia at gcc dot gnu.org
@ 2021-09-13 13:37 ` hjl.tools at gmail dot com
  2021-09-14  2:04 ` bart.vanassche at gmail dot com
  12 siblings, 0 replies; 14+ messages in thread
From: hjl.tools at gmail dot com @ 2021-09-13 13:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hjl.tools at gmail dot com

--- Comment #12 from H.J. Lu <hjl.tools at gmail dot com> ---
Please try

https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html

On Intel i7-8559U, I got

[hjl@gnu-cfl-2 gcc]$  ./xgcc -B./ -O2 -o gcc-sse /tmp/x.c 
[hjl@gnu-cfl-2 gcc]$ ./gcc-sse
Elapsed time: 0.534094 s
Elapsed time: 0.502171 s
Elapsed time: 0.454463 s
[hjl@gnu-cfl-2 gcc]$ clang-12 -O2 -o clang-sse /tmp/x.c
[hjl@gnu-cfl-2 gcc]$ ./clang-sse
Elapsed time: 0.608078 s
Elapsed time: 0.575241 s
Elapsed time: 0.454897 s
[hjl@gnu-cfl-2 gcc]$

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/102294] memset expansion is sometimes slow for small sizes
  2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
                   ` (11 preceding siblings ...)
  2021-09-13 13:37 ` hjl.tools at gmail dot com
@ 2021-09-14  2:04 ` bart.vanassche at gmail dot com
  12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-14  2:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

--- Comment #13 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Hi H.J. Lu, thank you for having taken a look. I would like to try your patch.
However, I'm not a gcc developer so I don't have a gcc tree checked out on my
development workstation. It may take some time before I can test the patch that
you shared.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-09-14  2:04 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
2021-09-12 22:17 ` [Bug middle-end/102294] " pinskia at gcc dot gnu.org
2021-09-13  1:41 ` bart.vanassche at gmail dot com
2021-09-13  1:43 ` bart.vanassche at gmail dot com
2021-09-13  1:59 ` pinskia at gcc dot gnu.org
2021-09-13  2:16 ` bart.vanassche at gmail dot com
2021-09-13  2:25 ` pinskia at gcc dot gnu.org
2021-09-13  2:57 ` [Bug middle-end/102294] memset expansion is sometimes slow for small sizes bart.vanassche at gmail dot com
2021-09-13  3:06 ` crazylht at gmail dot com
2021-09-13  3:21 ` bart.vanassche at gmail dot com
2021-09-13  3:28 ` crazylht at gmail dot com
2021-09-13  3:40 ` [Bug target/102294] " pinskia at gcc dot gnu.org
2021-09-13 13:37 ` hjl.tools at gmail dot com
2021-09-14  2:04 ` bart.vanassche at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).