* [Bug middle-end/102294] structure assignment slower than memberwise initialization
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
@ 2021-09-12 22:17 ` pinskia at gcc dot gnu.org
2021-09-13 1:41 ` bart.vanassche at gmail dot com
` (11 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-12 22:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note the Linux kernel compiles with -mno-sse where the results are hugely
difference than what can be done in the usespace where SSE can provide a speed
boost.
The easiest way to test this is to use -m32 -mno-sse
GCC:
bio_init1 Elapsed time: 2.427399 s
bio_init2 Elapsed time: 1.757616 s
bio_init3 Elapsed time: 1.959703 s
clang:
bio_init1 Elapsed time: 1.409902 s
bio_init2 Elapsed time: 1.796531 s
bio_init3 Elapsed time: 2.016957 s
So I think this might be wash for the kernel.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug middle-end/102294] structure assignment slower than memberwise initialization
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
2021-09-12 22:17 ` [Bug middle-end/102294] " pinskia at gcc dot gnu.org
@ 2021-09-13 1:41 ` bart.vanassche at gmail dot com
2021-09-13 1:43 ` bart.vanassche at gmail dot com
` (10 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-13 1:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
Bart Van Assche <bart.vanassche at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #51444|0 |1
is obsolete| |
--- Comment #2 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Created attachment 51445
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51445&action=edit
Test program that illustrates the issue
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug middle-end/102294] structure assignment slower than memberwise initialization
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
2021-09-12 22:17 ` [Bug middle-end/102294] " pinskia at gcc dot gnu.org
2021-09-13 1:41 ` bart.vanassche at gmail dot com
@ 2021-09-13 1:43 ` bart.vanassche at gmail dot com
2021-09-13 1:59 ` pinskia at gcc dot gnu.org
` (9 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-13 1:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #3 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Thanks for the quick feedback. I have modified the test program and added
target("no-sse") to the bio_init[123]() functions. With that change applied the
results are as follows:
$ gcc -O2 -o bio_init bio_init.c && ./bio_init
Elapsed time: 0.965606 s
Elapsed time: 0.529943 s
Elapsed time: 0.734645 s
$ clang -O2 -o bio_init-clang bio_init.c && ./bio_init-clang
Elapsed time: 0.633179 s
Elapsed time: 0.605532 s
Elapsed time: 0.504315 s
It seems like clang still generates significantly better code for bio_init3()
than gcc?
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug middle-end/102294] structure assignment slower than memberwise initialization
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
` (2 preceding siblings ...)
2021-09-13 1:43 ` bart.vanassche at gmail dot com
@ 2021-09-13 1:59 ` pinskia at gcc dot gnu.org
2021-09-13 2:16 ` bart.vanassche at gmail dot com
` (8 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-13 1:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
First atomic_set has an volatile store in it which messes things up a lot.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug middle-end/102294] structure assignment slower than memberwise initialization
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
` (3 preceding siblings ...)
2021-09-13 1:59 ` pinskia at gcc dot gnu.org
@ 2021-09-13 2:16 ` bart.vanassche at gmail dot com
2021-09-13 2:25 ` pinskia at gcc dot gnu.org
` (7 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-13 2:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #5 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Please note that bio_init3() does not use atomic_set() but ATOMIC_INIT(). The
definition of ATOMIC_INIT() is as follows:
#define ATOMIC_INIT(v) (atomic_t){.counter = (v)}
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug middle-end/102294] structure assignment slower than memberwise initialization
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
` (4 preceding siblings ...)
2021-09-13 2:16 ` bart.vanassche at gmail dot com
@ 2021-09-13 2:25 ` pinskia at gcc dot gnu.org
2021-09-13 2:57 ` [Bug middle-end/102294] memset expansion is sometimes slow for small sizes bart.vanassche at gmail dot com
` (6 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-13 2:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is literally just measuring memset times of a small structure.
-mtune=intel changes the timings too.
Doing -mstringop-strategy=libcall also changes the timing to the point where
they are about the same as clang.
So this is a target issue and not a middle-end.
You need to do timings on many more processors to have the -mtune=generic
changed.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug middle-end/102294] memset expansion is sometimes slow for small sizes
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
` (5 preceding siblings ...)
2021-09-13 2:25 ` pinskia at gcc dot gnu.org
@ 2021-09-13 2:57 ` bart.vanassche at gmail dot com
2021-09-13 3:06 ` crazylht at gmail dot com
` (5 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-13 2:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #7 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Initializing small data structures via structure assignment is a common
approach in the Linux kernel.
This is the code gcc generates with the no-sse option applied:
(gdb) disas bio_init3
Dump of assembler code for function bio_init3:
0x00000000004011b0 <+0>: mov %rdi,%r8
0x00000000004011b3 <+3>: mov $0xf,%ecx
0x00000000004011b8 <+8>: xor %eax,%eax
0x00000000004011ba <+10>: rep stos %rax,%es:(%rdi)
0x00000000004011bd <+13>: movl $0x1,0x20(%r8)
0x00000000004011c5 <+21>: mov %dx,0x62(%r8)
0x00000000004011ca <+26>: movl $0x1,0x64(%r8)
0x00000000004011d2 <+34>: mov %rsi,0x68(%r8)
0x00000000004011d6 <+38>: ret
This is the code clang generates with the no-sse option applied:
(gdb) disas bio_init3
Dump of assembler code for function bio_init3:
0x00000000004012c0 <+0>: movq $0x0,0x18(%rdi)
0x00000000004012c8 <+8>: movq $0x0,0x10(%rdi)
0x00000000004012d0 <+16>: movq $0x0,0x8(%rdi)
0x00000000004012d8 <+24>: movq $0x0,(%rdi)
0x00000000004012df <+31>: movl $0x1,0x20(%rdi)
0x00000000004012e6 <+38>: movq $0x0,0x24(%rdi)
0x00000000004012ee <+46>: movq $0x0,0x2c(%rdi)
0x00000000004012f6 <+54>: movq $0x0,0x34(%rdi)
0x00000000004012fe <+62>: movq $0x0,0x3c(%rdi)
0x0000000000401306 <+70>: movq $0x0,0x44(%rdi)
0x000000000040130e <+78>: movq $0x0,0x4c(%rdi)
0x0000000000401316 <+86>: movq $0x0,0x54(%rdi)
0x000000000040131e <+94>: movq $0x0,0x5a(%rdi)
0x0000000000401326 <+102>: mov %dx,0x62(%rdi)
0x000000000040132a <+106>: movl $0x1,0x64(%rdi)
0x0000000000401331 <+113>: mov %rsi,0x68(%rdi)
0x0000000000401335 <+117>: movq $0x0,0x70(%rdi)
0x000000000040133d <+125>: ret
Is there any x86_64 CPU on which the latter code runs slower than the former?
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug middle-end/102294] memset expansion is sometimes slow for small sizes
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
` (6 preceding siblings ...)
2021-09-13 2:57 ` [Bug middle-end/102294] memset expansion is sometimes slow for small sizes bart.vanassche at gmail dot com
@ 2021-09-13 3:06 ` crazylht at gmail dot com
2021-09-13 3:21 ` bart.vanassche at gmail dot com
` (4 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: crazylht at gmail dot com @ 2021-09-13 3:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
Hongtao.liu <crazylht at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |crazylht at gmail dot com
--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Andrew Pinski from comment #6)
> This is literally just measuring memset times of a small structure.
>
> -mtune=intel changes the timings too.
> Doing -mstringop-strategy=libcall also changes the timing to the point where
> they are about the same as clang.
>
> So this is a target issue and not a middle-end.
>
> You need to do timings on many more processors to have the -mtune=generic
> changed.
Yes, it's related to strongop strategy, w/ -mtune=skylake
gcc -O2 -march=x86-64 test.c -mtune=skylake
Elapsed time: 0.353267 s
Elapsed time: 0.515796 s
Elapsed time: 0.352953 s
gcc -O2 -march=x86-64 test.c
Elapsed time: 0.892582 s
Elapsed time: 0.515735 s
Elapsed time: 0.843342 s
w/ -mtune=skylake, xmm mov is used.
bio_init3:
.LFB30:
.cfi_startproc
pxor %xmm15, %xmm15
movups %xmm15, 96(%rdi)
movups %xmm15, 32(%rdi)
movw %dx, 98(%rdi)
movl $1, 32(%rdi)
movl $1, 100(%rdi)
movq %rsi, 104(%rdi)
movups %xmm15, (%rdi)
movups %xmm15, 16(%rdi)
movups %xmm15, 48(%rdi)
movups %xmm15, 64(%rdi)
movups %xmm15, 80(%rdi)
movq %xmm15, 112(%rdi)
ret
.cfi_endproc
w/ -mtune=generic, res stosq is used.
bio_init3:
.LFB30:
.cfi_startproc
movq %rdi, %r8
movl $15, %ecx
xorl %eax, %eax
rep stosq
movl $1, 32(%r8)
movw %dx, 98(%r8)
movl $1, 100(%r8)
movq %rsi, 104(%r8)
ret
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug middle-end/102294] memset expansion is sometimes slow for small sizes
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
` (7 preceding siblings ...)
2021-09-13 3:06 ` crazylht at gmail dot com
@ 2021-09-13 3:21 ` bart.vanassche at gmail dot com
2021-09-13 3:28 ` crazylht at gmail dot com
` (3 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-13 3:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #9 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Hmm ... isn't movups a floating-point instruction? I want to avoid floating
point instructions since my understanding is that it is not allowed to use
these in kernel code. See e.g.
https://stackoverflow.com/questions/13886338/use-of-floating-point-in-the-linux-kernel.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug middle-end/102294] memset expansion is sometimes slow for small sizes
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
` (8 preceding siblings ...)
2021-09-13 3:21 ` bart.vanassche at gmail dot com
@ 2021-09-13 3:28 ` crazylht at gmail dot com
2021-09-13 3:40 ` [Bug target/102294] " pinskia at gcc dot gnu.org
` (2 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: crazylht at gmail dot com @ 2021-09-13 3:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Bart Van Assche from comment #9)
> Hmm ... isn't movups a floating-point instruction? I want to avoid floating
> point instructions since my understanding is that it is not allowed to use
> these in kernel code. See e.g.
> https://stackoverflow.com/questions/13886338/use-of-floating-point-in-the-
> linux-kernel.
then as pinski mentioned in #c1, -mno-sse is needed, i guess for clang case,
xmm mov is also used which also shouldn't be available in kernel.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug target/102294] memset expansion is sometimes slow for small sizes
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
` (9 preceding siblings ...)
2021-09-13 3:28 ` crazylht at gmail dot com
@ 2021-09-13 3:40 ` pinskia at gcc dot gnu.org
2021-09-13 13:37 ` hjl.tools at gmail dot com
2021-09-14 2:04 ` bart.vanassche at gmail dot com
12 siblings, 0 replies; 14+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-13 3:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|middle-end |target
Keywords| |missed-optimization
Target| |x86_64
--- Comment #11 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
With the target("no-sse") attribute, clang turns off SSE but uses a bunch of
64bit stores for the memset while GCC uses rep;stos.
I don't know which one is better on which processors. So someone will need to
do timings on that. My bet is clang is tuned towards intel processors more
than say a generic AMD processor.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug target/102294] memset expansion is sometimes slow for small sizes
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
` (10 preceding siblings ...)
2021-09-13 3:40 ` [Bug target/102294] " pinskia at gcc dot gnu.org
@ 2021-09-13 13:37 ` hjl.tools at gmail dot com
2021-09-14 2:04 ` bart.vanassche at gmail dot com
12 siblings, 0 replies; 14+ messages in thread
From: hjl.tools at gmail dot com @ 2021-09-13 13:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
H.J. Lu <hjl.tools at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hjl.tools at gmail dot com
--- Comment #12 from H.J. Lu <hjl.tools at gmail dot com> ---
Please try
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html
On Intel i7-8559U, I got
[hjl@gnu-cfl-2 gcc]$ ./xgcc -B./ -O2 -o gcc-sse /tmp/x.c
[hjl@gnu-cfl-2 gcc]$ ./gcc-sse
Elapsed time: 0.534094 s
Elapsed time: 0.502171 s
Elapsed time: 0.454463 s
[hjl@gnu-cfl-2 gcc]$ clang-12 -O2 -o clang-sse /tmp/x.c
[hjl@gnu-cfl-2 gcc]$ ./clang-sse
Elapsed time: 0.608078 s
Elapsed time: 0.575241 s
Elapsed time: 0.454897 s
[hjl@gnu-cfl-2 gcc]$
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug target/102294] memset expansion is sometimes slow for small sizes
2021-09-12 21:59 [Bug c/102294] New: structure assignment slower than memberwise initialization bart.vanassche at gmail dot com
` (11 preceding siblings ...)
2021-09-13 13:37 ` hjl.tools at gmail dot com
@ 2021-09-14 2:04 ` bart.vanassche at gmail dot com
12 siblings, 0 replies; 14+ messages in thread
From: bart.vanassche at gmail dot com @ 2021-09-14 2:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #13 from Bart Van Assche <bart.vanassche at gmail dot com> ---
Hi H.J. Lu, thank you for having taken a look. I would like to try your patch.
However, I'm not a gcc developer so I don't have a gcc tree checked out on my
development workstation. It may take some time before I can test the patch that
you shared.
^ permalink raw reply [flat|nested] 14+ messages in thread