public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
@ 2023-09-08 1:49 d_vampile at 163 dot com
2023-09-08 1:54 ` [Bug target/111332] " d_vampile at 163 dot com
` (8 more replies)
0 siblings, 9 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08 1:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332
Bug ID: 111332
Summary: Using GCC7.3.0 and GCC10.3.0 to compile the same test
case, assembler file instructions are different and
performance fallback is obvious.
Product: gcc
Version: 10.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: d_vampile at 163 dot com
Target Milestone: ---
Created attachment 55850
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55850&action=edit
test case
Created Attachment
Test platform: x86_64
Compiler Options:
gcc main.c -g -o main -O2 -msse4.2 -mavx2 -fno-inline
Runtime with gcc7.3.0:
$ time ./main_gcc7.3 2000
start to run 2000.
end to run 2000.
real 6m30.461s
user 6m26.587s
sys 0m0.814s
Runtime with gcc10.3.0:
$ time ./main_gcc10.3 2000
start to run 2000.
end to run 2000.
real 7m18.696s
user 7m13.912s
sys 0m1.098s
Programs compiled with gcc10.3.0 run significantly longer than gcc7.3.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
2023-09-08 1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
@ 2023-09-08 1:54 ` d_vampile at 163 dot com
2023-09-08 1:58 ` d_vampile at 163 dot com
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08 1:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332
d_vampile <d_vampile at 163 dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |d_vampile at 163 dot com
--- Comment #1 from d_vampile <d_vampile at 163 dot com> ---
Created attachment 55851
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55851&action=edit
Assembly Instruction Differences
This figure shows the assembly instructions. The left one is gcc7.3.0, and the
right one is gcc10.3.0.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
2023-09-08 1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
2023-09-08 1:54 ` [Bug target/111332] " d_vampile at 163 dot com
@ 2023-09-08 1:58 ` d_vampile at 163 dot com
2023-09-08 2:11 ` pinskia at gcc dot gnu.org
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08 1:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332
--- Comment #2 from d_vampile <d_vampile at 163 dot com> ---
gcc7.3.0 program use vmovups and vmovups instructions , but gcc10.3.0 program
only use vmovups instructions.In addition, the order of the two assembly
instructions is not consistent.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
2023-09-08 1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
2023-09-08 1:54 ` [Bug target/111332] " d_vampile at 163 dot com
2023-09-08 1:58 ` d_vampile at 163 dot com
@ 2023-09-08 2:11 ` pinskia at gcc dot gnu.org
2023-09-08 2:16 ` pinskia at gcc dot gnu.org
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-08 2:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Known to work| |6.4.0
Known to fail| |7.3.0, 7.5.0, 8.5.0, 9.5.0
--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
GCC 11+ produces:
.L3:
vmovdqu (%rsi), %ymm2
vmovdqu 32(%rsi), %ymm1
subq $-128, %rdi
subq $-128, %rsi
vmovdqu -64(%rsi), %ymm0
vmovdqu -32(%rsi), %ymm3
vmovdqu %ymm2, -128(%rdi)
vmovdqu %ymm3, -32(%rdi)
vmovdqu %ymm1, -96(%rdi)
vmovdqu %ymm0, -64(%rdi)
cmpq %rax, %rdi
jne .L3
Which is the best code ...
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
2023-09-08 1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
` (2 preceding siblings ...)
2023-09-08 2:11 ` pinskia at gcc dot gnu.org
@ 2023-09-08 2:16 ` pinskia at gcc dot gnu.org
2023-09-08 2:34 ` d_vampile at 163 dot com
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-08 2:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |RESOLVED
Resolution|--- |FIXED
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Anyways the tuning was fixed for GCC 11. GCC 10 is no longer supported so
closing as fixed for GCC 11.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
2023-09-08 1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
` (3 preceding siblings ...)
2023-09-08 2:16 ` pinskia at gcc dot gnu.org
@ 2023-09-08 2:34 ` d_vampile at 163 dot com
2023-09-08 2:38 ` d_vampile at 163 dot com
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08 2:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332
--- Comment #5 from d_vampile <d_vampile at 163 dot com> ---
According to the analysis, the following two prs may cause the preceding
problems:
PR1:https://github.com/gcc-mirror/gcc/commit/dd9b529f08c3c6064c37234922d298336d78caf7
PR2:https://github.com/gcc-mirror/gcc/commit/e7bf9583fa2a16e9edd5d5347407ad8acc8f9794
I revert PR1 on gcc10.3.0 and found that the assembly instructions changed to
vmovups and vmovups.
And revert PR2, the order of assembly instructions can be consistent with that
of instructions generated during gcc7.3.0 compilation. However, the effects of
these two PRs and the potential risks of code rollback are still noclear.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
2023-09-08 1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
` (4 preceding siblings ...)
2023-09-08 2:34 ` d_vampile at 163 dot com
@ 2023-09-08 2:38 ` d_vampile at 163 dot com
2023-09-08 2:40 ` d_vampile at 163 dot com
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08 2:38 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332
--- Comment #6 from d_vampile <d_vampile at 163 dot com> ---
GCC 7.3.0 produces:
extern __inline __m256i __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm256_loadu_si256 (__m256i_u const *__P)
{
return *__P;
401170: c5 fa 6f 1e vmovdqu (%rsi),%xmm3
dst = (uint8_t *)dst + 128;
401174: 48 83 ef 80 sub $0xffffffffffffff80,%rdi
src = (const uint8_t *)src + 128;
401178: 48 83 ee 80 sub $0xffffffffffffff80,%rsi
40117c: c5 fa 6f 56 a0 vmovdqu -0x60(%rsi),%xmm2
401181: c4 e3 65 38 5e 90 01 vinserti128
$0x1,-0x70(%rsi),%ymm3,%ymm3
401188: c5 fa 6f 4e c0 vmovdqu -0x40(%rsi),%xmm1
40118d: c4 e3 6d 38 56 b0 01 vinserti128
$0x1,-0x50(%rsi),%ymm2,%ymm2
401194: c5 fa 6f 46 e0 vmovdqu -0x20(%rsi),%xmm0
401199: c4 e3 75 38 4e d0 01 vinserti128
$0x1,-0x30(%rsi),%ymm1,%ymm1
4011a0: c4 e3 7d 38 46 f0 01 vinserti128
$0x1,-0x10(%rsi),%ymm0,%ymm0
}
extern __inline void __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm256_storeu_si256 (__m256i_u *__P, __m256i __A)
{
*__P = __A;
4011a7: c5 f8 11 5f 80 vmovups %xmm3,-0x80(%rdi)
4011ac: c4 e3 7d 39 5f 90 01 vextracti128 $0x1,%ymm3,-0x70(%rdi)
4011b3: c5 f8 11 57 a0 vmovups %xmm2,-0x60(%rdi)
4011b8: c4 e3 7d 39 57 b0 01 vextracti128 $0x1,%ymm2,-0x50(%rdi)
4011bf: c5 f8 11 4f c0 vmovups %xmm1,-0x40(%rdi)
4011c4: c4 e3 7d 39 4f d0 01 vextracti128 $0x1,%ymm1,-0x30(%rdi)
4011cb: c5 f8 11 47 e0 vmovups %xmm0,-0x20(%rdi)
4011d0: c4 e3 7d 39 47 f0 01 vextracti128 $0x1,%ymm0,-0x10(%rdi)
while (n >= 128) {
4011d7: 48 39 c7 cmp %rax,%rdi
4011da: 75 94 jne 401170 <rte_mov128blocks+0x20>
4011dc: c5 f8 77 vzeroupper
In terms of runtime, this code is the best.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
2023-09-08 1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
` (5 preceding siblings ...)
2023-09-08 2:38 ` d_vampile at 163 dot com
@ 2023-09-08 2:40 ` d_vampile at 163 dot com
2023-09-08 4:00 ` pinskia at gcc dot gnu.org
2023-09-08 4:09 ` d_vampile at 163 dot com
8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08 2:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332
--- Comment #7 from d_vampile <d_vampile at 163 dot com> ---
(In reply to Andrew Pinski from comment #3)
> GCC 11+ produces:
> .L3:
> vmovdqu (%rsi), %ymm2
> vmovdqu 32(%rsi), %ymm1
> subq $-128, %rdi
> subq $-128, %rsi
> vmovdqu -64(%rsi), %ymm0
> vmovdqu -32(%rsi), %ymm3
> vmovdqu %ymm2, -128(%rdi)
> vmovdqu %ymm3, -32(%rdi)
> vmovdqu %ymm1, -96(%rdi)
> vmovdqu %ymm0, -64(%rdi)
> cmpq %rax, %rdi
> jne .L3
>
> Which is the best code ...
GCC 7.3.0 produces:
extern __inline __m256i __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm256_loadu_si256 (__m256i_u const *__P)
{
return *__P;
401170: c5 fa 6f 1e vmovdqu (%rsi),%xmm3
dst = (uint8_t *)dst + 128;
401174: 48 83 ef 80 sub $0xffffffffffffff80,%rdi
src = (const uint8_t *)src + 128;
401178: 48 83 ee 80 sub $0xffffffffffffff80,%rsi
40117c: c5 fa 6f 56 a0 vmovdqu -0x60(%rsi),%xmm2
401181: c4 e3 65 38 5e 90 01 vinserti128
$0x1,-0x70(%rsi),%ymm3,%ymm3
401188: c5 fa 6f 4e c0 vmovdqu -0x40(%rsi),%xmm1
40118d: c4 e3 6d 38 56 b0 01 vinserti128
$0x1,-0x50(%rsi),%ymm2,%ymm2
401194: c5 fa 6f 46 e0 vmovdqu -0x20(%rsi),%xmm0
401199: c4 e3 75 38 4e d0 01 vinserti128
$0x1,-0x30(%rsi),%ymm1,%ymm1
4011a0: c4 e3 7d 38 46 f0 01 vinserti128
$0x1,-0x10(%rsi),%ymm0,%ymm0
}
extern __inline void __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm256_storeu_si256 (__m256i_u *__P, __m256i __A)
{
*__P = __A;
4011a7: c5 f8 11 5f 80 vmovups %xmm3,-0x80(%rdi)
4011ac: c4 e3 7d 39 5f 90 01 vextracti128 $0x1,%ymm3,-0x70(%rdi)
4011b3: c5 f8 11 57 a0 vmovups %xmm2,-0x60(%rdi)
4011b8: c4 e3 7d 39 57 b0 01 vextracti128 $0x1,%ymm2,-0x50(%rdi)
4011bf: c5 f8 11 4f c0 vmovups %xmm1,-0x40(%rdi)
4011c4: c4 e3 7d 39 4f d0 01 vextracti128 $0x1,%ymm1,-0x30(%rdi)
4011cb: c5 f8 11 47 e0 vmovups %xmm0,-0x20(%rdi)
4011d0: c4 e3 7d 39 47 f0 01 vextracti128 $0x1,%ymm0,-0x10(%rdi)
while (n >= 128) {
4011d7: 48 39 c7 cmp %rax,%rdi
4011da: 75 94 jne 401170 <rte_mov128blocks+0x20>
4011dc: c5 f8 77 vzeroupper
In terms of runtime, this code is the best.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
2023-09-08 1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
` (6 preceding siblings ...)
2023-09-08 2:40 ` d_vampile at 163 dot com
@ 2023-09-08 4:00 ` pinskia at gcc dot gnu.org
2023-09-08 4:09 ` d_vampile at 163 dot com
8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-08 4:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332
--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to d_vampile from comment #7)
> In terms of runtime, this code is the best.
Depends on the core ....
What does -mtune=native provide for the core which you are running on?
Also what core are you testing with?
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
2023-09-08 1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
` (7 preceding siblings ...)
2023-09-08 4:00 ` pinskia at gcc dot gnu.org
@ 2023-09-08 4:09 ` d_vampile at 163 dot com
8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08 4:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332
--- Comment #9 from d_vampile <d_vampile at 163 dot com> ---
(In reply to Andrew Pinski from comment #8)
> (In reply to d_vampile from comment #7)
> > In terms of runtime, this code is the best.
>
> Depends on the core ....
> What does -mtune=native provide for the core which you are running on?
> Also what core are you testing with?
I also tried GCC11 and GCC12, using the same compilation options, but not even
the instruction ' vextracti128 ', so the program runs longer and performs
worse.
the assembly instruction is not change by use -mtune=native,and the test
results were still worse than gcc7.
CPU info:
Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz
-mtune=generic
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-09-08 4:09 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-08 1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
2023-09-08 1:54 ` [Bug target/111332] " d_vampile at 163 dot com
2023-09-08 1:58 ` d_vampile at 163 dot com
2023-09-08 2:11 ` pinskia at gcc dot gnu.org
2023-09-08 2:16 ` pinskia at gcc dot gnu.org
2023-09-08 2:34 ` d_vampile at 163 dot com
2023-09-08 2:38 ` d_vampile at 163 dot com
2023-09-08 2:40 ` d_vampile at 163 dot com
2023-09-08 4:00 ` pinskia at gcc dot gnu.org
2023-09-08 4:09 ` d_vampile at 163 dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).