[Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
@ 2023-09-08  1:49 d_vampile at 163 dot com
  2023-09-08  1:54 ` [Bug target/111332] " d_vampile at 163 dot com
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08  1:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332

            Bug ID: 111332
           Summary: Using GCC7.3.0 and GCC10.3.0 to compile the same test
                    case, assembler file instructions are different and
                    performance fallback is obvious.
           Product: gcc
           Version: 10.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: d_vampile at 163 dot com
  Target Milestone: ---

Created attachment 55850
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55850&action=edit
test case

Created Attachment

Test platform: x86_64

Compiler Options:  
gcc main.c -g -o main -O2 -msse4.2 -mavx2 -fno-inline

Runtime with gcc7.3.0:
$ time ./main_gcc7.3 2000
start to run 2000.
end to run 2000.

real    6m30.461s
user    6m26.587s
sys     0m0.814s


Runtime with gcc10.3.0:
$ time ./main_gcc10.3 2000
start to run 2000.
end to run 2000.

real    7m18.696s
user    7m13.912s
sys     0m1.098s

Programs compiled with gcc10.3.0 run significantly longer than gcc7.3.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
  2023-09-08  1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
@ 2023-09-08  1:54 ` d_vampile at 163 dot com
  2023-09-08  1:58 ` d_vampile at 163 dot com
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08  1:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332

d_vampile <d_vampile at 163 dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |d_vampile at 163 dot com

--- Comment #1 from d_vampile <d_vampile at 163 dot com> ---
Created attachment 55851
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55851&action=edit
Assembly Instruction Differences

This figure shows the assembly instructions. The left one is gcc7.3.0, and the
right one is gcc10.3.0.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
  2023-09-08  1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
  2023-09-08  1:54 ` [Bug target/111332] " d_vampile at 163 dot com
@ 2023-09-08  1:58 ` d_vampile at 163 dot com
  2023-09-08  2:11 ` pinskia at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08  1:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332

--- Comment #2 from d_vampile <d_vampile at 163 dot com> ---
gcc7.3.0 program use vmovups and vmovups instructions , but gcc10.3.0 program
only use vmovups instructions.In addition, the order of the two assembly
instructions is not consistent.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
  2023-09-08  1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
  2023-09-08  1:54 ` [Bug target/111332] " d_vampile at 163 dot com
  2023-09-08  1:58 ` d_vampile at 163 dot com
@ 2023-09-08  2:11 ` pinskia at gcc dot gnu.org
  2023-09-08  2:16 ` pinskia at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-08  2:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to work|                            |6.4.0
      Known to fail|                            |7.3.0, 7.5.0, 8.5.0, 9.5.0

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
GCC 11+ produces:
.L3:
        vmovdqu (%rsi), %ymm2
        vmovdqu 32(%rsi), %ymm1
        subq    $-128, %rdi
        subq    $-128, %rsi
        vmovdqu -64(%rsi), %ymm0
        vmovdqu -32(%rsi), %ymm3
        vmovdqu %ymm2, -128(%rdi)
        vmovdqu %ymm3, -32(%rdi)
        vmovdqu %ymm1, -96(%rdi)
        vmovdqu %ymm0, -64(%rdi)
        cmpq    %rax, %rdi
        jne     .L3

Which is the best code ...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
  2023-09-08  1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
                   ` (2 preceding siblings ...)
  2023-09-08  2:11 ` pinskia at gcc dot gnu.org
@ 2023-09-08  2:16 ` pinskia at gcc dot gnu.org
  2023-09-08  2:34 ` d_vampile at 163 dot com
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-08  2:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Anyways the tuning was fixed for GCC 11. GCC 10 is no longer supported so
closing as fixed for GCC 11.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
  2023-09-08  1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
                   ` (3 preceding siblings ...)
  2023-09-08  2:16 ` pinskia at gcc dot gnu.org
@ 2023-09-08  2:34 ` d_vampile at 163 dot com
  2023-09-08  2:38 ` d_vampile at 163 dot com
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08  2:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332

--- Comment #5 from d_vampile <d_vampile at 163 dot com> ---
According to the analysis, the following two prs may cause the preceding
problems:
PR1:https://github.com/gcc-mirror/gcc/commit/dd9b529f08c3c6064c37234922d298336d78caf7
PR2:https://github.com/gcc-mirror/gcc/commit/e7bf9583fa2a16e9edd5d5347407ad8acc8f9794

I revert PR1 on gcc10.3.0 and found that the assembly instructions changed to
vmovups and vmovups.

And revert PR2, the order of assembly instructions can be consistent with that
of instructions generated during gcc7.3.0 compilation. However, the effects  of
these two PRs and the potential risks of code rollback are still noclear.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
  2023-09-08  1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
                   ` (4 preceding siblings ...)
  2023-09-08  2:34 ` d_vampile at 163 dot com
@ 2023-09-08  2:38 ` d_vampile at 163 dot com
  2023-09-08  2:40 ` d_vampile at 163 dot com
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08  2:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332

--- Comment #6 from d_vampile <d_vampile at 163 dot com> ---
GCC 7.3.0 produces:
extern __inline __m256i __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm256_loadu_si256 (__m256i_u const *__P)
{
  return *__P;
  401170:       c5 fa 6f 1e             vmovdqu (%rsi),%xmm3
                dst = (uint8_t *)dst + 128;
  401174:       48 83 ef 80             sub    $0xffffffffffffff80,%rdi
                src = (const uint8_t *)src + 128;
  401178:       48 83 ee 80             sub    $0xffffffffffffff80,%rsi
  40117c:       c5 fa 6f 56 a0          vmovdqu -0x60(%rsi),%xmm2
  401181:       c4 e3 65 38 5e 90 01    vinserti128
$0x1,-0x70(%rsi),%ymm3,%ymm3
  401188:       c5 fa 6f 4e c0          vmovdqu -0x40(%rsi),%xmm1
  40118d:       c4 e3 6d 38 56 b0 01    vinserti128
$0x1,-0x50(%rsi),%ymm2,%ymm2
  401194:       c5 fa 6f 46 e0          vmovdqu -0x20(%rsi),%xmm0
  401199:       c4 e3 75 38 4e d0 01    vinserti128
$0x1,-0x30(%rsi),%ymm1,%ymm1
  4011a0:       c4 e3 7d 38 46 f0 01    vinserti128
$0x1,-0x10(%rsi),%ymm0,%ymm0
}

extern __inline void __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm256_storeu_si256 (__m256i_u *__P, __m256i __A)
{
  *__P = __A;
  4011a7:       c5 f8 11 5f 80          vmovups %xmm3,-0x80(%rdi)
  4011ac:       c4 e3 7d 39 5f 90 01    vextracti128 $0x1,%ymm3,-0x70(%rdi)
  4011b3:       c5 f8 11 57 a0          vmovups %xmm2,-0x60(%rdi)
  4011b8:       c4 e3 7d 39 57 b0 01    vextracti128 $0x1,%ymm2,-0x50(%rdi)
  4011bf:       c5 f8 11 4f c0          vmovups %xmm1,-0x40(%rdi)
  4011c4:       c4 e3 7d 39 4f d0 01    vextracti128 $0x1,%ymm1,-0x30(%rdi)
  4011cb:       c5 f8 11 47 e0          vmovups %xmm0,-0x20(%rdi)
  4011d0:       c4 e3 7d 39 47 f0 01    vextracti128 $0x1,%ymm0,-0x10(%rdi)
        while (n >= 128) {
  4011d7:       48 39 c7                cmp    %rax,%rdi
  4011da:       75 94                   jne    401170 <rte_mov128blocks+0x20>
  4011dc:       c5 f8 77                vzeroupper

In terms of runtime, this code is the best.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
  2023-09-08  1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
                   ` (5 preceding siblings ...)
  2023-09-08  2:38 ` d_vampile at 163 dot com
@ 2023-09-08  2:40 ` d_vampile at 163 dot com
  2023-09-08  4:00 ` pinskia at gcc dot gnu.org
  2023-09-08  4:09 ` d_vampile at 163 dot com
  8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08  2:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332

--- Comment #7 from d_vampile <d_vampile at 163 dot com> ---
(In reply to Andrew Pinski from comment #3)
> GCC 11+ produces:
> .L3:
>         vmovdqu (%rsi), %ymm2
>         vmovdqu 32(%rsi), %ymm1
>         subq    $-128, %rdi
>         subq    $-128, %rsi
>         vmovdqu -64(%rsi), %ymm0
>         vmovdqu -32(%rsi), %ymm3
>         vmovdqu %ymm2, -128(%rdi)
>         vmovdqu %ymm3, -32(%rdi)
>         vmovdqu %ymm1, -96(%rdi)
>         vmovdqu %ymm0, -64(%rdi)
>         cmpq    %rax, %rdi
>         jne     .L3
> 
> Which is the best code ...

GCC 7.3.0 produces:
extern __inline __m256i __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm256_loadu_si256 (__m256i_u const *__P)
{
  return *__P;
  401170:       c5 fa 6f 1e             vmovdqu (%rsi),%xmm3
                dst = (uint8_t *)dst + 128;
  401174:       48 83 ef 80             sub    $0xffffffffffffff80,%rdi
                src = (const uint8_t *)src + 128;
  401178:       48 83 ee 80             sub    $0xffffffffffffff80,%rsi
  40117c:       c5 fa 6f 56 a0          vmovdqu -0x60(%rsi),%xmm2
  401181:       c4 e3 65 38 5e 90 01    vinserti128
$0x1,-0x70(%rsi),%ymm3,%ymm3
  401188:       c5 fa 6f 4e c0          vmovdqu -0x40(%rsi),%xmm1
  40118d:       c4 e3 6d 38 56 b0 01    vinserti128
$0x1,-0x50(%rsi),%ymm2,%ymm2
  401194:       c5 fa 6f 46 e0          vmovdqu -0x20(%rsi),%xmm0
  401199:       c4 e3 75 38 4e d0 01    vinserti128
$0x1,-0x30(%rsi),%ymm1,%ymm1
  4011a0:       c4 e3 7d 38 46 f0 01    vinserti128
$0x1,-0x10(%rsi),%ymm0,%ymm0
}

extern __inline void __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm256_storeu_si256 (__m256i_u *__P, __m256i __A)
{
  *__P = __A;
  4011a7:       c5 f8 11 5f 80          vmovups %xmm3,-0x80(%rdi)
  4011ac:       c4 e3 7d 39 5f 90 01    vextracti128 $0x1,%ymm3,-0x70(%rdi)
  4011b3:       c5 f8 11 57 a0          vmovups %xmm2,-0x60(%rdi)
  4011b8:       c4 e3 7d 39 57 b0 01    vextracti128 $0x1,%ymm2,-0x50(%rdi)
  4011bf:       c5 f8 11 4f c0          vmovups %xmm1,-0x40(%rdi)
  4011c4:       c4 e3 7d 39 4f d0 01    vextracti128 $0x1,%ymm1,-0x30(%rdi)
  4011cb:       c5 f8 11 47 e0          vmovups %xmm0,-0x20(%rdi)
  4011d0:       c4 e3 7d 39 47 f0 01    vextracti128 $0x1,%ymm0,-0x10(%rdi)
        while (n >= 128) {
  4011d7:       48 39 c7                cmp    %rax,%rdi
  4011da:       75 94                   jne    401170 <rte_mov128blocks+0x20>
  4011dc:       c5 f8 77                vzeroupper

In terms of runtime, this code is the best.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
  2023-09-08  1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
                   ` (6 preceding siblings ...)
  2023-09-08  2:40 ` d_vampile at 163 dot com
@ 2023-09-08  4:00 ` pinskia at gcc dot gnu.org
  2023-09-08  4:09 ` d_vampile at 163 dot com
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-08  4:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332

--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to d_vampile from comment #7)
> In terms of runtime, this code is the best.

Depends on the core ....
What does -mtune=native provide for the core which you are running on?
Also what core are you testing with?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/111332] Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious.
  2023-09-08  1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
                   ` (7 preceding siblings ...)
  2023-09-08  4:00 ` pinskia at gcc dot gnu.org
@ 2023-09-08  4:09 ` d_vampile at 163 dot com
  8 siblings, 0 replies; 10+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-08  4:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111332

--- Comment #9 from d_vampile <d_vampile at 163 dot com> ---
(In reply to Andrew Pinski from comment #8)
> (In reply to d_vampile from comment #7)
> > In terms of runtime, this code is the best.
> 
> Depends on the core ....
> What does -mtune=native provide for the core which you are running on?
> Also what core are you testing with?

I also tried GCC11 and GCC12, using the same compilation options, but not even
the instruction ' vextracti128 ', so the program runs longer and performs
worse.

the assembly instruction is not change by use -mtune=native，and the test
results were still worse than gcc7.

CPU info:
Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz
-mtune=generic

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-09-08  4:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-08  1:49 [Bug target/111332] New: Using GCC7.3.0 and GCC10.3.0 to compile the same test case, assembler file instructions are different and performance fallback is obvious d_vampile at 163 dot com
2023-09-08  1:54 ` [Bug target/111332] " d_vampile at 163 dot com
2023-09-08  1:58 ` d_vampile at 163 dot com
2023-09-08  2:11 ` pinskia at gcc dot gnu.org
2023-09-08  2:16 ` pinskia at gcc dot gnu.org
2023-09-08  2:34 ` d_vampile at 163 dot com
2023-09-08  2:38 ` d_vampile at 163 dot com
2023-09-08  2:40 ` d_vampile at 163 dot com
2023-09-08  4:00 ` pinskia at gcc dot gnu.org
2023-09-08  4:09 ` d_vampile at 163 dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).