public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/111354] New: [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases.
@ 2023-09-09  4:21 d_vampile at 163 dot com
  2023-09-09  5:32 ` [Bug target/111354] " pinskia at gcc dot gnu.org
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-09  4:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111354

            Bug ID: 111354
           Summary: [7/10/12 regression] The instructions of the DPDK demo
                    program are different and run time increases.
           Product: gcc
           Version: 10.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: d_vampile at 163 dot com
  Target Milestone: ---

Created attachment 55863
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55863&action=edit
test case

Test platform: x86_64
The test platform supports avx2 and sse4.2
Default mtune=generic
Compiler Options:  
gcc main.c -g -o main -O2 -msse4.2 -mavx2 -fno-inline

GCC 7.3.0 produces:
.L3:
        vmovdqu (%rsi), %xmm3
        subq    $-128, %rdi
        subq    $-128, %rsi
        vmovdqu -96(%rsi), %xmm2
        vinserti128     $0x1, -112(%rsi), %ymm3, %ymm3
        vmovdqu -64(%rsi), %xmm1
        vinserti128     $0x1, -80(%rsi), %ymm2, %ymm2
        vmovdqu -32(%rsi), %xmm0
        vinserti128     $0x1, -48(%rsi), %ymm1, %ymm1
        vinserti128     $0x1, -16(%rsi), %ymm0, %ymm0
        vmovups %xmm3, -128(%rdi)
        vextracti128    $0x1, %ymm3, -112(%rdi)
        vmovups %xmm2, -96(%rdi)
        vextracti128    $0x1, %ymm2, -80(%rdi)
        vmovups %xmm1, -64(%rdi)
        vextracti128    $0x1, %ymm1, -48(%rdi)
        vmovups %xmm0, -32(%rdi)
        vextracti128    $0x1, %ymm0, -16(%rdi)
        cmpq    %rax, %rdi
        jne     .L3
        vzeroupper

Runtime with gcc7.3.0:
$ time ./main_gcc7.3 2000
start to run 2000.
end to run 2000.

real    6m30.461s
user    6m26.587s
sys     0m0.814s

GCC 10.3.0 produces:
.L3:
        vmovdqu (%rsi), %xmm4
        vmovdqu 32(%rsi), %xmm5
        subq    $-128, %rdi
        subq    $-128, %rsi
        vmovdqu -64(%rsi), %xmm6
        vmovdqu -32(%rsi), %xmm7
        vinserti128     $0x1, -112(%rsi), %ymm4, %ymm3
        vinserti128     $0x1, -80(%rsi), %ymm5, %ymm2
        vinserti128     $0x1, -48(%rsi), %ymm6, %ymm1
        vinserti128     $0x1, -16(%rsi), %ymm7, %ymm0
        vmovdqu %xmm3, -128(%rdi)
        vextracti128    $0x1, %ymm3, -112(%rdi)
        vextracti128    $0x1, %ymm2, -80(%rdi)
        vmovdqu %xmm2, -96(%rdi)
        vextracti128    $0x1, %ymm1, -48(%rdi)
        vextracti128    $0x1, %ymm0, -16(%rdi)
        vmovdqu %xmm1, -64(%rdi)
        vmovdqu %xmm0, -32(%rdi)
        cmpq    %rax, %rdi
        jne     .L3
        vzeroupper

Runtime with gcc10.3.0:
$ time ./main_gcc10.3 2000
start to run 2000.
end to run 2000.

real    7m18.696s
user    7m13.912s
sys     0m1.098s


GCC 12.3.0 produces:
.L3:
        vmovdqu (%rsi), %ymm2
        vmovdqu 32(%rsi), %ymm1
        subq    $-128, %rdi
        subq    $-128, %rsi
        vmovdqu -64(%rsi), %ymm0
        vmovdqu -32(%rsi), %ymm3
        vmovdqu %ymm2, -128(%rdi)
        vmovdqu %ymm3, -32(%rdi)
        vmovdqu %ymm1, -96(%rdi)
        vmovdqu %ymm0, -64(%rdi)
        cmpq    %rax, %rdi
        jne     .L3
        vzeroupper

Runtime with gcc12.3.0:
$ time ./main_gcc12.3 2000
start to run 2000.
end to run 2000.

real    10m1.303s
user    9m52.527s
sys     0m2.253s

Why does it seem that the instructions of gcc12 are simpler but run time is
significantly increased in the same test environment and compilation options?

What is the reason for the different instructions generated by these three
versions of gcc?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/111354] [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases.
  2023-09-09  4:21 [Bug target/111354] New: [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases d_vampile at 163 dot com
@ 2023-09-09  5:32 ` pinskia at gcc dot gnu.org
  2023-09-09  5:39 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-09  5:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111354

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
First off the performance is difference is die to micro-arch issues with
unaligned stores of 256 bits. 

Also iirc rte_mov128blocks is tuned at copying blocks which are aligned at
least to 32 bytes wide. But you are better asking the dpdk forum why they don't
just use memcpy here.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/111354] [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases.
  2023-09-09  4:21 [Bug target/111354] New: [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases d_vampile at 163 dot com
  2023-09-09  5:32 ` [Bug target/111354] " pinskia at gcc dot gnu.org
@ 2023-09-09  5:39 ` pinskia at gcc dot gnu.org
  2023-09-09  6:39 ` d_vampile at 163 dot com
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-09  5:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111354

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
It might be the case DPDK is tuned towards xeon's rather than the i series
also.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/111354] [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases.
  2023-09-09  4:21 [Bug target/111354] New: [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases d_vampile at 163 dot com
  2023-09-09  5:32 ` [Bug target/111354] " pinskia at gcc dot gnu.org
  2023-09-09  5:39 ` pinskia at gcc dot gnu.org
@ 2023-09-09  6:39 ` d_vampile at 163 dot com
  2023-09-12 12:12 ` rguenth at gcc dot gnu.org
  2023-09-13  9:37 ` crazylht at gmail dot com
  4 siblings, 0 replies; 6+ messages in thread
From: d_vampile at 163 dot com @ 2023-09-09  6:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111354

--- Comment #3 from d_vampile <d_vampile at 163 dot com> ---
(In reply to Andrew Pinski from comment #1)
> First off the performance is difference is die to micro-arch issues with
> unaligned stores of 256 bits. 
> 
> Also iirc rte_mov128blocks is tuned at copying blocks which are aligned at
> least to 32 bytes wide. But you are better asking the dpdk forum why they
> don't just use memcpy here.

The instruction 'movdqu' do not require the memory address to be aligned on a
natural vector-length byte boundary. Why does rte_mov128blocks need to be
aligened at 32 bytes wide? 

The test platform is Xeon.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/111354] [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases.
  2023-09-09  4:21 [Bug target/111354] New: [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases d_vampile at 163 dot com
                   ` (2 preceding siblings ...)
  2023-09-09  6:39 ` d_vampile at 163 dot com
@ 2023-09-12 12:12 ` rguenth at gcc dot gnu.org
  2023-09-13  9:37 ` crazylht at gmail dot com
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-09-12 12:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111354

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |INVALID
             Status|UNCONFIRMED                 |RESOLVED
             Target|                            |x86_64-*-*

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
On a Zen4 machine the code produced by GCC 12 (which btw matches what the
source intrinsics do) is faster.  Btw, both src and dst are aligned so that
shouldn't be the issue here.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/111354] [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases.
  2023-09-09  4:21 [Bug target/111354] New: [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases d_vampile at 163 dot com
                   ` (3 preceding siblings ...)
  2023-09-12 12:12 ` rguenth at gcc dot gnu.org
@ 2023-09-13  9:37 ` crazylht at gmail dot com
  4 siblings, 0 replies; 6+ messages in thread
From: crazylht at gmail dot com @ 2023-09-13  9:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111354

Hongtao.liu <crazylht at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---
void
rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
{
        __m256i ymm0, ymm1, ymm2, ymm3;

        while (n >= 128) {
                ymm0 = _mm256_loadu_si256((const __m256i *)(const void *)
                                          ((const uint8_t *)src + 0 * 32));
                n -= 128;
                ymm1 = _mm256_loadu_si256((const __m256i *)(const void *)
                                          ((const uint8_t *)src + 1 * 32));
                ymm2 = _mm256_loadu_si256((const __m256i *)(const void *)
                                          ((const uint8_t *)src + 2 * 32));
                ymm3 = _mm256_loadu_si256((const __m256i *)(const void *)
                                          ((const uint8_t *)src + 3 * 32));
                src = (const uint8_t *)src + 128;
                _mm256_storeu_si256((__m256i *)(void *)
                                    ((uint8_t *)dst + 0 * 32), ymm0);
                _mm256_storeu_si256((__m256i *)(void *)
                                    ((uint8_t *)dst + 1 * 32), ymm1);
                _mm256_storeu_si256((__m256i *)(void *)
                                    ((uint8_t *)dst + 2 * 32), ymm2);
                _mm256_storeu_si256((__m256i *)(void *)
                                    ((uint8_t *)dst + 3 * 32), ymm3);
                dst = (uint8_t *)dst + 128;
        }
}

I'm curious if we can distribute the uppper as an memmove?(of course, compiler
needs to know 2 array don't alias each other.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-09-13  9:37 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-09  4:21 [Bug target/111354] New: [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases d_vampile at 163 dot com
2023-09-09  5:32 ` [Bug target/111354] " pinskia at gcc dot gnu.org
2023-09-09  5:39 ` pinskia at gcc dot gnu.org
2023-09-09  6:39 ` d_vampile at 163 dot com
2023-09-12 12:12 ` rguenth at gcc dot gnu.org
2023-09-13  9:37 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).