public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "hubicka at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
Date: Sun, 28 May 2023 17:29:08 +0000	[thread overview]
Message-ID: <bug-109812-4-6YJRIRJOdH@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-109812-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|GraphicsMagick resize is a  |GraphicsMagick resize is a
                   |lot slower in GCC 13.1 vs   |lot slower in GCC 13.1 vs
                   |Clang 16                    |Clang 16 on Intel Raptor
                   |                            |Lake

--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
On zen3 hardware I get GCC:

GraphicsMagick 1.3.38:
    pts/graphics-magick-2.1.0 [Operation: Resizing]
    Test 1 of 1
    Estimated Trial Run Count:    3                     
    Estimated Time To Completion: 4 Minutes [17:00 UTC] 
        Started Run 1 @ 16:57:17
        Started Run 2 @ 16:58:22
        Started Run 3 @ 16:59:26

    Operation: Resizing:
        1390
        1386
        1383

    Average: 1386 Iterations Per Minute
    Deviation: 0.25%

clang16:

GraphicsMagick 1.3.38:
    pts/graphics-magick-2.1.0 [Operation: Resizing]
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 4 Minutes [16:54 UTC]
        Started Run 1 @ 16:51:48
        Started Run 2 @ 16:52:52
        Started Run 3 @ 16:53:56

    Operation: Resizing:
        180
        180
        180

    Average: 180 Iterations Per Minute
    Deviation: 0.00%


GCC profile:
  52.07%  VerticalFilter._omp_fn.0                                              
  24.59%  HorizontalFilter._omp_fn.0                                            
  11.78%  ReadCachePixels.isra.0                                                

Clang does not seem to have openmp in it, so to get comparable runs I added 
OMP_THREAD_LIMIT=1

With this I get:
GraphicsMagick 1.3.38:
    pts/graphics-magick-2.1.0 [Operation: Resizing]
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 4 Minutes [17:17 UTC]
        Started Run 1 @ 17:14:14
        Started Run 2 @ 17:15:18
        Started Run 3 @ 17:16:22

    Operation: Resizing:
        184
        186
        186

    Average: 185 Iterations Per Minute
    Deviation: 0.62%

so GCC build is still bit faster. Internal loop of VerticalFillter is:
  0.00 │4a0:┌─→mov          0x8(%rdx),%rax                                  ▒
  1.33 │    │  vmovsd       (%rdx),%xmm1                                    ▒
  1.58 │    │  add          $0x10,%rdx                                      ▒
  0.00 │    │  sub          %r13,%rax                                       ▒
  4.77 │    │  imul         %r11,%rax                                       ▒
  1.01 │    │  add          %rcx,%rax                                       ▒
  0.04 │    │  movzbl       0x2(%r15,%rax,4),%r10d                          ▒
  8.38 │    │  vcvtsi2sd    %r10d,%xmm2,%xmm0                               ▒
  2.44 │    │  movzbl       0x1(%r15,%rax,4),%r10d                          ◆
  1.55 │    │  movzbl       (%r15,%rax,4),%eax                              ▒
  0.00 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm4                               ▒
 13.91 │    │  vcvtsi2sd    %r10d,%xmm2,%xmm0                               ▒
  1.86 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm5                               ▒
 13.00 │    │  vcvtsi2sd    %eax,%xmm2,%xmm0                                ▒
  2.02 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm3                               ▒
 12.54 │    ├──cmp          %rdx,%rdi                                       ▒
  0.00 │    └──jne          4a0                                             ▒

HorisontalFiller:
  0.01 │520:┌─→mov          0x8(%r8),%rdx                         ▒
  0.96 │    │  vmovsd       (%r8),%xmm1                           ▒
  1.93 │    │  add          $0x10,%r8                             ▒
  0.50 │    │  sub          %r15,%rdx                             ▒
  4.02 │    │  add          %r11,%rdx                             ▒
  2.26 │    │  movzbl       0x2(%r14,%rdx,4),%ebx                 ▒
  0.09 │    │  vcvtsi2sd    %ebx,%xmm2,%xmm0                      ▒
 10.10 │    │  movzbl       0x1(%r14,%rdx,4),%ebx                 ◆
  0.92 │    │  movzbl       (%r14,%rdx,4),%edx                    ▒
  1.84 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm4                     ▒
  6.82 │    │  vcvtsi2sd    %ebx,%xmm2,%xmm0                      ▒
 11.15 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm3                     ▒
 13.81 │    │  vcvtsi2sd    %edx,%xmm2,%xmm0                      ▒
  6.16 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm5                     ▒
  8.61 │    ├──cmp          %rsi,%r8                              ▒
  1.56 │    └──jne          520                                   ▒

ReadCachePixels:
       │2e0:┌─→mov    (%rbx,%rax,4),%edx                          ▒
 83.03 │    │  mov    %edx,(%r12,%rax,4)                          ▒
 12.34 │    │  inc    %rax                                        ▒
  0.02 │    ├──cmp    %rsi,%rax                                   ▒

With Clang I get:
  49.08% VerticalFilter                                                         
  24.66% HorizontalFilter                                                       
  18.41% ReadCachePixels                                                        
   6.75% SyncCacheViewPixels

  0.00 │1c50:┌─→mov          (%rdx,%rsi,1),%r9                    ▒
  0.09 │     │  vmovddup     -0x8(%rdx,%rsi,1),%xmm3              ▒
  0.00 │     │  add          $0x10,%rsi                           ▒
  0.75 │     │  sub          %rdi,%r9                             ▒
  0.00 │     │  imul         %rcx,%r9                             ▒
  1.07 │     │  add          %r11,%r9                             ▒
  0.81 │     │  movzbl       0x2(%r14,%r9,4),%r10d                ▒
  3.73 │     │  movzwl       (%r14,%r9,4),%r9d                    ▒
  0.00 │     │  vcvtsi2sd    %r10d,%xmm14,%xmm2                   ▒
  0.11 │     │  vfmadd231sd  %xmm2,%xmm3,%xmm1                    ▒
  2.57 │     │  vmovd        %r9d,%xmm2                           ▒
  0.00 │     │  vpmovzxbd    %xmm2,%xmm2                          ▒
  0.95 │     │  vcvtdq2pd    %xmm2,%xmm2                          ▒
  0.74 │     │  vfmadd231pd  %xmm2,%xmm3,%xmm0                    ▒
 11.46 │     ├──cmp          %rsi,%r8                             ▒

       │1b50:┌─→mov          (%r10,%rdi,1),%rcx                   ▒
  0.76 │     │  vmovddup     -0x8(%r10,%rdi,1),%xmm3              ▒
  0.00 │     │  add          $0x10,%rdi                           ▒
  0.05 │     │  sub          %r8,%rcx                             ▒
  0.30 │     │  add          %rsi,%rcx                            ▒
  0.27 │     │  movzbl       0x2(%r14,%rcx,4),%ebp                ▒
  0.28 │     │  movzwl       (%r14,%rcx,4),%ecx                   ▒
  4.51 │     │  vcvtsi2sd    %ebp,%xmm13,%xmm2                    ▒
  0.75 │     │  vfmadd231sd  %xmm2,%xmm3,%xmm1                    ▒
  0.99 │     │  vmovd        %ecx,%xmm2                           ▒
  0.00 │     │  vpmovzxbd    %xmm2,%xmm2                          ▒
  0.29 │     │  vcvtdq2pd    %xmm2,%xmm2                          ▒
  0.27 │     │  vfmadd231pd  %xmm2,%xmm3,%xmm0                    ▒
 12.37 │     ├──cmp          %rdi,%r9                             ▒
  0.16 │     └──jne          1b50                                 ▒

  0.01 │        test    %r10,%r10                                 ▒
  0.01 │      ↓ jle     28b4                                      ▒
       │        lea     0x0(,%r15,4),%rcx                         ▒
  0.01 │        mov     0xd8(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r8,4),%rcx                         ▒
  0.01 │        lea     (%rcx,%rbp,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%rdi,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%rax,4),%rcx                        ▒
  0.02 │        lea     (%rcx,%rdx,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0xc8(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r9,4),%rcx                         ▒
  0.01 │        lea     (%rcx,%r13,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r11,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r12,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0xb8(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0xb0(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.01 │        mov     0xa8(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x98(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%rsi,4),%rcx                        ▒
  0.03 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x88(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0xa0(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.01 │        mov     0x90(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x58(%rsp),%r10                           ▒
  0.02 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x50(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x48(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x40(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x38(%rsp),%r10                           ▒
  0.02 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x60(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x68(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x70(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x78(%rsp),%r10                           ▒
  0.03 │        lea     (%rcx,%r10,4),%rcx                        ◆
  0.03 │        mov     0x80(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        add     0x28(%rsp),%rcx                           ▒
  0.03 │        mov     %rcx,0xf0(%rsp)                           ▒
  0.00 │        xor     %ecx,%ecx                                 ▒
 0.00 │        xor     %ecx,%ecx                                 ▒
       │2584:   mov     0xf0(%rsp),%r10                           ▒
  0.01 │        mov     (%r10,%rcx,4),%r10d                       ▒
  3.58 │        inc     %rcx                                      ▒
  0.03 │        mov     %r10d,(%r14)                              ▒
  0.02 │        mov     0x30(%rsp),%r10                           ▒
  0.01 │        add     $0x4,%r14                                 ▒
  0.01 │        mov     (%r10),%r10                               ▒
  0.06 │        cmp     %r10,%rcx                                 ▒
  0.05 │      ↑ jl      2584                                      ▒

So I suppose the filler loops are vectorized while memcpy is unrolled (in very
odd way).  I guesss the vectorization does not help on zen3 but may help on
Raptor Lake.

  parent reply	other threads:[~2023-05-28 17:29 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
2023-05-11 14:26 ` [Bug tree-optimization/109812] " aros at gmx dot com
2023-05-11 15:20 ` [Bug target/109812] " pinskia at gcc dot gnu.org
2023-05-11 15:50 ` aros at gmx dot com
2023-05-12  8:47 ` aros at gmx dot com
2023-05-16 22:43 ` juzhe.zhong at rivai dot ai
2023-05-17  0:08 ` sjames at gcc dot gnu.org
2023-05-28 16:46 ` hubicka at gcc dot gnu.org
2023-05-28 17:29 ` hubicka at gcc dot gnu.org [this message]
2023-05-28 17:39 ` [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake hubicka at gcc dot gnu.org
2023-05-28 18:11 ` hubicka at gcc dot gnu.org
2023-05-28 18:50 ` hubicka at gcc dot gnu.org
2023-05-30  0:05 ` zhangjungcc at gmail dot com
2023-05-31 12:42 ` hubicka at ucw dot cz
2023-05-31 16:11 ` hubicka at gcc dot gnu.org
2023-05-31 16:52 ` jamborm at gcc dot gnu.org
2023-06-01  9:38 ` jamborm at gcc dot gnu.org
2023-06-01 11:19 ` jakub at gcc dot gnu.org
2023-06-01 12:28 ` hubicka at gcc dot gnu.org
2023-06-21  9:46 ` ubizjak at gmail dot com
2023-10-12  4:48 ` cvs-commit at gcc dot gnu.org
2023-11-24 23:38 ` hubicka at gcc dot gnu.org
2023-11-25 10:21 ` liuhongt at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-109812-4-6YJRIRJOdH@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).