public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
From: "hubicka at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> To: gcc-bugs@gcc.gnu.org Subject: [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake Date: Sun, 28 May 2023 17:29:08 +0000 [thread overview] Message-ID: <bug-109812-4-6YJRIRJOdH@http.gcc.gnu.org/bugzilla/> (raw) In-Reply-To: <bug-109812-4@http.gcc.gnu.org/bugzilla/> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 Jan Hubicka <hubicka at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|GraphicsMagick resize is a |GraphicsMagick resize is a |lot slower in GCC 13.1 vs |lot slower in GCC 13.1 vs |Clang 16 |Clang 16 on Intel Raptor | |Lake --- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> --- On zen3 hardware I get GCC: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count: 3 Estimated Time To Completion: 4 Minutes [17:00 UTC] Started Run 1 @ 16:57:17 Started Run 2 @ 16:58:22 Started Run 3 @ 16:59:26 Operation: Resizing: 1390 1386 1383 Average: 1386 Iterations Per Minute Deviation: 0.25% clang16: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count: 3 Estimated Time To Completion: 4 Minutes [16:54 UTC] Started Run 1 @ 16:51:48 Started Run 2 @ 16:52:52 Started Run 3 @ 16:53:56 Operation: Resizing: 180 180 180 Average: 180 Iterations Per Minute Deviation: 0.00% GCC profile: 52.07% VerticalFilter._omp_fn.0 24.59% HorizontalFilter._omp_fn.0 11.78% ReadCachePixels.isra.0 Clang does not seem to have openmp in it, so to get comparable runs I added OMP_THREAD_LIMIT=1 With this I get: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count: 3 Estimated Time To Completion: 4 Minutes [17:17 UTC] Started Run 1 @ 17:14:14 Started Run 2 @ 17:15:18 Started Run 3 @ 17:16:22 Operation: Resizing: 184 186 186 Average: 185 Iterations Per Minute Deviation: 0.62% so GCC build is still bit faster. Internal loop of VerticalFillter is: 0.00 │4a0:┌─→mov 0x8(%rdx),%rax ▒ 1.33 │ │ vmovsd (%rdx),%xmm1 ▒ 1.58 │ │ add $0x10,%rdx ▒ 0.00 │ │ sub %r13,%rax ▒ 4.77 │ │ imul %r11,%rax ▒ 1.01 │ │ add %rcx,%rax ▒ 0.04 │ │ movzbl 0x2(%r15,%rax,4),%r10d ▒ 8.38 │ │ vcvtsi2sd %r10d,%xmm2,%xmm0 ▒ 2.44 │ │ movzbl 0x1(%r15,%rax,4),%r10d ◆ 1.55 │ │ movzbl (%r15,%rax,4),%eax ▒ 0.00 │ │ vfmadd231sd %xmm0,%xmm1,%xmm4 ▒ 13.91 │ │ vcvtsi2sd %r10d,%xmm2,%xmm0 ▒ 1.86 │ │ vfmadd231sd %xmm0,%xmm1,%xmm5 ▒ 13.00 │ │ vcvtsi2sd %eax,%xmm2,%xmm0 ▒ 2.02 │ │ vfmadd231sd %xmm0,%xmm1,%xmm3 ▒ 12.54 │ ├──cmp %rdx,%rdi ▒ 0.00 │ └──jne 4a0 ▒ HorisontalFiller: 0.01 │520:┌─→mov 0x8(%r8),%rdx ▒ 0.96 │ │ vmovsd (%r8),%xmm1 ▒ 1.93 │ │ add $0x10,%r8 ▒ 0.50 │ │ sub %r15,%rdx ▒ 4.02 │ │ add %r11,%rdx ▒ 2.26 │ │ movzbl 0x2(%r14,%rdx,4),%ebx ▒ 0.09 │ │ vcvtsi2sd %ebx,%xmm2,%xmm0 ▒ 10.10 │ │ movzbl 0x1(%r14,%rdx,4),%ebx ◆ 0.92 │ │ movzbl (%r14,%rdx,4),%edx ▒ 1.84 │ │ vfmadd231sd %xmm0,%xmm1,%xmm4 ▒ 6.82 │ │ vcvtsi2sd %ebx,%xmm2,%xmm0 ▒ 11.15 │ │ vfmadd231sd %xmm0,%xmm1,%xmm3 ▒ 13.81 │ │ vcvtsi2sd %edx,%xmm2,%xmm0 ▒ 6.16 │ │ vfmadd231sd %xmm0,%xmm1,%xmm5 ▒ 8.61 │ ├──cmp %rsi,%r8 ▒ 1.56 │ └──jne 520 ▒ ReadCachePixels: │2e0:┌─→mov (%rbx,%rax,4),%edx ▒ 83.03 │ │ mov %edx,(%r12,%rax,4) ▒ 12.34 │ │ inc %rax ▒ 0.02 │ ├──cmp %rsi,%rax ▒ With Clang I get: 49.08% VerticalFilter 24.66% HorizontalFilter 18.41% ReadCachePixels 6.75% SyncCacheViewPixels 0.00 │1c50:┌─→mov (%rdx,%rsi,1),%r9 ▒ 0.09 │ │ vmovddup -0x8(%rdx,%rsi,1),%xmm3 ▒ 0.00 │ │ add $0x10,%rsi ▒ 0.75 │ │ sub %rdi,%r9 ▒ 0.00 │ │ imul %rcx,%r9 ▒ 1.07 │ │ add %r11,%r9 ▒ 0.81 │ │ movzbl 0x2(%r14,%r9,4),%r10d ▒ 3.73 │ │ movzwl (%r14,%r9,4),%r9d ▒ 0.00 │ │ vcvtsi2sd %r10d,%xmm14,%xmm2 ▒ 0.11 │ │ vfmadd231sd %xmm2,%xmm3,%xmm1 ▒ 2.57 │ │ vmovd %r9d,%xmm2 ▒ 0.00 │ │ vpmovzxbd %xmm2,%xmm2 ▒ 0.95 │ │ vcvtdq2pd %xmm2,%xmm2 ▒ 0.74 │ │ vfmadd231pd %xmm2,%xmm3,%xmm0 ▒ 11.46 │ ├──cmp %rsi,%r8 ▒ │1b50:┌─→mov (%r10,%rdi,1),%rcx ▒ 0.76 │ │ vmovddup -0x8(%r10,%rdi,1),%xmm3 ▒ 0.00 │ │ add $0x10,%rdi ▒ 0.05 │ │ sub %r8,%rcx ▒ 0.30 │ │ add %rsi,%rcx ▒ 0.27 │ │ movzbl 0x2(%r14,%rcx,4),%ebp ▒ 0.28 │ │ movzwl (%r14,%rcx,4),%ecx ▒ 4.51 │ │ vcvtsi2sd %ebp,%xmm13,%xmm2 ▒ 0.75 │ │ vfmadd231sd %xmm2,%xmm3,%xmm1 ▒ 0.99 │ │ vmovd %ecx,%xmm2 ▒ 0.00 │ │ vpmovzxbd %xmm2,%xmm2 ▒ 0.29 │ │ vcvtdq2pd %xmm2,%xmm2 ▒ 0.27 │ │ vfmadd231pd %xmm2,%xmm3,%xmm0 ▒ 12.37 │ ├──cmp %rdi,%r9 ▒ 0.16 │ └──jne 1b50 ▒ 0.01 │ test %r10,%r10 ▒ 0.01 │ ↓ jle 28b4 ▒ │ lea 0x0(,%r15,4),%rcx ▒ 0.01 │ mov 0xd8(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r8,4),%rcx ▒ 0.01 │ lea (%rcx,%rbp,4),%rcx ▒ 0.01 │ lea (%rcx,%rdi,4),%rcx ▒ 0.01 │ lea (%rcx,%rax,4),%rcx ▒ 0.02 │ lea (%rcx,%rdx,4),%rcx ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0xc8(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r9,4),%rcx ▒ 0.01 │ lea (%rcx,%r13,4),%rcx ▒ 0.01 │ lea (%rcx,%r11,4),%rcx ▒ 0.01 │ lea (%rcx,%r12,4),%rcx ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0xb8(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ mov 0xb0(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.01 │ mov 0xa8(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x98(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%rsi,4),%rcx ▒ 0.03 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x88(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0xa0(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.01 │ mov 0x90(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x58(%rsp),%r10 ▒ 0.02 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ mov 0x50(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x48(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x40(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ mov 0x38(%rsp),%r10 ▒ 0.02 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ mov 0x60(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x68(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ mov 0x70(%rsp),%r10 ▒ 0.00 │ lea (%rcx,%r10,4),%rcx ▒ 0.02 │ mov 0x78(%rsp),%r10 ▒ 0.03 │ lea (%rcx,%r10,4),%rcx ◆ 0.03 │ mov 0x80(%rsp),%r10 ▒ 0.01 │ lea (%rcx,%r10,4),%rcx ▒ 0.03 │ add 0x28(%rsp),%rcx ▒ 0.03 │ mov %rcx,0xf0(%rsp) ▒ 0.00 │ xor %ecx,%ecx ▒ 0.00 │ xor %ecx,%ecx ▒ │2584: mov 0xf0(%rsp),%r10 ▒ 0.01 │ mov (%r10,%rcx,4),%r10d ▒ 3.58 │ inc %rcx ▒ 0.03 │ mov %r10d,(%r14) ▒ 0.02 │ mov 0x30(%rsp),%r10 ▒ 0.01 │ add $0x4,%r14 ▒ 0.01 │ mov (%r10),%r10 ▒ 0.06 │ cmp %r10,%rcx ▒ 0.05 │ ↑ jl 2584 ▒ So I suppose the filler loops are vectorized while memcpy is unrolled (in very odd way). I guesss the vectorization does not help on zen3 but may help on Raptor Lake.
next prev parent reply other threads:[~2023-05-28 17:29 UTC|newest] Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-05-11 14:25 [Bug tree-optimization/109812] New: GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com 2023-05-11 14:26 ` [Bug tree-optimization/109812] " aros at gmx dot com 2023-05-11 15:20 ` [Bug target/109812] " pinskia at gcc dot gnu.org 2023-05-11 15:50 ` aros at gmx dot com 2023-05-12 8:47 ` aros at gmx dot com 2023-05-16 22:43 ` juzhe.zhong at rivai dot ai 2023-05-17 0:08 ` sjames at gcc dot gnu.org 2023-05-28 16:46 ` hubicka at gcc dot gnu.org 2023-05-28 17:29 ` hubicka at gcc dot gnu.org [this message] 2023-05-28 17:39 ` [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake hubicka at gcc dot gnu.org 2023-05-28 18:11 ` hubicka at gcc dot gnu.org 2023-05-28 18:50 ` hubicka at gcc dot gnu.org 2023-05-30 0:05 ` zhangjungcc at gmail dot com 2023-05-31 12:42 ` hubicka at ucw dot cz 2023-05-31 16:11 ` hubicka at gcc dot gnu.org 2023-05-31 16:52 ` jamborm at gcc dot gnu.org 2023-06-01 9:38 ` jamborm at gcc dot gnu.org 2023-06-01 11:19 ` jakub at gcc dot gnu.org 2023-06-01 12:28 ` hubicka at gcc dot gnu.org 2023-06-21 9:46 ` ubizjak at gmail dot com 2023-10-12 4:48 ` cvs-commit at gcc dot gnu.org 2023-11-24 23:38 ` hubicka at gcc dot gnu.org 2023-11-25 10:21 ` liuhongt at gcc dot gnu.org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-109812-4-6YJRIRJOdH@http.gcc.gnu.org/bugzilla/ \ --to=gcc-bugzilla@gcc.gnu.org \ --cc=gcc-bugs@gcc.gnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).