From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 7A4D53858D38; Sun, 28 May 2023 17:29:09 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7A4D53858D38 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1685294949; bh=R5MyDoR2gnUlo95FeZkcDbLj+WUDg4BXaNDubQSowE8=; h=From:To:Subject:Date:In-Reply-To:References:From; b=u/1PRk6qoHGL4NeO0aFBVndT8ryJvh+JJA8q9TNnIlGoBcdgKaqVIDkQ/OZ2cMimy RxXmXiVzUSk5UVhwd1yfJaH/kEd6N0PUUjuxM2E4WIw03YunDWJysrxkt3OCsQif/q dd924u7GKjdmmQhM7BPGu+02opeG6qztOaYbyJXw= From: "hubicka at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake Date: Sun, 28 May 2023 17:29:08 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 13.1.1 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: hubicka at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: short_desc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109812 Jan Hubicka changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|GraphicsMagick resize is a |GraphicsMagick resize is a |lot slower in GCC 13.1 vs |lot slower in GCC 13.1 vs |Clang 16 |Clang 16 on Intel Raptor | |Lake --- Comment #7 from Jan Hubicka --- On zen3 hardware I get GCC: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count: 3=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20 Estimated Time To Completion: 4 Minutes [17:00 UTC]=20 Started Run 1 @ 16:57:17 Started Run 2 @ 16:58:22 Started Run 3 @ 16:59:26 Operation: Resizing: 1390 1386 1383 Average: 1386 Iterations Per Minute Deviation: 0.25% clang16: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count: 3 Estimated Time To Completion: 4 Minutes [16:54 UTC] Started Run 1 @ 16:51:48 Started Run 2 @ 16:52:52 Started Run 3 @ 16:53:56 Operation: Resizing: 180 180 180 Average: 180 Iterations Per Minute Deviation: 0.00% GCC profile: 52.07% VerticalFilter._omp_fn.0=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20 24.59% HorizontalFilter._omp_fn.0=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20 11.78% ReadCachePixels.isra.0=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20 Clang does not seem to have openmp in it, so to get comparable runs I added= =20 OMP_THREAD_LIMIT=3D1 With this I get: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count: 3 Estimated Time To Completion: 4 Minutes [17:17 UTC] Started Run 1 @ 17:14:14 Started Run 2 @ 17:15:18 Started Run 3 @ 17:16:22 Operation: Resizing: 184 186 186 Average: 185 Iterations Per Minute Deviation: 0.62% so GCC build is still bit faster. Internal loop of VerticalFillter is: 0.00 =E2=94=824a0:=E2=94=8C=E2=94=80=E2=86=92mov 0x8(%rdx),%rax = =E2=96=92 1.33 =E2=94=82 =E2=94=82 vmovsd (%rdx),%xmm1 = =E2=96=92 1.58 =E2=94=82 =E2=94=82 add $0x10,%rdx = =E2=96=92 0.00 =E2=94=82 =E2=94=82 sub %r13,%rax = =E2=96=92 4.77 =E2=94=82 =E2=94=82 imul %r11,%rax = =E2=96=92 1.01 =E2=94=82 =E2=94=82 add %rcx,%rax = =E2=96=92 0.04 =E2=94=82 =E2=94=82 movzbl 0x2(%r15,%rax,4),%r10d = =E2=96=92 8.38 =E2=94=82 =E2=94=82 vcvtsi2sd %r10d,%xmm2,%xmm0 = =E2=96=92 2.44 =E2=94=82 =E2=94=82 movzbl 0x1(%r15,%rax,4),%r10d = =E2=97=86 1.55 =E2=94=82 =E2=94=82 movzbl (%r15,%rax,4),%eax = =E2=96=92 0.00 =E2=94=82 =E2=94=82 vfmadd231sd %xmm0,%xmm1,%xmm4 = =E2=96=92 13.91 =E2=94=82 =E2=94=82 vcvtsi2sd %r10d,%xmm2,%xmm0 = =E2=96=92 1.86 =E2=94=82 =E2=94=82 vfmadd231sd %xmm0,%xmm1,%xmm5 = =E2=96=92 13.00 =E2=94=82 =E2=94=82 vcvtsi2sd %eax,%xmm2,%xmm0 = =E2=96=92 2.02 =E2=94=82 =E2=94=82 vfmadd231sd %xmm0,%xmm1,%xmm3 = =E2=96=92 12.54 =E2=94=82 =E2=94=9C=E2=94=80=E2=94=80cmp %rdx,%rdi = =E2=96=92 0.00 =E2=94=82 =E2=94=94=E2=94=80=E2=94=80jne 4a0 = =E2=96=92 HorisontalFiller: 0.01 =E2=94=82520:=E2=94=8C=E2=94=80=E2=86=92mov 0x8(%r8),%rdx = =E2=96=92 0.96 =E2=94=82 =E2=94=82 vmovsd (%r8),%xmm1 = =E2=96=92 1.93 =E2=94=82 =E2=94=82 add $0x10,%r8 = =E2=96=92 0.50 =E2=94=82 =E2=94=82 sub %r15,%rdx = =E2=96=92 4.02 =E2=94=82 =E2=94=82 add %r11,%rdx = =E2=96=92 2.26 =E2=94=82 =E2=94=82 movzbl 0x2(%r14,%rdx,4),%ebx = =E2=96=92 0.09 =E2=94=82 =E2=94=82 vcvtsi2sd %ebx,%xmm2,%xmm0 = =E2=96=92 10.10 =E2=94=82 =E2=94=82 movzbl 0x1(%r14,%rdx,4),%ebx = =E2=97=86 0.92 =E2=94=82 =E2=94=82 movzbl (%r14,%rdx,4),%edx = =E2=96=92 1.84 =E2=94=82 =E2=94=82 vfmadd231sd %xmm0,%xmm1,%xmm4 = =E2=96=92 6.82 =E2=94=82 =E2=94=82 vcvtsi2sd %ebx,%xmm2,%xmm0 = =E2=96=92 11.15 =E2=94=82 =E2=94=82 vfmadd231sd %xmm0,%xmm1,%xmm3 = =E2=96=92 13.81 =E2=94=82 =E2=94=82 vcvtsi2sd %edx,%xmm2,%xmm0 = =E2=96=92 6.16 =E2=94=82 =E2=94=82 vfmadd231sd %xmm0,%xmm1,%xmm5 = =E2=96=92 8.61 =E2=94=82 =E2=94=9C=E2=94=80=E2=94=80cmp %rsi,%r8 = =E2=96=92 1.56 =E2=94=82 =E2=94=94=E2=94=80=E2=94=80jne 520 = =E2=96=92 ReadCachePixels: =E2=94=822e0:=E2=94=8C=E2=94=80=E2=86=92mov (%rbx,%rax,4),%edx = =E2=96=92 83.03 =E2=94=82 =E2=94=82 mov %edx,(%r12,%rax,4) = =E2=96=92 12.34 =E2=94=82 =E2=94=82 inc %rax = =E2=96=92 0.02 =E2=94=82 =E2=94=9C=E2=94=80=E2=94=80cmp %rsi,%rax = =E2=96=92 With Clang I get: 49.08% VerticalFilter=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 24.66% HorizontalFilter=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20 18.41% ReadCachePixels=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20 6.75% SyncCacheViewPixels 0.00 =E2=94=821c50:=E2=94=8C=E2=94=80=E2=86=92mov (%rdx,%rsi,1),= %r9 =E2=96=92 0.09 =E2=94=82 =E2=94=82 vmovddup -0x8(%rdx,%rsi,1),%xmm3 = =E2=96=92 0.00 =E2=94=82 =E2=94=82 add $0x10,%rsi = =E2=96=92 0.75 =E2=94=82 =E2=94=82 sub %rdi,%r9 = =E2=96=92 0.00 =E2=94=82 =E2=94=82 imul %rcx,%r9 = =E2=96=92 1.07 =E2=94=82 =E2=94=82 add %r11,%r9 = =E2=96=92 0.81 =E2=94=82 =E2=94=82 movzbl 0x2(%r14,%r9,4),%r10d = =E2=96=92 3.73 =E2=94=82 =E2=94=82 movzwl (%r14,%r9,4),%r9d = =E2=96=92 0.00 =E2=94=82 =E2=94=82 vcvtsi2sd %r10d,%xmm14,%xmm2 = =E2=96=92 0.11 =E2=94=82 =E2=94=82 vfmadd231sd %xmm2,%xmm3,%xmm1 = =E2=96=92 2.57 =E2=94=82 =E2=94=82 vmovd %r9d,%xmm2 = =E2=96=92 0.00 =E2=94=82 =E2=94=82 vpmovzxbd %xmm2,%xmm2 = =E2=96=92 0.95 =E2=94=82 =E2=94=82 vcvtdq2pd %xmm2,%xmm2 = =E2=96=92 0.74 =E2=94=82 =E2=94=82 vfmadd231pd %xmm2,%xmm3,%xmm0 = =E2=96=92 11.46 =E2=94=82 =E2=94=9C=E2=94=80=E2=94=80cmp %rsi,%r8 = =E2=96=92 =E2=94=821b50:=E2=94=8C=E2=94=80=E2=86=92mov (%r10,%rdi,1),= %rcx =E2=96=92 0.76 =E2=94=82 =E2=94=82 vmovddup -0x8(%r10,%rdi,1),%xmm3 = =E2=96=92 0.00 =E2=94=82 =E2=94=82 add $0x10,%rdi = =E2=96=92 0.05 =E2=94=82 =E2=94=82 sub %r8,%rcx = =E2=96=92 0.30 =E2=94=82 =E2=94=82 add %rsi,%rcx = =E2=96=92 0.27 =E2=94=82 =E2=94=82 movzbl 0x2(%r14,%rcx,4),%ebp = =E2=96=92 0.28 =E2=94=82 =E2=94=82 movzwl (%r14,%rcx,4),%ecx = =E2=96=92 4.51 =E2=94=82 =E2=94=82 vcvtsi2sd %ebp,%xmm13,%xmm2 = =E2=96=92 0.75 =E2=94=82 =E2=94=82 vfmadd231sd %xmm2,%xmm3,%xmm1 = =E2=96=92 0.99 =E2=94=82 =E2=94=82 vmovd %ecx,%xmm2 = =E2=96=92 0.00 =E2=94=82 =E2=94=82 vpmovzxbd %xmm2,%xmm2 = =E2=96=92 0.29 =E2=94=82 =E2=94=82 vcvtdq2pd %xmm2,%xmm2 = =E2=96=92 0.27 =E2=94=82 =E2=94=82 vfmadd231pd %xmm2,%xmm3,%xmm0 = =E2=96=92 12.37 =E2=94=82 =E2=94=9C=E2=94=80=E2=94=80cmp %rdi,%r9 = =E2=96=92 0.16 =E2=94=82 =E2=94=94=E2=94=80=E2=94=80jne 1b50 = =E2=96=92 0.01 =E2=94=82 test %r10,%r10 = =E2=96=92 0.01 =E2=94=82 =E2=86=93 jle 28b4 = =E2=96=92 =E2=94=82 lea 0x0(,%r15,4),%rcx = =E2=96=92 0.01 =E2=94=82 mov 0xd8(%rsp),%r10 = =E2=96=92 0.00 =E2=94=82 lea (%rcx,%r8,4),%rcx = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%rbp,4),%rcx = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%rdi,4),%rcx = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%rax,4),%rcx = =E2=96=92 0.02 =E2=94=82 lea (%rcx,%rdx,4),%rcx = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.02 =E2=94=82 mov 0xc8(%rsp),%r10 = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r9,4),%rcx = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r13,4),%rcx = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r11,4),%rcx = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r12,4),%rcx = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.02 =E2=94=82 mov 0xb8(%rsp),%r10 = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.03 =E2=94=82 mov 0xb0(%rsp),%r10 = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.01 =E2=94=82 mov 0xa8(%rsp),%r10 = =E2=96=92 0.00 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.02 =E2=94=82 mov 0x98(%rsp),%r10 = =E2=96=92 0.00 =E2=94=82 lea (%rcx,%rsi,4),%rcx = =E2=96=92 0.03 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.02 =E2=94=82 mov 0x88(%rsp),%r10 = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.02 =E2=94=82 mov 0xa0(%rsp),%r10 = =E2=96=92 0.00 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.01 =E2=94=82 mov 0x90(%rsp),%r10 = =E2=96=92 0.00 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.02 =E2=94=82 mov 0x58(%rsp),%r10 = =E2=96=92 0.02 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.03 =E2=94=82 mov 0x50(%rsp),%r10 = =E2=96=92 0.00 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.02 =E2=94=82 mov 0x48(%rsp),%r10 = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.02 =E2=94=82 mov 0x40(%rsp),%r10 = =E2=96=92 0.00 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.03 =E2=94=82 mov 0x38(%rsp),%r10 = =E2=96=92 0.02 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.03 =E2=94=82 mov 0x60(%rsp),%r10 = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.02 =E2=94=82 mov 0x68(%rsp),%r10 = =E2=96=92 0.00 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.03 =E2=94=82 mov 0x70(%rsp),%r10 = =E2=96=92 0.00 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.02 =E2=94=82 mov 0x78(%rsp),%r10 = =E2=96=92 0.03 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=97=86 0.03 =E2=94=82 mov 0x80(%rsp),%r10 = =E2=96=92 0.01 =E2=94=82 lea (%rcx,%r10,4),%rcx = =E2=96=92 0.03 =E2=94=82 add 0x28(%rsp),%rcx = =E2=96=92 0.03 =E2=94=82 mov %rcx,0xf0(%rsp) = =E2=96=92 0.00 =E2=94=82 xor %ecx,%ecx = =E2=96=92 0.00 =E2=94=82 xor %ecx,%ecx = =E2=96=92 =E2=94=822584: mov 0xf0(%rsp),%r10 = =E2=96=92 0.01 =E2=94=82 mov (%r10,%rcx,4),%r10d = =E2=96=92 3.58 =E2=94=82 inc %rcx = =E2=96=92 0.03 =E2=94=82 mov %r10d,(%r14) = =E2=96=92 0.02 =E2=94=82 mov 0x30(%rsp),%r10 = =E2=96=92 0.01 =E2=94=82 add $0x4,%r14 = =E2=96=92 0.01 =E2=94=82 mov (%r10),%r10 = =E2=96=92 0.06 =E2=94=82 cmp %r10,%rcx = =E2=96=92 0.05 =E2=94=82 =E2=86=91 jl 2584 = =E2=96=92 So I suppose the filler loops are vectorized while memcpy is unrolled (in v= ery odd way). I guesss the vectorization does not help on zen3 but may help on Raptor Lake.=