From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 6F91A3858428; Fri, 17 Dec 2021 15:34:28 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6F91A3858428 From: "ubizjak at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop Date: Fri, 17 Dec 2021 15:34:28 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: missed-optimization, ra X-Bugzilla-Severity: normal X-Bugzilla-Who: ubizjak at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Dec 2021 15:34:28 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D103750 --- Comment #9 from Uro=C5=A1 Bizjak --- (In reply to Thiago Macieira from comment #0) > Testcase: ... > The assembly for this produces: >=20 > vmovdqu16 (%rdi), %ymm1 > vmovdqu16 32(%rdi), %ymm2 > vpcmpuw $0, %ymm0, %ymm1, %k0 > vpcmpuw $0, %ymm0, %ymm2, %k1 > kmovw %k0, %edx > kmovw %k1, %eax > kortestw %k1, %k0 > je .L10 >=20 > Those two KMOVW instructions aren't required for the check that follows. > They're also dispatched on port 0, same as the KORTESTW, meaning the KORT= EST > can't be dispatched until those two have executed, thus introducing a > 2-cycle delay in this loop. These are not NOP moves but zero-extensions. vmovdqu16 (%rdi), %ymm1 # 93 [c=3D17 l=3D6] movv16hi_in= ternal/2 vmovdqu16 32(%rdi), %ymm2 # 94 [c=3D21 l=3D7] movv16hi_in= ternal/2 vpcmpuw $0, %ymm0, %ymm1, %k0 # 21 [c=3D4 l=3D7] avx512vl_ucm= pv16hi3 vpcmpuw $0, %ymm0, %ymm2, %k1 # 27 [c=3D4 l=3D7] avx512vl_ucm= pv16hi3 kmovw %k0, %edx # 30 [c=3D4 l=3D4] *zero_extendhisi2/1 kmovw %k1, %eax # 29 [c=3D4 l=3D4] *zero_extendhisi2/1 kortestw %k1, %k0 # 31 [c=3D4 l=3D4] kortesthi since for some reason tree optimizers give us: _28 =3D VIEW_CONVERT_EXPR<__v16hi>(_31); _29 =3D __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535); _26 =3D VIEW_CONVERT_EXPR<__v16hi>(_30); _27 =3D __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535); _2 =3D (int) _27; _3 =3D (int) _29; _15 =3D __builtin_ia32_kortestzhi (_3, _2); > Clang generates: >=20 > .LBB0_2: # =3D>This Inner Loop Header: Dep= th=3D1 > vpcmpeqw (%rdi), %ymm0, %k0 > vpcmpeqw 32(%rdi), %ymm0, %k1 > kortestw %k0, %k1 > jne .LBB0_3 >=20 > ICC inserts one KMOVW, but not the other. >=20 > Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M >=20 > LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78 > It shows the Clang loop runs on average 2.0 cycles per loop, whereas the = GCC > code is 3 cycles/loop. >=20 > LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycl= es > per loop, because it can run in parallel with the second load, given that > the loads are ports 2 and 3.=