From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 6F91A3858428; Fri, 17 Dec 2021 15:34:28 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6F91A3858428
From: "ubizjak at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/103750] [i386] GCC schedules KMOV instructions that
 destroys performance in loop
Date: Fri, 17 Dec 2021 15:34:28 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization, ra
X-Bugzilla-Severity: normal
X-Bugzilla-Who: ubizjak at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-103750-4-VUSJJoHcLl@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-103750-4@http.gcc.gnu.org/bugzilla/>
References: <bug-103750-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 17 Dec 2021 15:34:28 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D103750
--- Comment #9 from Uro=C5=A1 Bizjak <ubizjak at gmail dot com> ---
(In reply to Thiago Macieira from comment #0)
> Testcase:
...
> The assembly for this produces:
>=20
>         vmovdqu16       (%rdi), %ymm1
>         vmovdqu16       32(%rdi), %ymm2
>         vpcmpuw $0, %ymm0, %ymm1, %k0
>         vpcmpuw $0, %ymm0, %ymm2, %k1
>         kmovw   %k0, %edx
>         kmovw   %k1, %eax
>         kortestw        %k1, %k0
>         je      .L10
>=20
> Those two KMOVW instructions aren't required for the check that follows.
> They're also dispatched on port 0, same as the KORTESTW, meaning the KORT=
EST
> can't be dispatched until those two have executed, thus introducing a
> 2-cycle delay in this loop.

These are not NOP moves but zero-extensions.

        vmovdqu16       (%rdi), %ymm1   # 93    [c=3D17 l=3D6]  movv16hi_in=
ternal/2
        vmovdqu16       32(%rdi), %ymm2 # 94    [c=3D21 l=3D7]  movv16hi_in=
ternal/2
        vpcmpuw $0, %ymm0, %ymm1, %k0   # 21    [c=3D4 l=3D7]  avx512vl_ucm=
pv16hi3
        vpcmpuw $0, %ymm0, %ymm2, %k1   # 27    [c=3D4 l=3D7]  avx512vl_ucm=
pv16hi3
        kmovw   %k0, %edx       # 30    [c=3D4 l=3D4]  *zero_extendhisi2/1
        kmovw   %k1, %eax       # 29    [c=3D4 l=3D4]  *zero_extendhisi2/1
        kortestw        %k1, %k0        # 31    [c=3D4 l=3D4]  kortesthi

since for some reason tree optimizers give us:

  _28 =3D VIEW_CONVERT_EXPR<__v16hi>(_31);
  _29 =3D __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535);
  _26 =3D VIEW_CONVERT_EXPR<__v16hi>(_30);
  _27 =3D __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535);
  _2 =3D (int) _27;
  _3 =3D (int) _29;
  _15 =3D __builtin_ia32_kortestzhi (_3, _2);


> Clang generates:
>=20
> .LBB0_2:                                # =3D>This Inner Loop Header: Dep=
th=3D1
>         vpcmpeqw        (%rdi), %ymm0, %k0
>         vpcmpeqw        32(%rdi), %ymm0, %k1
>         kortestw        %k0, %k1
>         jne     .LBB0_3
>=20
> ICC inserts one KMOVW, but not the other.
>=20
> Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M
>=20
> LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78
> It shows the Clang loop runs on average 2.0 cycles per loop, whereas the =
GCC
> code is 3 cycles/loop.
>=20
> LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycl=
es
> per loop, because it can run in parallel with the second load, given that
> the loads are ports 2 and 3.=