From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 327C73858C60; Mon, 24 Jul 2023 08:21:37 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 327C73858C60
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1690186897;
	bh=vB3sKt0ljqABv8QwntuV0ifZDyr8TCTGYEaNktn1xYA=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=N+M2+xprxUL5ybT2Iwfv3RNrI3WSKEl9mNwxaky9pGkIdw4fYAZhn6axlkC7xMNJO
	 2C8egXuwSPXz0cuc2SOhNaPtSAnCIQ6Bb4EKGrxRObD7W20FXv/MAoXLgLv4FKwK5P
	 Hgt/tzKaI36TBZrJbDlB4n1vb207tihoekEjePBE=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/107093] AVX512 mask operations not simplified in fully
 masked loop
Date: Mon, 24 Jul 2023 08:21:37 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: RESOLVED
X-Bugzilla-Resolution: FIXED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: resolution bug_status
Message-ID: <bug-107093-4-K89aVO2ctS@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-107093-4@http.gcc.gnu.org/bugzilla/>
References: <bug-107093-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107093

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|UNCONFIRMED                 |RESOLVED
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
icelake is able to forward a masked store with a all-ones mask, Zen4 isn't =
able
to do that.  Other masked stores indeed do not forward.

There's a related problem also when an outer loop causes a low trip inner l=
oop
to use masked load/store to then overlapping vectors:

outer iteration 1
   ... =3D .MASK_LOAD (p, {-1, -1, -1, -1, 0, 0, 0, 0});
   ...
   .MASK_STORE (p, val, {-1, -1, -1, -1, 0, 0, 0, 0});

outer iteration 2
   ... =3D .MASK_LOAD (p + delta, {-1, -1, -1, -1, 0, 0, 0, 0});
   ...
   .MASK_STORE (p + delta, val, {-1, -1, -1, -1, 0, 0, 0, 0});

with delta causing the next outer iteration to access the masked out values
from the previous iteration.  That gets a STLF failure (obviously) but
we now also need to wait for the masked store to retire before the masked
load of iteration 2 can be carried out.

We are hitting this case in SPEC CPU 2017 with masked epilogues (the
inner loop just iterates 4 times, vectorized with V8DFmode vectors).

Ideally the implementation (the CPU) would "shorten" loads/stores for
trailing sequences of zeros so this hazard doesn't occur.  Not sure if
that would be allowed by the x86 memory model though (I didn't find
anything specific there with respect to load/store masking).  ISTR store
buffer entries are usually assigned at instruction issue time, I'm not
sure if the mask is resolved there or whether the size of the store could
be adjusted later when it is.  The implementation could also somehow
ignore the "conflict".

Note I didn't yet fully benchmark masked epilogues with
-mpreferred-vector-width=3D512 on icelake or sapphire rapids, maybe Intel C=
PUs
are not affected
by this issue.

The original issue in the description seems solved we now generate the
following with the code generation variant that's now on trunk:

.L3:
        vmovapd b(%rax), %ymm0{%k1}
        movl    %edx, %ecx
        subl    $4, %edx
        kmovw   %edx, %k0
        vmulpd  %ymm3, %ymm0, %ymm1{%k1}{z}
        vmovapd %ymm1, a(%rax){%k1}
        vpbroadcastmw2d %k0, %xmm1
        addq    $32, %rax
        vpcmpud $6, %xmm2, %xmm1, %k1
        cmpw    $4, %cx
        ja      .L3

that avoids using the slow mask ops for loop control.  It oddly does

        subl    $4, %edx
        kmovw   %edx, %k0
        vpbroadcastmw2d %k0, %xmm1

with -march=3Dcascadelake - with -march=3Dznver4 I get the expected

        subl    $8, %edx
        vpbroadcastw    %edx, %xmm1

but I can reproduce the mask register "spill" with -mprefer-vector-width=3D=
256.

We expand to

(insn 14 13 15 (set (reg:V4SI 96)
        (vec_duplicate:V4SI (reg:SI 93 [ _27 ]))) 8167
{*avx512vl_vec_dup_gprv4si}
     (nil))

I'll file a separate bugreport for this.=