From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 0F8323858429; Mon, 13 Sep 2021 01:16:46 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0F8323858429
From: "crazylht at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/91103] AVX512 vector element extract uses more than 1
 shuffle instruction; VALIGND can grab any element
Date: Mon, 13 Sep 2021 01:16:45 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 10.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: crazylht at gmail dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-91103-4-pGALMqgY46@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-91103-4@http.gcc.gnu.org/bugzilla/>
References: <bug-91103-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Sep 2021 01:16:46 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D91103
--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Peter Cordes from comment #9)
> Thanks for implementing my idea :)
>=20
> (In reply to Hongtao.liu from comment #6)
> > For elements located above 128bits, it seems always better(?) to use
> > valign{d,q}
>=20
> TL:DR:
>  I think we should still use vextracti* / vextractf* when that can get the
> job done in a single instruction, especially when the VEX-encoded
> vextracti/f128 can save a byte of code size for v[4].
>=20
> Extracts are simpler shuffles that might have better throughput on some
> future CPUs, especially the upcoming Zen4, so even without code-size savi=
ngs
> we should use them when possible.  Tiger Lake has a 256-bit shuffle unit =
on
> port 1 that supports some common shuffles (like vpshufb); a future Intel
> might add 256->128-bit extracts to that.
>=20
> It might also save a tiny bit of power, allowing on-average higher turbo
> clocks.
>=20
> ---
>=20
> On current CPUs with AVX-512, valignd is about equal to a single vextract,
Yes, they're equal but consider the below comments, i thinks it's reasonabl=
e to
use vextract instead of valign for byte_offset % 16 =3D=3D 0.

> and better than multiple instruction.  It doesn't really have downsides on
> current Intel, since I think Intel has continued to not have int/FP bypass
> delays for shuffles.
>=20
> We don't know yet what AMD's Zen4 implementation of AVX-512 will look lik=
e.=20
> If it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other
> than insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle
> like valignd probably costs more than 2 uops.  (vpermq is more than 2 uops
> on Piledriver/Zen1).  But a 128-bit extract will probably cost just one u=
op.
> (And especially an extract of the high 256 might be very cheap and low
> latency, like vextracti128 on Zen1, so we might prefer vextracti64x4 for
> v[8].)
>=20
> So this change is good, but using a vextracti64x2 or vextracti64x4 could =
be
> a useful peephole optimization when byte_offset % 16 =3D=3D 0.  Or of cou=
rse
> vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessi=
ble
> with an EVEX-encoded instruction).
>=20
> vextractf-whatever allows an FP shuffle on FP data in case some future CPU
> cares about that for shuffles.
>=20
> An extract is a simpler shuffle that might have better throughput on some
> future CPU even with full-width execution units.  Some future Intel CPU
> might add support for vextract uops to the extra shuffle unit on port 1.=
=20
> (Which is available when no 512-bit uops are in flight.)  Currently (Ice
> Lake / Tiger Lake) it can only run some common shuffles like vpshufb ymm,
> but not including any vextract or valign.  Of course port 1 vector ALUs a=
re
> shut down when 512-bit uops are in flight, but could be relevant for __m2=
56
> vectors on these hypothetical future CPUs.
>=20
> When we can get the job done with a single vextract-something, we should =
use
> that instead of valignd.  Otherwise use valignd.
>=20
> We already check the index for low-128 special cases to use vunpckhqdq vs.
> vpshufd (or vpsrldq) or similar FP shuffles.
>=20
> -----
>=20
> On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to=
 be
> zero), an extract that only writes a 128-bit register will keep them clean
> (even if it reads a ZMM), not needing a VZEROUPPER.  Since VZEROUPPER is
> only needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function
> like
>=20
> float foo(float *p) {
>   some vector stuff that can use high zmm regs;
>   return scalar that happens to be from the middle of a vector;
> }
>=20
> could vextract into XMM0, but would need vzeroupper if it used valignd in=
to
> ZMM0.
>=20
> (Also related
> https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-
> for-turbo-clocks-to-recover-after-a-512-bit-instruc re reading a ZMM at a=
ll
> and turbo clock).
>=20
> ---
>=20
> Having known zeros outside the low 128 bits (from writing an xmm instead =
of
> rotating a zmm) is unlikely to matter, although for FP stuff copying fewer
> elements that might be subnormal could happen to be an advantage, maybe
> saving an FP assist for denormal.  We're unlikely to be able to take
> advantage of it to save instructions/uops (like OR instead of blend).  But
> it's not worse to use a single extract instruction instead of a single
> valignd.=