From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 0F8323858429; Mon, 13 Sep 2021 01:16:46 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0F8323858429 From: "crazylht at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element Date: Mon, 13 Sep 2021 01:16:45 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 10.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: crazylht at gmail dot com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Sep 2021 01:16:46 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D91103 --- Comment #10 from Hongtao.liu --- (In reply to Peter Cordes from comment #9) > Thanks for implementing my idea :) >=20 > (In reply to Hongtao.liu from comment #6) > > For elements located above 128bits, it seems always better(?) to use > > valign{d,q} >=20 > TL:DR: > I think we should still use vextracti* / vextractf* when that can get the > job done in a single instruction, especially when the VEX-encoded > vextracti/f128 can save a byte of code size for v[4]. >=20 > Extracts are simpler shuffles that might have better throughput on some > future CPUs, especially the upcoming Zen4, so even without code-size savi= ngs > we should use them when possible. Tiger Lake has a 256-bit shuffle unit = on > port 1 that supports some common shuffles (like vpshufb); a future Intel > might add 256->128-bit extracts to that. >=20 > It might also save a tiny bit of power, allowing on-average higher turbo > clocks. >=20 > --- >=20 > On current CPUs with AVX-512, valignd is about equal to a single vextract, Yes, they're equal but consider the below comments, i thinks it's reasonabl= e to use vextract instead of valign for byte_offset % 16 =3D=3D 0. > and better than multiple instruction. It doesn't really have downsides on > current Intel, since I think Intel has continued to not have int/FP bypass > delays for shuffles. >=20 > We don't know yet what AMD's Zen4 implementation of AVX-512 will look lik= e.=20 > If it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other > than insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle > like valignd probably costs more than 2 uops. (vpermq is more than 2 uops > on Piledriver/Zen1). But a 128-bit extract will probably cost just one u= op. > (And especially an extract of the high 256 might be very cheap and low > latency, like vextracti128 on Zen1, so we might prefer vextracti64x4 for > v[8].) >=20 > So this change is good, but using a vextracti64x2 or vextracti64x4 could = be > a useful peephole optimization when byte_offset % 16 =3D=3D 0. Or of cou= rse > vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessi= ble > with an EVEX-encoded instruction). >=20 > vextractf-whatever allows an FP shuffle on FP data in case some future CPU > cares about that for shuffles. >=20 > An extract is a simpler shuffle that might have better throughput on some > future CPU even with full-width execution units. Some future Intel CPU > might add support for vextract uops to the extra shuffle unit on port 1.= =20 > (Which is available when no 512-bit uops are in flight.) Currently (Ice > Lake / Tiger Lake) it can only run some common shuffles like vpshufb ymm, > but not including any vextract or valign. Of course port 1 vector ALUs a= re > shut down when 512-bit uops are in flight, but could be relevant for __m2= 56 > vectors on these hypothetical future CPUs. >=20 > When we can get the job done with a single vextract-something, we should = use > that instead of valignd. Otherwise use valignd. >=20 > We already check the index for low-128 special cases to use vunpckhqdq vs. > vpshufd (or vpsrldq) or similar FP shuffles. >=20 > ----- >=20 > On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to= be > zero), an extract that only writes a 128-bit register will keep them clean > (even if it reads a ZMM), not needing a VZEROUPPER. Since VZEROUPPER is > only needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function > like >=20 > float foo(float *p) { > some vector stuff that can use high zmm regs; > return scalar that happens to be from the middle of a vector; > } >=20 > could vextract into XMM0, but would need vzeroupper if it used valignd in= to > ZMM0. >=20 > (Also related > https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper- > for-turbo-clocks-to-recover-after-a-512-bit-instruc re reading a ZMM at a= ll > and turbo clock). >=20 > --- >=20 > Having known zeros outside the low 128 bits (from writing an xmm instead = of > rotating a zmm) is unlikely to matter, although for FP stuff copying fewer > elements that might be subnormal could happen to be an advantage, maybe > saving an FP assist for denormal. We're unlikely to be able to take > advantage of it to save instructions/uops (like OR instead of blend). But > it's not worse to use a single extract instruction instead of a single > valignd.=