public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "crazylht at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element
Date: Mon, 13 Sep 2021 01:16:45 +0000	[thread overview]
Message-ID: <bug-91103-4-pGALMqgY46@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-91103-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Peter Cordes from comment #9)
> Thanks for implementing my idea :)
> 
> (In reply to Hongtao.liu from comment #6)
> > For elements located above 128bits, it seems always better(?) to use
> > valign{d,q}
> 
> TL:DR:
>  I think we should still use vextracti* / vextractf* when that can get the
> job done in a single instruction, especially when the VEX-encoded
> vextracti/f128 can save a byte of code size for v[4].
> 
> Extracts are simpler shuffles that might have better throughput on some
> future CPUs, especially the upcoming Zen4, so even without code-size savings
> we should use them when possible.  Tiger Lake has a 256-bit shuffle unit on
> port 1 that supports some common shuffles (like vpshufb); a future Intel
> might add 256->128-bit extracts to that.
> 
> It might also save a tiny bit of power, allowing on-average higher turbo
> clocks.
> 
> ---
> 
> On current CPUs with AVX-512, valignd is about equal to a single vextract,
Yes, they're equal but consider the below comments, i thinks it's reasonable to
use vextract instead of valign for byte_offset % 16 == 0.

> and better than multiple instruction.  It doesn't really have downsides on
> current Intel, since I think Intel has continued to not have int/FP bypass
> delays for shuffles.
> 
> We don't know yet what AMD's Zen4 implementation of AVX-512 will look like. 
> If it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other
> than insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle
> like valignd probably costs more than 2 uops.  (vpermq is more than 2 uops
> on Piledriver/Zen1).  But a 128-bit extract will probably cost just one uop.
> (And especially an extract of the high 256 might be very cheap and low
> latency, like vextracti128 on Zen1, so we might prefer vextracti64x4 for
> v[8].)
> 
> So this change is good, but using a vextracti64x2 or vextracti64x4 could be
> a useful peephole optimization when byte_offset % 16 == 0.  Or of course
> vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessible
> with an EVEX-encoded instruction).
> 
> vextractf-whatever allows an FP shuffle on FP data in case some future CPU
> cares about that for shuffles.
> 
> An extract is a simpler shuffle that might have better throughput on some
> future CPU even with full-width execution units.  Some future Intel CPU
> might add support for vextract uops to the extra shuffle unit on port 1. 
> (Which is available when no 512-bit uops are in flight.)  Currently (Ice
> Lake / Tiger Lake) it can only run some common shuffles like vpshufb ymm,
> but not including any vextract or valign.  Of course port 1 vector ALUs are
> shut down when 512-bit uops are in flight, but could be relevant for __m256
> vectors on these hypothetical future CPUs.
> 
> When we can get the job done with a single vextract-something, we should use
> that instead of valignd.  Otherwise use valignd.
> 
> We already check the index for low-128 special cases to use vunpckhqdq vs.
> vpshufd (or vpsrldq) or similar FP shuffles.
> 
> -----
> 
> On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to be
> zero), an extract that only writes a 128-bit register will keep them clean
> (even if it reads a ZMM), not needing a VZEROUPPER.  Since VZEROUPPER is
> only needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function
> like
> 
> float foo(float *p) {
>   some vector stuff that can use high zmm regs;
>   return scalar that happens to be from the middle of a vector;
> }
> 
> could vextract into XMM0, but would need vzeroupper if it used valignd into
> ZMM0.
> 
> (Also related
> https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-
> for-turbo-clocks-to-recover-after-a-512-bit-instruc re reading a ZMM at all
> and turbo clock).
> 
> ---
> 
> Having known zeros outside the low 128 bits (from writing an xmm instead of
> rotating a zmm) is unlikely to matter, although for FP stuff copying fewer
> elements that might be subnormal could happen to be an advantage, maybe
> saving an FP assist for denormal.  We're unlikely to be able to take
> advantage of it to save instructions/uops (like OR instead of blend).  But
> it's not worse to use a single extract instruction instead of a single
> valignd.

  parent reply	other threads:[~2021-09-13  1:16 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <bug-91103-4@http.gcc.gnu.org/bugzilla/>
2021-09-05  4:32 ` pinskia at gcc dot gnu.org
2021-09-08 10:13 ` crazylht at gmail dot com
2021-09-09  1:33 ` cvs-commit at gcc dot gnu.org
2021-09-09  1:35 ` crazylht at gmail dot com
2021-09-11  7:54 ` peter at cordes dot ca
2021-09-13  1:16 ` crazylht at gmail dot com [this message]
2021-09-15  8:38 ` cvs-commit at gcc dot gnu.org
2023-07-12 11:20 ` rguenth at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-91103-4-pGALMqgY46@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).