Re: RFC/A: Add a targetm.vectorize.related_mode hook

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <richard.guenther@gmail.com>
To: Richard Sandiford <richard.sandiford@arm.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: RFC/A: Add a targetm.vectorize.related_mode hook
Date: Wed, 30 Oct 2019 14:26:00 -0000	[thread overview]
Message-ID: <CAFiYyc2-DJX1YzCy6Dya8T-Ks3zrs_xneUU3V3wZ2vcTObtEnw@mail.gmail.com> (raw)
In-Reply-To: <mpt4kzq1scj.fsf@arm.com>

On Wed, Oct 30, 2019 at 10:43 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> The series posted so far now shows how the hook would be used in practice.
> Just wanted to follow up on some points here:
>
> Richard Sandiford <richard.sandiford@arm.com> writes:
> > Richard Biener <richard.guenther@gmail.com> writes:
> >> On Wed, Oct 23, 2019 at 2:12 PM Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >>>
> >>> Richard Biener <richard.guenther@gmail.com> writes:
> >>> > On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford
> >>> > <richard.sandiford@arm.com> wrote:
> >>> >>
> >>> >> Richard Biener <richard.guenther@gmail.com> writes:
> >>> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
> >>> >> > <richard.sandiford@arm.com> wrote:
> >>> >> >>
> >>> >> >> This patch is the first of a series that tries to remove two
> >>> >> >> assumptions:
> >>> >> >>
> >>> >> >> (1) that all vectors involved in vectorisation must be the same size
> >>> >> >>
> >>> >> >> (2) that there is only one vector mode for a given element mode and
> >>> >> >>     number of elements
> >>> >> >>
> >>> >> >> Relaxing (1) helps with targets that support multiple vector sizes or
> >>> >> >> that require the number of elements to stay the same.  E.g. if we're
> >>> >> >> vectorising code that operates on narrow and wide elements, and the
> >>> >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
> >>> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
> >>> >> >> for the wide elements.
> >>> >> >>
> >>> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
> >>> >> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
> >>> >> >> vectors to work with -msve-vector-bits=256.
> >>> >> >>
> >>> >> >> The patch adds a new hook that targets can use to control how we
> >>> >> >> move from one vector mode to another.  The hook takes a starting vector
> >>> >> >> mode, a new element mode, and (optionally) a new number of elements.
> >>> >> >> The flexibility needed for (1) comes in when the number of elements
> >>> >> >> isn't specified.
> >>> >> >>
> >>> >> >> All callers in this patch specify the number of elements, but a later
> >>> >> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
> >>> >> >> for a few days, hence the RFC/A tag.
> >>> >> >>
> >>> >> >> Tested individually on aarch64-linux-gnu and as a series on
> >>> >> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
> >>> >> >> look OK?
> >>> >> >
> >>> >> > In isolation the idea looks good but maybe a bit limited?  I see
> >>> >> > how it works for the same-size case but if you consider x86
> >>> >> > where we have SSE, AVX256 and AVX512 what would it return
> >>> >> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
> >>> >> > kind of query not intended (where the component modes match
> >>> >> > but nunits is zero)?
> >>> >>
> >>> >> In that case we'd normally get V4SImode back.  It's an allowed
> >>> >> combination, but not very useful.
> >>> >>
> >>> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
> >>> >> > it just used to stay in the same register set for different component
> >>> >> > modes?
> >>> >>
> >>> >> Yeah, the idea is to use the original vector mode as essentially
> >>> >> a base architecture.
> >>> >>
> >>> >> The follow-on patches replace vec_info::vector_size with
> >>> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
> >>> >> with targetm.vectorize.autovectorize_vector_modes.  These are the
> >>> >> starting modes that would be passed to the hook in the nunits==0 case.
> >>> >>
> >>> >> E.g. for Advanced SIMD on AArch64, it would make more sense for
> >>> >> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode.
> >>> >> I think things would work in a similar way for the x86_64 vector archs.
> >>> >>
> >>> >> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the
> >>> >> Advanced SIMD mode) to autovectorize_vector_modes, even though they
> >>> >> happen to be the same size for 128-bit SVE.  We can then compare
> >>> >> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring
> >>> >> that we consistently use all-SVE modes or all-Advanced SIMD modes
> >>> >> for each attempt.
> >>> >>
> >>> >> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes:
> >>> >>
> >>> >> - VNx16QImode (full vector)
> >>> >> - VNx8QImode (half vector)
> >>> >> - VNx4QImode (quarter vector)
> >>> >> - VNx2QImode (eighth vector)
> >>> >>
> >>> >> and then pick the one with the lowest cost.  related_mode would
> >>> >> keep the number of units the same for nunits==0, within the limit
> >>> >> of the vector size.  E.g.:
> >>> >>
> >>> >> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector)
> >>> >> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector)
> >>> >> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector)
> >>> >> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector)
> >>> >>
> >>> >> and:
> >>> >>
> >>> >> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector)
> >>> >> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector)
> >>> >> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector)
> >>> >> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector)
> >>> >>
> >>> >> So when operating on multiple element sizes, the tradeoff is between
> >>> >> trying to make full use of the vector size (higher base nunits) vs.
> >>> >> trying to remove packs and unpacks between multiple vector copies
> >>> >> (lower base nunits).  The latter is useful because extending within
> >>> >> a vector is an in-lane rather than cross-lane operation and truncating
> >>> >> within a vector is a no-op.
> >>> >>
> >>> >> With a couple of tweaks, we seem to do a good job of guessing which
> >>> >> version has the lowest cost, at least for the simple cases I've tried
> >>> >> so far.
> >>> >>
> >>> >> Obviously there's going to be a bit of a compile-time cost
> >>> >> for SVE targets, but I think it's worth paying for.
> >>> >
> >>> > I would guess that immediate benefit could be seen with
> >>> > basic-block vectorization which simply fails when conversions
> >>> > are involved.  x86_64 should now always support V4SImode
> >>> > and V2SImode so eventually a testcase can be crafted for that
> >>> > as well.
> >>>
> >>> I'd hoped so too, but the problem is that if the cost saving is good
> >>> enough, BB vectorisation simply stops at the conversion frontiers and
> >>> vectorises the rest, rather than considering other vector mode
> >>> combinations that might be able to do more.
> >>
> >> Sure, but when SLP build fails because it thinks it needs to unroll
> >> it could ask for a vector type with the same number of lanes
> >
> > Do you mean for loop or BB vectorisation?  For loop vectorisation
> > the outer loop iterating over vector sizes/modes works fine: if we
> > need to unroll beyond the maximum VF, we'll just retry vectorisation
> > with a different vector size/mode combination, with the narrowest
> > element having smaller vectors.
> >
> > But yeah, I guess if get_vectype_for_scalar_type returns a vector
> > with too many units for BB vectorisation, we could try asking for
> > a different type with the "right" number of units, rather than
> > failing and falling back to the next iteration of the outer loop.
> > I'll give that a go.
>
> This is 16/n: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg02063.html
> It was easier than I'd expected and cleans up some nasty codegen for
> mixed-size SLP on AArch64, so thanks :-)
>
> On iterating over VFs:
>
> >> (that's probably what we should do from the start - get same number
> >> of lane vector types in BB vectorization).
> >
> > It's still useful to have different numbers of lanes for both loop
> > and BB vectorisation.  E.g. if we're applying SLP to an operation on
> > 8 ints and 8 shorts, it's still better to mix V4SI and V8HI for 128-bit
> > vector archs.
> >
> >> It's when you introduce multiple sizes then the outer loop over all
> >> sizes comparing costs becomes somewhat obsolete...  this outer
> >> loop should then instead compare different VFs (which also means
> >> possible extra unrolling beyond the vector size need?).
> >
> > This is effectively what we're doing for the SVE arrangement above.
> > But replacing targetm.vectorize.autovectorize_vector_sizes with
> > targetm.vectorize.autovectorize_vector_modes means that we can
> > also try multiple vector archs with the same VF (e.g. 128-bit SVE
> > vs. 128-bit Advanced SIMD).
>
> See 12/n: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01831.html
> for how this iteration works in practice.  The AArch64 port specifies
> four ways of picking vector types, depending on which elements (if any)
> use 64 bits instead of 128 bits.  Each of these approaches then determines
> a VF, based on the element types that are actually used.  Sometimes we
> can end up trying the same VF and vector types multiple times, but that's
> easy to fix -- will post a patch soon as preparation for picking the
> "best" size.
>
> I think iterating on these approaches is better than iterating directly
> on candidate VFs for the reason mentioned above: full-vector 128-bit SVE
> and full-vector 128-bit Advanced SIMD both provide the same VF, but we
> want to try them both.
>
> And like you say, if base vector mode M1 gives a VF of X and base vector
> mode M2 gives a VF of X/2, and if 2*M2 appears cheaper than M1, we could
> in future have the option of doubling the VF for M2 artificially as a
> way of getting interleave-based unrolling.  So in that case we'd have
> two different approaches for vectorising at VF X even if we stick to the
> same vector architecture.
>
> So I don't think starting with the VF and calculating other things from
> that is intrinsically better.  We'd then have to have a subloop that
> iterates over Advanced SIMD vs. SVE, and potentially a further subloop
> that iterates over unroll factors.

OK, fair enough.

The patch is OK.

Thanks,
Richard.

> Thanks,
> Richard

next prev parent reply	other threads:[~2019-10-30 14:25 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-23 11:01 Richard Sandiford
2019-10-23 11:16 ` Richard Biener
2019-10-23 12:00   ` Richard Sandiford
2019-10-23 12:07     ` Richard Biener
2019-10-23 12:29       ` Richard Sandiford
2019-10-25  7:21         ` Richard Biener
2019-10-25  8:01           ` Richard Sandiford
2019-10-30  9:58             ` Richard Sandiford
2019-10-30 14:26               ` Richard Biener [this message]
2019-10-23 22:35     ` H.J. Lu
2019-10-24  8:14       ` Richard Sandiford
2019-10-24 15:18         ` H.J. Lu
2019-10-23 21:42   ` Jim Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFiYyc2-DJX1YzCy6Dya8T-Ks3zrs_xneUU3V3wZ2vcTObtEnw@mail.gmail.com \
    --to=richard.guenther@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).