Re: RFC/A: Add a targetm.vectorize.related_mode hook

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: RFC/A: Add a targetm.vectorize.related_mode hook
Date: Wed, 30 Oct 2019 09:58:00 -0000	[thread overview]
Message-ID: <mpt4kzq1scj.fsf@arm.com> (raw)
In-Reply-To: <mpteez1b6gt.fsf@arm.com> (Richard Sandiford's message of "Fri,	25 Oct 2019 09:00:02 +0100")

The series posted so far now shows how the hook would be used in practice.
Just wanted to follow up on some points here:

Richard Sandiford <richard.sandiford@arm.com> writes:
> Richard Biener <richard.guenther@gmail.com> writes:
>> On Wed, Oct 23, 2019 at 2:12 PM Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>>>
>>> Richard Biener <richard.guenther@gmail.com> writes:
>>> > On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford
>>> > <richard.sandiford@arm.com> wrote:
>>> >>
>>> >> Richard Biener <richard.guenther@gmail.com> writes:
>>> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
>>> >> > <richard.sandiford@arm.com> wrote:
>>> >> >>
>>> >> >> This patch is the first of a series that tries to remove two
>>> >> >> assumptions:
>>> >> >>
>>> >> >> (1) that all vectors involved in vectorisation must be the same size
>>> >> >>
>>> >> >> (2) that there is only one vector mode for a given element mode and
>>> >> >>     number of elements
>>> >> >>
>>> >> >> Relaxing (1) helps with targets that support multiple vector sizes or
>>> >> >> that require the number of elements to stay the same.  E.g. if we're
>>> >> >> vectorising code that operates on narrow and wide elements, and the
>>> >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
>>> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
>>> >> >> for the wide elements.
>>> >> >>
>>> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
>>> >> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
>>> >> >> vectors to work with -msve-vector-bits=256.
>>> >> >>
>>> >> >> The patch adds a new hook that targets can use to control how we
>>> >> >> move from one vector mode to another.  The hook takes a starting vector
>>> >> >> mode, a new element mode, and (optionally) a new number of elements.
>>> >> >> The flexibility needed for (1) comes in when the number of elements
>>> >> >> isn't specified.
>>> >> >>
>>> >> >> All callers in this patch specify the number of elements, but a later
>>> >> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
>>> >> >> for a few days, hence the RFC/A tag.
>>> >> >>
>>> >> >> Tested individually on aarch64-linux-gnu and as a series on
>>> >> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
>>> >> >> look OK?
>>> >> >
>>> >> > In isolation the idea looks good but maybe a bit limited?  I see
>>> >> > how it works for the same-size case but if you consider x86
>>> >> > where we have SSE, AVX256 and AVX512 what would it return
>>> >> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
>>> >> > kind of query not intended (where the component modes match
>>> >> > but nunits is zero)?
>>> >>
>>> >> In that case we'd normally get V4SImode back.  It's an allowed
>>> >> combination, but not very useful.
>>> >>
>>> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
>>> >> > it just used to stay in the same register set for different component
>>> >> > modes?
>>> >>
>>> >> Yeah, the idea is to use the original vector mode as essentially
>>> >> a base architecture.
>>> >>
>>> >> The follow-on patches replace vec_info::vector_size with
>>> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
>>> >> with targetm.vectorize.autovectorize_vector_modes.  These are the
>>> >> starting modes that would be passed to the hook in the nunits==0 case.
>>> >>
>>> >> E.g. for Advanced SIMD on AArch64, it would make more sense for
>>> >> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode.
>>> >> I think things would work in a similar way for the x86_64 vector archs.
>>> >>
>>> >> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the
>>> >> Advanced SIMD mode) to autovectorize_vector_modes, even though they
>>> >> happen to be the same size for 128-bit SVE.  We can then compare
>>> >> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring
>>> >> that we consistently use all-SVE modes or all-Advanced SIMD modes
>>> >> for each attempt.
>>> >>
>>> >> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes:
>>> >>
>>> >> - VNx16QImode (full vector)
>>> >> - VNx8QImode (half vector)
>>> >> - VNx4QImode (quarter vector)
>>> >> - VNx2QImode (eighth vector)
>>> >>
>>> >> and then pick the one with the lowest cost.  related_mode would
>>> >> keep the number of units the same for nunits==0, within the limit
>>> >> of the vector size.  E.g.:
>>> >>
>>> >> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector)
>>> >> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector)
>>> >> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector)
>>> >> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector)
>>> >>
>>> >> and:
>>> >>
>>> >> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector)
>>> >> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector)
>>> >> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector)
>>> >> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector)
>>> >>
>>> >> So when operating on multiple element sizes, the tradeoff is between
>>> >> trying to make full use of the vector size (higher base nunits) vs.
>>> >> trying to remove packs and unpacks between multiple vector copies
>>> >> (lower base nunits).  The latter is useful because extending within
>>> >> a vector is an in-lane rather than cross-lane operation and truncating
>>> >> within a vector is a no-op.
>>> >>
>>> >> With a couple of tweaks, we seem to do a good job of guessing which
>>> >> version has the lowest cost, at least for the simple cases I've tried
>>> >> so far.
>>> >>
>>> >> Obviously there's going to be a bit of a compile-time cost
>>> >> for SVE targets, but I think it's worth paying for.
>>> >
>>> > I would guess that immediate benefit could be seen with
>>> > basic-block vectorization which simply fails when conversions
>>> > are involved.  x86_64 should now always support V4SImode
>>> > and V2SImode so eventually a testcase can be crafted for that
>>> > as well.
>>>
>>> I'd hoped so too, but the problem is that if the cost saving is good
>>> enough, BB vectorisation simply stops at the conversion frontiers and
>>> vectorises the rest, rather than considering other vector mode
>>> combinations that might be able to do more.
>>
>> Sure, but when SLP build fails because it thinks it needs to unroll
>> it could ask for a vector type with the same number of lanes
>
> Do you mean for loop or BB vectorisation?  For loop vectorisation
> the outer loop iterating over vector sizes/modes works fine: if we
> need to unroll beyond the maximum VF, we'll just retry vectorisation
> with a different vector size/mode combination, with the narrowest
> element having smaller vectors.
>
> But yeah, I guess if get_vectype_for_scalar_type returns a vector
> with too many units for BB vectorisation, we could try asking for
> a different type with the "right" number of units, rather than
> failing and falling back to the next iteration of the outer loop.
> I'll give that a go.

This is 16/n: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg02063.html
It was easier than I'd expected and cleans up some nasty codegen for
mixed-size SLP on AArch64, so thanks :-)

On iterating over VFs:

>> (that's probably what we should do from the start - get same number
>> of lane vector types in BB vectorization).
>
> It's still useful to have different numbers of lanes for both loop
> and BB vectorisation.  E.g. if we're applying SLP to an operation on
> 8 ints and 8 shorts, it's still better to mix V4SI and V8HI for 128-bit
> vector archs.
>
>> It's when you introduce multiple sizes then the outer loop over all
>> sizes comparing costs becomes somewhat obsolete...  this outer
>> loop should then instead compare different VFs (which also means
>> possible extra unrolling beyond the vector size need?).
>
> This is effectively what we're doing for the SVE arrangement above.
> But replacing targetm.vectorize.autovectorize_vector_sizes with
> targetm.vectorize.autovectorize_vector_modes means that we can
> also try multiple vector archs with the same VF (e.g. 128-bit SVE
> vs. 128-bit Advanced SIMD).

See 12/n: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01831.html
for how this iteration works in practice.  The AArch64 port specifies
four ways of picking vector types, depending on which elements (if any)
use 64 bits instead of 128 bits.  Each of these approaches then determines
a VF, based on the element types that are actually used.  Sometimes we
can end up trying the same VF and vector types multiple times, but that's
easy to fix -- will post a patch soon as preparation for picking the
"best" size.

I think iterating on these approaches is better than iterating directly
on candidate VFs for the reason mentioned above: full-vector 128-bit SVE
and full-vector 128-bit Advanced SIMD both provide the same VF, but we
want to try them both.

And like you say, if base vector mode M1 gives a VF of X and base vector
mode M2 gives a VF of X/2, and if 2*M2 appears cheaper than M1, we could
in future have the option of doubling the VF for M2 artificially as a
way of getting interleave-based unrolling.  So in that case we'd have
two different approaches for vectorising at VF X even if we stick to the
same vector architecture.

So I don't think starting with the VF and calculating other things from
that is intrinsically better.  We'd then have to have a subloop that
iterates over Advanced SIMD vs. SVE, and potentially a further subloop
that iterates over unroll factors.

Thanks,
Richard

next prev parent reply	other threads:[~2019-10-30  9:43 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-23 11:01 Richard Sandiford
2019-10-23 11:16 ` Richard Biener
2019-10-23 12:00   ` Richard Sandiford
2019-10-23 12:07     ` Richard Biener
2019-10-23 12:29       ` Richard Sandiford
2019-10-25  7:21         ` Richard Biener
2019-10-25  8:01           ` Richard Sandiford
2019-10-30  9:58             ` Richard Sandiford [this message]
2019-10-30 14:26               ` Richard Biener
2019-10-23 22:35     ` H.J. Lu
2019-10-24  8:14       ` Richard Sandiford
2019-10-24 15:18         ` H.J. Lu
2019-10-23 21:42   ` Jim Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mpt4kzq1scj.fsf@arm.com \
    --to=richard.sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).