From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-511726-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 81203 invoked by alias); 25 Oct 2019 08:01:05 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 80940 invoked by uid 89); 25 Oct 2019 08:00:46 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-4.6 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.1 spammy=V4SI, v4si, shorts, H*f:sk:gJiLcEp
X-HELO: foss.arm.com
Received: from foss.arm.com (HELO foss.arm.com) (217.140.110.172) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 25 Oct 2019 08:00:05 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E992C28;	Fri, 25 Oct 2019 01:00:03 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126])	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 73AF83F718;	Fri, 25 Oct 2019 01:00:03 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <richard.guenther@gmail.com>
Mail-Followup-To: Richard Biener <richard.guenther@gmail.com>,GCC Patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: RFC/A: Add a targetm.vectorize.related_mode hook
References: <mptv9sfengd.fsf@arm.com>	<CAFiYyc0Jh9wstAvAPbeQdziBuPhuYshgH0nSn17LfQ8rLeKN6A@mail.gmail.com>	<mptr233d6if.fsf@arm.com>	<CAFiYyc2hVJhi+C5=WHz6wkGa9hHWieqEJ2D23=bxdGwHLhDU+w@mail.gmail.com>	<mptd0end5kc.fsf@arm.com>	<CAFiYyc29SZOfEzq-5vem11NZ5jtj=gJiLcEphce-kVhvzUXi5w@mail.gmail.com>
Date: Fri, 25 Oct 2019 08:01:00 -0000
In-Reply-To: <CAFiYyc29SZOfEzq-5vem11NZ5jtj=gJiLcEphce-kVhvzUXi5w@mail.gmail.com>	(Richard Biener's message of "Fri, 25 Oct 2019 09:14:35 +0200")
Message-ID: <mpteez1b6gt.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-IsSubscribed: yes
X-SW-Source: 2019-10/txt/msg01797.txt.bz2

Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Oct 23, 2019 at 2:12 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Richard Biener <richard.guenther@gmail.com> writes:
>> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
>> >> > <richard.sandiford@arm.com> wrote:
>> >> >>
>> >> >> This patch is the first of a series that tries to remove two
>> >> >> assumptions:
>> >> >>
>> >> >> (1) that all vectors involved in vectorisation must be the same size
>> >> >>
>> >> >> (2) that there is only one vector mode for a given element mode and
>> >> >>     number of elements
>> >> >>
>> >> >> Relaxing (1) helps with targets that support multiple vector sizes or
>> >> >> that require the number of elements to stay the same.  E.g. if we're
>> >> >> vectorising code that operates on narrow and wide elements, and the
>> >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
>> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
>> >> >> for the wide elements.
>> >> >>
>> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
>> >> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
>> >> >> vectors to work with -msve-vector-bits=256.
>> >> >>
>> >> >> The patch adds a new hook that targets can use to control how we
>> >> >> move from one vector mode to another.  The hook takes a starting vector
>> >> >> mode, a new element mode, and (optionally) a new number of elements.
>> >> >> The flexibility needed for (1) comes in when the number of elements
>> >> >> isn't specified.
>> >> >>
>> >> >> All callers in this patch specify the number of elements, but a later
>> >> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
>> >> >> for a few days, hence the RFC/A tag.
>> >> >>
>> >> >> Tested individually on aarch64-linux-gnu and as a series on
>> >> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
>> >> >> look OK?
>> >> >
>> >> > In isolation the idea looks good but maybe a bit limited?  I see
>> >> > how it works for the same-size case but if you consider x86
>> >> > where we have SSE, AVX256 and AVX512 what would it return
>> >> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
>> >> > kind of query not intended (where the component modes match
>> >> > but nunits is zero)?
>> >>
>> >> In that case we'd normally get V4SImode back.  It's an allowed
>> >> combination, but not very useful.
>> >>
>> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
>> >> > it just used to stay in the same register set for different component
>> >> > modes?
>> >>
>> >> Yeah, the idea is to use the original vector mode as essentially
>> >> a base architecture.
>> >>
>> >> The follow-on patches replace vec_info::vector_size with
>> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
>> >> with targetm.vectorize.autovectorize_vector_modes.  These are the
>> >> starting modes that would be passed to the hook in the nunits==0 case.
>> >>
>> >> E.g. for Advanced SIMD on AArch64, it would make more sense for
>> >> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode.
>> >> I think things would work in a similar way for the x86_64 vector archs.
>> >>
>> >> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the
>> >> Advanced SIMD mode) to autovectorize_vector_modes, even though they
>> >> happen to be the same size for 128-bit SVE.  We can then compare
>> >> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring
>> >> that we consistently use all-SVE modes or all-Advanced SIMD modes
>> >> for each attempt.
>> >>
>> >> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes:
>> >>
>> >> - VNx16QImode (full vector)
>> >> - VNx8QImode (half vector)
>> >> - VNx4QImode (quarter vector)
>> >> - VNx2QImode (eighth vector)
>> >>
>> >> and then pick the one with the lowest cost.  related_mode would
>> >> keep the number of units the same for nunits==0, within the limit
>> >> of the vector size.  E.g.:
>> >>
>> >> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector)
>> >> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector)
>> >> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector)
>> >> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector)
>> >>
>> >> and:
>> >>
>> >> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector)
>> >> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector)
>> >> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector)
>> >> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector)
>> >>
>> >> So when operating on multiple element sizes, the tradeoff is between
>> >> trying to make full use of the vector size (higher base nunits) vs.
>> >> trying to remove packs and unpacks between multiple vector copies
>> >> (lower base nunits).  The latter is useful because extending within
>> >> a vector is an in-lane rather than cross-lane operation and truncating
>> >> within a vector is a no-op.
>> >>
>> >> With a couple of tweaks, we seem to do a good job of guessing which
>> >> version has the lowest cost, at least for the simple cases I've tried
>> >> so far.
>> >>
>> >> Obviously there's going to be a bit of a compile-time cost
>> >> for SVE targets, but I think it's worth paying for.
>> >
>> > I would guess that immediate benefit could be seen with
>> > basic-block vectorization which simply fails when conversions
>> > are involved.  x86_64 should now always support V4SImode
>> > and V2SImode so eventually a testcase can be crafted for that
>> > as well.
>>
>> I'd hoped so too, but the problem is that if the cost saving is good
>> enough, BB vectorisation simply stops at the conversion frontiers and
>> vectorises the rest, rather than considering other vector mode
>> combinations that might be able to do more.
>
> Sure, but when SLP build fails because it thinks it needs to unroll
> it could ask for a vector type with the same number of lanes

Do you mean for loop or BB vectorisation?  For loop vectorisation
the outer loop iterating over vector sizes/modes works fine: if we
need to unroll beyond the maximum VF, we'll just retry vectorisation
with a different vector size/mode combination, with the narrowest
element having smaller vectors.

But yeah, I guess if get_vectype_for_scalar_type returns a vector
with too many units for BB vectorisation, we could try asking for
a different type with the "right" number of units, rather than
failing and falling back to the next iteration of the outer loop.
I'll give that a go.

> (that's probably what we should do from the start - get same number
> of lane vector types in BB vectorization).

It's still useful to have different numbers of lanes for both loop
and BB vectorisation.  E.g. if we're applying SLP to an operation on
8 ints and 8 shorts, it's still better to mix V4SI and V8HI for 128-bit
vector archs.

> It's when you introduce multiple sizes then the outer loop over all
> sizes comparing costs becomes somewhat obsolete...  this outer
> loop should then instead compare different VFs (which also means
> possible extra unrolling beyond the vector size need?).

This is effectively what we're doing for the SVE arrangement above.
But replacing targetm.vectorize.autovectorize_vector_sizes with
targetm.vectorize.autovectorize_vector_modes means that we can
also try multiple vector archs with the same VF (e.g. 128-bit SVE
vs. 128-bit Advanced SIMD).

Thanks,
Richard