From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 81203 invoked by alias); 25 Oct 2019 08:01:05 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 80940 invoked by uid 89); 25 Oct 2019 08:00:46 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-4.6 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.1 spammy=V4SI, v4si, shorts, H*f:sk:gJiLcEp X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.110.172) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 25 Oct 2019 08:00:05 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E992C28; Fri, 25 Oct 2019 01:00:03 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 73AF83F718; Fri, 25 Oct 2019 01:00:03 -0700 (PDT) From: Richard Sandiford To: Richard Biener Mail-Followup-To: Richard Biener ,GCC Patches , richard.sandiford@arm.com Cc: GCC Patches Subject: Re: RFC/A: Add a targetm.vectorize.related_mode hook References: Date: Fri, 25 Oct 2019 08:01:00 -0000 In-Reply-To: (Richard Biener's message of "Fri, 25 Oct 2019 09:14:35 +0200") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-IsSubscribed: yes X-SW-Source: 2019-10/txt/msg01797.txt.bz2 Richard Biener writes: > On Wed, Oct 23, 2019 at 2:12 PM Richard Sandiford > wrote: >> >> Richard Biener writes: >> > On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford >> > wrote: >> >> >> >> Richard Biener writes: >> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford >> >> > wrote: >> >> >> >> >> >> This patch is the first of a series that tries to remove two >> >> >> assumptions: >> >> >> >> >> >> (1) that all vectors involved in vectorisation must be the same size >> >> >> >> >> >> (2) that there is only one vector mode for a given element mode and >> >> >> number of elements >> >> >> >> >> >> Relaxing (1) helps with targets that support multiple vector sizes or >> >> >> that require the number of elements to stay the same. E.g. if we're >> >> >> vectorising code that operates on narrow and wide elements, and the >> >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally >> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors >> >> >> for the wide elements. >> >> >> >> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce >> >> >> fixed-length code for SVE. It also allows unpacked/half-size SVE >> >> >> vectors to work with -msve-vector-bits=256. >> >> >> >> >> >> The patch adds a new hook that targets can use to control how we >> >> >> move from one vector mode to another. The hook takes a starting vector >> >> >> mode, a new element mode, and (optionally) a new number of elements. >> >> >> The flexibility needed for (1) comes in when the number of elements >> >> >> isn't specified. >> >> >> >> >> >> All callers in this patch specify the number of elements, but a later >> >> >> vectoriser patch doesn't. I won't be posting the vectoriser patch >> >> >> for a few days, hence the RFC/A tag. >> >> >> >> >> >> Tested individually on aarch64-linux-gnu and as a series on >> >> >> x86_64-linux-gnu. OK to install? Or if not yet, does the idea >> >> >> look OK? >> >> > >> >> > In isolation the idea looks good but maybe a bit limited? I see >> >> > how it works for the same-size case but if you consider x86 >> >> > where we have SSE, AVX256 and AVX512 what would it return >> >> > for related_vector_mode (V4SImode, SImode, 0)? Or is this >> >> > kind of query not intended (where the component modes match >> >> > but nunits is zero)? >> >> >> >> In that case we'd normally get V4SImode back. It's an allowed >> >> combination, but not very useful. >> >> >> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then? Or is >> >> > it just used to stay in the same register set for different component >> >> > modes? >> >> >> >> Yeah, the idea is to use the original vector mode as essentially >> >> a base architecture. >> >> >> >> The follow-on patches replace vec_info::vector_size with >> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes >> >> with targetm.vectorize.autovectorize_vector_modes. These are the >> >> starting modes that would be passed to the hook in the nunits==0 case. >> >> >> >> E.g. for Advanced SIMD on AArch64, it would make more sense for >> >> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode. >> >> I think things would work in a similar way for the x86_64 vector archs. >> >> >> >> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the >> >> Advanced SIMD mode) to autovectorize_vector_modes, even though they >> >> happen to be the same size for 128-bit SVE. We can then compare >> >> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring >> >> that we consistently use all-SVE modes or all-Advanced SIMD modes >> >> for each attempt. >> >> >> >> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes: >> >> >> >> - VNx16QImode (full vector) >> >> - VNx8QImode (half vector) >> >> - VNx4QImode (quarter vector) >> >> - VNx2QImode (eighth vector) >> >> >> >> and then pick the one with the lowest cost. related_mode would >> >> keep the number of units the same for nunits==0, within the limit >> >> of the vector size. E.g.: >> >> >> >> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector) >> >> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector) >> >> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector) >> >> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector) >> >> >> >> and: >> >> >> >> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector) >> >> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector) >> >> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector) >> >> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector) >> >> >> >> So when operating on multiple element sizes, the tradeoff is between >> >> trying to make full use of the vector size (higher base nunits) vs. >> >> trying to remove packs and unpacks between multiple vector copies >> >> (lower base nunits). The latter is useful because extending within >> >> a vector is an in-lane rather than cross-lane operation and truncating >> >> within a vector is a no-op. >> >> >> >> With a couple of tweaks, we seem to do a good job of guessing which >> >> version has the lowest cost, at least for the simple cases I've tried >> >> so far. >> >> >> >> Obviously there's going to be a bit of a compile-time cost >> >> for SVE targets, but I think it's worth paying for. >> > >> > I would guess that immediate benefit could be seen with >> > basic-block vectorization which simply fails when conversions >> > are involved. x86_64 should now always support V4SImode >> > and V2SImode so eventually a testcase can be crafted for that >> > as well. >> >> I'd hoped so too, but the problem is that if the cost saving is good >> enough, BB vectorisation simply stops at the conversion frontiers and >> vectorises the rest, rather than considering other vector mode >> combinations that might be able to do more. > > Sure, but when SLP build fails because it thinks it needs to unroll > it could ask for a vector type with the same number of lanes Do you mean for loop or BB vectorisation? For loop vectorisation the outer loop iterating over vector sizes/modes works fine: if we need to unroll beyond the maximum VF, we'll just retry vectorisation with a different vector size/mode combination, with the narrowest element having smaller vectors. But yeah, I guess if get_vectype_for_scalar_type returns a vector with too many units for BB vectorisation, we could try asking for a different type with the "right" number of units, rather than failing and falling back to the next iteration of the outer loop. I'll give that a go. > (that's probably what we should do from the start - get same number > of lane vector types in BB vectorization). It's still useful to have different numbers of lanes for both loop and BB vectorisation. E.g. if we're applying SLP to an operation on 8 ints and 8 shorts, it's still better to mix V4SI and V8HI for 128-bit vector archs. > It's when you introduce multiple sizes then the outer loop over all > sizes comparing costs becomes somewhat obsolete... this outer > loop should then instead compare different VFs (which also means > possible extra unrolling beyond the vector size need?). This is effectively what we're doing for the SVE arrangement above. But replacing targetm.vectorize.autovectorize_vector_sizes with targetm.vectorize.autovectorize_vector_modes means that we can also try multiple vector archs with the same VF (e.g. 128-bit SVE vs. 128-bit Advanced SIMD). Thanks, Richard