From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 1DE3B3858D1E for ; Thu, 3 Aug 2023 13:16:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1DE3B3858D1E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D4E26113E; Thu, 3 Aug 2023 06:16:46 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 67BDD3F6C4; Thu, 3 Aug 2023 06:16:03 -0700 (PDT) From: Richard Sandiford To: Prathamesh Kulkarni Mail-Followup-To: Prathamesh Kulkarni ,gcc Patches , richard.sandiford@arm.com Cc: gcc Patches Subject: Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors References: Date: Thu, 03 Aug 2023 14:16:02 +0100 In-Reply-To: (Richard Sandiford's message of "Thu, 03 Aug 2023 14:01:36 +0100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-26.1 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Richard Sandiford writes: > Prathamesh Kulkarni writes: >> On Tue, 25 Jul 2023 at 18:25, Richard Sandiford >> wrote: >>> >>> Hi, >>> >>> Thanks for the rework and sorry for the slow review. >> Hi Richard, >> Thanks for the suggestions! Please find my responses inline below. >>> >>> Prathamesh Kulkarni writes: >>> > Hi Richard, >>> > This is reworking of patch to extend fold_vec_perm to handle VLA vectors. >>> > The attached patch unifies handling of VLS and VLA vector_csts, while >>> > using fallback code >>> > for ctors. >>> > >>> > For VLS vector, the patch ignores underlying encoding, and >>> > uses npatterns = nelts, and nelts_per_pattern = 1. >>> > >>> > For VLA patterns, if sel has a stepped sequence, then it >>> > only chooses elements from a particular pattern of a particular >>> > input vector. >>> > >>> > To make things simpler, the patch imposes following constraints: >>> > (a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2. >>> > (b) The step size for a stepped sequence is a power of 2, and >>> > multiple of npatterns of chosen input vector. >>> > (c) Runtime vector length of sel is a multiple of sel_npatterns. >>> > So, we don't handle sel.length = 2 + 2x and npatterns = 4. >>> > >>> > Eg: >>> > op0, op1: npatterns = 2, nelts_per_pattern = 3 >>> > op0_len = op1_len = 16 + 16x. >>> > sel = { 0, 0, 2, 0, 4, 0, ... } >>> > npatterns = 2, nelts_per_pattern = 3. >>> > >>> > For pattern {0, 2, 4, ...} >>> > Let, >>> > a1 = 2 >>> > S = step size = 2 >>> > >>> > Let Esel denote number of elements per pattern in sel at runtime. >>> > Esel = (16 + 16x) / npatterns_sel >>> > = (16 + 16x) / 2 >>> > = (8 + 8x) >>> > >>> > So, last element of pattern: >>> > ae = a1 + (Esel - 2) * S >>> > = 2 + (8 + 8x - 2) * 2 >>> > = 14 + 16x >>> > >>> > a1 /trunc arg0_len = 2 / (16 + 16x) = 0 >>> > ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0 >>> > Since both are equal with quotient = 0, we select elements from op0. >>> > >>> > Since step size (S) is a multiple of npatterns(op0), we select >>> > all elements from same pattern of op0. >>> > >>> > res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns)) >>> > = max (2, max (2, 2) >>> > = 2 >>> > >>> > res_nelts_per_pattern = max (op0_nelts_per_pattern, >>> > max (op1_nelts_per_pattern, >>> > sel_nelts_per_pattern)) >>> > = max (3, max (3, 3)) >>> > = 3 >>> > >>> > So res has encoding with npatterns = 2, nelts_per_pattern = 3. >>> > res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... } >>> > >>> > Unfortunately, this results in an issue for poly_int_cst index: >>> > For example, >>> > op0, op1: npatterns = 1, nelts_per_pattern = 3 >>> > op0_len = op1_len = 4 + 4x >>> > >>> > sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1 >>> > >>> > In this case, >>> > a1 = 5 + 4x >>> > S = (6 + 4x) - (5 + 4x) = 1 >>> > Esel = 4 + 4x >>> > >>> > ae = a1 + (esel - 2) * S >>> > = (5 + 4x) + (4 + 4x - 2) * 1 >>> > = 7 + 8x >>> > >>> > IIUC, 7 + 8x will always be index for last element of op1 ? >>> > if x = 0, len = 4, 7 + 8x = 7 >>> > if x = 1, len = 8, 7 + 8x = 15, etc. >>> > So the stepped sequence will always choose elements >>> > from op1 regardless of vector length for above case ? >>> > >>> > However, >>> > ae /trunc op0_len >>> > = (7 + 8x) / (4 + 4x) >>> > which is not defined because 7/4 != 8/4 >>> > and we return NULL_TREE, but I suppose the expected result would be: >>> > res: { op1[0], op1[1], op1[2], ... } ? >>> > >>> > The patch passes bootstrap+test on aarch64-linux-gnu with and without sve, >>> > and on x86_64-unknown-linux-gnu. >>> > I would be grateful for suggestions on how to proceed. >>> > >>> > Thanks, >>> > Prathamesh >>> > >>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc >>> > index a02ede79fed..8028b3e8e9a 100644 >>> > --- a/gcc/fold-const.cc >>> > +++ b/gcc/fold-const.cc >>> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3. If not see >>> > #include "vec-perm-indices.h" >>> > #include "asan.h" >>> > #include "gimple-range.h" >>> > +#include >>> > +#include "tree-pretty-print.h" >>> > +#include "gimple-pretty-print.h" >>> > +#include "print-tree.h" >>> > >>> > /* Nonzero if we are folding constants inside an initializer or a C++ >>> > manifestly-constant-evaluated context; zero otherwise. >>> > @@ -10493,15 +10497,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr) >>> > static bool >>> > vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts) >>> > { >>> > - unsigned HOST_WIDE_INT i, nunits; >>> > + unsigned HOST_WIDE_INT i; >>> > >>> > - if (TREE_CODE (arg) == VECTOR_CST >>> > - && VECTOR_CST_NELTS (arg).is_constant (&nunits)) >>> > - { >>> > - for (i = 0; i < nunits; ++i) >>> > - elts[i] = VECTOR_CST_ELT (arg, i); >>> > - } >>> > - else if (TREE_CODE (arg) == CONSTRUCTOR) >>> > + if (TREE_CODE (arg) == CONSTRUCTOR) >>> > { >>> > constructor_elt *elt; >>> > >>> > @@ -10519,6 +10517,230 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts) >>> > return true; >>> > } >>> > >>> > +/* Return a vector with (NPATTERNS, NELTS_PER_PATTERN) encoding. */ >>> > + >>> > +static tree >>> > +vector_cst_reshape (tree vec, unsigned npatterns, unsigned nelts_per_pattern) >>> > +{ >>> > + gcc_assert (pow2p_hwi (npatterns)); >>> > + >>> > + if (VECTOR_CST_NPATTERNS (vec) == npatterns >>> > + && VECTOR_CST_NELTS_PER_PATTERN (vec) == nelts_per_pattern) >>> > + return vec; >>> > + >>> > + tree v = make_vector (exact_log2 (npatterns), nelts_per_pattern); >>> > + TREE_TYPE (v) = TREE_TYPE (vec); >>> > + >>> > + unsigned nelts = npatterns * nelts_per_pattern; >>> > + for (unsigned i = 0; i < nelts; i++) >>> > + VECTOR_CST_ENCODED_ELT(v, i) = vector_cst_elt (vec, i); >>> > + return v; >>> > +} >>> > + >>> > +/* Helper routine for fold_vec_perm_vla to check if ARG is a suitable >>> > + operand for VLA vec_perm folding. If arg is VLS, then set >>> > + NPATTERNS = nelts and NELTS_PER_PATTERN = 1. */ >>> > + >>> > +static tree >>> > +valid_operand_for_fold_vec_perm_cst_p (tree arg) >>> > +{ >>> > + if (TREE_CODE (arg) != VECTOR_CST) >>> > + return NULL_TREE; >>> > + >>> > + unsigned HOST_WIDE_INT nelts; >>> > + unsigned npatterns, nelts_per_pattern; >>> > + if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant (&nelts)) >>> > + { >>> > + npatterns = nelts; >>> > + nelts_per_pattern = 1; >>> > + } >>> > + else >>> > + { >>> > + npatterns = VECTOR_CST_NPATTERNS (arg); >>> > + nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg); >>> > + } >>> > + >>> > + if (!pow2p_hwi (npatterns)) >>> > + return NULL_TREE; >>> > + >>> > + return vector_cst_reshape (arg, npatterns, nelts_per_pattern); >>> > +} >>> >>> I don't think we should reshape the vectors for VLS, since it would >>> create more nodes for GC to clean up later. Also, the "compact" encoding >>> is canonical even for VLS, so the reshaping would effectively create >>> noncanonical constants (even if only temporarily). >> Removed in the attached patch. >>> >>> Instead, I think we should change the later: >>> >>> > + if (!valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, sel_npatterns, >>> > + sel_nelts_per_pattern, reason, verbose)) >>> > + return NULL_TREE; >>> >>> so that it comes after the computation of res_npatterns and >>> res_nelts_per_pattern. Then, if valid_mask_for_fold_vec_perm_cst_p >>> returns false, and if the result type has a constant number of elements, >>> we should: >>> >>> * set res_npatterns to that number of elements >>> * set res_nelts_per_pattern to 1 >>> * continue instead of returning null >> Assuming we don't enforce only VLA or only VLS for input vectors and sel, >> won't that be still an issue if res (and sel) is VLS, and input >> vectors are VLA ? >> For eg: >> arg0, arg1 are type VNx4SI with npatterns = 2, nelts_per_pattern = 3, step = 2 >> sel is V4SI constant with encoding { 0, 2, 4, ... } >> and res_type is V4SI. >> In this case, when it comes to index 4, the vector selection becomes ambiguous, >> since it can be arg1 for len = 4 + 4x, and arg0 for lengths > 4 + 4x ? > > Ah, right. So the condition is whether the result type and the data > input type have a constant number of elements, rather than just the result. Actually, I take that back. The reason: >>> The loop that follows will then do the correct thing for each element. is true is that: + if (!can_div_trunc_p (sel[i], len, &q, &r)) + { + if (reason) + strcpy (reason, "cannot divide selector element by arg len"); + return NULL_TREE; + } will return false if q isn't computable at compile time (that is, if we can't decide at compile time which input the element comes from). So I think checking the result is enough. Thanks, Richard