From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=J6IO=DU=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 1DE3B3858D1E
	for <gcc-patches@gcc.gnu.org>; Thu,  3 Aug 2023 13:16:04 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1DE3B3858D1E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D4E26113E;
	Thu,  3 Aug 2023 06:16:46 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 67BDD3F6C4;
	Thu,  3 Aug 2023 06:16:03 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Mail-Followup-To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,gcc Patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: gcc Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors
References: <CAAgBjMnk_N=tgPNBhUu91yt8YN0HcCoWgQQYpshHMqhU=6WgAQ@mail.gmail.com>
	<mptmszkdubp.fsf@arm.com>
	<CAAgBjMmOrVxsNT8jK0Ei-RYrhtNVymd7pmaud2tA6=OAJCbpOg@mail.gmail.com>
	<mpt4jlg2sb3.fsf@arm.com>
Date: Thu, 03 Aug 2023 14:16:02 +0100
In-Reply-To: <mpt4jlg2sb3.fsf@arm.com> (Richard Sandiford's message of "Thu,
	03 Aug 2023 14:01:36 +0100")
Message-ID: <mptv8dw1d2l.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-26.1 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Richard Sandiford <richard.sandiford@arm.com> writes:
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> On Tue, 25 Jul 2023 at 18:25, Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>>>
>>> Hi,
>>>
>>> Thanks for the rework and sorry for the slow review.
>> Hi Richard,
>> Thanks for the suggestions!  Please find my responses inline below.
>>>
>>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>>> > Hi Richard,
>>> > This is reworking of patch to extend fold_vec_perm to handle VLA vectors.
>>> > The attached patch unifies handling of VLS and VLA vector_csts, while
>>> > using fallback code
>>> > for ctors.
>>> >
>>> > For VLS vector, the patch ignores underlying encoding, and
>>> > uses npatterns = nelts, and nelts_per_pattern = 1.
>>> >
>>> > For VLA patterns, if sel has a stepped sequence, then it
>>> > only chooses elements from a particular pattern of a particular
>>> > input vector.
>>> >
>>> > To make things simpler, the patch imposes following constraints:
>>> > (a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2.
>>> > (b) The step size for a stepped sequence is a power of 2, and
>>> >       multiple of npatterns of chosen input vector.
>>> > (c) Runtime vector length of sel is a multiple of sel_npatterns.
>>> >      So, we don't handle sel.length = 2 + 2x and npatterns = 4.
>>> >
>>> > Eg:
>>> > op0, op1: npatterns = 2, nelts_per_pattern = 3
>>> > op0_len = op1_len = 16 + 16x.
>>> > sel = { 0, 0, 2, 0, 4, 0, ... }
>>> > npatterns = 2, nelts_per_pattern = 3.
>>> >
>>> > For pattern {0, 2, 4, ...}
>>> > Let,
>>> > a1 = 2
>>> > S = step size = 2
>>> >
>>> > Let Esel denote number of elements per pattern in sel at runtime.
>>> > Esel = (16 + 16x) / npatterns_sel
>>> >         = (16 + 16x) / 2
>>> >         = (8 + 8x)
>>> >
>>> > So, last element of pattern:
>>> > ae = a1 + (Esel - 2) * S
>>> >      = 2 + (8 + 8x - 2) * 2
>>> >      = 14 + 16x
>>> >
>>> > a1 /trunc arg0_len = 2 / (16 + 16x) = 0
>>> > ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0
>>> > Since both are equal with quotient = 0, we select elements from op0.
>>> >
>>> > Since step size (S) is a multiple of npatterns(op0), we select
>>> > all elements from same pattern of op0.
>>> >
>>> > res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns))
>>> >                        = max (2, max (2, 2)
>>> >                        = 2
>>> >
>>> > res_nelts_per_pattern = max (op0_nelts_per_pattern,
>>> >                                                 max (op1_nelts_per_pattern,
>>> >                                                          sel_nelts_per_pattern))
>>> >                                     = max (3, max (3, 3))
>>> >                                     = 3
>>> >
>>> > So res has encoding with npatterns = 2, nelts_per_pattern = 3.
>>> > res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
>>> >
>>> > Unfortunately, this results in an issue for poly_int_cst index:
>>> > For example,
>>> > op0, op1: npatterns = 1, nelts_per_pattern = 3
>>> > op0_len = op1_len = 4 + 4x
>>> >
>>> > sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1
>>> >
>>> > In this case,
>>> > a1 = 5 + 4x
>>> > S = (6 + 4x) - (5 + 4x) = 1
>>> > Esel = 4 + 4x
>>> >
>>> > ae = a1 + (esel - 2) * S
>>> >      = (5 + 4x) + (4 + 4x - 2) * 1
>>> >      = 7 + 8x
>>> >
>>> > IIUC, 7 + 8x will always be index for last element of op1 ?
>>> > if x = 0, len = 4, 7 + 8x = 7
>>> > if x = 1, len = 8, 7 + 8x = 15, etc.
>>> > So the stepped sequence will always choose elements
>>> > from op1 regardless of vector length for above case ?
>>> >
>>> > However,
>>> > ae /trunc op0_len
>>> > = (7 + 8x) / (4 + 4x)
>>> > which is not defined because 7/4 != 8/4
>>> > and we return NULL_TREE, but I suppose the expected result would be:
>>> > res: { op1[0], op1[1], op1[2], ... } ?
>>> >
>>> > The patch passes bootstrap+test on aarch64-linux-gnu with and without sve,
>>> > and on x86_64-unknown-linux-gnu.
>>> > I would be grateful for suggestions on how to proceed.
>>> >
>>> > Thanks,
>>> > Prathamesh
>>> >
>>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>>> > index a02ede79fed..8028b3e8e9a 100644
>>> > --- a/gcc/fold-const.cc
>>> > +++ b/gcc/fold-const.cc
>>> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
>>> >  #include "vec-perm-indices.h"
>>> >  #include "asan.h"
>>> >  #include "gimple-range.h"
>>> > +#include <algorithm>
>>> > +#include "tree-pretty-print.h"
>>> > +#include "gimple-pretty-print.h"
>>> > +#include "print-tree.h"
>>> >
>>> >  /* Nonzero if we are folding constants inside an initializer or a C++
>>> >     manifestly-constant-evaluated context; zero otherwise.
>>> > @@ -10493,15 +10497,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
>>> >  static bool
>>> >  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>>> >  {
>>> > -  unsigned HOST_WIDE_INT i, nunits;
>>> > +  unsigned HOST_WIDE_INT i;
>>> >
>>> > -  if (TREE_CODE (arg) == VECTOR_CST
>>> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
>>> > -    {
>>> > -      for (i = 0; i < nunits; ++i)
>>> > -     elts[i] = VECTOR_CST_ELT (arg, i);
>>> > -    }
>>> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
>>> > +  if (TREE_CODE (arg) == CONSTRUCTOR)
>>> >      {
>>> >        constructor_elt *elt;
>>> >
>>> > @@ -10519,6 +10517,230 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>>> >    return true;
>>> >  }
>>> >
>>> > +/* Return a vector with (NPATTERNS, NELTS_PER_PATTERN) encoding.  */
>>> > +
>>> > +static tree
>>> > +vector_cst_reshape (tree vec, unsigned npatterns, unsigned nelts_per_pattern)
>>> > +{
>>> > +  gcc_assert (pow2p_hwi (npatterns));
>>> > +
>>> > +  if (VECTOR_CST_NPATTERNS (vec) == npatterns
>>> > +      && VECTOR_CST_NELTS_PER_PATTERN (vec) == nelts_per_pattern)
>>> > +    return vec;
>>> > +
>>> > +  tree v = make_vector (exact_log2 (npatterns), nelts_per_pattern);
>>> > +  TREE_TYPE (v) = TREE_TYPE (vec);
>>> > +
>>> > +  unsigned nelts = npatterns * nelts_per_pattern;
>>> > +  for (unsigned i = 0; i < nelts; i++)
>>> > +    VECTOR_CST_ENCODED_ELT(v, i) = vector_cst_elt (vec, i);
>>> > +  return v;
>>> > +}
>>> > +
>>> > +/* Helper routine for fold_vec_perm_vla to check if ARG is a suitable
>>> > +   operand for VLA vec_perm folding. If arg is VLS, then set
>>> > +   NPATTERNS = nelts and NELTS_PER_PATTERN = 1.  */
>>> > +
>>> > +static tree
>>> > +valid_operand_for_fold_vec_perm_cst_p (tree arg)
>>> > +{
>>> > +  if (TREE_CODE (arg) != VECTOR_CST)
>>> > +    return NULL_TREE;
>>> > +
>>> > +  unsigned HOST_WIDE_INT nelts;
>>> > +  unsigned npatterns, nelts_per_pattern;
>>> > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant (&nelts))
>>> > +    {
>>> > +      npatterns = nelts;
>>> > +      nelts_per_pattern = 1;
>>> > +    }
>>> > +  else
>>> > +    {
>>> > +      npatterns = VECTOR_CST_NPATTERNS (arg);
>>> > +      nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg);
>>> > +    }
>>> > +
>>> > +  if (!pow2p_hwi (npatterns))
>>> > +    return NULL_TREE;
>>> > +
>>> > +  return vector_cst_reshape (arg, npatterns, nelts_per_pattern);
>>> > +}
>>>
>>> I don't think we should reshape the vectors for VLS, since it would
>>> create more nodes for GC to clean up later.  Also, the "compact" encoding
>>> is canonical even for VLS, so the reshaping would effectively create
>>> noncanonical constants (even if only temporarily).
>> Removed in the attached patch.
>>>
>>> Instead, I think we should change the later:
>>>
>>> > +  if (!valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, sel_npatterns,
>>> > +                                        sel_nelts_per_pattern, reason, verbose))
>>> > +    return NULL_TREE;
>>>
>>> so that it comes after the computation of res_npatterns and
>>> res_nelts_per_pattern.  Then, if valid_mask_for_fold_vec_perm_cst_p
>>> returns false, and if the result type has a constant number of elements,
>>> we should:
>>>
>>> * set res_npatterns to that number of elements
>>> * set res_nelts_per_pattern to 1
>>> * continue instead of returning null
>> Assuming we don't enforce only VLA or only VLS for input vectors and sel,
>> won't that be still an issue if res (and sel) is VLS, and input
>> vectors are VLA ?
>> For eg:
>> arg0, arg1 are type VNx4SI with npatterns = 2, nelts_per_pattern = 3, step = 2
>> sel is V4SI constant with encoding { 0, 2, 4, ... }
>> and res_type is V4SI.
>> In this case, when it comes to index 4, the vector selection becomes ambiguous,
>> since it can be arg1 for len = 4 + 4x, and arg0 for lengths > 4 + 4x ?
>
> Ah, right.  So the condition is whether the result type and the data
> input type have a constant number of elements, rather than just the result.

Actually, I take that back.  The reason:

>>> The loop that follows will then do the correct thing for each element.

is true is that:

+      if (!can_div_trunc_p (sel[i], len, &q, &r))
+	{
+	  if (reason)
+	    strcpy (reason, "cannot divide selector element by arg len");
+	  return NULL_TREE;
+	}

will return false if q isn't computable at compile time (that is,
if we can't decide at compile time which input the element comes from).

So I think checking the result is enough.

Thanks,
Richard