From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=AuS4=DZ=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id A1A653858D20
	for <gcc-patches@gcc.gnu.org>; Tue,  8 Aug 2023 09:57:17 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A1A653858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D3607152B;
	Tue,  8 Aug 2023 02:57:59 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D26A63F59C;
	Tue,  8 Aug 2023 02:57:16 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Mail-Followup-To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,gcc Patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: gcc Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors
References: <CAAgBjMnk_N=tgPNBhUu91yt8YN0HcCoWgQQYpshHMqhU=6WgAQ@mail.gmail.com>
	<mptmszkdubp.fsf@arm.com>
	<CAAgBjMmOrVxsNT8jK0Ei-RYrhtNVymd7pmaud2tA6=OAJCbpOg@mail.gmail.com>
	<mpt4jlg2sb3.fsf@arm.com> <mptv8dw1d2l.fsf@arm.com>
	<CAAgBjMmuzg54RiUTXVQE06+AZxCM-4w+ohojFND5TWghfsiV8w@mail.gmail.com>
	<mpto7jmyhhe.fsf@arm.com>
	<CAAgBjMnhiuXreLaq13zYAwq=c2qRxoZzgSz6=nAT+sDbtPoGSQ@mail.gmail.com>
Date: Tue, 08 Aug 2023 10:57:03 +0100
In-Reply-To: <CAAgBjMnhiuXreLaq13zYAwq=c2qRxoZzgSz6=nAT+sDbtPoGSQ@mail.gmail.com>
	(Prathamesh Kulkarni's message of "Sun, 6 Aug 2023 17:55:37 +0530")
Message-ID: <mpth6p93lhs.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-26.1 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Fri, 4 Aug 2023 at 20:36, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Full review this time, sorry for the skipping the tests earlier.
> Thanks for the detailed review! Please find my responses inline below.
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> > index 7e5494dfd39..680d0e54fd4 100644
>> > --- a/gcc/fold-const.cc
>> > +++ b/gcc/fold-const.cc
>> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
>> >  #include "vec-perm-indices.h"
>> >  #include "asan.h"
>> >  #include "gimple-range.h"
>> > +#include <algorithm>
>>
>> This should be included by defining INCLUDE_ALGORITHM instead.
> Done. Just curious, why do we use this macro instead of directly
> including <algorithm> ?

AIUI, one of the reasons for having every file start with includes
of config.h and (b)system.h, in that order, is to ensure that a small
and predictable amount of GCC-specific stuff happens before including
the system header files.  That helps to avoid OS-specific clashes between
GCC code and system headers.

But another major reason is that system.h ends by poisoning a lot of
stuff that system headers would be entitled to use.

>> > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
>> > +
>> > +  // Fill a0 for each pattern
>> > +  for (unsigned i = 0; i < npatterns; i++)
>> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
>> > +
>> > +  if (nelts_per_pattern == 1)
>> > +    return builder.build ();
>> > +
>> > +  // Fill a1 for each pattern
>> > +  for (unsigned i = 0; i < npatterns; i++)
>> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
>> > +
>> > +  if (nelts_per_pattern == 2)
>> > +    return builder.build ();
>> > +
>> > +  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
>> > +    {
>> > +      tree prev_elem = builder[i - npatterns];
>> > +      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
>> > +      int val = prev_elem_val + S;
>> > +      builder.quick_push (build_int_cst (inner_type, val));
>> > +    }
>> > +
>> > +  return builder.build ();
>> > +}
>> > +
>> > +static void
>> > +validate_res (unsigned npatterns, unsigned nelts_per_pattern,
>> > +           tree res, tree *expected_res)
>> > +{
>> > +  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
>> > +  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
>>
>> I don't think this is safe when the inputs are randomised.  E.g. we
>> could by chance end up with a vector of all zeros, which would have
>> a single pattern and a single element per pattern, regardless of the
>> shapes of the inputs.
>>
>> Given the way that vector_builder<T, Shape, Derived>::finalize
>> canonicalises the encoding, it should be safe to use:
>>
>> * VECTOR_CST_NPATTERNS (res) <= npatterns
>> * vector_cst_encoded_nelts (res) <= npatterns * nelts_per_pattern
>>
>> If we do that then...
>>
>> > +
>> > +  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
>>
>> ...this loop bound should be npatterns * nelts_per_pattern instead.
> Ah indeed. Fixed, thanks.

The patch instead does:

  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) <= npatterns);
  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) <= nelts_per_pattern);

I think the version I suggested is safer.  It's not the goal of the
canonicalisation algorithm to reduce both npattners and nelts_per_pattern
individually.  The algorithm can increase nelts_per_pattern in order
to decrease npatterns.

>> > +  {
>> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
>> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
>> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> > +
>> > +    vec_perm_builder builder (arg0_len, 1, 3);
>> > +    builder.quick_push (arg0_len);
>> > +    builder.quick_push (arg0_len + 1);
>> > +    builder.quick_push (arg0_len + 2);
>> > +
>> > +    vec_perm_indices sel (builder, 2, arg0_len);
>> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, NULL, true);
>> > +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 1),
>> > +                         vector_cst_elt (arg1, 2) };
>> > +    validate_res (1, 3, res, expected_res);
>> > +  }
>> > +
>> > +  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of arg0.
>> > +     sel = {len, 0, 0, 0, 2, 0, ...}
>> > +     npatterns = 2, nelts_per_pattern = 3.
>> > +     Use extra pattern {0, ...} to lower number of elements per pattern.  */
>> > +  {
>> > +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
>> > +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
>> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> > +
>> > +    vec_perm_builder builder (arg0_len, 2, 3);
>> > +    builder.quick_push (arg0_len);
>> > +    int mask_elems[] = { 0, 0, 0, 2, 0 };
>> > +    for (int i = 0; i < 5; i++)
>> > +      builder.quick_push (mask_elems[i]);
>>
>> This leaves one of the elements unspecified.
> Sorry, I didn't understand.
> It first pushes len in:
> builder.quick_push (arg0_len)
> and then pushes the remaining indices in the loop:
> for (int i = 0; i < 5; i++)
>   builder.quick_push (mask_elems[i])
> So overall, builder will have 6 elements: {len, 0, 0, 0, 2, 0}

Ah, right.  But in that case I think it would be clearer to put arg0_len
in mask_elems.

> +/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
> +   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
> +   REASON has same purpose as described in
> +   valid_mask_for_fold_vec_perm_cst_p.  */
> +
> +

Nit: too much vertical space.

> +/* Invoke tests for fold_vec_perm_cst.  */
> +
> +static void
> +test ()
> +{
> +  /* Conditionally execute fold_vec_perm_cst tests, if target supports
> +     VLA vectors. Use a compile time check so we avoid instantiating
> +     poly_uint64 with N > 1 on targets that do not support VLA vectors.  */
> +  if constexpr (poly_int_traits<poly_uint64>::num_coeffs > 1)

That's a C++17 feature, so we can't use it.

Instead, how about:

static bool
is_simple_vla_size (poly_uint64 size)
{
  if (size.is_constant ())
    return false;
  for (int i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
    if (size[i] != (i <= 1 ? size[0] : 0))
      return false;
  return true;
}


  FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
    {
      auto nunits = GET_MODE_NUNITS (mode);
      if (!is_simple_vla_size (nunits))
        continue;
      if (nunits[0] ...)
        test_... (mode);
      ...

    }

test_vnx4si_v4si and test_v4si_vnx4si look good.  But with the
loop structure above, I think we can apply the test_vnx4si and
test_vnx16qi to more cases.  So the classification isn't the
exact number of elements, but instead a limit.

I think the nunits[0] conditions for test_vnx4si are as follows
(inspection only, so could be wrong):

> +/* Test cases where result and input vectors are VNx4SI  */
> +
> +static void
> +test_vnx4si (machine_mode vmode)
> +{
> +  /* Case 1: mask = {0, ...} */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 1);
> +    builder.quick_push (0);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (res, 0) };
> +    validate_res (1, 1, res, expected_res);
> +  }

nunits[0] >= 2 (could be all nunits if the inputs had nelts_per_pattern==1,
which I think would be better)

> +
> +  /* Case 2: mask = {len, ...} */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 1);
> +    builder.quick_push (len);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg1, 0) };
> +    validate_res (1, 1, res, expected_res);
> +  }

same

> +
> +  /* Case 3: mask = { 0, len, ... } */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 2, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0) };
> +    validate_res (2, 1, res, expected_res);
> +  }

nunits[0] >= 2

> +
> +  /* Case 4: mask = { 0, len, 1, len+1, ... } */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 2, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    builder.quick_push (1);
> +    builder.quick_push (len + 1);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (2, 2, res, expected_res);
> +  }

nunits[0] >= 2

> +
> +  /* Case 5: mask = { 0, len, 1, len+1, .... }
> +     npatterns = 4, nelts_per_pattern = 1 */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 4, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    builder.quick_push (1);
> +    builder.quick_push (len + 1);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (4, 1, res, expected_res);
> +  }

nunits[0] >= 4

> +
> +  /* Case 6: mask = {0, 4, ...}
> +     npatterns = 1, nelts_per_pattern = 2.
> +     This should return NULL_TREE because the index 4 may choose
> +     from either arg0 or arg1 depending on vector length.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (4);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +    ASSERT_TRUE (res == NULL_TREE);
> +  }

nunits[0] == 2 or == 4 (could be <= 4 if the inputs had nelts_per_pattern==1)

> +
> +  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
> +     mask = {0, len, 1, len + 1, ...}
> +     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 2, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (arg0_len);
> +    builder.quick_push (1);
> +    builder.quick_push (arg0_len + 1);
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (2, 2, res, expected_res);
> +  }

nunits[0] >= 2


> +
> +  /* Case 8: sel = {0, 1, 2, ...}
> +     npatterns = 1, nelts_per_pattern = 3
> +     expected res: { arg0[0], arg0[1], arg0[2], ... } */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> +    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 1, 3);
> +    builder.quick_push (0);
> +    builder.quick_push (1);
> +    builder.quick_push (2);
> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
> +			    vector_cst_elt (arg0, 2) };
> +    validate_res (1, 3, res, expected_res);
> +  }

all nunits[0]

> +
> +  /* Case 9: sel = {len, len + 1, len + 2, ... }
> +     npatterns = 1, nelts_per_pattern = 3
> +     expected res: { op1[0], op1[1], op1[2], ... }  */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> +    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 1, 3);
> +    builder.quick_push (arg0_len);
> +    builder.quick_push (arg0_len + 1);
> +    builder.quick_push (arg0_len + 2);
> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, NULL);
> +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 1),
> +			    vector_cst_elt (arg1, 2) };
> +    validate_res (1, 3, res, expected_res);
> +  }

all nunits[0]

Same idea for the others.

Thanks,
Richard