public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,
	gcc Patches <gcc-patches@gcc.gnu.org>,
	 Richard Biener <rguenther@suse.de>,
	richard.sandiford@arm.com
Subject: Re: PR111754
Date: Mon, 27 Nov 2023 20:43:34 +0530	[thread overview]
Message-ID: <CAAgBjMmbbcEeasB1NrDpEb8EvZ9iW9SO1Tkr39X7ggGD5iqRvg@mail.gmail.com> (raw)
In-Reply-To: <mptzfz4403f.fsf@arm.com>

[-- Attachment #1: Type: text/plain, Size: 18911 bytes --]

On Fri, 24 Nov 2023 at 03:13, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Thu, 26 Oct 2023 at 09:43, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> >>
> >> On Thu, 26 Oct 2023 at 04:09, Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >> >
> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford
> >> > > <richard.sandiford@arm.com> wrote:
> >> > >> So I think the PR could be solved by something like the attached.
> >> > >> Do you agree?  If so, could you base the patch on this instead?
> >> > >>
> >> > >> Only tested against the self-tests.
> >> > >>
> >> > >> Thanks,
> >> > >> Richard
> >> > >>
> >> > >> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> >> > >> index 40767736389..00fce4945a7 100644
> >> > >> --- a/gcc/fold-const.cc
> >> > >> +++ b/gcc/fold-const.cc
> >> > >> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> >> > >>    unsigned res_npatterns, res_nelts_per_pattern;
> >> > >>    unsigned HOST_WIDE_INT res_nelts;
> >> > >>
> >> > >> -  /* (1) If SEL is a suitable mask as determined by
> >> > >> -     valid_mask_for_fold_vec_perm_cst_p, then:
> >> > >> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> >> > >> -     res_nelts_per_pattern = max of nelts_per_pattern between
> >> > >> -                            ARG0, ARG1 and SEL.
> >> > >> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> >> > >> -     res_npatterns = nelts in result vector.
> >> > >> -     res_nelts_per_pattern = 1.
> >> > >> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> >> > >> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> >> > >> -    {
> >> > >> -      res_npatterns
> >> > >> -       = std::max (VECTOR_CST_NPATTERNS (arg0),
> >> > >> -                   std::max (VECTOR_CST_NPATTERNS (arg1),
> >> > >> -                             sel.encoding ().npatterns ()));
> >> > >> +  /* First try to implement the fold in a VLA-friendly way.
> >> > >> +
> >> > >> +     (1) If the selector is simply a duplication of N elements, the
> >> > >> +        result is likewise a duplication of N elements.
> >> > >> +
> >> > >> +     (2) If the selector is N elements followed by a duplication
> >> > >> +        of N elements, the result is too.
> >> > >>
> >> > >> -      res_nelts_per_pattern
> >> > >> -       = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> >> > >> -                   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> >> > >> -                             sel.encoding ().nelts_per_pattern ()));
> >> > >> +     (3) If the selector is N elements followed by an interleaving
> >> > >> +        of N linear series, the situation is more complex.
> >> > >>
> >> > >> +        valid_mask_for_fold_vec_perm_cst_p detects whether we
> >> > >> +        can handle this case.  If we can, then each of the N linear
> >> > >> +        series either (a) selects the same element each time or
> >> > >> +        (b) selects a linear series from one of the input patterns.
> >> > >> +
> >> > >> +        If (b) holds for one of the linear series, the result
> >> > >> +        will contain a linear series, and so the result will have
> >> > >> +        the same shape as the selector.  If (a) holds for all of
> >> > >> +        the lienar series, the result will be the same as (2) above.
> >> > >> +
> >> > >> +        (b) can only hold if one of the inputs pattern has a
> >> > >> +        stepped encoding.  */
> >> > >> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> >> > >> +    {
> >> > >> +      res_npatterns = sel.encoding ().npatterns ();
> >> > >> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> >> > >> +      if (res_nelts_per_pattern == 3
> >> > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> >> > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> >> > >> +       res_nelts_per_pattern = 2;
> >> > > Um, in this case, should we set:
> >> > > res_nelts_per_pattern = max (nelts_per_pattern (arg0), nelts_per_pattern(arg1))
> >> > > if both have nelts_per_pattern == 1 ?
> >> >
> >> > No, it still needs to be 2 even if arg0 and arg1 are duplicates.
> >> > E.g. consider a selector that picks the first element of arg0
> >> > followed by a duplicate of the first element of arg1.
> >> >
> >> > > Also I suppose this matters only for non-integral element type, since
> >> > > for integral element type,
> >> > > vector_cst_elt will return the correct value even if the element is
> >> > > not explicitly encoded and input vector is dup ?
> >> >
> >> > Yeah, but it might help even for integers.  If we build fewer
> >> > elements explicitly, and so read fewer implicitly-encoded inputs,
> >> > there's less risk of running into:
> >> >
> >> >       if (!can_div_trunc_p (sel[i], len, &q, &r))
> >> >         {
> >> >           if (reason)
> >> >             *reason = "cannot divide selector element by arg len";
> >> >           return NULL_TREE;
> >> >         }
> >> Ah right, thanks for the clarification!
> >> I am currently away on vacation and will return next Thursday, and
> >> will post a follow up patch based on your patch.
> >> Sorry for the delay.
> > Hi,
> > Sorry for slow response, I have rebased your patch and added couple of tests.
> > The attached patch resulted in fallout for aarch64/sve/slp_3.c and
> > aarch64/sve/slp_4.c.
> >
> > Specifically for slp_3.c, we didn't fold following case:
> > arg0, arg1 are dup vectors.
> > sel = { 0, len, 1, len + 1, 2, len + 2, ... } // (npatterns = 2,
> > nelts_per_pattern = 3)
> > because res_nelts_per_pattern was set to 3, and upon encountering 2,
> > fold_vec_perm_cst returned false.
> >
> > With patch, we set res_nelts_per_pattern = 2 (since input vectors are
> > dup), and thus gets folded to:
> > res = { arg0[0], arg1[0], ... } // (2, 1)
> >
> > Which results in using ldrqd for loading the result instead of doing
> > the permutation at runtime with mov and zip1.
> > I have adjusted the tests for new code-gen.
> > Does it look OK ?
> >
> > There's also this strange failure observed on x86_64, as well as on aarch64:
> > New tests that FAIL (1 tests):
> > libitm.c++/dropref.C -B
> > /home/prathamesh.kulkarni/gnu-toolchain/gcc/gnu-964-5/bootstrap-build-after/aarch64-unknown-linux-gnu/./libitm/../libstdc++-v3/src/.libs
> > execution test
> >
> > Looking at dropref.C:
> > /* { dg-xfail-run-if "unsupported" { *-*-* } } */
> > #include <libitm.h>
> >
> > char *pp;
> >
> > int main()
> > {
> >   __transaction_atomic {
> >     _ITM_dropReferences (pp, 555);
> >   }
> >   return 0;
> > }
> >
> > doesn't seem relevant to VEC_PERM_EXPR folding ?
> > The patch otherwise passes bootstrap+test on aarch64-linux-gnu with
> > and without SVE, and on x86_64-linux-gnu.
> >
> > Thanks,
> > Prathamesh
> >>
> >> Thanks,
> >> Prathamesh
> >> >
> >> > Thanks,
> >> > Richard
> >
> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> > index 40767736389..75410869796 100644
> > --- a/gcc/fold-const.cc
> > +++ b/gcc/fold-const.cc
> > @@ -10743,27 +10743,38 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> >    unsigned res_npatterns, res_nelts_per_pattern;
> >    unsigned HOST_WIDE_INT res_nelts;
> >
> > -  /* (1) If SEL is a suitable mask as determined by
> > -     valid_mask_for_fold_vec_perm_cst_p, then:
> > -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> > -     res_nelts_per_pattern = max of nelts_per_pattern between
> > -                          ARG0, ARG1 and SEL.
> > -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> > -     res_npatterns = nelts in result vector.
> > -     res_nelts_per_pattern = 1.
> > -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> > -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> > -    {
> > -      res_npatterns
> > -     = std::max (VECTOR_CST_NPATTERNS (arg0),
> > -                 std::max (VECTOR_CST_NPATTERNS (arg1),
> > -                           sel.encoding ().npatterns ()));
> > +  /* First try to implement the fold in a VLA-friendly way.
> > +
> > +     (1) If the selector is simply a duplication of N elements, the
> > +      result is likewise a duplication of N elements.
> > +
> > +     (2) If the selector is N elements followed by a duplication
> > +      of N elements, the result is too.
> > +
> > +     (3) If the selector is N elements followed by an interleaving
> > +      of N linear series, the situation is more complex.
> > +
> > +      valid_mask_for_fold_vec_perm_cst_p detects whether we
> > +      can handle this case.  If we can, then each of the N linear
> > +      series either (a) selects the same element each time or
> > +      (b) selects a linear series from one of the input patterns.
> > +
> > +      If (b) holds for one of the linear series, the result
> > +      will contain a linear series, and so the result will have
> > +      the same shape as the selector.  If (a) holds for all of
> > +      the lienar series, the result will be the same as (2) above.
>
> my typo: linear
> >
> > -      res_nelts_per_pattern
> > -     = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > -                 std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> > -                           sel.encoding ().nelts_per_pattern ()));
> > +      (b) can only hold if one of the input patterns has a
> > +      stepped encoding.  */
> >
> > +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> > +    {
> > +      res_npatterns = sel.encoding ().npatterns ();
> > +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > +      if (res_nelts_per_pattern == 3
> > +       && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> > +       && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> > +     res_nelts_per_pattern = 2;
> >        res_nelts = res_npatterns * res_nelts_per_pattern;
> >      }
> >    else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
> > @@ -17562,6 +17573,29 @@ test_nunits_min_2 (machine_mode vmode)
> >       tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) };
> >       validate_res (1, 3, res, expected_res);
> >        }
> > +
> > +      /* Case 8: Same as aarch64/sve/slp_3.c:
> > +      arg0, arg1 are dup vectors.
> > +      sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3)
> > +      So res = { arg0[0], arg1[0], ... } // (2, 1)
> > +
> > +      In this case, since the input vectors are dup, only the first two
> > +      elements per pattern in sel are considered significant.  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 1);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 1);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 2, 3);
> > +     poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG1(0) };
> > +     validate_res (2, 1, res, expected_res);
> > +      }
> >      }
> >  }
> >
> > @@ -17730,6 +17764,45 @@ test_nunits_min_4 (machine_mode vmode)
> >       ASSERT_TRUE (res == NULL_TREE);
> >       ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
> >        }
> > +
> > +      /* Case 8: PR111754: When input vector is not a stepped sequence,
> > +      check that the result is not a stepped sequence either, even
> > +      if sel has a stepped sequence.  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 2);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 2);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 1, 3);
> > +     poly_uint64 mask_elems[] = { 0, 1, 2 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG0(1) };
> > +     validate_res (sel.encoding ().npatterns (), 2, res, expected_res);
>
> The test is OK, but I think it's worth noting that the fold_vec_perm_cst
> arguments aren't canonical.  Since sel selects only from the first input,
> the canonical form would be:
>
>   tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg0, sel);
>
> So OK with a comment, but also OK with the line above instead (and no arg1).
>
> > +      }
> > +
> > +      /* Case 9: If sel doesn't contain a stepped sequence,
> > +      check that the result has same encoding as sel, irrespective
> > +      of shape of input vectors.  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 1, 2);
> > +     poly_uint64 mask_elems[] = { 0, len };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG1(0) };
> > +     validate_res (sel.encoding ().npatterns (),
> > +                   sel.encoding ().nelts_per_pattern (), res, expected_res);
> > +      }
> >      }
> >  }
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/pr111754.c b/gcc/testsuite/gcc.dg/vect/pr111754.c
> > new file mode 100644
> > index 00000000000..7c1c16875c7
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/pr111754.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-optimized" } */
> > +
> > +typedef float __attribute__((__vector_size__ (16))) F;
> > +
> > +F foo (F a, F b)
> > +{
> > +  F v = (F) { 9 };
> > +  return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */
> > +/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" "optimized" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> > index 82dd43a4d98..cb649bc1aa9 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> > @@ -33,21 +33,15 @@ TEST_ALL (VEC_PERM)
> >
> >  /* 1 for each 8-bit type.  */
> >  /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */
> > -/* 1 for each 16-bit type plus 1 for double.  */
> > -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 } } */
> > +/* 1 for each 16-bit type  */
> > +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */
> >  /* 1 for each 32-bit type.  */
> >  /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #25\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #31\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #62\n} 2 } } */
>
> Let's replace the deleted lines with:
>
> /* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 6 } } */
>
> > -/* 3 for double.  */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 3 } } */
> >  /* The 64-bit types need:
> >
> >        ZIP1 ZIP1 (2 ZIP2s optimized away)
>
> This line should be deleted, now that the ZIP1s are gone.
>
> >        ZIP1 ZIP2.  */
> > -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 9 } } */
> > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
> >  /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
> >
> >  /* The loop should be fully-masked.  The 64-bit types need two loads
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> > index b1fa5e3cf68..ce940a28597 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> > @@ -35,20 +35,10 @@ vec_slp_##TYPE (TYPE *restrict a, int n)                  \
> >
> >  TEST_ALL (VEC_PERM)
> >
> > -/* 1 for each 8-bit type, 4 for each 32-bit type and 4 for double.  */
> > -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 18 } } */
> > +/* 1 for each 8-bit type  */
> > +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 2 } } */
> >  /* 1 for each 16-bit type.  */
> >  /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #11\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #17\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #80\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #63\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #37\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #24\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #81\n} 2 } } */
> > -/* 4 for double.  */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 4 } } */
>
> Similarly here:
>
> /* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 18 } } */
>
> >  /* The 32-bit types need:
> >
> >        ZIP1 ZIP1 (2 ZIP2s optimized away)
>
> This line should be deleted.
>
> > @@ -59,7 +49,7 @@ TEST_ALL (VEC_PERM)
> >        ZIP1 ZIP1 ZIP1 ZIP1 (4 ZIP2s optimized away)
>
> Same here.
>
> OK with those changes, and sorry for the slow review.
Hi Richard,
Thanks for the suggestions, I have done the changes in attached patch.
Bootstrapped+tested with and without SVE on aarch64-linux-gnu and
x86_64-linux-gnu.
Which passes with exception of dropref.C failure above, but I assume
that's spurious since
it's not relevant to VEC_PERM_EXPR folding ?
Is it OK to commit the patch to trunk ?

Thanks,
Prathamesh
>
> Thanks,
> Richard
>
> >        ZIP1 ZIP2 ZIP1 ZIP2
> >        ZIP1 ZIP2 ZIP1 ZIP2.  */
> > -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 33 } } */
> > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
> >  /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
> >
> >  /* The loop should be fully-masked.  The 32-bit types need two loads

[-- Attachment #2: gnu-964-7.txt --]
[-- Type: text/plain, Size: 10237 bytes --]

PR111754: Rework encoding of result for VEC_PERM_EXPR with constant input vectors.

gcc/ChangeLog:
	PR middle-end/111754
	* fold-const.cc (fold_vec_perm_cst): Set result's encoding to sel's
	encoding, and set res_nelts_per_pattern to 2 if sel contains stepped
	sequence but input vectors do not.
	(test_nunits_min_2): New test Case 8.
	(test_nunits_min_4): New tests Case 8 and Case 9.

gcc/testsuite/ChangeLog:
	PR middle-end/111754
	* gcc.target/aarch64/sve/slp_3.c: Adjust code-gen.
	* gcc.target/aarch64/sve/slp_4.c: Likewise.
	* gcc.dg/vect/pr111754.c: New test.

Co-authored-by: Richard Sandiford <richard.sandiford@arm.com>

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 332bc8aead2..dff09b81f7b 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -10803,27 +10803,38 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
   unsigned res_npatterns, res_nelts_per_pattern;
   unsigned HOST_WIDE_INT res_nelts;
 
-  /* (1) If SEL is a suitable mask as determined by
-     valid_mask_for_fold_vec_perm_cst_p, then:
-     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
-     res_nelts_per_pattern = max of nelts_per_pattern between
-			     ARG0, ARG1 and SEL.
-     (2) If SEL is not a suitable mask, and TYPE is VLS then:
-     res_npatterns = nelts in result vector.
-     res_nelts_per_pattern = 1.
-     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
-  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
-    {
-      res_npatterns
-	= std::max (VECTOR_CST_NPATTERNS (arg0),
-		    std::max (VECTOR_CST_NPATTERNS (arg1),
-			      sel.encoding ().npatterns ()));
+  /* First try to implement the fold in a VLA-friendly way.
+
+     (1) If the selector is simply a duplication of N elements, the
+	 result is likewise a duplication of N elements.
+
+     (2) If the selector is N elements followed by a duplication
+	 of N elements, the result is too.
+
+     (3) If the selector is N elements followed by an interleaving
+	 of N linear series, the situation is more complex.
+
+	 valid_mask_for_fold_vec_perm_cst_p detects whether we
+	 can handle this case.  If we can, then each of the N linear
+	 series either (a) selects the same element each time or
+	 (b) selects a linear series from one of the input patterns.
 
-      res_nelts_per_pattern
-	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
-		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
-			      sel.encoding ().nelts_per_pattern ()));
+	 If (b) holds for one of the linear series, the result
+	 will contain a linear series, and so the result will have
+	 the same shape as the selector.  If (a) holds for all of
+	 the linear series, the result will be the same as (2) above.
 
+	 (b) can only hold if one of the input patterns has a
+	 stepped encoding.  */
+
+  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
+    {
+      res_npatterns = sel.encoding ().npatterns ();
+      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+      if (res_nelts_per_pattern == 3
+	  && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
+	  && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
+	res_nelts_per_pattern = 2;
       res_nelts = res_npatterns * res_nelts_per_pattern;
     }
   else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
@@ -17622,6 +17633,29 @@ test_nunits_min_2 (machine_mode vmode)
 	tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) };
 	validate_res (1, 3, res, expected_res);
       }
+
+      /* Case 8: Same as aarch64/sve/slp_3.c:
+	 arg0, arg1 are dup vectors.
+	 sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3)
+	 So res = { arg0[0], arg1[0], ... } // (2, 1)
+
+	 In this case, since the input vectors are dup, only the first two
+	 elements per pattern in sel are considered significant.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 3);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0) };
+	validate_res (2, 1, res, expected_res);
+      }
     }
 }
 
@@ -17790,6 +17824,44 @@ test_nunits_min_4 (machine_mode vmode)
 	ASSERT_TRUE (res == NULL_TREE);
 	ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
       }
+
+      /* Case 8: PR111754: When input vector is not a stepped sequence,
+	 check that the result is not a stepped sequence either, even
+	 if sel has a stepped sequence.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 2);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = { 0, 1, 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 1, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg0, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(1) };
+	validate_res (sel.encoding ().npatterns (), 2, res, expected_res);
+      }
+
+      /* Case 9: If sel doesn't contain a stepped sequence,
+	 check that the result has same encoding as sel, irrespective
+	 of shape of input vectors.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 2);
+	poly_uint64 mask_elems[] = { 0, len };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0) };
+	validate_res (sel.encoding ().npatterns (),
+		      sel.encoding ().nelts_per_pattern (), res, expected_res);
+      }
     }
 }
 
diff --git a/gcc/testsuite/gcc.dg/vect/pr111754.c b/gcc/testsuite/gcc.dg/vect/pr111754.c
new file mode 100644
index 00000000000..7c1c16875c7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr111754.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+typedef float __attribute__((__vector_size__ (16))) F;
+
+F foo (F a, F b)
+{
+  F v = (F) { 9 };
+  return __builtin_shufflevector (v, v, 1, 0, 1, 2);
+}
+
+/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */
+/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" "optimized" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
index 82dd43a4d98..775c1e1d530 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
@@ -33,21 +33,14 @@ TEST_ALL (VEC_PERM)
 
 /* 1 for each 8-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */
-/* 1 for each 16-bit type plus 1 for double.  */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 } } */
+/* 1 for each 16-bit type  */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */
 /* 1 for each 32-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #25\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #31\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #62\n} 2 } } */
-/* 3 for double.  */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 6 } } */
 /* The 64-bit types need:
-
-      ZIP1 ZIP1 (2 ZIP2s optimized away)
       ZIP1 ZIP2.  */
-/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 9 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 
 /* The loop should be fully-masked.  The 64-bit types need two loads
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
index b1fa5e3cf68..5a9fc8ff750 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
@@ -35,31 +35,20 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\
 
 TEST_ALL (VEC_PERM)
 
-/* 1 for each 8-bit type, 4 for each 32-bit type and 4 for double.  */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 18 } } */
+/* 1 for each 8-bit type  */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 2 } } */
 /* 1 for each 16-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #11\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #17\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #80\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #63\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #37\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #24\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #81\n} 2 } } */
-/* 4 for double.  */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 4 } } */
+/* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 18 } } */
 /* The 32-bit types need:
 
-      ZIP1 ZIP1 (2 ZIP2s optimized away)
       ZIP1 ZIP2
 
    and the 64-bit types need:
 
-      ZIP1 ZIP1 ZIP1 ZIP1 (4 ZIP2s optimized away)
       ZIP1 ZIP2 ZIP1 ZIP2
       ZIP1 ZIP2 ZIP1 ZIP2.  */
-/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 33 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
 /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
 
 /* The loop should be fully-masked.  The 32-bit types need two loads

  reply	other threads:[~2023-11-27 15:14 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-20 17:25 PR111754 Prathamesh Kulkarni
2023-10-24 21:28 ` PR111754 Richard Sandiford
2023-10-25  8:42   ` PR111754 Richard Sandiford
2023-10-25 10:59   ` PR111754 Prathamesh Kulkarni
2023-10-25 22:38     ` PR111754 Richard Sandiford
2023-10-26  4:13       ` PR111754 Prathamesh Kulkarni
2023-11-08 16:27         ` PR111754 Prathamesh Kulkarni
2023-11-15 15:14           ` PR111754 Prathamesh Kulkarni
2023-11-23 11:40             ` PR111754 Prathamesh Kulkarni
2023-11-23 21:43           ` PR111754 Richard Sandiford
2023-11-27 15:13             ` Prathamesh Kulkarni [this message]
2023-11-27 16:49               ` PR111754 Richard Sandiford
2023-11-28  7:56 PR111754 juzhe.zhong
2023-11-28  8:43 ` PR111754 Jakub Jelinek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAgBjMmbbcEeasB1NrDpEb8EvZ9iW9SO1Tkr39X7ggGD5iqRvg@mail.gmail.com \
    --to=prathamesh.kulkarni@linaro.org \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=rguenther@suse.de \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).