From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=z6OI=GG=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 972B638582B7
	for <gcc-patches@gcc.gnu.org>; Tue, 24 Oct 2023 21:28:52 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 972B638582B7
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 972B638582B7
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698182935; cv=none;
	b=mIbJHNi88fJrZm377+uTPxRvtvGi8cRJ9AROZZBZS0gHh18DBEB1Jkr3XpzuOdN3y0AjdTFktRKtwciR7+GsDBEKHSi0tihoN2FmOA9BFmKILW6vzrG6NfyBVBX5hUU/waGwywVpBhYEbRzhpD8l9dbzezCEsE5iNX60HULCWLE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1698182935; c=relaxed/simple;
	bh=breL4161pANz3j+Dhx/rMd/VoxZZxiUGToXj8lMC/RQ=;
	h=From:To:Subject:Date:Message-ID:MIME-Version; b=BMAfKpx23PTq7Rvy+02l0/vgmAGLVFbQy7ecJ7W0z0AZuEE4KLfvPikniPH91XIUqvyHEN+1ikr4mZjy8jCbZ7+oJ7LWhftxNNoCp4EqVJs3MwEVkteTflVU5QVblD0VQGNe7rBo9AxwjcdQAvnQsG6P0yk/F688iKCyPSOQL8Q=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A66AE2F4;
	Tue, 24 Oct 2023 14:29:32 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4939E3F762;
	Tue, 24 Oct 2023 14:28:50 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Mail-Followup-To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,gcc Patches <gcc-patches@gcc.gnu.org>,  Richard Biener <rguenther@suse.de>, richard.sandiford@arm.com
Cc: gcc Patches <gcc-patches@gcc.gnu.org>,  Richard Biener <rguenther@suse.de>
Subject: Re: PR111754
References: <CAAgBjMkP2ZTUq9_YN+4_kCzfPBDroFE-YUVSS4h9=NFWxhetwA@mail.gmail.com>
Date: Tue, 24 Oct 2023 22:28:48 +0100
In-Reply-To: <CAAgBjMkP2ZTUq9_YN+4_kCzfPBDroFE-YUVSS4h9=NFWxhetwA@mail.gmail.com>
	(Prathamesh Kulkarni's message of "Fri, 20 Oct 2023 22:55:44 +0530")
Message-ID: <mptcyx3n20f.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-23.4 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_NUMSUBJECT,SPF_HELO_NONE,SPF_NONE,TXREP,WEIRD_PORT autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hi,

Sorry the slow review.  I clearly didn't think this through properly
when doing the review of the original patch, so I wanted to spend
some time working on the code to get a better understanding of
the problem.

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> Hi,
> For the following test-case:
>
> typedef float __attribute__((__vector_size__ (16))) F;
> F foo (F a, F b)
> {
>   F v =3D (F) { 9 };
>   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> }
>
> Compiling with -O2 results in following ICE:
> foo.c: In function =E2=80=98foo=E2=80=99:
> foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314
>     6 |   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
>       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode>
>>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode>
> const&)
>         ../../gcc/gcc/rtl.h:2314
> 0x7f3185 wide_int_ref_storage<false,
> false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode>
>>(std::pair<rtx_def*, machine_mode> const&)
>         ../../gcc/gcc/wide-int.h:1089
> 0x7f3185 generic_wide_int<wide_int_ref_storage<false, false>
>>::generic_wide_int<std::pair<rtx_def*, machine_mode>
>>(std::pair<rtx_def*, machine_mode> const&)
>         ../../gcc/gcc/wide-int.h:847
> 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> false> > >::poly_int<std::pair<rtx_def*, machine_mode>
>>(poly_int_full, std::pair<rtx_def*, machine_mode> const&)
>         ../../gcc/gcc/poly-int.h:467
> 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> false> > >::poly_int<std::pair<rtx_def*, machine_mode>
>>(std::pair<rtx_def*, machine_mode> const&)
>         ../../gcc/gcc/poly-int.h:453
> 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode)
>         ../../gcc/gcc/rtl.h:2383
> 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const
>         ../../gcc/gcc/rtx-vector-builder.h:122
> 0xfd4e1b vector_builder<rtx_def*, machine_mode,
> rtx_vector_builder>::elt(unsigned int) const
>         ../../gcc/gcc/vector-builder.h:253
> 0xfd4d11 rtx_vector_builder::build()
>         ../../gcc/gcc/rtx-vector-builder.cc:73
> 0xc21d9c const_vector_from_tree
>         ../../gcc/gcc/expr.cc:13487
> 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>         ../../gcc/gcc/expr.cc:11059
> 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
>         ../../gcc/gcc/expr.h:310
> 0xaee682 expand_return
>         ../../gcc/gcc/cfgexpand.cc:3809
> 0xaee682 expand_gimple_stmt_1
>         ../../gcc/gcc/cfgexpand.cc:3918
> 0xaee682 expand_gimple_stmt
>         ../../gcc/gcc/cfgexpand.cc:4044
> 0xaf28f0 expand_gimple_basic_block
>         ../../gcc/gcc/cfgexpand.cc:6100
> 0xaf4996 execute
>         ../../gcc/gcc/cfgexpand.cc:6835
>
> IIUC, the issue is that fold_vec_perm returns a vector having float eleme=
nt
> type with res_nelts_per_pattern =3D=3D 3, and later ICE's when it tries
> to derive element v[3], not present in the encoding, while trying to
> build rtx vector
> in rtx_vector_builder::build():
>  for (unsigned int i =3D 0; i < nelts; ++i)
>     RTVEC_ELT (v, i) =3D elt (i);
>
> The attached patch tries to fix this by returning false from
> valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and
> input vector has non-integral element type, so for VLA vectors, it
> will only build result with dup sequence (nelts_per_pattern < 3) for
> non-integral element type.
>
> For VLS vectors, this will still work for stepped sequence since it
> will then use the "VLS exception" in fold_vec_perm_cst, and set:
> res_npattern =3D res_nelts and
> res_nelts_per_pattern =3D 1
>
> and fold the above case to:
> F foo (F a, F b)
> {
>   <bb 2> [local count: 1073741824]:
>   return { 0.0, 9.0e+0, 0.0, 0.0 };
> }
>
> But I am not sure if this is entirely correct, since:
> tree res =3D out_elts.build ();
> will canonicalize the encoding and may result in a stepped sequence
> (vector_builder::finalize() may reduce npatterns at the cost of increasing
> nelts_per_pattern)  ?
>
> PS: This issue is now latent after PR111648 fix, since
> valid_mask_for_fold_vec_perm_cst with  sel =3D {1, 0, 1, ...} returns
> false because the corresponding pattern in arg0 is not a natural
> stepped sequence, and folds correctly using VLS exception. However, I
> guess the underlying issue of dealing with non-integral element types
> in fold_vec_perm_cst still remains ?
>
> The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu,
> and on x86_64-linux-gnu.

I think the problem is instead in the way that we're calculating
res_npatterns and res_nelts_per_pattern.

If the selector is a duplication of { a1, ..., an }, then the
result will be a duplication of n elements, regardless of the shape
of the other arguments.

Similarly, if the selector is { a1, ...., an } followed by a
duplication of { b1, ..., bn }, the result be n elements followed
by a duplication of n elements, regardless of the shape of the other
arguments.

So for these two cases, res_npatterns and res_nelts_per_pattern
can come directly from the selector's encoding.

If:

(1) the selector is an n-pattern stepped sequence
(2) the stepped part of each pattern selects from the same input pattern
(3) the stepped part of each pattern does not select the first element
    of the input pattern, or the full input pattern is stepped
    (your previous patch)

then the result is stepped only if one of the inputs is stepped.
This is because, if an input pattern has 1 or 2 elements, (3) means
that each element of the stepped sequence will select the same value,
as if the selector step had been 0.

So I think the PR could be solved by something like the attached.
Do you agree?  If so, could you base the patch on this instead?

Only tested against the self-tests.

Thanks,
Richard

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 40767736389..00fce4945a7 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg=
1, const vec_perm_indices &sel,
   unsigned res_npatterns, res_nelts_per_pattern;
   unsigned HOST_WIDE_INT res_nelts;
=20
-  /* (1) If SEL is a suitable mask as determined by
-     valid_mask_for_fold_vec_perm_cst_p, then:
-     res_npatterns =3D max of npatterns between ARG0, ARG1, and SEL
-     res_nelts_per_pattern =3D max of nelts_per_pattern between
-			     ARG0, ARG1 and SEL.
-     (2) If SEL is not a suitable mask, and TYPE is VLS then:
-     res_npatterns =3D nelts in result vector.
-     res_nelts_per_pattern =3D 1.
-     This exception is made so that VLS ARG0, ARG1 and SEL work as before.=
  */
-  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
-    {
-      res_npatterns
-	=3D std::max (VECTOR_CST_NPATTERNS (arg0),
-		    std::max (VECTOR_CST_NPATTERNS (arg1),
-			      sel.encoding ().npatterns ()));
+  /* First try to implement the fold in a VLA-friendly way.
+
+     (1) If the selector is simply a duplication of N elements, the
+	 result is likewise a duplication of N elements.
+
+     (2) If the selector is N elements followed by a duplication
+	 of N elements, the result is too.
=20
-      res_nelts_per_pattern
-	=3D std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
-		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
-			      sel.encoding ().nelts_per_pattern ()));
+     (3) If the selector is N elements followed by an interleaving
+	 of N linear series, the situation is more complex.
=20
+	 valid_mask_for_fold_vec_perm_cst_p detects whether we
+	 can handle this case.  If we can, then each of the N linear
+	 series either (a) selects the same element each time or
+	 (b) selects a linear series from one of the input patterns.
+
+	 If (b) holds for one of the linear series, the result
+	 will contain a linear series, and so the result will have
+	 the same shape as the selector.  If (a) holds for all of
+	 the lienar series, the result will be the same as (2) above.
+
+	 (b) can only hold if one of the inputs pattern has a
+	 stepped encoding.  */
+  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
+    {
+      res_npatterns =3D sel.encoding ().npatterns ();
+      res_nelts_per_pattern =3D sel.encoding ().nelts_per_pattern ();
+      if (res_nelts_per_pattern =3D=3D 3
+	  && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
+	  && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
+	res_nelts_per_pattern =3D 2;
       res_nelts =3D res_npatterns * res_nelts_per_pattern;
     }
   else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
--=20
2.25.1