From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d])
	by sourceware.org (Postfix) with ESMTPS id CC79F395B422
	for <gcc-patches@gcc.gnu.org>; Sat, 17 Sep 2022 05:04:01 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CC79F395B422
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id 07D7D5C479;
	Sat, 17 Sep 2022 05:04:01 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1663391041; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=amPoEy0Fo8jSrbB5j00/tcemfEME/JX+RJeSUAZ7DBU=;
	b=EORZO4ys3akxUNhMYTWg2TmjEY+OkIaxW0iAY8Fb0fSh4uwt2d0NSfQtp8pKgkTN4MZXoT
	ZZSJHbRP4GoMA16hFt2ASDrA2OJzYEz1Rjrp64Uckj5etY1xI8ZtSvM1FySaDId3FpWZuT
	7cEe7M/iGUcG9MVsrmBEB9pL9juJ9KA=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1663391041;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=amPoEy0Fo8jSrbB5j00/tcemfEME/JX+RJeSUAZ7DBU=;
	b=UER2bvXUyzPwcOHXL/DX7YkpONfxxj1bJQrLuacyZ/EJ5cH4/hO63miubAh4iKhlI4zedI
	Pbzjqy0zhuapKtAQ==
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
	(No client certificate requested)
	by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id D96B113A49;
	Sat, 17 Sep 2022 05:04:00 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
	by imap2.suse-dmz.suse.de with ESMTPSA
	id AKgdNEBVJWNYaAAAMHmgww
	(envelope-from <rguenther@suse.de>); Sat, 17 Sep 2022 05:04:00 +0000
Date: Sat, 17 Sep 2022 07:04:00 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
X-X-Sender: rguenther@rguenther-XPS-13-9380
To: Richard Sandiford <richard.sandiford@arm.com>
cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] vect: Fix SLP layout handling of masked loads
 [PR106794]
In-Reply-To: <mptczbv4udm.fsf@arm.com>
Message-ID: <alpine.DEB.2.22.394.2209170703460.5923@rguenther-XPS-13-9380>
References: <mptczbv4udm.fsf@arm.com>
User-Agent: Alpine 2.22 (DEB 394 2020-01-19)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII; format=flowed
X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Fri, 16 Sep 2022, Richard Sandiford wrote:

> PR106794 shows that I'd forgotten about masked loads when
> doing the SLP layout changes.  These loads can't currently
> be permuted independently of their mask input, so during
> construction they never get a load permutation.
>
> (If we did support permuting masked loads in future, the mask
> would need to be in the right order for the load, rather than in
> the order implied by the result of the permutation.  Since masked
> loads can't be partly or fully scalarised in the way that normal
> permuted loads can be, there's probably no benefit to fusing the
> permutation and the load.  Permutation after the fact is probably
> good enough.)
>
> Tested on aarch64-linux-gnu & x86_64-linux-gnu.  OK to install?

OK.

Thanks,
Richard.

> Richard
>
>
> gcc/
> 	PR tree-optimization/106794
> 	PR tree-optimization/106914
> 	* tree-vect-slp.cc (vect_optimize_slp_pass::internal_node_cost):
> 	Only consider loads that already have a permutation.
> 	(vect_optimize_slp_pass::start_choosing_layouts): Assert that
> 	loads with permutations are leaf nodes.  Prevent any kind of grouped
> 	access from changing layout if it doesn't have a load permutation.
>
> gcc/testsuite/
> 	* gcc.dg/vect/pr106914.c: New test.
> 	* g++.dg/vect/pr106794.cc: Likewise.
> ---
> gcc/testsuite/g++.dg/vect/pr106794.cc | 40 +++++++++++++++++++++++++++
> gcc/testsuite/gcc.dg/vect/pr106914.c  | 15 ++++++++++
> gcc/tree-vect-slp.cc                  | 30 ++++++++++++++------
> 3 files changed, 77 insertions(+), 8 deletions(-)
> create mode 100644 gcc/testsuite/g++.dg/vect/pr106794.cc
> create mode 100644 gcc/testsuite/gcc.dg/vect/pr106914.c
>
> diff --git a/gcc/testsuite/g++.dg/vect/pr106794.cc b/gcc/testsuite/g++.dg/vect/pr106794.cc
> new file mode 100644
> index 00000000000..f056563c4e1
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/vect/pr106794.cc
> @@ -0,0 +1,40 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-Ofast" } */
> +/* { dg-additional-options "-march=bdver2" { target x86_64-*-* i?86-*-* } } */
> +
> +template <class T> struct Vector3 {
> +  Vector3();
> +  Vector3(T, T, T);
> +  T length() const;
> +  T x, y, z;
> +};
> +template <class T>
> +Vector3<T>::Vector3(T _x, T _y, T _z) : x(_x), y(_y), z(_z) {}
> +Vector3<float> cross(Vector3<float> a, Vector3<float> b) {
> +  return Vector3<float>(a.y * b.z - a.z * b.y, a.z * b.x - a.x * b.z,
> +                        a.x * b.y - a.y * b.x);
> +}
> +template <class T> T Vector3<T>::length() const { return z; }
> +int generateNormals_i;
> +float generateNormals_p2_0, generateNormals_p0_0;
> +struct SphereMesh {
> +  void generateNormals();
> +  float vertices;
> +};
> +void SphereMesh::generateNormals() {
> +  Vector3<float> *faceNormals = new Vector3<float>;
> +  for (int j; j; j++) {
> +    float *p0 = &vertices + 3, *p1 = &vertices + j * 3, *p2 = &vertices + 3,
> +          *p3 = &vertices + generateNormals_i + j * 3;
> +    Vector3<float> v0(p1[0] - generateNormals_p0_0, p1[1] - 1, p1[2] - 2),
> +        v1(0, 1, 2);
> +    if (v0.length())
> +      v1 = Vector3<float>(p3[0] - generateNormals_p2_0, p3[1] - p2[1],
> +                          p3[2] - p2[2]);
> +    else
> +      v1 = Vector3<float>(generateNormals_p0_0 - p3[0], p0[1] - p3[1],
> +                          p0[2] - p3[2]);
> +    Vector3<float> faceNormal = cross(v0, v1);
> +    faceNormals[j] = faceNormal;
> +  }
> +}
> diff --git a/gcc/testsuite/gcc.dg/vect/pr106914.c b/gcc/testsuite/gcc.dg/vect/pr106914.c
> new file mode 100644
> index 00000000000..9d9b3e30081
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr106914.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-fprofile-generate" } */
> +/* { dg-additional-options "-mavx512vl" { target x86_64-*-* i?86-*-* } } */
> +
> +int *mask_slp_int64_t_8_2_x, *mask_slp_int64_t_8_2_y, *mask_slp_int64_t_8_2_z;
> +
> +void
> +__attribute__mask_slp_int64_t_8_2() {
> +  for (int i; i; i += 8) {
> +    mask_slp_int64_t_8_2_x[i + 6] =
> +        mask_slp_int64_t_8_2_y[i + 6] ? mask_slp_int64_t_8_2_z[i] : 1;
> +    mask_slp_int64_t_8_2_x[i + 7] =
> +        mask_slp_int64_t_8_2_y[i + 7] ? mask_slp_int64_t_8_2_z[i + 7] : 2;
> +  }
> +}
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index ca3422c2a1e..229f2663ebc 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -4494,7 +4494,8 @@ vect_optimize_slp_pass::internal_node_cost (slp_tree node, int in_layout_i,
>   stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
>   if (rep
>       && STMT_VINFO_DATA_REF (rep)
> -      && DR_IS_READ (STMT_VINFO_DATA_REF (rep)))
> +      && DR_IS_READ (STMT_VINFO_DATA_REF (rep))
> +      && SLP_TREE_LOAD_PERMUTATION (node).exists ())
>     {
>       auto_load_permutation_t tmp_perm;
>       tmp_perm.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
> @@ -4569,8 +4570,12 @@ vect_optimize_slp_pass::start_choosing_layouts ()
>       if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> 	{
> 	  /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the node
> -	     unpermuted, record a layout that reverses this permutation.  */
> -	  gcc_assert (partition.layout == 0);
> +	     unpermuted, record a layout that reverses this permutation.
> +
> +	     We would need more work to cope with loads that are internally
> +	     permuted and also have inputs (such as masks for
> +	     IFN_MASK_LOADs).  */
> +	  gcc_assert (partition.layout == 0 && !m_slpg->vertices[node_i].succ);
> 	  if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> 	    continue;
> 	  dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> @@ -4684,12 +4689,21 @@ vect_optimize_slp_pass::start_choosing_layouts ()
> 	  vertex.weight = vect_slp_node_weight (node);
>
> 	  /* We do not handle stores with a permutation, so all
> -	     incoming permutations must have been materialized.  */
> +	     incoming permutations must have been materialized.
> +
> +	     We also don't handle masked grouped loads, which lack a
> +	     permutation vector.  In this case the memory locations
> +	     form an implicit second input to the loads, on top of the
> +	     explicit mask input, and the memory input's layout cannot
> +	     be changed.
> +
> +	     On the other hand, we do support permuting gather loads and
> +	     masked gather loads, where each scalar load is independent
> +	     of the others.  This can be useful if the address/index input
> +	     benefits from permutation.  */
> 	  if (STMT_VINFO_DATA_REF (rep)
> -	      && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep)))
> -	    /* ???  We're forcing materialization in place
> -	       of the child here, we'd need special handling
> -	       in materialization to leave layout -1 here.  */
> +	      && STMT_VINFO_GROUPED_ACCESS (rep)
> +	      && !SLP_TREE_LOAD_PERMUTATION (node).exists ())
> 	    partition.layout = 0;
>
> 	  /* We cannot change the layout of an operation that is
> -- 
> 2.25.1
>
>