From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id DAAC43858C5E
	for <gcc-patches@gcc.gnu.org>; Tue,  1 Nov 2022 14:25:23 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DAAC43858C5E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 892081FB;
	Tue,  1 Nov 2022 07:25:29 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 896EE3F703;
	Tue,  1 Nov 2022 07:25:22 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
Mail-Followup-To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>,Tamar Christina <tamar.christina@arm.com>,  nd@arm.com,  rguenther@suse.de, richard.sandiford@arm.com
Cc: Tamar Christina <tamar.christina@arm.com>,  nd@arm.com,  rguenther@suse.de
Subject: Re: [PATCH 3/8]middle-end: Support extractions of subvectors from arbitrary element position inside a vector
References: <Y1+4Nu1ryQIKoOQA@arm.com>
Date: Tue, 01 Nov 2022 14:25:21 +0000
In-Reply-To: <Y1+4Nu1ryQIKoOQA@arm.com> (Tamar Christina via Gcc-patches's
	message of "Mon, 31 Oct 2022 11:57:42 +0000")
Message-ID: <mptr0ym21m6.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-42.8 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Hi All,
>
> The current vector extract pattern can only extract from a vector when the
> position to extract is a multiple of the vector bitsize as a whole.
>
> That means extract something like a V2SI from a V4SI vector from position 32
> isn't possible as 32 is not a multiple of 64.  Ideally this optab should have
> worked on multiple of the element size, but too many targets rely on this
> semantic now.
>
> So instead add a new case which allows any extraction as long as the bit pos
> is a multiple of the element size.  We use a VEC_PERM to shuffle the elements
> into the bottom parts of the vector and then use a subreg to extract the values
> out.  This now allows various vector operations that before were being
> decomposed into very inefficient scalar operations.
>
> NOTE: I added 3 testcases, I only fixed the 3rd one.
>
> The 1st one missed because we don't optimize VEC_PERM expressions into
> bitfields.  The 2nd one is missed because extract_bit_field only works on
> vector modes.  In this case the intermediate extract is DImode.
>
> On targets where the scalar mode is tiable to vector modes the extract should
> work fine.
>
> However I ran out of time to fix the first two and so will do so in GCC 14.
> For now this catches the case that my pattern now introduces more easily.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* expmed.cc (extract_bit_field_1): Add support for vector element
> 	extracts.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/ext_1.c: New.
>
> --- inline copy of patch -- 
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index bab020c07222afa38305ef8d7333f271b1965b78..ffdf65210d17580a216477cfe4ac1598941ac9e4 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -1718,6 +1718,45 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, poly_uint64 bitnum,
>  	      return target;
>  	    }
>  	}
> +      else if (!known_eq (bitnum, 0U)
> +	       && multiple_p (GET_MODE_UNIT_BITSIZE (tmode), bitnum, &pos))
> +	{
> +	  /* The encoding has a single stepped pattern.  */
> +	  poly_uint64 nunits = GET_MODE_NUNITS (new_mode);
> +	  int nelts = nunits.to_constant ();
> +	  vec_perm_builder sel (nunits, nelts, 1);
> +	  int delta = -pos.to_constant ();
> +	  for (int i = 0; i < nelts; ++i)
> +	    sel.quick_push ((i - delta) % nelts);
> +	  vec_perm_indices indices (sel, 1, nunits);

Thanks for doing this, looks good.  But I don't think the to_constant
calls are safe.  new_mode and pos could in principle be nonconstant.

To build a stepped pattern, we just need:

    vec_perm_builder sel (nunits, 1, 3);

and then push pos, pos + 1, and pos + 2 to it.  There's no need to
clamp the position to nelts, it happens automatically.

> +
> +	  if (can_vec_perm_const_p (new_mode, new_mode, indices, false))
> +	    {
> +	      class expand_operand ops[4];
> +	      machine_mode outermode = new_mode;
> +	      machine_mode innermode = tmode;
> +	      enum insn_code icode
> +		= direct_optab_handler (vec_perm_optab, outermode);
> +	      target = gen_reg_rtx (outermode);
> +	      if (icode != CODE_FOR_nothing)
> +		{
> +		  rtx sel = vec_perm_indices_to_rtx (outermode, indices);
> +		  create_output_operand (&ops[0], target, outermode);
> +		  ops[0].target = 1;
> +		  create_input_operand (&ops[1], op0, outermode);
> +		  create_input_operand (&ops[2], op0, outermode);
> +		  create_input_operand (&ops[3], sel, outermode);

I think this should be GET_MODE (sel).  Looks like the current
version would ICE for float vectors.  That said...

> +		  if (maybe_expand_insn (icode, 4, ops))
> +		    return simplify_gen_subreg (innermode, target, outermode, 0);
> +		}
> +	      else if (targetm.vectorize.vec_perm_const != NULL)
> +		{
> +		  if (targetm.vectorize.vec_perm_const (outermode, outermode,
> +							target, op0, op0, indices))
> +		    return simplify_gen_subreg (innermode, target, outermode, 0);
> +		}

...can we use expand_vec_perm_const here?  It will try the constant
expansion first, which is the preferred order.  It also has a few
variations up its sleeve.

Thanks,
Richard


> +	    }
> +	}
>      }
>  
>    /* See if we can get a better vector mode before extracting.  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ext_1.c b/gcc/testsuite/gcc.target/aarch64/ext_1.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..18a10a14f1161584267a8472e571b3bc2ddf887a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ext_1.c
> @@ -0,0 +1,54 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <string.h>
> +
> +typedef unsigned int v4si __attribute__((vector_size (16)));
> +typedef unsigned int v2si __attribute__((vector_size (8)));
> +
> +/*
> +** extract: { xfail *-*-* }
> +**	ext	v0.16b, v0.16b, v0.16b, #4
> +**	ret
> +*/
> +v2si extract (v4si x)
> +{
> +    v2si res = {x[1], x[2]};
> +    return res;
> +}
> +
> +/*
> +** extract1: { xfail *-*-* }
> +**	ext	v0.16b, v0.16b, v0.16b, #4
> +**	ret
> +*/
> +v2si extract1 (v4si x)
> +{
> +    v2si res;
> +    memcpy (&res, ((int*)&x)+1, sizeof(res));
> +    return res;
> +}
> +
> +typedef struct cast {
> +  int a;
> +  v2si b __attribute__((packed));
> +} cast_t;
> +
> +typedef union Data {
> +   v4si x;
> +   cast_t y;
> +} data;  
> +
> +/*
> +** extract2:
> +**	ext	v0.16b, v0.16b, v0.16b, #4
> +**	ret
> +*/
> +v2si extract2 (v4si x)
> +{
> +    data d;
> +    d.x = x;
> +    return d.y.b;
> +}
> +