From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=yLpU=AX=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id D8479385734E
	for <gcc-patches@gcc.gnu.org>; Tue,  2 May 2023 12:02:31 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D8479385734E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A7964C14;
	Tue,  2 May 2023 05:03:15 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 047813F64C;
	Tue,  2 May 2023 05:02:30 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Mail-Followup-To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,gcc Patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: gcc Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [aarch64] Code-gen for vector initialization involving constants
References: <CAAgBjMnwGk4fOc3PTM_agTXXFvt=767a3-AWOfSr23Xja6K81w@mail.gmail.com>
	<mpt354o448m.fsf@arm.com>
	<CAAgBjMm2o_64mbmOsnzjZMrLNRozav95sCFR73cz+iA2SLURwg@mail.gmail.com>
	<mpt354fw0dc.fsf@arm.com>
	<CAAgBjM=ryaC9h-9N_WWmPjcqE_fEA7xXPK8wye8TvFXsnMTyHg@mail.gmail.com>
Date: Tue, 02 May 2023 13:02:29 +0100
In-Reply-To: <CAAgBjM=ryaC9h-9N_WWmPjcqE_fEA7xXPK8wye8TvFXsnMTyHg@mail.gmail.com>
	(Prathamesh Kulkarni's message of "Tue, 2 May 2023 15:52:29 +0530")
Message-ID: <mpto7n3ueju.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-30.3 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Tue, 2 May 2023 at 14:56, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>> > [aarch64] Improve code-gen for vector initialization with single constant element.
>> >
>> > gcc/ChangeLog:
>> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
>> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>> >       and if maxv == 1, use constant element for duplicating into register.
>> >
>> > gcc/testsuite/ChangeLog:
>> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
>> >
>> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> > index 2b0de7ca038..f46750133a6 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >       and matches[X][1] with the count of duplicate elements (if X is the
>> >       earliest element which has duplicates).  */
>> >
>> > -  if (n_var == n_elts && n_elts <= 16)
>> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>> >      {
>> >        int matches[16][2] = {0};
>> >        for (int i = 0; i < n_elts; i++)
>> > @@ -22227,6 +22227,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >            vector register.  For big-endian we want that position to hold
>> >            the last element of VALS.  */
>> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> > +
>> > +       /* If we have a single constant element, use that for duplicating
>> > +          instead.  */
>> > +       if (n_var == n_elts - 1)
>> > +         for (int i = 0; i < n_elts; i++)
>> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
>> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
>> > +             {
>> > +               maxelement = i;
>> > +               break;
>> > +             }
>> > +
>> >         rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> >         aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>>
>> We don't want to force the constant into a register though.
> OK right, sorry.
> With the attached patch, for the following test-case:
> int64x2_t f_s64(int64_t x)
> {
>   return (int64x2_t) { x, 1 };
> }
>
> it loads constant from memory (same code-gen as without patch).
> f_s64:
>         adrp    x1, .LC0
>         ldr     q0, [x1, #:lo12:.LC0]
>         ins     v0.d[0], x0
>         ret
>
> Does the patch look OK ?
>
> Thanks,
> Prathamesh
> [...]
> [aarch64] Improve code-gen for vector initialization with single constant element.
>
> gcc/ChangeLog:
> 	* config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> 	if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> 	and if maxv == 1, use constant element for duplicating into register.
>
> gcc/testsuite/ChangeLog:
> 	* gcc.target/aarch64/vec-init-single-const.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 2b0de7ca038..97309ddec4f 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>       and matches[X][1] with the count of duplicate elements (if X is the
>       earliest element which has duplicates).  */
>  
> -  if (n_var == n_elts && n_elts <= 16)
> +  if ((n_var >= n_elts - 1) && n_elts <= 16)

No need for the extra brackets.

>      {
>        int matches[16][2] = {0};
>        for (int i = 0; i < n_elts; i++)
> @@ -22227,8 +22227,26 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>  	     vector register.  For big-endian we want that position to hold
>  	     the last element of VALS.  */
>  	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> -	  rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> -	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> +
> +	  /* If we have a single constant element, use that for duplicating
> +	     instead.  */
> +	  if (n_var == n_elts - 1)
> +	    for (int i = 0; i < n_elts; i++)
> +	      if (CONST_INT_P (XVECEXP (vals, 0, i))
> +		  || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> +		{
> +		  maxelement = i;
> +		  break;
> +		}
> +
> +	  rtx maxval = XVECEXP (vals, 0, maxelement);
> +	  if (!(CONST_INT_P (maxval) || CONST_DOUBLE_P (maxval)))
> +	    {
> +	      rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +	      aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> +	    }
> +	  else
> +	    aarch64_emit_move (target, gen_vec_duplicate (mode, maxval));
>  	}
>        else
>  	{

This seems a bit convoluted.  It might be easier to record whether
we see a CONST_INT_P or a CONST_DOUBLE_P during the previous loop,
and if so what the constant is.  Then handle that case first,
as a separate arm of the "if".

> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> new file mode 100644
> index 00000000000..682fd43439a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> @@ -0,0 +1,66 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** f_s8:
> +**	...
> +**	dup	v[0-9]+\.16b, w[0-9]+
> +**	movi	v[0-9]+\.8b, 0x1
> +**	ins	v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
> +**	...
> +**	ret

Like with the divide-and-conquer patch, there's nothing that requires
the first two instructions to be in that order.

What is the second ... hiding?  What sequences do we actually generate?

BTW, remember to say how patches were tested :-)

Thanks,
Richard

> +*/
> +
> +int8x16_t f_s8(int8_t x)
> +{
> +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> +                       x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s16:
> +**	...
> +**	dup	v[0-9]+\.8h, w[0-9]+
> +**	movi	v[0-9]+\.4h, 0x1
> +**	ins	v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
> +**	...
> +**	ret
> +*/
> +
> +int16x8_t f_s16(int16_t x)
> +{
> +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s32:
> +**	...
> +**	movi	v[0-9]\.2s, 0x1
> +**	dup	v[0-9]\.4s, w[0-9]+
> +**	ins	v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
> +**	...
> +**	ret
> +*/
> +
> +int32x4_t f_s32(int32_t x)
> +{
> +  return (int32x4_t) { x, x, x, 1 };
> +}
> +
> +/*
> +** f_s64:
> +**	...
> +**	adrp	x[0-9]+, .LC[0-9]+
> +**	ldr	q[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> +**	ins	v[0-9]+\.d\[0\], x[0-9]+
> +**	...
> +**	ret
> +*/
> +
> +int64x2_t f_s64(int64_t x)
> +{
> +  return (int64x2_t) { x, 1 };
> +}