From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id F10533858420
 for <gcc-patches@gcc.gnu.org>; Tue, 31 Aug 2021 16:07:17 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org F10533858420
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 803516D;
 Tue, 31 Aug 2021 09:07:17 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8E60B3F766;
 Tue, 31 Aug 2021 09:07:16 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Tamar Christina <Tamar.Christina@arm.com>
Mail-Followup-To: Tamar Christina <Tamar.Christina@arm.com>,
 "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>,
 Richard Earnshaw <Richard.Earnshaw@arm.com>,
 Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
 Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>, richard.sandiford@arm.com
Cc: "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>,
 Richard Earnshaw <Richard.Earnshaw@arm.com>,
 Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
 Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants and
 operations
References: <patch-14774-tamar@arm.com> <mptsfyplk9a.fsf@arm.com>
 <VI1PR08MB53259D8B7FC65B7F193B1589FFCC9@VI1PR08MB5325.eurprd08.prod.outlook.com>
Date: Tue, 31 Aug 2021 17:07:14 +0100
In-Reply-To: <VI1PR08MB53259D8B7FC65B7F193B1589FFCC9@VI1PR08MB5325.eurprd08.prod.outlook.com>
 (Tamar Christina's message of "Tue, 31 Aug 2021 15:47:19 +0000")
Message-ID: <mpt5yvllhsd.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-6.4 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 31 Aug 2021 16:07:21 -0000

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Tuesday, August 31, 2021 4:14 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
>> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants
>> and operations
>>=20
>> Tamar Christina <tamar.christina@arm.com> writes:
>> > @@ -13936,8 +13937,65 @@ cost_plus:
>> >  			     mode, MULT, 1, speed);
>> >            return true;
>> >          }
>> > +	break;
>> > +    case PARALLEL:
>> > +      /* Fall through */
>>=20
>> Which code paths lead to getting a PARALLEL here?
>
> Hi,
>
> Thanks for the review!
>
> I added it for completeness because CSE treats a parallel and CONST_VECTO=
R as
> equivalent when they each entry in the parallel defines a constant.

Could you test whether it ever triggers in practice though?
The code would be much simpler without it.

>> > +    case CONST_VECTOR:
>> > +	{
>> > +	  rtx gen_insn =3D aarch64_simd_make_constant (x, true);
>> > +	  /* Not a valid const vector.  */
>> > +	  if (!gen_insn)
>> > +	    break;
>> >
>> > -      /* Fall through.  */
>> > +	  switch (GET_CODE (gen_insn))
>> > +	  {
>> > +	  case CONST_VECTOR:
>> > +	    /* Load using MOVI/MVNI.  */
>> > +	    if (aarch64_simd_valid_immediate (x, NULL))
>> > +	      *cost +=3D extra_cost->vect.movi;
>> > +	    else /* Load using constant pool.  */
>> > +	      *cost +=3D extra_cost->ldst.load;
>> > +	    break;
>> > +	  /* Load using a DUP.  */
>> > +	  case VEC_DUPLICATE:
>> > +	    *cost +=3D extra_cost->vect.dup;
>> > +	    break;
>>=20
>> Does this trigger in practice?  The new check=3D=3Dtrue path (rightly) s=
tops the
>> duplicated element from being forced into a register, but then I would h=
ave
>> expected:
>>=20
>> rtx
>> gen_vec_duplicate (machine_mode mode, rtx x) {
>>   if (valid_for_const_vector_p (mode, x))
>>     return gen_const_vec_duplicate (mode, x);
>>   return gen_rtx_VEC_DUPLICATE (mode, x); }
>>=20
>> to generate the original CONST_VECTOR again.
>
> Yes, but CSE is trying to see whether using a DUP is cheaper than another=
 instruction.
> Normal code won't hit this but CSE is just costing all the different ways=
 one can semantically
> construct a vector, which RTL actually comes out of it depends on how it'=
s folded as you say.

But what I mean is, you call:

	  rtx gen_insn =3D aarch64_simd_make_constant (x, true);
	  /* Not a valid const vector.  */
	  if (!gen_insn)
	    break;

where aarch64_simd_make_constant does:

  if (CONST_VECTOR_P (vals))
    const_vec =3D vals;
  else if (GET_CODE (vals) =3D=3D PARALLEL)
    {
      /* A CONST_VECTOR must contain only CONST_INTs and
	 CONST_DOUBLEs, but CONSTANT_P allows more (e.g. SYMBOL_REF).
	 Only store valid constants in a CONST_VECTOR.  */
      int n_elts =3D XVECLEN (vals, 0);
      for (i =3D 0; i < n_elts; ++i)
	{
	  rtx x =3D XVECEXP (vals, 0, i);
	  if (CONST_INT_P (x) || CONST_DOUBLE_P (x))
	    n_const++;
	}
      if (n_const =3D=3D n_elts)
	const_vec =3D gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0));
    }
  else
    gcc_unreachable ();

  if (const_vec !=3D NULL_RTX
      && aarch64_simd_valid_immediate (const_vec, NULL))
    /* Load using MOVI/MVNI.  */
    return const_vec;
  else if ((const_dup =3D aarch64_simd_dup_constant (vals, check)) !=3D NUL=
L_RTX)
    /* Loaded using DUP.  */
    return const_dup;

and aarch64_simd_dup_constant does:

  machine_mode mode =3D GET_MODE (vals);
  machine_mode inner_mode =3D GET_MODE_INNER (mode);
  rtx x;

  if (!const_vec_duplicate_p (vals, &x))
    return NULL_RTX;

  /* We can load this constant by using DUP and a constant in a
     single ARM register.  This will be cheaper than a vector
     load.  */
  if (!check)
    x =3D copy_to_mode_reg (inner_mode, x);
  return gen_vec_duplicate (mode, x);

For the =E2=80=9Ccheck=E2=80=9D case, =E2=80=9Cx=E2=80=9D will be a constan=
t, and so gen_vec_duplicate
will call gen_const_vec_duplicate, which will return a CONST_VECTOR.
It didn't seem to be possible for gen_insn to be a VEC_DUPLICATE.

This would be much simpler if we could call aarch64_simd_valid_immediate
and aarch64_simd_dup_constant directly from the rtx cost code, hence the
question about whether the PARALLEL stuff was really needed in practice.

>> > +	  default:
>> > +	    *cost +=3D extra_cost->ldst.load;
>> > +	    break;
>> > +	  }
>> > +	  return true;
>> > +	}
>> > +    case VEC_CONCAT:
>> > +	/* depending on the operation, either DUP or INS.
>> > +	   For now, keep default costing.  */
>> > +	break;
>> > +    case VEC_DUPLICATE:
>> > +	*cost +=3D extra_cost->vect.dup;
>> > +	return true;
>> > +    case VEC_SELECT:
>> > +	{
>> > +	  /* cost subreg of 0 as free, otherwise as DUP */
>> > +	  rtx op1 =3D XEXP (x, 1);
>> > +	  int nelts;
>> > +	  if ((op1 =3D=3D const0_rtx && !BYTES_BIG_ENDIAN)
>> > +	      || (BYTES_BIG_ENDIAN
>> > +		  && GET_MODE_NUNITS (mode).is_constant(&nelts)
>> > +		  && INTVAL (op1) =3D=3D nelts - 1))
>> > +	    ;
>> > +	  else if (vec_series_lowpart_p (mode, GET_MODE (op1), op1))
>> > +	    ;
>> > +	  else if (vec_series_highpart_p (mode, GET_MODE (op1), op1))
>> > +	  /* Selecting the high part is not technically free, but we lack
>> > +	     enough information to decide that here.  For instance selecting
>> > +	     the high-part of a vec_dup *is* free or to feed into any _high
>> > +	     instruction.   Both of which we can't really tell.  That said
>> > +	     have a better chance to optimize an dup vs multiple constants. =
 */
>> > +	    ;
>>=20
>> Not sure about this.  We already try to detect the latter case (_high
>> instructions) via aarch64_strip_extend_vec_half.  We might be missing so=
me
>> cases, but that still feels like the right way to go IMO.
>
> That's a different problem from what I understand.  What this is trying t=
o say is that
> If you have a vector [x y a b] and you need vector [x y] that you can use=
 the top part
> of the original vector for this.
>
> This is an approximation, because something that can be created with a mo=
vi is probably
> Cheaper to keep distinct if it's not going to be paired with a _high oper=
ation (since you will have a dup then).
>
> The problem is that the front end has already spit the two Vectors into [=
x y a b] and [x y].
> There's nothing else that tries to consolidate them back up if both survi=
ve.
>
> As a consequence of this, the testcase test0 is not handled optimally.  I=
t would instead create
> 2 vectors, both of movi 0x3, just one being 64-bits and one being 128-bit=
s.
>
> So if the cost of selecting it is cheaper than the movi, cse will not con=
solidate the vectors,
> and because movi's are so cheap, the only cost that worked was 0.  But in=
creasing the costs
> of movi's requires the costs of everything to be increased (including loa=
ds).
>
> I preferred to 0 out the cost, because the worst that can happen is an du=
p instead of a movi,
> And at best a dup instead of a load from a pool (if the constant is compl=
icated).

Hmm, will need to look at this more tomorrow.

>> Selecting the high part of a vec_dup should get folded into another vec_=
dup.
>>=20
>> The lowpart bits look OK, but which paths call this function without fir=
st
>> simplifying the select to a subreg?  The subreg is now the canonical form
>> (thanks to r12-2288).
>
> The simplification will happen during folding in cse or in combine.  This=
 costing happens before the folding,
> When CSE is trying to decide whether to undo the front end's lowering of =
constants.
>
> To do so it models the constants and the semantic operation required to e=
xtract them. E.g. to get
> 2 out of [0 2 4 5] it would need a VEC_SELECT of 1. And I don't treat the=
 first element/bottom part special
> Here.  Costing wise they would be the same.

But which code path creates the VEC_SELECT?  We don't need any context
to know that the VEC_SELECT is non-canonical.  It's obvious from the
operands of the VEC_SELECT in isolation.

I'd just rather tackle this at source than try to get the cost code
to handle non-canonical rtl.

Thanks,
Richard