From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id CDF8C38A9087
	for <gcc-patches@gcc.gnu.org>; Tue, 15 Nov 2022 18:24:20 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CDF8C38A9087
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D4E9713D5;
	Tue, 15 Nov 2022 10:24:26 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F1D533F587;
	Tue, 15 Nov 2022 10:24:19 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: "Andre Vieira \(lists\)" <andre.simoesdiasvieira@arm.com>
Mail-Followup-To: "Andre Vieira \(lists\)" <andre.simoesdiasvieira@arm.com>,Richard Biener <rguenther@suse.de>,  "Andre Vieira \(lists\) via Gcc-patches" <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: Richard Biener <rguenther@suse.de>,  "Andre Vieira \(lists\) via Gcc-patches" <gcc-patches@gcc.gnu.org>
Subject: Re: [AArch64] Enable generation of FRINTNZ instructions
References: <8225375c-eb9e-f9b3-6bcd-9fbccf2fc87b@arm.com>
	<70s9nn94-452-5rrr-4458-q6n3qp563652@fhfr.qr>
	<36e3469a-3922-d49e-4006-0088eac29157@arm.com>
	<653o8886-3p5n-sr82-9n83-71q33o8824@fhfr.qr>
	<6c730f35-10b1-2843-cffc-4ed0851380be@arm.com>
	<85sr96q-o3s-602o-3436-40713n68pp84@fhfr.qr>
	<8d593d5f-41a0-6051-0ce0-d72834ecfa25@arm.com>
	<s96q8088-262q-n93-ro5p-759n51837n8@fhfr.qr>
	<f5312e85-b175-fbc9-8858-de611b3d99ac@arm.com>
	<mptee6b8kpi.fsf@arm.com>
	<9d3755df-6c41-20e4-31fb-811e5cd9182a@arm.com>
	<231396s0-2756-q51s-q55-o8npqo91on32@fhfr.qr>
	<5d7bb7af-b09e-cb91-b457-c6148f65028e@arm.com>
	<mptmtjyr4b4.fsf@arm.com>
	<051c70ff-7c59-bbfe-7780-4cb2380d9a93@arm.com>
	<nycvar.YFH.7.77.849.2211071055240.4294@jbgna.fhfr.qr>
	<1cdffce8-e0e5-3304-52b9-3b736a4d380d@arm.com>
Date: Tue, 15 Nov 2022 18:24:18 +0000
In-Reply-To: <1cdffce8-e0e5-3304-52b9-3b736a4d380d@arm.com> (Andre Vieira's
	message of "Mon, 7 Nov 2022 14:19:09 +0000")
Message-ID: <mptmt8sgjot.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-33.2 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> On 07/11/2022 11:05, Richard Biener wrote:
>> On Fri, 4 Nov 2022, Andre Vieira (lists) wrote:
>>
>>> Sorry for the delay, just been reminded I still had this patch outstand=
ing
>>> from last stage 1. Hopefully since it has been mostly reviewed it could=
 go in
>>> for this stage 1?
>>>
>>> I addressed the comments and gave the slp-part of vectorizable_call som=
e TLC
>>> to make it work.
>>>
>>> I also changed vect_get_slp_defs as I noticed that the call from
>>> vectorizable_call was creating an auto_vec with 'nargs' that might be l=
ess
>>> than the number of children in the slp_node
>> how so?  Please fix that in the caller.  It looks like it probably
>> shoud use vect_nargs instead?
> Well that was my first intuition, but when I looked at it further the=20
> variant it's calling:
> void vect_get_slp_defs (vec_info *, slp_tree slp_node, vec<vec<tree> >=20
> *vec_oprnds, unsigned n)
>
> Is actually creating a vector of vectors of slp defs. So for each child=20
> of slp_node it calls:
> void vect_get_slp_defs (slp_tree slp_node, vec<tree> *vec_defs)
>
> Which returns a vector of vectorized defs. So vect_nargs would be the=20
> right size for the inner vec<tree> of vec_defs, but the outer should=20
> have the same number of elements as the original slp_node has children.
>
> However, at the call site (vectorizable_call), the operand we pass to=20
> vect_get_slp_defs 'vec_defs', is initialized before the code-path is=20
> specialized for slp_node. I'll go see if I can change the call site to=20
> not have to do that, given the continue at the end of the if (slp_node)=20
> BB I don't think it needs to use vec_defs after it, but it may require=20
> some massaging to be able to define it separately for each code-path.
>
>>
>>> , so that quick_push might not be
>>> safe as is, so I added the reserve (n) to ensure it's safe to push. I d=
idn't
>>> actually come across any failure because of it though. Happy to split t=
his
>>> into a separate patch if needed.
>>>
>>> Bootstrapped and regression tested on aarch64-none-linux-gnu and
>>> x86_64-pc-linux-gnu.
>>>
>>> OK for trunk?
>> I'll leave final approval to Richard but
>>
>> -     This only needs 1 bit, but occupies the full 16 to ensure a nice
>> +     This only needs 1 bit, but occupies the full 15 to ensure a nice
>>        layout.  */
>>     unsigned int vectorizable : 16;
>>
>> you don't actually change the width of the bitfield.  I would find
>> it more natural to have
>>
>>    signed int type0 : 7;
>>    signed int type0_vtrans : 1;
>>    signed int type1 : 7;
>>    signed int type1_vtrans : 1;
>>
>> with typeN_vtrans specifying how the types transform when vectorized.
>> I would imagine another variant we could need is narrow/widen
>> according to either result or other argument type?  That said,
>> just your flag would then be
>>
>>    signed int type0 : 7;
>>    signed int pad   : 1;
>>    signed int type1 : 7;
>>    signed int type1_vect_as_scalar : 1;
>>
>> ?
> That's a cool idea! I'll leave it as a single bit for now like that, if=20
> we want to re-use it for multiple transformations we will obviously need=
=20
> to rename & give it more bits.

I think we should steal bits from vectorizable rather than shrink
type0 and type1 though.  Then add a 14-bit padding field to show
how many bits are left.

> @@ -3340,9 +3364,20 @@ vectorizable_call (vec_info *vinfo,
>        rhs_type =3D unsigned_type_node;
>      }
>=20
> +  /* The argument that is not of the same type as the others.  */
>    int mask_opno =3D -1;
> +  int scalar_opno =3D -1;
>    if (internal_fn_p (cfn))
> -    mask_opno =3D internal_fn_mask_index (as_internal_fn (cfn));
> +    {
> +      internal_fn ifn =3D as_internal_fn (cfn);
> +      if (direct_internal_fn_p (ifn)
> +	  && direct_internal_fn (ifn).type1_is_scalar_p)
> +	scalar_opno =3D direct_internal_fn (ifn).type1;
> +      else
> +	/* For masked operations this represents the argument that carries the
> +	   mask.  */
> +	mask_opno =3D internal_fn_mask_index (as_internal_fn (cfn));

This doesn't seem logically like an else.  We should do both.

LGTM otherwise for the bits outside match.pd.  If Richard's happy with
the match.pd bits then I think the patch is OK with those changes and
without the vect_get_slp_defs thing (as you mentioned downthread).

Thanks,
Richard


>>
>>> gcc/ChangeLog:
>>>
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * config/aarch64/aarch64.md=
 (ftrunc<mode><frintnz_mode>2): New
>>> pattern.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * config/aarch64/iterators.=
md (FRINTNZ): New iterator.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (frintnz_mode): New int att=
ribute.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (VSFDF): Make iterator cond=
itional.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * internal-fn.def (FTRUNC_I=
NT): New IFN.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * internal-fn.cc (ftrunc_in=
t_direct): New define.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (expand_ftrunc_int_optab_fn=
): New custom expander.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (direct_ftrunc_int_optab_su=
pported_p): New supported_p.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * internal-fn.h (direct_int=
ernal_fn_info): Add new member
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 type1_is_scalar_p.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * match.pd: Add to the exis=
ting TRUNC pattern match.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * optabs.def (ftrunc_int): =
New entry.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * stor-layout.h (element_pr=
ecision): Moved from here...
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * tree.h (element_precision=
): ... to here.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (element_type): New declara=
tion.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * tree.cc (element_type): N=
ew function.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (element_precision): Change=
d to use element_type.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * tree-vect-stmts.cc (vecto=
rizable_internal_function): Add
>>> support for
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 IFNs with different input t=
ypes.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (vect_get_scalar_oprnds): N=
ew function.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (vectorizable_call): Teach =
to handle IFN_FTRUNC_INT.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * tree-vect-slp.cc (check_s=
calar_arg_ok): New function.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (vect_slp_analyze_node_oper=
ations): Use check_scalar_arg_ok.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (vect_get_slp_defs): Ensure=
 vec_oprnds has enough slots to push.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * doc/md.texi: New entry fo=
r ftrunc pattern name.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * doc/sourcebuild.texi (aar=
ch64_frintzx_ok): New target.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * gcc.target/aarch64/merge_=
trunc1.c: Adapted to skip if frintnz
>>> instructions available.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * lib/target-supports.exp: =
Added aarch64_frintnzx_ok target and
>>> aarch64_frintz options.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * gcc.target/aarch64/frintn=
z.c: New test.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * gcc.target/aarch64/frintn=
z_vec.c: New test.
>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * gcc.target/aarch64/frintn=
z_slp.c: New test.
>>>