From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-180279-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 23989 invoked by alias); 27 Sep 2013 14:50:14 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 23980 invoked by uid 89); 27 Sep 2013 14:50:13 -0000
Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 27 Sep 2013 14:50:13 +0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.2 required=5.0 tests=AWL,BAYES_50,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD,SPAM_SUBJECT,SPF_PASS autolearn=no version=3.3.2
X-HELO: service87.mimecast.com
Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Fri, 27 Sep 2013 15:50:10 +0100
Received: from e103625-lin.cambridge.arm.com ([10.1.255.212]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.0);	 Fri, 27 Sep 2013 15:50:09 +0100
Date: Fri, 27 Sep 2013 14:50:00 -0000
From: Vidya Praveen <vidyapraveen@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: "gcc@gcc.gnu.org" <gcc@gcc.gnu.org>, "ook@ucw.cz" <ook@ucw.cz>
Subject: Re: [RFC] Vectorization of indexed elements
Message-ID: <20130927145008.GA861@e103625-lin.cambridge.arm.com>
References: <20130909172533.GA25330@e103625-lin.cambridge.arm.com> <alpine.DEB.2.10.1309091949090.3565@laptop-mg.saclay.inria.fr> <20130924150425.GE22907@e103625-lin.cambridge.arm.com> <alpine.LNX.2.00.1309251123490.29411@zhemvz.fhfr.qr>
MIME-Version: 1.0
In-Reply-To: <alpine.LNX.2.00.1309251123490.29411@zhemvz.fhfr.qr>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-MC-Unique: 113092715501001701
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-IsSubscribed: yes
X-SW-Source: 2013-09/txt/msg00238.txt.bz2

On Wed, Sep 25, 2013 at 10:24:56AM +0100, Richard Biener wrote:
> On Tue, 24 Sep 2013, Vidya Praveen wrote:
>=20
> > On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote:
> > > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > >=20
> > > > Hello,
> > > >
> > > > This post details some thoughts on an enhancement to the vectorizer=
 that
> > > > could take advantage of the SIMD instructions that allows indexed e=
lement
> > > > as an operand thus reducing the need for duplication and possibly i=
mprove
> > > > reuse of previously loaded data.
> > > >
> > > > Appreciate your opinion on this.
> > > >
> > > > ---
> > > >
> > > > A phrase like this:
> > > >
> > > > for(i=3D0;i<4;i++)
> > > >   a[i] =3D b[i] <op> c[2];
> > > >
> > > > is usually vectorized as:
> > > >
> > > >  va:V4SI =3D a[0:3]
> > > >  vb:V4SI =3D b[0:3]
> > > >  t =3D c[2]
> > > >  vc:V4SI =3D { t, t, t, t } // typically expanded as vec_duplicate =
at vec_init
> > > >  ...
> > > >  va:V4SI =3D vb:V4SI <op> vc:V4SI
> > > >
> > > > But this could be simplified further if a target has instructions t=
hat support
> > > > indexed element as a parameter. For example an instruction like thi=
s:
> > > >
> > > >  mul v0.4s, v1.4s, v2.4s[2]
> > > >
> > > > can perform multiplication of each element of v2.4s with the third =
element of
> > > > v2.4s (specified as v2.4s[2]) and store the results in the correspo=
nding
> > > > elements of v0.4s.
> > > >
> > > > For this to happen, vectorizer needs to understand this idiom and t=
reat the
> > > > operand c[2] specially (and by taking in to consideration if the ma=
chine
> > > > supports indexed element as an operand for <op> through a target ho=
ok or macro)
> > > > and consider this as vectorizable statement without having to dupli=
cate the
> > > > elements explicitly.
> > > >
> > > > There are fews ways this could be represented at gimple:
> > > >
> > > >  ...
> > > >  va:V4SI =3D vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V=
4SI 2))
> > > >  ...
> > > >
> > > > or by allowing a vectorizer treat an indexed element as a valid ope=
rand in a
> > > > vectorizable statement:
> > >=20
> > > Might as well allow any scalar then...
> >=20
> > Yes, I had given an example below.
> >=20
> > >=20
> > > >  ...
> > > >  va:V4SI =3D vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> > > >  ...
> > > >
> > > > For the sake of explanation, the above two representations assumes =
that
> > > > c[0:3] is loaded in vc for some other use and reused here. But when=
 c[2] is the
> > > > only use of 'c' then it may be safer to just load one element and u=
se it like
> > > > this:
> > > >
> > > >  vc:V4SI[0] =3D c[2]
> > > >  va:V4SI =3D vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > >
> > > > This could also mean that expressions involving scalar could be tre=
ated
> > > > similarly. For example,
> > > >
> > > >  for(i=3D0;i<4;i++)
> > > >    a[i] =3D b[i] <op> c
> > > >
> > > > could be vectorized as:
> > > >
> > > >  vc:V4SI[0] =3D c
> > > >  va:V4SI =3D vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > >
> > > > Such a change would also require new standard pattern names to be d=
efined for
> > > > each <op>.
> > > >
> > > > Alternatively, having something like this:
> > > >
> > > >  ...
> > > >  vt:V4SI =3D VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > >  va:V4SI =3D vb:V4SI <op> vt:V4SI
> > > >  ...
> > > >
> > > > would remove the need to introduce several new standard pattern nam=
es but have
> > > > just one to represent vec_duplicate(vec_select()) but ofcourse this=
 will expect
> > > > the target to have combiner patterns.
> > >=20
> > > The cost estimation wouldn't be very good, but aren't combine pattern=
s=20
> > > enough for the whole thing? Don't you model your mul instruction as:
> > >=20
> > > (mult:V4SI
> > >    (match_operand:V4SI)
> > >    (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> > >=20
> > > anyway? Seems that combine should be able to handle it. What currentl=
y=20
> > > happens that we fail to generate the right instruction?
> >=20
> > At vec_init, I can recognize an idiom in order to generate vec_duplicat=
e but
> > I can't really insist on the single lane load.. something like:
> >=20
> > vc:V4SI[0] =3D c
> > vt:V4SI =3D vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > va:V4SI =3D vb:V4SI <op> vt:V4SI
> >=20
> > Or is there any other way to do this?
>=20
> Can you elaborate on "I can't really insist on the single lane load"?
> What's the single lane load in your example?=20

Loading just one lane of the vector like this:

vc:V4SI[0] =3D c // from the above scalar example

or=20

vc:V4SI[0] =3D c[2]=20

is what I meant by single lane load. In this example:

t =3D c[2]=20
...
vb:v4si =3D b[0:3]=20
vc:v4si =3D { t, t, t, t }
va:v4si =3D vb:v4si <op> vc:v4si=20

If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
insist 't' to be vector and t =3D c[2] to be vect_t[0] =3D c[2] (which coul=
d be=20
seen as vec_select:SI (vect_t 0) ).=20

> I'd expect the instruction
> pattern as quoted to just work (and I hope we expand an uniform
> constructor { a, a, a, a } properly using vec_duplicate).

As much as I went through the code, this is only done using vect_init. It is
not expanded as vec_duplicate from, for example, store_constructor() of exp=
r.c

VP

>=20
> Richard.
>=20
> > Cheers
> > VP
> >=20
> > >=20
> > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCT=
OR=20
> > > for vec_duplicate, adding new nodes is always painful.
> > >=20
> > > > This enhancement could possibly help further optimizing larger scen=
arios such
> > > > as linear systems.
> > > >
> > > > Regards
> > > > VP
> > >=20
> > > --=20
> > > Marc Glisse
> > >
> >=20
> >=20
> >=20
>=20
> --=20
> Richard Biener <rguenther@suse.de>
> SUSE / SUSE Labs
> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> GF: Jeff Hawn, Jennifer Guild, Felix Imend
>=20