From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-180290-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 28378 invoked by alias); 30 Sep 2013 13:01:52 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 28368 invoked by uid 89); 30 Sep 2013 13:01:51 -0000
Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 30 Sep 2013 13:01:51 +0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-3.2 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2
X-HELO: service87.mimecast.com
Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Mon, 30 Sep 2013 14:01:48 +0100
Received: from e103625-lin.cambridge.arm.com ([10.1.255.212]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.0);	 Mon, 30 Sep 2013 14:01:45 +0100
Date: Mon, 30 Sep 2013 13:01:00 -0000
From: Vidya Praveen <vidyapraveen@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: "gcc@gcc.gnu.org" <gcc@gcc.gnu.org>, "ook@ucw.cz" <ook@ucw.cz>
Subject: Re: [RFC] Vectorization of indexed elements
Message-ID: <20130930130145.GE3460@e103625-lin.cambridge.arm.com>
References: <20130909172533.GA25330@e103625-lin.cambridge.arm.com> <alpine.DEB.2.10.1309091949090.3565@laptop-mg.saclay.inria.fr> <alpine.LNX.2.00.1309101019060.3869@zhemvz.fhfr.qr> <20130924150343.GD22907@e103625-lin.cambridge.arm.com> <alpine.LNX.2.00.1309251116040.29411@zhemvz.fhfr.qr>
MIME-Version: 1.0
In-Reply-To: <alpine.LNX.2.00.1309251116040.29411@zhemvz.fhfr.qr>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-MC-Unique: 113093014014800201
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-IsSubscribed: yes
X-SW-Source: 2013-09/txt/msg00249.txt.bz2

On Wed, Sep 25, 2013 at 10:22:05AM +0100, Richard Biener wrote:
> On Tue, 24 Sep 2013, Vidya Praveen wrote:
>=20
> > On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote:
> > > On Mon, 9 Sep 2013, Marc Glisse wrote:
> > >=20
> > > > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > > >=20
> > > > > Hello,
> > > > >=20
> > > > > This post details some thoughts on an enhancement to the vectoriz=
er that
> > > > > could take advantage of the SIMD instructions that allows indexed=
 element
> > > > > as an operand thus reducing the need for duplication and possibly=
 improve
> > > > > reuse of previously loaded data.
> > > > >=20
> > > > > Appreciate your opinion on this.
> > > > >=20
> > > > > ---
> > > > >=20
> > > > > A phrase like this:
> > > > >=20
> > > > > for(i=3D0;i<4;i++)
> > > > >   a[i] =3D b[i] <op> c[2];
> > > > >=20
> > > > > is usually vectorized as:
> > > > >=20
> > > > >  va:V4SI =3D a[0:3]
> > > > >  vb:V4SI =3D b[0:3]
> > > > >  t =3D c[2]
> > > > >  vc:V4SI =3D { t, t, t, t } // typically expanded as vec_duplicat=
e at vec_init
> > > > >  ...
> > > > >  va:V4SI =3D vb:V4SI <op> vc:V4SI
> > > > >=20
> > > > > But this could be simplified further if a target has instructions=
 that
> > > > > support
> > > > > indexed element as a parameter. For example an instruction like t=
his:
> > > > >=20
> > > > >  mul v0.4s, v1.4s, v2.4s[2]
> > > > >=20
> > > > > can perform multiplication of each element of v2.4s with the thir=
d element
> > > > > of
> > > > > v2.4s (specified as v2.4s[2]) and store the results in the corres=
ponding
> > > > > elements of v0.4s.
> > > > >=20
> > > > > For this to happen, vectorizer needs to understand this idiom and=
 treat the
> > > > > operand c[2] specially (and by taking in to consideration if the =
machine
> > > > > supports indexed element as an operand for <op> through a target =
hook or
> > > > > macro)
> > > > > and consider this as vectorizable statement without having to dup=
licate the
> > > > > elements explicitly.
> > > > >=20
> > > > > There are fews ways this could be represented at gimple:
> > > > >=20
> > > > >  ...
> > > > >  va:V4SI =3D vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc=
:V4SI 2))
> > > > >  ...
> > > > >=20
> > > > > or by allowing a vectorizer treat an indexed element as a valid o=
perand in a
> > > > > vectorizable statement:
> > > >=20
> > > > Might as well allow any scalar then...
> > >=20
> > > I agree.  The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form
> > > would necessarily be two extra separate statements and thus subject
> > > to CSE obfuscating it enough for RTL expansion to no longer notice it.
> >=20
> > I also thought about having a specialized expression like
> >=20
> > VEC_INDEXED_<op>_EXPR < arg0, arg1, arg2, index>=20
> >=20
> > to mean:
> >=20
> > arg0 =3D arg1 <op> arg2[index]
> >=20
> > and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_E=
XPR
> > is handled in expr.c. But I dropped this idea since we may need to intr=
oduce
> > many such nodes.
> >=20
> > >=20
> > > That said, allowing mixed scalar/vector ops isn't very nice and
> > > your scheme can be simplified by just using
> > >=20
> > >   vc:V4SI =3D VEC_DUPLICATE_EXPR <...>
> > >   va:V4SI =3D vb:V4SI <op> vc:V4SI
> > >=20
> > > where the expander only has to see that vc:V4SI is defined by
> > > a duplicate.
> >=20
> > I did try out something like this quickly before I posted this RFC, tho=
ugh
> > I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select())
> >=20
> > for:=20
> >=20
> >   for(i=3D0;i<8;i++)
> >     a[i] =3D b[2] * c[i];
> >=20
> > I could generate:
> >=20
> >   ...
> >   <bb 8>:
> >   _88 =3D prolog_loop_adjusted_niters.6_60 * 4;
> >   vectp_c.13_87 =3D c_10(D) + _88;
> >   vect_ldidx_.16_92 =3D MEM[(int *)b_8(D) + 8B];                     <<=
<<<<<<
> >   vect_idxed_.17_93 =3D (vect_ldidx_.16_92) <<< ??? >>> (0);         <<=
<<<<<<
> >   _96 =3D prolog_loop_adjusted_niters.6_60 * 4;
> >   vectp_a.19_95 =3D a_6(D) + _96;
> >   vect__12.14_115 =3D MEM[(int *)vectp_c.13_87];
> >   vect_patt_40.15_116 =3D vect__12.14_115 * vect_idxed_.17_93;       <<=
<<<<<<
> >   MEM[(int *)vectp_a.19_95] =3D vect_patt_40.15_116;                 <<=
<<<<<<
> >   vectp_c.12_118 =3D vectp_c.13_87 + 16;
> >   vectp_a.18_119 =3D vectp_a.19_95 + 16;
> >   ivtmp_120 =3D 1;
> >   if (ivtmp_120 < bnd.8_62)
> >     goto <bb 9>;
> >   else
> >     goto <bb 11>;
> >=20
> >   <bb 9>:
> >   # vectp_c.12_89 =3D PHI <vectp_c.12_118(8)>
> >   # vectp_a.18_97 =3D PHI <vectp_a.18_119(8)>
> >   # ivtmp_14 =3D PHI <ivtmp_120(8)>
> >   vect__12.14_91 =3D MEM[(int *)vectp_c.12_89];                      <<=
<<<<<<
> >   vect_patt_40.15_94 =3D vect__12.14_91 * vect_idxed_.17_93;         <<=
<<<<<<
> >   MEM[(int *)vectp_a.18_97] =3D vect_patt_40.15_94;
> >   ...
> >=20
> > It's a crude implementation so VEC_DUP is printed as:
> >=20
> >   (vect_ldidx_.16_92) <<< ??? >>> (0);
> >=20
> >=20
> > > > >  ...
> > > > >  va:V4SI =3D vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> > > > >  ...
> > > > >
> > > > > For the sake of explanation, the above two representations assume=
s that
> > > > > c[0:3] is loaded in vc for some other use and reused here. But wh=
en c[2] is
> > > > > the
> > > > > only use of 'c' then it may be safer to just load one element and=
 use it
> > > > > like
> > > > > this:
> > > > >
> > > > >  vc:V4SI[0] =3D c[2]
> > > > >  va:V4SI =3D vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > > >=20
> > > > > This could also mean that expressions involving scalar could be t=
reated
> > > > > similarly. For example,
> > > > >=20
> > > > >  for(i=3D0;i<4;i++)
> > > > >    a[i] =3D b[i] <op> c
> > > > >=20
> > > > > could be vectorized as:
> > > > >=20
> > > > >  vc:V4SI[0] =3D c
> > > > >  va:V4SI =3D vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > > >=20
> > > > > Such a change would also require new standard pattern names to be=
 defined
> > > > > for
> > > > > each <op>.
> > > > >=20
> > > > > Alternatively, having something like this:
> > > > >=20
> > > > >  ...
> > > > >  vt:V4SI =3D VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > > >  va:V4SI =3D vb:V4SI <op> vt:V4SI
> > > > >  ...
> > > > >=20
> > > > > would remove the need to introduce several new standard pattern n=
ames but
> > > > > have
> > > > > just one to represent vec_duplicate(vec_select()) but ofcourse th=
is will
> > > > > expect
> > > > > the target to have combiner patterns.
> > > >=20
> > > > The cost estimation wouldn't be very good, but aren't combine patte=
rns enough
> > > > for the whole thing? Don't you model your mul instruction as:
> > > >=20
> > > > (mult:V4SI
> > > >   (match_operand:V4SI)
> > > >   (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> > > >=20
> > > > anyway? Seems that combine should be able to handle it. What curren=
tly happens
> > > > that we fail to generate the right instruction?
> > > >=20
> > > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRU=
CTOR for
> > > > vec_duplicate, adding new nodes is always painful.
> > >=20
> > > True, though CONSTRUCTOR isn't a good vec_duplicate primitive.  But y=
es,
> > > we have it that way at the moment and indeed adding new nodes is alwa=
ys
> > > painful.
> > >=20
> > > > > This enhancement could possibly help further optimizing larger sc=
enarios
> > > > > such
> > > > > as linear systems.
> > >=20
> > > Given that the vectorizer already handles all cases you quote but
> > > just the expansion doesn't use the targets special abilities - can't
> > > you just teach the expander to lookup the definition of the
> > > vectors and see if it is an uniform CONSTRUCTOR?
> > >=20
> > > Richard.
> > >=20
> >=20
> > I did think of handling this as a part of expanding CONSTRUCTOR but I t=
hought
> > it may not a good idea if we want to enhance this support in the future=
 to
> > handle larger cases like this one (hypothetical example):
> >=20
> >   for i =3D 0 to 3
> >     for j =3D 0 to 3
> >       a[j] +=3D b[j] * c[i]
> >=20
> > to
> >=20=20
> >   a[0:3] +=3D b[0:3] + c[0]=20
> >   a[0:3] +=3D b[0:3] + c[1]=20
> >   a[0:3] +=3D b[0:3] + c[2]=20
> >   a[0:3] +=3D b[0:3] + c[3]

Actually there was a typo in my previous example, it should've been:

   a[0:3] +=3D b[0:3] * c[0]
   a[0:3] +=3D b[0:3] * c[1]
   a[0:3] +=3D b[0:3] * c[2]
   a[0:3] +=3D b[0:3] * c[3]

> >=20
> > Secondly, I am not sure if introducing a single lane load at the time o=
f=20
> > expansion and removing or expecting the existing scalar load to be remo=
ved
> > later as it is unused, is a good idea.=20
>=20
> The above example requires unrolling and scalar cleanup anyway so you'll
> currently see sth like
>=20
> tem00_1 =3D c[0];
> tem0_2 =3D { tem00_1, tem00_1, tem00_1, tem00_1 };
> temb_3 =3D b[0:3];
> tems_4 =3D temb_3 + tem0_2;
> tema_5 =3D a[0:3];
> tems_6 =3D tema_5 + tems_4;
> a[0:3] =3D tems_6;
>=20
> which at expansion time can be short-cut easily without even going to
> expand the constructor.  That's the feature of TER which defers
> expansion of SSA name definition statements until they are first
> used.  In this case when you expand tems_4 you can pattern match
> the case you are interested in (not 100% sure which it is, I suppose
> it is vector register + vector_select < vector register >) and
> simply expand it optimally.  The "consumed" SSA names will have no
> further uses and will simply not be expanded at all.

Thanks. This sounds better. What I'm interested in is:

<vecreg> +=3D <vecreg> * vec_duplicate( vec_select( <vecreg> <index>))

Cheers
VP