From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 28378 invoked by alias); 30 Sep 2013 13:01:52 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 28368 invoked by uid 89); 30 Sep 2013 13:01:51 -0000 Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 30 Sep 2013 13:01:51 +0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.2 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 X-HELO: service87.mimecast.com Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Mon, 30 Sep 2013 14:01:48 +0100 Received: from e103625-lin.cambridge.arm.com ([10.1.255.212]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.0); Mon, 30 Sep 2013 14:01:45 +0100 Date: Mon, 30 Sep 2013 13:01:00 -0000 From: Vidya Praveen To: Richard Biener Cc: "gcc@gcc.gnu.org" , "ook@ucw.cz" Subject: Re: [RFC] Vectorization of indexed elements Message-ID: <20130930130145.GE3460@e103625-lin.cambridge.arm.com> References: <20130909172533.GA25330@e103625-lin.cambridge.arm.com> <20130924150343.GD22907@e103625-lin.cambridge.arm.com> MIME-Version: 1.0 In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-MC-Unique: 113093014014800201 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-IsSubscribed: yes X-SW-Source: 2013-09/txt/msg00249.txt.bz2 On Wed, Sep 25, 2013 at 10:22:05AM +0100, Richard Biener wrote: > On Tue, 24 Sep 2013, Vidya Praveen wrote: >=20 > > On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote: > > > On Mon, 9 Sep 2013, Marc Glisse wrote: > > >=20 > > > > On Mon, 9 Sep 2013, Vidya Praveen wrote: > > > >=20 > > > > > Hello, > > > > >=20 > > > > > This post details some thoughts on an enhancement to the vectoriz= er that > > > > > could take advantage of the SIMD instructions that allows indexed= element > > > > > as an operand thus reducing the need for duplication and possibly= improve > > > > > reuse of previously loaded data. > > > > >=20 > > > > > Appreciate your opinion on this. > > > > >=20 > > > > > --- > > > > >=20 > > > > > A phrase like this: > > > > >=20 > > > > > for(i=3D0;i<4;i++) > > > > > a[i] =3D b[i] c[2]; > > > > >=20 > > > > > is usually vectorized as: > > > > >=20 > > > > > va:V4SI =3D a[0:3] > > > > > vb:V4SI =3D b[0:3] > > > > > t =3D c[2] > > > > > vc:V4SI =3D { t, t, t, t } // typically expanded as vec_duplicat= e at vec_init > > > > > ... > > > > > va:V4SI =3D vb:V4SI vc:V4SI > > > > >=20 > > > > > But this could be simplified further if a target has instructions= that > > > > > support > > > > > indexed element as a parameter. For example an instruction like t= his: > > > > >=20 > > > > > mul v0.4s, v1.4s, v2.4s[2] > > > > >=20 > > > > > can perform multiplication of each element of v2.4s with the thir= d element > > > > > of > > > > > v2.4s (specified as v2.4s[2]) and store the results in the corres= ponding > > > > > elements of v0.4s. > > > > >=20 > > > > > For this to happen, vectorizer needs to understand this idiom and= treat the > > > > > operand c[2] specially (and by taking in to consideration if the = machine > > > > > supports indexed element as an operand for through a target = hook or > > > > > macro) > > > > > and consider this as vectorizable statement without having to dup= licate the > > > > > elements explicitly. > > > > >=20 > > > > > There are fews ways this could be represented at gimple: > > > > >=20 > > > > > ... > > > > > va:V4SI =3D vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc= :V4SI 2)) > > > > > ... > > > > >=20 > > > > > or by allowing a vectorizer treat an indexed element as a valid o= perand in a > > > > > vectorizable statement: > > > >=20 > > > > Might as well allow any scalar then... > > >=20 > > > I agree. The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form > > > would necessarily be two extra separate statements and thus subject > > > to CSE obfuscating it enough for RTL expansion to no longer notice it. > >=20 > > I also thought about having a specialized expression like > >=20 > > VEC_INDEXED__EXPR < arg0, arg1, arg2, index>=20 > >=20 > > to mean: > >=20 > > arg0 =3D arg1 arg2[index] > >=20 > > and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_E= XPR > > is handled in expr.c. But I dropped this idea since we may need to intr= oduce > > many such nodes. > >=20 > > >=20 > > > That said, allowing mixed scalar/vector ops isn't very nice and > > > your scheme can be simplified by just using > > >=20 > > > vc:V4SI =3D VEC_DUPLICATE_EXPR <...> > > > va:V4SI =3D vb:V4SI vc:V4SI > > >=20 > > > where the expander only has to see that vc:V4SI is defined by > > > a duplicate. > >=20 > > I did try out something like this quickly before I posted this RFC, tho= ugh > > I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select()) > >=20 > > for:=20 > >=20 > > for(i=3D0;i<8;i++) > > a[i] =3D b[2] * c[i]; > >=20 > > I could generate: > >=20 > > ... > > : > > _88 =3D prolog_loop_adjusted_niters.6_60 * 4; > > vectp_c.13_87 =3D c_10(D) + _88; > > vect_ldidx_.16_92 =3D MEM[(int *)b_8(D) + 8B]; <<= <<<<<< > > vect_idxed_.17_93 =3D (vect_ldidx_.16_92) <<< ??? >>> (0); <<= <<<<<< > > _96 =3D prolog_loop_adjusted_niters.6_60 * 4; > > vectp_a.19_95 =3D a_6(D) + _96; > > vect__12.14_115 =3D MEM[(int *)vectp_c.13_87]; > > vect_patt_40.15_116 =3D vect__12.14_115 * vect_idxed_.17_93; <<= <<<<<< > > MEM[(int *)vectp_a.19_95] =3D vect_patt_40.15_116; <<= <<<<<< > > vectp_c.12_118 =3D vectp_c.13_87 + 16; > > vectp_a.18_119 =3D vectp_a.19_95 + 16; > > ivtmp_120 =3D 1; > > if (ivtmp_120 < bnd.8_62) > > goto ; > > else > > goto ; > >=20 > > : > > # vectp_c.12_89 =3D PHI > > # vectp_a.18_97 =3D PHI > > # ivtmp_14 =3D PHI > > vect__12.14_91 =3D MEM[(int *)vectp_c.12_89]; <<= <<<<<< > > vect_patt_40.15_94 =3D vect__12.14_91 * vect_idxed_.17_93; <<= <<<<<< > > MEM[(int *)vectp_a.18_97] =3D vect_patt_40.15_94; > > ... > >=20 > > It's a crude implementation so VEC_DUP is printed as: > >=20 > > (vect_ldidx_.16_92) <<< ??? >>> (0); > >=20 > >=20 > > > > > ... > > > > > va:V4SI =3D vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) > > > > > ... > > > > > > > > > > For the sake of explanation, the above two representations assume= s that > > > > > c[0:3] is loaded in vc for some other use and reused here. But wh= en c[2] is > > > > > the > > > > > only use of 'c' then it may be safer to just load one element and= use it > > > > > like > > > > > this: > > > > > > > > > > vc:V4SI[0] =3D c[2] > > > > > va:V4SI =3D vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > >=20 > > > > > This could also mean that expressions involving scalar could be t= reated > > > > > similarly. For example, > > > > >=20 > > > > > for(i=3D0;i<4;i++) > > > > > a[i] =3D b[i] c > > > > >=20 > > > > > could be vectorized as: > > > > >=20 > > > > > vc:V4SI[0] =3D c > > > > > va:V4SI =3D vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > >=20 > > > > > Such a change would also require new standard pattern names to be= defined > > > > > for > > > > > each . > > > > >=20 > > > > > Alternatively, having something like this: > > > > >=20 > > > > > ... > > > > > vt:V4SI =3D VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > > > va:V4SI =3D vb:V4SI vt:V4SI > > > > > ... > > > > >=20 > > > > > would remove the need to introduce several new standard pattern n= ames but > > > > > have > > > > > just one to represent vec_duplicate(vec_select()) but ofcourse th= is will > > > > > expect > > > > > the target to have combiner patterns. > > > >=20 > > > > The cost estimation wouldn't be very good, but aren't combine patte= rns enough > > > > for the whole thing? Don't you model your mul instruction as: > > > >=20 > > > > (mult:V4SI > > > > (match_operand:V4SI) > > > > (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI)))) > > > >=20 > > > > anyway? Seems that combine should be able to handle it. What curren= tly happens > > > > that we fail to generate the right instruction? > > > >=20 > > > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRU= CTOR for > > > > vec_duplicate, adding new nodes is always painful. > > >=20 > > > True, though CONSTRUCTOR isn't a good vec_duplicate primitive. But y= es, > > > we have it that way at the moment and indeed adding new nodes is alwa= ys > > > painful. > > >=20 > > > > > This enhancement could possibly help further optimizing larger sc= enarios > > > > > such > > > > > as linear systems. > > >=20 > > > Given that the vectorizer already handles all cases you quote but > > > just the expansion doesn't use the targets special abilities - can't > > > you just teach the expander to lookup the definition of the > > > vectors and see if it is an uniform CONSTRUCTOR? > > >=20 > > > Richard. > > >=20 > >=20 > > I did think of handling this as a part of expanding CONSTRUCTOR but I t= hought > > it may not a good idea if we want to enhance this support in the future= to > > handle larger cases like this one (hypothetical example): > >=20 > > for i =3D 0 to 3 > > for j =3D 0 to 3 > > a[j] +=3D b[j] * c[i] > >=20 > > to > >=20=20 > > a[0:3] +=3D b[0:3] + c[0]=20 > > a[0:3] +=3D b[0:3] + c[1]=20 > > a[0:3] +=3D b[0:3] + c[2]=20 > > a[0:3] +=3D b[0:3] + c[3] Actually there was a typo in my previous example, it should've been: a[0:3] +=3D b[0:3] * c[0] a[0:3] +=3D b[0:3] * c[1] a[0:3] +=3D b[0:3] * c[2] a[0:3] +=3D b[0:3] * c[3] > >=20 > > Secondly, I am not sure if introducing a single lane load at the time o= f=20 > > expansion and removing or expecting the existing scalar load to be remo= ved > > later as it is unused, is a good idea.=20 >=20 > The above example requires unrolling and scalar cleanup anyway so you'll > currently see sth like >=20 > tem00_1 =3D c[0]; > tem0_2 =3D { tem00_1, tem00_1, tem00_1, tem00_1 }; > temb_3 =3D b[0:3]; > tems_4 =3D temb_3 + tem0_2; > tema_5 =3D a[0:3]; > tems_6 =3D tema_5 + tems_4; > a[0:3] =3D tems_6; >=20 > which at expansion time can be short-cut easily without even going to > expand the constructor. That's the feature of TER which defers > expansion of SSA name definition statements until they are first > used. In this case when you expand tems_4 you can pattern match > the case you are interested in (not 100% sure which it is, I suppose > it is vector register + vector_select < vector register >) and > simply expand it optimally. The "consumed" SSA names will have no > further uses and will simply not be expanded at all. Thanks. This sounds better. What I'm interested in is: +=3D * vec_duplicate( vec_select( )) Cheers VP