From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 23989 invoked by alias); 27 Sep 2013 14:50:14 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 23980 invoked by uid 89); 27 Sep 2013 14:50:13 -0000 Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 27 Sep 2013 14:50:13 +0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.2 required=5.0 tests=AWL,BAYES_50,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD,SPAM_SUBJECT,SPF_PASS autolearn=no version=3.3.2 X-HELO: service87.mimecast.com Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Fri, 27 Sep 2013 15:50:10 +0100 Received: from e103625-lin.cambridge.arm.com ([10.1.255.212]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.0); Fri, 27 Sep 2013 15:50:09 +0100 Date: Fri, 27 Sep 2013 14:50:00 -0000 From: Vidya Praveen To: Richard Biener Cc: "gcc@gcc.gnu.org" , "ook@ucw.cz" Subject: Re: [RFC] Vectorization of indexed elements Message-ID: <20130927145008.GA861@e103625-lin.cambridge.arm.com> References: <20130909172533.GA25330@e103625-lin.cambridge.arm.com> <20130924150425.GE22907@e103625-lin.cambridge.arm.com> MIME-Version: 1.0 In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-MC-Unique: 113092715501001701 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-IsSubscribed: yes X-SW-Source: 2013-09/txt/msg00238.txt.bz2 On Wed, Sep 25, 2013 at 10:24:56AM +0100, Richard Biener wrote: > On Tue, 24 Sep 2013, Vidya Praveen wrote: >=20 > > On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote: > > > On Mon, 9 Sep 2013, Vidya Praveen wrote: > > >=20 > > > > Hello, > > > > > > > > This post details some thoughts on an enhancement to the vectorizer= that > > > > could take advantage of the SIMD instructions that allows indexed e= lement > > > > as an operand thus reducing the need for duplication and possibly i= mprove > > > > reuse of previously loaded data. > > > > > > > > Appreciate your opinion on this. > > > > > > > > --- > > > > > > > > A phrase like this: > > > > > > > > for(i=3D0;i<4;i++) > > > > a[i] =3D b[i] c[2]; > > > > > > > > is usually vectorized as: > > > > > > > > va:V4SI =3D a[0:3] > > > > vb:V4SI =3D b[0:3] > > > > t =3D c[2] > > > > vc:V4SI =3D { t, t, t, t } // typically expanded as vec_duplicate = at vec_init > > > > ... > > > > va:V4SI =3D vb:V4SI vc:V4SI > > > > > > > > But this could be simplified further if a target has instructions t= hat support > > > > indexed element as a parameter. For example an instruction like thi= s: > > > > > > > > mul v0.4s, v1.4s, v2.4s[2] > > > > > > > > can perform multiplication of each element of v2.4s with the third = element of > > > > v2.4s (specified as v2.4s[2]) and store the results in the correspo= nding > > > > elements of v0.4s. > > > > > > > > For this to happen, vectorizer needs to understand this idiom and t= reat the > > > > operand c[2] specially (and by taking in to consideration if the ma= chine > > > > supports indexed element as an operand for through a target ho= ok or macro) > > > > and consider this as vectorizable statement without having to dupli= cate the > > > > elements explicitly. > > > > > > > > There are fews ways this could be represented at gimple: > > > > > > > > ... > > > > va:V4SI =3D vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V= 4SI 2)) > > > > ... > > > > > > > > or by allowing a vectorizer treat an indexed element as a valid ope= rand in a > > > > vectorizable statement: > > >=20 > > > Might as well allow any scalar then... > >=20 > > Yes, I had given an example below. > >=20 > > >=20 > > > > ... > > > > va:V4SI =3D vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) > > > > ... > > > > > > > > For the sake of explanation, the above two representations assumes = that > > > > c[0:3] is loaded in vc for some other use and reused here. But when= c[2] is the > > > > only use of 'c' then it may be safer to just load one element and u= se it like > > > > this: > > > > > > > > vc:V4SI[0] =3D c[2] > > > > va:V4SI =3D vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > > > This could also mean that expressions involving scalar could be tre= ated > > > > similarly. For example, > > > > > > > > for(i=3D0;i<4;i++) > > > > a[i] =3D b[i] c > > > > > > > > could be vectorized as: > > > > > > > > vc:V4SI[0] =3D c > > > > va:V4SI =3D vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > > > Such a change would also require new standard pattern names to be d= efined for > > > > each . > > > > > > > > Alternatively, having something like this: > > > > > > > > ... > > > > vt:V4SI =3D VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > > va:V4SI =3D vb:V4SI vt:V4SI > > > > ... > > > > > > > > would remove the need to introduce several new standard pattern nam= es but have > > > > just one to represent vec_duplicate(vec_select()) but ofcourse this= will expect > > > > the target to have combiner patterns. > > >=20 > > > The cost estimation wouldn't be very good, but aren't combine pattern= s=20 > > > enough for the whole thing? Don't you model your mul instruction as: > > >=20 > > > (mult:V4SI > > > (match_operand:V4SI) > > > (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI)))) > > >=20 > > > anyway? Seems that combine should be able to handle it. What currentl= y=20 > > > happens that we fail to generate the right instruction? > >=20 > > At vec_init, I can recognize an idiom in order to generate vec_duplicat= e but > > I can't really insist on the single lane load.. something like: > >=20 > > vc:V4SI[0] =3D c > > vt:V4SI =3D vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > va:V4SI =3D vb:V4SI vt:V4SI > >=20 > > Or is there any other way to do this? >=20 > Can you elaborate on "I can't really insist on the single lane load"? > What's the single lane load in your example?=20 Loading just one lane of the vector like this: vc:V4SI[0] =3D c // from the above scalar example or=20 vc:V4SI[0] =3D c[2]=20 is what I meant by single lane load. In this example: t =3D c[2]=20 ... vb:v4si =3D b[0:3]=20 vc:v4si =3D { t, t, t, t } va:v4si =3D vb:v4si vc:v4si=20 If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot insist 't' to be vector and t =3D c[2] to be vect_t[0] =3D c[2] (which coul= d be=20 seen as vec_select:SI (vect_t 0) ).=20 > I'd expect the instruction > pattern as quoted to just work (and I hope we expand an uniform > constructor { a, a, a, a } properly using vec_duplicate). As much as I went through the code, this is only done using vect_init. It is not expanded as vec_duplicate from, for example, store_constructor() of exp= r.c VP >=20 > Richard. >=20 > > Cheers > > VP > >=20 > > >=20 > > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCT= OR=20 > > > for vec_duplicate, adding new nodes is always painful. > > >=20 > > > > This enhancement could possibly help further optimizing larger scen= arios such > > > > as linear systems. > > > > > > > > Regards > > > > VP > > >=20 > > > --=20 > > > Marc Glisse > > > > >=20 > >=20 > >=20 >=20 > --=20 > Richard Biener > SUSE / SUSE Labs > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 > GF: Jeff Hawn, Jennifer Guild, Felix Imend >=20