From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 17960 invoked by alias); 4 Dec 2013 16:10:14 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 17946 invoked by uid 89); 4 Dec 2013 16:10:13 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_50,RDNS_NONE,SPF_PASS autolearn=no version=3.3.2 X-HELO: service87.mimecast.com Received: from Unknown (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 04 Dec 2013 16:10:12 +0000 Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Wed, 04 Dec 2013 16:10:03 +0000 Received: from e103625-lin.cambridge.arm.com ([10.1.255.212]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Wed, 4 Dec 2013 16:10:00 +0000 Date: Wed, 04 Dec 2013 16:10:00 -0000 From: Vidya Praveen To: Richard Biener Cc: "gcc@gcc.gnu.org" , "ook@ucw.cz" , "marc.glisse@inria.fr" Subject: Re: [RFC] Vectorization of indexed elements Message-ID: <20131204161000.GB26784@e103625-lin.cambridge.arm.com> References: <20130924150425.GE22907@e103625-lin.cambridge.arm.com> <20130927145008.GA861@e103625-lin.cambridge.arm.com> <20130927151945.GB861@e103625-lin.cambridge.arm.com> <20130930125454.GD3460@e103625-lin.cambridge.arm.com> <20130930140001.GF3460@e103625-lin.cambridge.arm.com> <20131011145408.GB23850@e103625-lin.cambridge.arm.com> MIME-Version: 1.0 In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-MC-Unique: 113120416100302401 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-IsSubscribed: yes X-SW-Source: 2013-12/txt/msg00031.txt.bz2 Hi Richi, Apologies for the late response. I was on vacation. On Mon, Oct 14, 2013 at 09:04:58AM +0100, Richard Biener wrote: > On Fri, 11 Oct 2013, Vidya Praveen wrote: >=20 > > On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote: > > > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > >=20 > > > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote: > > > > > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > > >=20 > > > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote: > > > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote: > > > > > > > [...] > > > > > > > > > > I can't really insist on the single lane load.. somethi= ng like: > > > > > > > > > >=20 > > > > > > > > > > vc:V4SI[0] =3D c > > > > > > > > > > vt:V4SI =3D vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > > > > > > > > > va:V4SI =3D vb:V4SI vt:V4SI > > > > > > > > > >=20 > > > > > > > > > > Or is there any other way to do this? > > > > > > > > >=20 > > > > > > > > > Can you elaborate on "I can't really insist on the single= lane load"? > > > > > > > > > What's the single lane load in your example?=20 > > > > > > > >=20 > > > > > > > > Loading just one lane of the vector like this: > > > > > > > >=20 > > > > > > > > vc:V4SI[0] =3D c // from the above scalar example > > > > > > > >=20 > > > > > > > > or=20 > > > > > > > >=20 > > > > > > > > vc:V4SI[0] =3D c[2]=20 > > > > > > > >=20 > > > > > > > > is what I meant by single lane load. In this example: > > > > > > > >=20 > > > > > > > > t =3D c[2]=20 > > > > > > > > ... > > > > > > > > vb:v4si =3D b[0:3]=20 > > > > > > > > vc:v4si =3D { t, t, t, t } > > > > > > > > va:v4si =3D vb:v4si vc:v4si=20 > > > > > > > >=20 > > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec= _init, I cannot > > > > > > > > insist 't' to be vector and t =3D c[2] to be vect_t[0] =3D = c[2] (which could be=20 > > > > > > > > seen as vec_select:SI (vect_t 0) ).=20 > > > > > > > >=20 > > > > > > > > > I'd expect the instruction > > > > > > > > > pattern as quoted to just work (and I hope we expand an u= niform > > > > > > > > > constructor { a, a, a, a } properly using vec_duplicate). > > > > > > > >=20 > > > > > > > > As much as I went through the code, this is only done using= vect_init. It is > > > > > > > > not expanded as vec_duplicate from, for example, store_cons= tructor() of expr.c > > > > > > >=20 > > > > > > > Do you see any issues if we expand such constructor as vec_du= plicate directly=20 > > > > > > > instead of going through vect_init way?=20 > > > > > >=20 > > > > > > Sorry, that was a bad question. > > > > > >=20 > > > > > > But here's what I would like to propose as a first step. Please= tell me if this > > > > > > is acceptable or if it makes sense: > > > > > >=20 > > > > > > - Introduce standard pattern names=20 > > > > > >=20 > > > > > > "vmulim4" - vector muliply with second operand as indexed opera= nd > > > > > >=20 > > > > > > Example: > > > > > >=20 > > > > > > (define_insn "vmuliv4si4" > > > > > > [set (match_operand:V4SI 0 "register_operand") > > > > > > (mul:V4SI (match_operand:V4SI 1 "register_operand") > > > > > > (vec_duplicate:V4SI > > > > > > (vec_select:SI > > > > > > (match_operand:V4SI 2 "register_operand") > > > > > > (match_operand:V4SI 3 "immediate_operand)= ))))] > > > > > > ... > > > > > > ) > > > > >=20 > > > > > We could factor this with providing a standard pattern name for > > > > >=20 > > > > > (define_insn "vdupi" > > > > > [set (match_operand: 0 "register_operand") > > > > > (vec_duplicate: > > > > > (vec_select: > > > > > (match_operand: 1 "register_operand") > > > > > (match_operand:SI 2 "immediate_operand))))] > > > >=20 > > > > This is good. I did think about this but then I thought of avoiding= the need > > > > for combiner patterns :-)=20 > > > >=20 > > > > But do you find the lane specific mov pattern I proposed, acceptabl= e?=20 > > >=20 > > > The specific mul pattern? As said, consider factoring to vdupi to > > > avoid an explosion in required special optabs. > > >=20 > > > > > (you use V4SI for the immediate?=20=20 > > > >=20 > > > > Sorry typo again!! It should've been SI. > > > >=20 > > > > > Ideally vdupi has another custom > > > > > mode for the vector index). > > > > >=20 > > > > > Note that this factored pattern is already available as vec_perm_= const! > > > > > It is simply (vec_perm_const:V4SI ). > > > > >=20 > > > > > Which means that on the GIMPLE level we should try to combine > > > > >=20 > > > > > el_4 =3D BIT_FIELD_REF ; > > > > > v_5 =3D { el_4, el_4, ... }; > > > >=20 > > > > I don't think we reach this state at all for the scenarios in discu= ssion. > > > > what we generally have is: > > > >=20 > > > > el_4 =3D MEM_REF < array + index*size > > > > > v_5 =3D { el_4, ... } > > > >=20 > > > > Or am I missing something? > > >=20 > > > Well, but in that case I doubt it is profitable (or even valid!) to > > > turn this into a vector lane load from the array. If it is profitable > > > to perform a vector read (because we're going to use the other elemen= ts > > > of the vector as well) then the vectorizer should produce a vector > > > load and materialize the uniform vector from one of its elements. > > >=20 > > > Maybe at this point you should show us a compilable C testcase > > > with a loop that should be vectorized using your instructions in > > > the end? > >=20 > > Here's a compilable example: > >=20 > > void=20 > > foo (int *__restrict__ a, > > int *__restrict__ b, > > int *__restrict__ c) > > { > > int i; > >=20 > > for (i =3D 0; i < 8; i++) > > a[i] =3D b[i] * c[2]; > > } > >=20 > > This is vectorized by duplicating c[2] now. But I'm trying to take adva= ntage > > of target instructions that can take a vector register as second argume= nt but > > use only one element (by using the same value for all the lanes) of the= =20 > > vector register. > >=20 > > Eg. mul , , [index] > > mla , , [index] // multiply and add > >=20 > > But for a loop like the one in the C example given, I will have to load= the > > c[2] in one element of the vector register (leaving the remaining unuse= d) > > rather. This is why I was proposing to load just one element in a vecto= r=20 > > register (what I meant as "lane specific load"). The benefit of doing t= his is > > that we avoid explicit duplication, however such a simplification can o= nly > > be done where such support is available - the reason why I was thinking= in > > terms of optional standard pattern name. Another benefit is we will als= o be > > able to support scalars in the expression like in the following example: > >=20 > > void > > foo (int *__restrict__ a, > > int *__restrict__ b, > > int c) > > { > > int i; > >=20 > > for (i =3D 0; i < 8; i++) > > a[i] =3D b[i] * c; > > } >=20 > Both cases can be handled by patterns that match >=20 > (mul:VXSI (reg:VXSI > (vec_duplicate:VXSI reg:SI))) How do I arrive at this pattern in the first place? Assuming vec_init with uniform values are expanded as vec_duplicate, it will still be two expressi= ons. That is, (set reg:VXSI (vec_duplicate:VXSI (reg:SI))) (set reg:VXSI (mul:VXSI (reg:VXSI) (reg:VXSI))) > You'd then "consume" the vec_duplicate and implement it as > load scalar into element zero of the vector and use index mult > with index zero. If I understand this correctly, you are suggesting to leave the scalar load from memory as it is but treat the=20 (mul:VXSI (reg:VXSI (vec_duplicate:VXSI reg:SI))) as=20 load reg:VXSI[0], reg:SI mul reg:VXSI, reg:VXSI, re:VXSI[0] // by reusing the destination register p= erhaps either by generating instructions directly or by using define_split. Am I r= ight? If I'm right, then my concern is that it may be possible to simplify this f= urther by loading directly to a indexed vector register from memory but it's too l= ate at this point for such simplification to be possible. Please let me know what am I not understanding. > This handling can be either implemented at combine level (this > is a two-instruction combine only!) or via TER. > > > Another example which can take advantage of the target feature: > >=20 > > void=20 > > foo (int *__restrict__ a, > > int *__restrict__ b, > > int *__restrict__ c) > > { > > int i,j; > >=20 > > for (i =3D 0; i < 8; i++) > > for (j =3D 0; j < 8; j++) > > a[j] +=3D b[j] * c[i]; > > } > >=20 > > This scenario we discussed this earlier (you suggested handling this at= TER). >=20 > Yeah, though for optimal code you'd want to load c[0:8] once with a > vector load. Eventually the vectorizer outer loop vectorization support > can be massaged to handle this optimally. I agree.=20 Thanks VP.