From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 21074 invoked by alias); 4 Dec 2013 17:07:37 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 21059 invoked by uid 89); 4 Dec 2013 17:07:36 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.6 required=5.0 tests=AWL,BAYES_50,RDNS_NONE,SPF_PASS autolearn=no version=3.3.2 X-HELO: service87.mimecast.com Received: from Unknown (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 04 Dec 2013 17:07:33 +0000 Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Wed, 04 Dec 2013 17:07:24 +0000 Received: from e103625-lin.cambridge.arm.com ([10.1.255.212]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Wed, 4 Dec 2013 17:07:21 +0000 Date: Wed, 04 Dec 2013 17:07:00 -0000 From: Vidya Praveen To: Jakub Jelinek Cc: Richard Biener , "gcc@gcc.gnu.org" , "ook@ucw.cz" , "marc.glisse@inria.fr" Subject: Re: [RFC] Vectorization of indexed elements Message-ID: <20131204170721.GC26784@e103625-lin.cambridge.arm.com> References: <20130924150425.GE22907@e103625-lin.cambridge.arm.com> <20130927145008.GA861@e103625-lin.cambridge.arm.com> <20130927151945.GB861@e103625-lin.cambridge.arm.com> <20130930125454.GD3460@e103625-lin.cambridge.arm.com> <20130930140001.GF3460@e103625-lin.cambridge.arm.com> <20131011145408.GB23850@e103625-lin.cambridge.arm.com> <20131011150524.GX30970@tucnak.zalov.cz> MIME-Version: 1.0 In-Reply-To: <20131011150524.GX30970@tucnak.zalov.cz> User-Agent: Mutt/1.5.21 (2010-09-15) X-MC-Unique: 113120417072400401 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-IsSubscribed: yes X-SW-Source: 2013-12/txt/msg00033.txt.bz2 Hi Jakub, Apologies for the late response. On Fri, Oct 11, 2013 at 04:05:24PM +0100, Jakub Jelinek wrote: > On Fri, Oct 11, 2013 at 03:54:08PM +0100, Vidya Praveen wrote: > > Here's a compilable example: > >=20 > > void=20 > > foo (int *__restrict__ a, > > int *__restrict__ b, > > int *__restrict__ c) > > { > > int i; > >=20 > > for (i =3D 0; i < 8; i++) > > a[i] =3D b[i] * c[2]; > > } > >=20 > > This is vectorized by duplicating c[2] now. But I'm trying to take adva= ntage > > of target instructions that can take a vector register as second argume= nt but > > use only one element (by using the same value for all the lanes) of the= =20 > > vector register. > >=20 > > Eg. mul , , [index] > > mla , , [index] // multiply and add > >=20 > > But for a loop like the one in the C example given, I will have to load= the > > c[2] in one element of the vector register (leaving the remaining unuse= d) > > rather. This is why I was proposing to load just one element in a vecto= r=20 > > register (what I meant as "lane specific load"). The benefit of doing t= his is > > that we avoid explicit duplication, however such a simplification can o= nly > > be done where such support is available - the reason why I was thinking= in > > terms of optional standard pattern name. Another benefit is we will als= o be > > able to support scalars in the expression like in the following example: > >=20 > > void > > foo (int *__restrict__ a, > > int *__restrict__ b, > > int c) > > { > > int i; > >=20 > > for (i =3D 0; i < 8; i++) > > a[i] =3D b[i] * c; > > } >=20 > So just during combine let the broadcast operation be combined with the > arithmetics?=20=20 Yes. I can do that. But I always want it to be possible to recognize and lo= ad directly to the indexed vector register from memory. > Intel AVX512 ISA has similar feature, not sure what exactly > they are doing for this.=20 Thanks. I'll try to go through the code to understand. > That said, the broadcast is likely going to be > hoisted before the loop, and in that case is it really cheaper to have > it unbroadcasted in a vector register rather than to broadcast it before = the > loop and just use there? Could you explain what do you mean by unbroadcast? The constructor needs to= be expanded in one way or another, isn't it? I thought expanding to vec_duplic= ate when the values are uniform is the most efficient when vec_duplicate could = be supported by the target. If you had meant that each element of vector is lo= aded separately, I am thinking how can I combine such an operation with the arit= hmetic operation. Thanks VP.