From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-181274-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 21074 invoked by alias); 4 Dec 2013 17:07:37 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 21059 invoked by uid 89); 4 Dec 2013 17:07:36 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.6 required=5.0 tests=AWL,BAYES_50,RDNS_NONE,SPF_PASS autolearn=no version=3.3.2
X-HELO: service87.mimecast.com
Received: from Unknown (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 04 Dec 2013 17:07:33 +0000
Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Wed, 04 Dec 2013 17:07:24 +0000
Received: from e103625-lin.cambridge.arm.com ([10.1.255.212]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959);	 Wed, 4 Dec 2013 17:07:21 +0000
Date: Wed, 04 Dec 2013 17:07:00 -0000
From: Vidya Praveen <vidyapraveen@arm.com>
To: Jakub Jelinek <jakub@redhat.com>
Cc: Richard Biener <rguenther@suse.de>, "gcc@gcc.gnu.org" <gcc@gcc.gnu.org>,	"ook@ucw.cz" <ook@ucw.cz>,	"marc.glisse@inria.fr" <marc.glisse@inria.fr>
Subject: Re: [RFC] Vectorization of indexed elements
Message-ID: <20131204170721.GC26784@e103625-lin.cambridge.arm.com>
References: <20130924150425.GE22907@e103625-lin.cambridge.arm.com> <alpine.LNX.2.00.1309251123490.29411@zhemvz.fhfr.qr> <20130927145008.GA861@e103625-lin.cambridge.arm.com> <20130927151945.GB861@e103625-lin.cambridge.arm.com> <20130930125454.GD3460@e103625-lin.cambridge.arm.com> <alpine.LNX.2.00.1309301504120.5759@zhemvz.fhfr.qr> <20130930140001.GF3460@e103625-lin.cambridge.arm.com> <alpine.LNX.2.00.1310011022420.5759@zhemvz.fhfr.qr> <20131011145408.GB23850@e103625-lin.cambridge.arm.com> <20131011150524.GX30970@tucnak.zalov.cz>
MIME-Version: 1.0
In-Reply-To: <20131011150524.GX30970@tucnak.zalov.cz>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-MC-Unique: 113120417072400401
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-IsSubscribed: yes
X-SW-Source: 2013-12/txt/msg00033.txt.bz2

Hi Jakub,

Apologies for the late response.

On Fri, Oct 11, 2013 at 04:05:24PM +0100, Jakub Jelinek wrote:
> On Fri, Oct 11, 2013 at 03:54:08PM +0100, Vidya Praveen wrote:
> > Here's a compilable example:
> >=20
> > void=20
> > foo (int *__restrict__ a,
> >      int *__restrict__ b,
> >      int *__restrict__ c)
> > {
> >   int i;
> >=20
> >   for (i =3D 0; i < 8; i++)
> >     a[i] =3D b[i] * c[2];
> > }
> >=20
> > This is vectorized by duplicating c[2] now. But I'm trying to take adva=
ntage
> > of target instructions that can take a vector register as second argume=
nt but
> > use only one element (by using the same value for all the lanes) of the=
=20
> > vector register.
> >=20
> > Eg. mul <vec-reg>, <vec-reg>, <vec-reg>[index]
> >     mla <vec-reg>, <vec-reg>, <vec-reg>[index] // multiply and add
> >=20
> > But for a loop like the one in the C example given, I will have to load=
 the
> > c[2] in one element of the vector register (leaving the remaining unuse=
d)
> > rather. This is why I was proposing to load just one element in a vecto=
r=20
> > register (what I meant as "lane specific load"). The benefit of doing t=
his is
> > that we avoid explicit duplication, however such a simplification can o=
nly
> > be done where such support is available - the reason why I was thinking=
 in
> > terms of optional standard pattern name. Another benefit is we will als=
o be
> > able to support scalars in the expression like in the following example:
> >=20
> > void
> > foo (int *__restrict__ a,
> >      int *__restrict__ b,
> >      int c)
> > {
> >   int i;
> >=20
> >   for (i =3D 0; i < 8; i++)
> >     a[i] =3D b[i] * c;
> > }
>=20
> So just during combine let the broadcast operation be combined with the
> arithmetics?=20=20

Yes. I can do that. But I always want it to be possible to recognize and lo=
ad
directly to the indexed vector register from memory.


> Intel AVX512 ISA has similar feature, not sure what exactly
> they are doing for this.=20

Thanks. I'll try to go through the code to understand.

> That said, the broadcast is likely going to be
> hoisted before the loop, and in that case is it really cheaper to have
> it unbroadcasted in a vector register rather than to broadcast it before =
the
> loop and just use there?

Could you explain what do you mean by unbroadcast? The constructor needs to=
 be
expanded in one way or another, isn't it? I thought expanding to vec_duplic=
ate
when the values are uniform is the most efficient when vec_duplicate could =
be
supported by the target. If you had meant that each element of vector is lo=
aded
separately, I am thinking how can I combine such an operation with the arit=
hmetic
operation.

Thanks
VP.