From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-180417-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 16449 invoked by alias); 11 Oct 2013 14:54:14 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 16436 invoked by uid 89); 11 Oct 2013 14:54:14 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.8 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2
X-HELO: service87.mimecast.com
Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 11 Oct 2013 14:54:13 +0000
Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Fri, 11 Oct 2013 15:54:10 +0100
Received: from e103625-lin.cambridge.arm.com ([10.1.255.212]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959);	 Fri, 11 Oct 2013 15:54:08 +0100
Date: Fri, 11 Oct 2013 14:54:00 -0000
From: Vidya Praveen <vidyapraveen@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: "gcc@gcc.gnu.org" <gcc@gcc.gnu.org>, "ook@ucw.cz" <ook@ucw.cz>,	marc.glisse@inria.fr
Subject: Re: [RFC] Vectorization of indexed elements
Message-ID: <20131011145408.GB23850@e103625-lin.cambridge.arm.com>
References: <20130909172533.GA25330@e103625-lin.cambridge.arm.com> <alpine.DEB.2.10.1309091949090.3565@laptop-mg.saclay.inria.fr> <20130924150425.GE22907@e103625-lin.cambridge.arm.com> <alpine.LNX.2.00.1309251123490.29411@zhemvz.fhfr.qr> <20130927145008.GA861@e103625-lin.cambridge.arm.com> <20130927151945.GB861@e103625-lin.cambridge.arm.com> <20130930125454.GD3460@e103625-lin.cambridge.arm.com> <alpine.LNX.2.00.1309301504120.5759@zhemvz.fhfr.qr> <20130930140001.GF3460@e103625-lin.cambridge.arm.com> <alpine.LNX.2.00.1310011022420.5759@zhemvz.fhfr.qr>
MIME-Version: 1.0
In-Reply-To: <alpine.LNX.2.00.1310011022420.5759@zhemvz.fhfr.qr>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-MC-Unique: 113101115541000501
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-IsSubscribed: yes
X-SW-Source: 2013-10/txt/msg00116.txt.bz2

On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote:
> On Mon, 30 Sep 2013, Vidya Praveen wrote:
>=20
> > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > >=20
> > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > > [...]
> > > > > > > > I can't really insist on the single lane load.. something l=
ike:
> > > > > > > >=20
> > > > > > > > vc:V4SI[0] =3D c
> > > > > > > > vt:V4SI =3D vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > > va:V4SI =3D vb:V4SI <op> vt:V4SI
> > > > > > > >=20
> > > > > > > > Or is there any other way to do this?
> > > > > > >=20
> > > > > > > Can you elaborate on "I can't really insist on the single lan=
e load"?
> > > > > > > What's the single lane load in your example?=20
> > > > > >=20
> > > > > > Loading just one lane of the vector like this:
> > > > > >=20
> > > > > > vc:V4SI[0] =3D c // from the above scalar example
> > > > > >=20
> > > > > > or=20
> > > > > >=20
> > > > > > vc:V4SI[0] =3D c[2]=20
> > > > > >=20
> > > > > > is what I meant by single lane load. In this example:
> > > > > >=20
> > > > > > t =3D c[2]=20
> > > > > > ...
> > > > > > vb:v4si =3D b[0:3]=20
> > > > > > vc:v4si =3D { t, t, t, t }
> > > > > > va:v4si =3D vb:v4si <op> vc:v4si=20
> > > > > >=20
> > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_ini=
t, I cannot
> > > > > > insist 't' to be vector and t =3D c[2] to be vect_t[0] =3D c[2]=
 (which could be=20
> > > > > > seen as vec_select:SI (vect_t 0) ).=20
> > > > > >=20
> > > > > > > I'd expect the instruction
> > > > > > > pattern as quoted to just work (and I hope we expand an unifo=
rm
> > > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > > >=20
> > > > > > As much as I went through the code, this is only done using vec=
t_init. It is
> > > > > > not expanded as vec_duplicate from, for example, store_construc=
tor() of expr.c
> > > > >=20
> > > > > Do you see any issues if we expand such constructor as vec_duplic=
ate directly=20
> > > > > instead of going through vect_init way?=20
> > > >=20
> > > > Sorry, that was a bad question.
> > > >=20
> > > > But here's what I would like to propose as a first step. Please tel=
l me if this
> > > > is acceptable or if it makes sense:
> > > >=20
> > > > - Introduce standard pattern names=20
> > > >=20
> > > > "vmulim4" - vector muliply with second operand as indexed operand
> > > >=20
> > > > Example:
> > > >=20
> > > > (define_insn "vmuliv4si4"
> > > >    [set (match_operand:V4SI 0 "register_operand")
> > > >         (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > > >                   (vec_duplicate:V4SI
> > > >                     (vec_select:SI
> > > >                       (match_operand:V4SI 2 "register_operand")
> > > >                       (match_operand:V4SI 3 "immediate_operand)))))]
> > > >  ...
> > > > )
> > >=20
> > > We could factor this with providing a standard pattern name for
> > >=20
> > > (define_insn "vdupi<mode>"
> > >   [set (match_operand:<mode> 0 "register_operand")
> > >        (vec_duplicate:<mode>
> > >           (vec_select:<scalarmode>
> > >              (match_operand:<mode> 1 "register_operand")
> > >              (match_operand:SI 2 "immediate_operand))))]
> >=20
> > This is good. I did think about this but then I thought of avoiding the=
 need
> > for combiner patterns :-)=20
> >=20
> > But do you find the lane specific mov pattern I proposed, acceptable?=20
>=20
> The specific mul pattern?  As said, consider factoring to vdupi to
> avoid an explosion in required special optabs.
>=20
> > > (you use V4SI for the immediate?=20=20
> >=20
> > Sorry typo again!! It should've been SI.
> >=20
> > > Ideally vdupi has another custom
> > > mode for the vector index).
> > >=20
> > > Note that this factored pattern is already available as vec_perm_cons=
t!
> > > It is simply (vec_perm_const:V4SI <source> <source> <immediate-select=
or>).
> > >=20
> > > Which means that on the GIMPLE level we should try to combine
> > >=20
> > > el_4 =3D BIT_FIELD_REF <v_3, ...>;
> > > v_5 =3D { el_4, el_4, ... };
> >=20
> > I don't think we reach this state at all for the scenarios in discussio=
n.
> > what we generally have is:
> >=20
> >  el_4 =3D MEM_REF < array + index*size >
> >  v_5 =3D { el_4, ... }
> >=20
> > Or am I missing something?
>=20
> Well, but in that case I doubt it is profitable (or even valid!) to
> turn this into a vector lane load from the array.  If it is profitable
> to perform a vector read (because we're going to use the other elements
> of the vector as well) then the vectorizer should produce a vector
> load and materialize the uniform vector from one of its elements.
>=20
> Maybe at this point you should show us a compilable C testcase
> with a loop that should be vectorized using your instructions in
> the end?

Here's a compilable example:

void=20
foo (int *__restrict__ a,
     int *__restrict__ b,
     int *__restrict__ c)
{
  int i;

  for (i =3D 0; i < 8; i++)
    a[i] =3D b[i] * c[2];
}

This is vectorized by duplicating c[2] now. But I'm trying to take advantage
of target instructions that can take a vector register as second argument b=
ut
use only one element (by using the same value for all the lanes) of the=20
vector register.

Eg. mul <vec-reg>, <vec-reg>, <vec-reg>[index]
    mla <vec-reg>, <vec-reg>, <vec-reg>[index] // multiply and add

But for a loop like the one in the C example given, I will have to load the
c[2] in one element of the vector register (leaving the remaining unused)
rather. This is why I was proposing to load just one element in a vector=20
register (what I meant as "lane specific load"). The benefit of doing this =
is
that we avoid explicit duplication, however such a simplification can only
be done where such support is available - the reason why I was thinking in
terms of optional standard pattern name. Another benefit is we will also be
able to support scalars in the expression like in the following example:

void
foo (int *__restrict__ a,
     int *__restrict__ b,
     int c)
{
  int i;

  for (i =3D 0; i < 8; i++)
    a[i] =3D b[i] * c;
}


Another example which can take advantage of the target feature:

void=20
foo (int *__restrict__ a,
     int *__restrict__ b,
     int *__restrict__ c)
{
  int i,j;

  for (i =3D 0; i < 8; i++)
    for (j =3D 0; j < 8; j++)
      a[j] +=3D b[j] * c[i];
}

This scenario we discussed this earlier (you suggested handling this at TER=
).

Cheers
VP