[RFC] Vectorization of indexed elements

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC] Vectorization of indexed elements
@ 2013-09-09 17:25 Vidya Praveen
  2013-09-09 18:02 ` Marc Glisse
  0 siblings, 1 reply; 20+ messages in thread
From: Vidya Praveen @ 2013-09-09 17:25 UTC (permalink / raw)
  To: gcc; +Cc: rguenther, ook

Hello,

This post details some thoughts on an enhancement to the vectorizer that 
could take advantage of the SIMD instructions that allows indexed element
as an operand thus reducing the need for duplication and possibly improve
reuse of previously loaded data.

Appreciate your opinion on this. 

--- 

A phrase like this:

 for(i=0;i<4;i++)
   a[i] = b[i] <op> c[2];

is usually vectorized as:

  va:V4SI = a[0:3]
  vb:V4SI = b[0:3]
  t = c[2]
  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
  ...
  va:V4SI = vb:V4SI <op> vc:V4SI

But this could be simplified further if a target has instructions that support
indexed element as a parameter. For example an instruction like this:

  mul v0.4s, v1.4s, v2.4s[2]

can perform multiplication of each element of v2.4s with the third element of
v2.4s (specified as v2.4s[2]) and store the results in the corresponding 
elements of v0.4s. 

For this to happen, vectorizer needs to understand this idiom and treat the 
operand c[2] specially (and by taking in to consideration if the machine 
supports indexed element as an operand for <op> through a target hook or macro)
and consider this as vectorizable statement without having to duplicate the 
elements explicitly. 

There are fews ways this could be represented at gimple:

  ...
  va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
  ...

or by allowing a vectorizer treat an indexed element as a valid operand in a 
vectorizable statement:

  ...
  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
  ...

For the sake of explanation, the above two representations assumes that 
c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the
only use of 'c' then it may be safer to just load one element and use it like
this:

  vc:V4SI[0] = c[2]
  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)

This could also mean that expressions involving scalar could be treated 
similarly. For example,

  for(i=0;i<4;i++)
    a[i] = b[i] <op> c

could be vectorized as:

  vc:V4SI[0] = c
  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)

Such a change would also require new standard pattern names to be defined for
each <op>.

Alternatively, having something like this:

  ...
  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
  va:V4SI = vb:V4SI <op> vt:V4SI
  ...

would remove the need to introduce several new standard pattern names but have
just one to represent vec_duplicate(vec_select()) but ofcourse this will expect
the target to have combiner patterns.

This enhancement could possibly help further optimizing larger scenarios such 
as linear systems.

Regards
VP

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-09 17:25 [RFC] Vectorization of indexed elements Vidya Praveen
@ 2013-09-09 18:02 ` Marc Glisse
  2013-09-10  8:25   ` Richard Biener
  2013-09-24 15:04   ` Vidya Praveen
  0 siblings, 2 replies; 20+ messages in thread
From: Marc Glisse @ 2013-09-09 18:02 UTC (permalink / raw)
  To: Vidya Praveen; +Cc: gcc, rguenther, ook

On Mon, 9 Sep 2013, Vidya Praveen wrote:

> Hello,
>
> This post details some thoughts on an enhancement to the vectorizer that
> could take advantage of the SIMD instructions that allows indexed element
> as an operand thus reducing the need for duplication and possibly improve
> reuse of previously loaded data.
>
> Appreciate your opinion on this.
>
> ---
>
> A phrase like this:
>
> for(i=0;i<4;i++)
>   a[i] = b[i] <op> c[2];
>
> is usually vectorized as:
>
>  va:V4SI = a[0:3]
>  vb:V4SI = b[0:3]
>  t = c[2]
>  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
>  ...
>  va:V4SI = vb:V4SI <op> vc:V4SI
>
> But this could be simplified further if a target has instructions that support
> indexed element as a parameter. For example an instruction like this:
>
>  mul v0.4s, v1.4s, v2.4s[2]
>
> can perform multiplication of each element of v2.4s with the third element of
> v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> elements of v0.4s.
>
> For this to happen, vectorizer needs to understand this idiom and treat the
> operand c[2] specially (and by taking in to consideration if the machine
> supports indexed element as an operand for <op> through a target hook or macro)
> and consider this as vectorizable statement without having to duplicate the
> elements explicitly.
>
> There are fews ways this could be represented at gimple:
>
>  ...
>  va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
>  ...
>
> or by allowing a vectorizer treat an indexed element as a valid operand in a
> vectorizable statement:

Might as well allow any scalar then...

>  ...
>  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
>  ...
>
> For the sake of explanation, the above two representations assumes that
> c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the
> only use of 'c' then it may be safer to just load one element and use it like
> this:
>
>  vc:V4SI[0] = c[2]
>  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
>
> This could also mean that expressions involving scalar could be treated
> similarly. For example,
>
>  for(i=0;i<4;i++)
>    a[i] = b[i] <op> c
>
> could be vectorized as:
>
>  vc:V4SI[0] = c
>  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
>
> Such a change would also require new standard pattern names to be defined for
> each <op>.
>
> Alternatively, having something like this:
>
>  ...
>  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
>  va:V4SI = vb:V4SI <op> vt:V4SI
>  ...
>
> would remove the need to introduce several new standard pattern names but have
> just one to represent vec_duplicate(vec_select()) but ofcourse this will expect
> the target to have combiner patterns.

The cost estimation wouldn't be very good, but aren't combine patterns 
enough for the whole thing? Don't you model your mul instruction as:

(mult:V4SI
   (match_operand:V4SI)
   (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))

anyway? Seems that combine should be able to handle it. What currently 
happens that we fail to generate the right instruction?

In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR 
for vec_duplicate, adding new nodes is always painful.

> This enhancement could possibly help further optimizing larger scenarios such
> as linear systems.
>
> Regards
> VP

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-09 18:02 ` Marc Glisse
@ 2013-09-10  8:25   ` Richard Biener
  2013-09-24 15:03     ` Vidya Praveen
  2013-09-24 15:04   ` Vidya Praveen
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Biener @ 2013-09-10  8:25 UTC (permalink / raw)
  To: gcc; +Cc: Vidya Praveen, ook

On Mon, 9 Sep 2013, Marc Glisse wrote:

> On Mon, 9 Sep 2013, Vidya Praveen wrote:
> 
> > Hello,
> > 
> > This post details some thoughts on an enhancement to the vectorizer that
> > could take advantage of the SIMD instructions that allows indexed element
> > as an operand thus reducing the need for duplication and possibly improve
> > reuse of previously loaded data.
> > 
> > Appreciate your opinion on this.
> > 
> > ---
> > 
> > A phrase like this:
> > 
> > for(i=0;i<4;i++)
> >   a[i] = b[i] <op> c[2];
> > 
> > is usually vectorized as:
> > 
> >  va:V4SI = a[0:3]
> >  vb:V4SI = b[0:3]
> >  t = c[2]
> >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
> >  ...
> >  va:V4SI = vb:V4SI <op> vc:V4SI
> > 
> > But this could be simplified further if a target has instructions that
> > support
> > indexed element as a parameter. For example an instruction like this:
> > 
> >  mul v0.4s, v1.4s, v2.4s[2]
> > 
> > can perform multiplication of each element of v2.4s with the third element
> > of
> > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > elements of v0.4s.
> > 
> > For this to happen, vectorizer needs to understand this idiom and treat the
> > operand c[2] specially (and by taking in to consideration if the machine
> > supports indexed element as an operand for <op> through a target hook or
> > macro)
> > and consider this as vectorizable statement without having to duplicate the
> > elements explicitly.
> > 
> > There are fews ways this could be represented at gimple:
> > 
> >  ...
> >  va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> >  ...
> > 
> > or by allowing a vectorizer treat an indexed element as a valid operand in a
> > vectorizable statement:
> 
> Might as well allow any scalar then...

I agree.  The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form
would necessarily be two extra separate statements and thus subject
to CSE obfuscating it enough for RTL expansion to no longer notice it.

That said, allowing mixed scalar/vector ops isn't very nice and
your scheme can be simplified by just using

  vc:V4SI = VEC_DUPLICATE_EXPR <...>
  va:V4SI = vb:V4SI <op> vc:V4SI

where the expander only has to see that vc:V4SI is defined by
a duplicate.

> >  ...
> >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> >  ...
> > 
> > For the sake of explanation, the above two representations assumes that
> > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is
> > the
> > only use of 'c' then it may be safer to just load one element and use it
> > like
> > this:
> > 
> >  vc:V4SI[0] = c[2]
> >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > 
> > This could also mean that expressions involving scalar could be treated
> > similarly. For example,
> > 
> >  for(i=0;i<4;i++)
> >    a[i] = b[i] <op> c
> > 
> > could be vectorized as:
> > 
> >  vc:V4SI[0] = c
> >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > 
> > Such a change would also require new standard pattern names to be defined
> > for
> > each <op>.
> > 
> > Alternatively, having something like this:
> > 
> >  ...
> >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> >  va:V4SI = vb:V4SI <op> vt:V4SI
> >  ...
> > 
> > would remove the need to introduce several new standard pattern names but
> > have
> > just one to represent vec_duplicate(vec_select()) but ofcourse this will
> > expect
> > the target to have combiner patterns.
> 
> The cost estimation wouldn't be very good, but aren't combine patterns enough
> for the whole thing? Don't you model your mul instruction as:
> 
> (mult:V4SI
>   (match_operand:V4SI)
>   (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> 
> anyway? Seems that combine should be able to handle it. What currently happens
> that we fail to generate the right instruction?
> 
> In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for
> vec_duplicate, adding new nodes is always painful.

True, though CONSTRUCTOR isn't a good vec_duplicate primitive.  But yes,
we have it that way at the moment and indeed adding new nodes is always
painful.

> > This enhancement could possibly help further optimizing larger scenarios
> > such
> > as linear systems.

Given that the vectorizer already handles all cases you quote but
just the expansion doesn't use the targets special abilities - can't
you just teach the expander to lookup the definition of the
vectors and see if it is an uniform CONSTRUCTOR?

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-10  8:25   ` Richard Biener
@ 2013-09-24 15:03     ` Vidya Praveen
  2013-09-25  9:22       ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Vidya Praveen @ 2013-09-24 15:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, ook

On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote:
> On Mon, 9 Sep 2013, Marc Glisse wrote:
> 
> > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > 
> > > Hello,
> > > 
> > > This post details some thoughts on an enhancement to the vectorizer that
> > > could take advantage of the SIMD instructions that allows indexed element
> > > as an operand thus reducing the need for duplication and possibly improve
> > > reuse of previously loaded data.
> > > 
> > > Appreciate your opinion on this.
> > > 
> > > ---
> > > 
> > > A phrase like this:
> > > 
> > > for(i=0;i<4;i++)
> > >   a[i] = b[i] <op> c[2];
> > > 
> > > is usually vectorized as:
> > > 
> > >  va:V4SI = a[0:3]
> > >  vb:V4SI = b[0:3]
> > >  t = c[2]
> > >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
> > >  ...
> > >  va:V4SI = vb:V4SI <op> vc:V4SI
> > > 
> > > But this could be simplified further if a target has instructions that
> > > support
> > > indexed element as a parameter. For example an instruction like this:
> > > 
> > >  mul v0.4s, v1.4s, v2.4s[2]
> > > 
> > > can perform multiplication of each element of v2.4s with the third element
> > > of
> > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > elements of v0.4s.
> > > 
> > > For this to happen, vectorizer needs to understand this idiom and treat the
> > > operand c[2] specially (and by taking in to consideration if the machine
> > > supports indexed element as an operand for <op> through a target hook or
> > > macro)
> > > and consider this as vectorizable statement without having to duplicate the
> > > elements explicitly.
> > > 
> > > There are fews ways this could be represented at gimple:
> > > 
> > >  ...
> > >  va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > >  ...
> > > 
> > > or by allowing a vectorizer treat an indexed element as a valid operand in a
> > > vectorizable statement:
> > 
> > Might as well allow any scalar then...
> 
> I agree.  The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form
> would necessarily be two extra separate statements and thus subject
> to CSE obfuscating it enough for RTL expansion to no longer notice it.

I also thought about having a specialized expression like

VEC_INDEXED_<op>_EXPR < arg0, arg1, arg2, index> 

to mean:

arg0 = arg1 <op> arg2[index]

and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR
is handled in expr.c. But I dropped this idea since we may need to introduce
many such nodes.

> 
> That said, allowing mixed scalar/vector ops isn't very nice and
> your scheme can be simplified by just using
> 
>   vc:V4SI = VEC_DUPLICATE_EXPR <...>
>   va:V4SI = vb:V4SI <op> vc:V4SI
> 
> where the expander only has to see that vc:V4SI is defined by
> a duplicate.

I did try out something like this quickly before I posted this RFC, though
I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select())

for: 

  for(i=0;i<8;i++)
    a[i] = b[2] * c[i];

I could generate:

  ...
  <bb 8>:
  _88 = prolog_loop_adjusted_niters.6_60 * 4;
  vectp_c.13_87 = c_10(D) + _88;
  vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B];                     <<<<<<<<
  vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0);         <<<<<<<<
  _96 = prolog_loop_adjusted_niters.6_60 * 4;
  vectp_a.19_95 = a_6(D) + _96;
  vect__12.14_115 = MEM[(int *)vectp_c.13_87];
  vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93;       <<<<<<<<
  MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116;                 <<<<<<<<
  vectp_c.12_118 = vectp_c.13_87 + 16;
  vectp_a.18_119 = vectp_a.19_95 + 16;
  ivtmp_120 = 1;
  if (ivtmp_120 < bnd.8_62)
    goto <bb 9>;
  else
    goto <bb 11>;

  <bb 9>:
  # vectp_c.12_89 = PHI <vectp_c.12_118(8)>
  # vectp_a.18_97 = PHI <vectp_a.18_119(8)>
  # ivtmp_14 = PHI <ivtmp_120(8)>
  vect__12.14_91 = MEM[(int *)vectp_c.12_89];                      <<<<<<<<
  vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93;         <<<<<<<<
  MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94;
  ...

It's a crude implementation so VEC_DUP is printed as:

  (vect_ldidx_.16_92) <<< ??? >>> (0);


> > >  ...
> > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> > >  ...
> > >
> > > For the sake of explanation, the above two representations assumes that
> > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is
> > > the
> > > only use of 'c' then it may be safer to just load one element and use it
> > > like
> > > this:
> > >
> > >  vc:V4SI[0] = c[2]
> > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > 
> > > This could also mean that expressions involving scalar could be treated
> > > similarly. For example,
> > > 
> > >  for(i=0;i<4;i++)
> > >    a[i] = b[i] <op> c
> > > 
> > > could be vectorized as:
> > > 
> > >  vc:V4SI[0] = c
> > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > 
> > > Such a change would also require new standard pattern names to be defined
> > > for
> > > each <op>.
> > > 
> > > Alternatively, having something like this:
> > > 
> > >  ...
> > >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > >  va:V4SI = vb:V4SI <op> vt:V4SI
> > >  ...
> > > 
> > > would remove the need to introduce several new standard pattern names but
> > > have
> > > just one to represent vec_duplicate(vec_select()) but ofcourse this will
> > > expect
> > > the target to have combiner patterns.
> > 
> > The cost estimation wouldn't be very good, but aren't combine patterns enough
> > for the whole thing? Don't you model your mul instruction as:
> > 
> > (mult:V4SI
> >   (match_operand:V4SI)
> >   (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> > 
> > anyway? Seems that combine should be able to handle it. What currently happens
> > that we fail to generate the right instruction?
> > 
> > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for
> > vec_duplicate, adding new nodes is always painful.
> 
> True, though CONSTRUCTOR isn't a good vec_duplicate primitive.  But yes,
> we have it that way at the moment and indeed adding new nodes is always
> painful.
> 
> > > This enhancement could possibly help further optimizing larger scenarios
> > > such
> > > as linear systems.
> 
> Given that the vectorizer already handles all cases you quote but
> just the expansion doesn't use the targets special abilities - can't
> you just teach the expander to lookup the definition of the
> vectors and see if it is an uniform CONSTRUCTOR?
> 
> Richard.
> 

I did think of handling this as a part of expanding CONSTRUCTOR but I thought
it may not a good idea if we want to enhance this support in the future to
handle larger cases like this one (hypothetical example):

  for i = 0 to 3
    for j = 0 to 3
      a[j] += b[j] * c[i]

to
 
  a[0:3] += b[0:3] + c[0] 
  a[0:3] += b[0:3] + c[1] 
  a[0:3] += b[0:3] + c[2] 
  a[0:3] += b[0:3] + c[3]

Secondly, I am not sure if introducing a single lane load at the time of 
expansion and removing or expecting the existing scalar load to be removed
later as it is unused, is a good idea. 

Please advise. 

Cheers
VP


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-09 18:02 ` Marc Glisse
  2013-09-10  8:25   ` Richard Biener
@ 2013-09-24 15:04   ` Vidya Praveen
  2013-09-25  9:25     ` Richard Biener
  1 sibling, 1 reply; 20+ messages in thread
From: Vidya Praveen @ 2013-09-24 15:04 UTC (permalink / raw)
  To: gcc; +Cc: rguenther, ook

On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote:
> On Mon, 9 Sep 2013, Vidya Praveen wrote:
> 
> > Hello,
> >
> > This post details some thoughts on an enhancement to the vectorizer that
> > could take advantage of the SIMD instructions that allows indexed element
> > as an operand thus reducing the need for duplication and possibly improve
> > reuse of previously loaded data.
> >
> > Appreciate your opinion on this.
> >
> > ---
> >
> > A phrase like this:
> >
> > for(i=0;i<4;i++)
> >   a[i] = b[i] <op> c[2];
> >
> > is usually vectorized as:
> >
> >  va:V4SI = a[0:3]
> >  vb:V4SI = b[0:3]
> >  t = c[2]
> >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
> >  ...
> >  va:V4SI = vb:V4SI <op> vc:V4SI
> >
> > But this could be simplified further if a target has instructions that support
> > indexed element as a parameter. For example an instruction like this:
> >
> >  mul v0.4s, v1.4s, v2.4s[2]
> >
> > can perform multiplication of each element of v2.4s with the third element of
> > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > elements of v0.4s.
> >
> > For this to happen, vectorizer needs to understand this idiom and treat the
> > operand c[2] specially (and by taking in to consideration if the machine
> > supports indexed element as an operand for <op> through a target hook or macro)
> > and consider this as vectorizable statement without having to duplicate the
> > elements explicitly.
> >
> > There are fews ways this could be represented at gimple:
> >
> >  ...
> >  va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> >  ...
> >
> > or by allowing a vectorizer treat an indexed element as a valid operand in a
> > vectorizable statement:
> 
> Might as well allow any scalar then...

Yes, I had given an example below.

> 
> >  ...
> >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> >  ...
> >
> > For the sake of explanation, the above two representations assumes that
> > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the
> > only use of 'c' then it may be safer to just load one element and use it like
> > this:
> >
> >  vc:V4SI[0] = c[2]
> >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> >
> > This could also mean that expressions involving scalar could be treated
> > similarly. For example,
> >
> >  for(i=0;i<4;i++)
> >    a[i] = b[i] <op> c
> >
> > could be vectorized as:
> >
> >  vc:V4SI[0] = c
> >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> >
> > Such a change would also require new standard pattern names to be defined for
> > each <op>.
> >
> > Alternatively, having something like this:
> >
> >  ...
> >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> >  va:V4SI = vb:V4SI <op> vt:V4SI
> >  ...
> >
> > would remove the need to introduce several new standard pattern names but have
> > just one to represent vec_duplicate(vec_select()) but ofcourse this will expect
> > the target to have combiner patterns.
> 
> The cost estimation wouldn't be very good, but aren't combine patterns 
> enough for the whole thing? Don't you model your mul instruction as:
> 
> (mult:V4SI
>    (match_operand:V4SI)
>    (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> 
> anyway? Seems that combine should be able to handle it. What currently 
> happens that we fail to generate the right instruction?

At vec_init, I can recognize an idiom in order to generate vec_duplicate but
I can't really insist on the single lane load.. something like:

vc:V4SI[0] = c
vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
va:V4SI = vb:V4SI <op> vt:V4SI

Or is there any other way to do this?

Cheers
VP

> 
> In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR 
> for vec_duplicate, adding new nodes is always painful.
> 
> > This enhancement could possibly help further optimizing larger scenarios such
> > as linear systems.
> >
> > Regards
> > VP
> 
> -- 
> Marc Glisse
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-24 15:03     ` Vidya Praveen
@ 2013-09-25  9:22       ` Richard Biener
  2013-09-30 13:01         ` Vidya Praveen
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2013-09-25  9:22 UTC (permalink / raw)
  To: Vidya Praveen; +Cc: gcc, ook

On Tue, 24 Sep 2013, Vidya Praveen wrote:

> On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote:
> > On Mon, 9 Sep 2013, Marc Glisse wrote:
> > 
> > > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > > 
> > > > Hello,
> > > > 
> > > > This post details some thoughts on an enhancement to the vectorizer that
> > > > could take advantage of the SIMD instructions that allows indexed element
> > > > as an operand thus reducing the need for duplication and possibly improve
> > > > reuse of previously loaded data.
> > > > 
> > > > Appreciate your opinion on this.
> > > > 
> > > > ---
> > > > 
> > > > A phrase like this:
> > > > 
> > > > for(i=0;i<4;i++)
> > > >   a[i] = b[i] <op> c[2];
> > > > 
> > > > is usually vectorized as:
> > > > 
> > > >  va:V4SI = a[0:3]
> > > >  vb:V4SI = b[0:3]
> > > >  t = c[2]
> > > >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
> > > >  ...
> > > >  va:V4SI = vb:V4SI <op> vc:V4SI
> > > > 
> > > > But this could be simplified further if a target has instructions that
> > > > support
> > > > indexed element as a parameter. For example an instruction like this:
> > > > 
> > > >  mul v0.4s, v1.4s, v2.4s[2]
> > > > 
> > > > can perform multiplication of each element of v2.4s with the third element
> > > > of
> > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > > elements of v0.4s.
> > > > 
> > > > For this to happen, vectorizer needs to understand this idiom and treat the
> > > > operand c[2] specially (and by taking in to consideration if the machine
> > > > supports indexed element as an operand for <op> through a target hook or
> > > > macro)
> > > > and consider this as vectorizable statement without having to duplicate the
> > > > elements explicitly.
> > > > 
> > > > There are fews ways this could be represented at gimple:
> > > > 
> > > >  ...
> > > >  va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > >  ...
> > > > 
> > > > or by allowing a vectorizer treat an indexed element as a valid operand in a
> > > > vectorizable statement:
> > > 
> > > Might as well allow any scalar then...
> > 
> > I agree.  The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form
> > would necessarily be two extra separate statements and thus subject
> > to CSE obfuscating it enough for RTL expansion to no longer notice it.
> 
> I also thought about having a specialized expression like
> 
> VEC_INDEXED_<op>_EXPR < arg0, arg1, arg2, index> 
> 
> to mean:
> 
> arg0 = arg1 <op> arg2[index]
> 
> and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR
> is handled in expr.c. But I dropped this idea since we may need to introduce
> many such nodes.
> 
> > 
> > That said, allowing mixed scalar/vector ops isn't very nice and
> > your scheme can be simplified by just using
> > 
> >   vc:V4SI = VEC_DUPLICATE_EXPR <...>
> >   va:V4SI = vb:V4SI <op> vc:V4SI
> > 
> > where the expander only has to see that vc:V4SI is defined by
> > a duplicate.
> 
> I did try out something like this quickly before I posted this RFC, though
> I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select())
> 
> for: 
> 
>   for(i=0;i<8;i++)
>     a[i] = b[2] * c[i];
> 
> I could generate:
> 
>   ...
>   <bb 8>:
>   _88 = prolog_loop_adjusted_niters.6_60 * 4;
>   vectp_c.13_87 = c_10(D) + _88;
>   vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B];                     <<<<<<<<
>   vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0);         <<<<<<<<
>   _96 = prolog_loop_adjusted_niters.6_60 * 4;
>   vectp_a.19_95 = a_6(D) + _96;
>   vect__12.14_115 = MEM[(int *)vectp_c.13_87];
>   vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93;       <<<<<<<<
>   MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116;                 <<<<<<<<
>   vectp_c.12_118 = vectp_c.13_87 + 16;
>   vectp_a.18_119 = vectp_a.19_95 + 16;
>   ivtmp_120 = 1;
>   if (ivtmp_120 < bnd.8_62)
>     goto <bb 9>;
>   else
>     goto <bb 11>;
> 
>   <bb 9>:
>   # vectp_c.12_89 = PHI <vectp_c.12_118(8)>
>   # vectp_a.18_97 = PHI <vectp_a.18_119(8)>
>   # ivtmp_14 = PHI <ivtmp_120(8)>
>   vect__12.14_91 = MEM[(int *)vectp_c.12_89];                      <<<<<<<<
>   vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93;         <<<<<<<<
>   MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94;
>   ...
> 
> It's a crude implementation so VEC_DUP is printed as:
> 
>   (vect_ldidx_.16_92) <<< ??? >>> (0);
> 
> 
> > > >  ...
> > > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> > > >  ...
> > > >
> > > > For the sake of explanation, the above two representations assumes that
> > > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is
> > > > the
> > > > only use of 'c' then it may be safer to just load one element and use it
> > > > like
> > > > this:
> > > >
> > > >  vc:V4SI[0] = c[2]
> > > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > > 
> > > > This could also mean that expressions involving scalar could be treated
> > > > similarly. For example,
> > > > 
> > > >  for(i=0;i<4;i++)
> > > >    a[i] = b[i] <op> c
> > > > 
> > > > could be vectorized as:
> > > > 
> > > >  vc:V4SI[0] = c
> > > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > > 
> > > > Such a change would also require new standard pattern names to be defined
> > > > for
> > > > each <op>.
> > > > 
> > > > Alternatively, having something like this:
> > > > 
> > > >  ...
> > > >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > >  va:V4SI = vb:V4SI <op> vt:V4SI
> > > >  ...
> > > > 
> > > > would remove the need to introduce several new standard pattern names but
> > > > have
> > > > just one to represent vec_duplicate(vec_select()) but ofcourse this will
> > > > expect
> > > > the target to have combiner patterns.
> > > 
> > > The cost estimation wouldn't be very good, but aren't combine patterns enough
> > > for the whole thing? Don't you model your mul instruction as:
> > > 
> > > (mult:V4SI
> > >   (match_operand:V4SI)
> > >   (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> > > 
> > > anyway? Seems that combine should be able to handle it. What currently happens
> > > that we fail to generate the right instruction?
> > > 
> > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for
> > > vec_duplicate, adding new nodes is always painful.
> > 
> > True, though CONSTRUCTOR isn't a good vec_duplicate primitive.  But yes,
> > we have it that way at the moment and indeed adding new nodes is always
> > painful.
> > 
> > > > This enhancement could possibly help further optimizing larger scenarios
> > > > such
> > > > as linear systems.
> > 
> > Given that the vectorizer already handles all cases you quote but
> > just the expansion doesn't use the targets special abilities - can't
> > you just teach the expander to lookup the definition of the
> > vectors and see if it is an uniform CONSTRUCTOR?
> > 
> > Richard.
> > 
> 
> I did think of handling this as a part of expanding CONSTRUCTOR but I thought
> it may not a good idea if we want to enhance this support in the future to
> handle larger cases like this one (hypothetical example):
> 
>   for i = 0 to 3
>     for j = 0 to 3
>       a[j] += b[j] * c[i]
> 
> to
>  
>   a[0:3] += b[0:3] + c[0] 
>   a[0:3] += b[0:3] + c[1] 
>   a[0:3] += b[0:3] + c[2] 
>   a[0:3] += b[0:3] + c[3]
> 
> Secondly, I am not sure if introducing a single lane load at the time of 
> expansion and removing or expecting the existing scalar load to be removed
> later as it is unused, is a good idea. 

The above example requires unrolling and scalar cleanup anyway so you'll
currently see sth like

tem00_1 = c[0];
tem0_2 = { tem00_1, tem00_1, tem00_1, tem00_1 };
temb_3 = b[0:3];
tems_4 = temb_3 + tem0_2;
tema_5 = a[0:3];
tems_6 = tema_5 + tems_4;
a[0:3] = tems_6;

which at expansion time can be short-cut easily without even going to
expand the constructor.  That's the feature of TER which defers
expansion of SSA name definition statements until they are first
used.  In this case when you expand tems_4 you can pattern match
the case you are interested in (not 100% sure which it is, I suppose
it is vector register + vector_select < vector register >) and
simply expand it optimally.  The "consumed" SSA names will have no
further uses and will simply not be expanded at all.

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-24 15:04   ` Vidya Praveen
@ 2013-09-25  9:25     ` Richard Biener
  2013-09-27 14:50       ` Vidya Praveen
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2013-09-25  9:25 UTC (permalink / raw)
  To: Vidya Praveen; +Cc: gcc, ook

On Tue, 24 Sep 2013, Vidya Praveen wrote:

> On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote:
> > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > 
> > > Hello,
> > >
> > > This post details some thoughts on an enhancement to the vectorizer that
> > > could take advantage of the SIMD instructions that allows indexed element
> > > as an operand thus reducing the need for duplication and possibly improve
> > > reuse of previously loaded data.
> > >
> > > Appreciate your opinion on this.
> > >
> > > ---
> > >
> > > A phrase like this:
> > >
> > > for(i=0;i<4;i++)
> > >   a[i] = b[i] <op> c[2];
> > >
> > > is usually vectorized as:
> > >
> > >  va:V4SI = a[0:3]
> > >  vb:V4SI = b[0:3]
> > >  t = c[2]
> > >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
> > >  ...
> > >  va:V4SI = vb:V4SI <op> vc:V4SI
> > >
> > > But this could be simplified further if a target has instructions that support
> > > indexed element as a parameter. For example an instruction like this:
> > >
> > >  mul v0.4s, v1.4s, v2.4s[2]
> > >
> > > can perform multiplication of each element of v2.4s with the third element of
> > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > elements of v0.4s.
> > >
> > > For this to happen, vectorizer needs to understand this idiom and treat the
> > > operand c[2] specially (and by taking in to consideration if the machine
> > > supports indexed element as an operand for <op> through a target hook or macro)
> > > and consider this as vectorizable statement without having to duplicate the
> > > elements explicitly.
> > >
> > > There are fews ways this could be represented at gimple:
> > >
> > >  ...
> > >  va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > >  ...
> > >
> > > or by allowing a vectorizer treat an indexed element as a valid operand in a
> > > vectorizable statement:
> > 
> > Might as well allow any scalar then...
> 
> Yes, I had given an example below.
> 
> > 
> > >  ...
> > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> > >  ...
> > >
> > > For the sake of explanation, the above two representations assumes that
> > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the
> > > only use of 'c' then it may be safer to just load one element and use it like
> > > this:
> > >
> > >  vc:V4SI[0] = c[2]
> > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > >
> > > This could also mean that expressions involving scalar could be treated
> > > similarly. For example,
> > >
> > >  for(i=0;i<4;i++)
> > >    a[i] = b[i] <op> c
> > >
> > > could be vectorized as:
> > >
> > >  vc:V4SI[0] = c
> > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > >
> > > Such a change would also require new standard pattern names to be defined for
> > > each <op>.
> > >
> > > Alternatively, having something like this:
> > >
> > >  ...
> > >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > >  va:V4SI = vb:V4SI <op> vt:V4SI
> > >  ...
> > >
> > > would remove the need to introduce several new standard pattern names but have
> > > just one to represent vec_duplicate(vec_select()) but ofcourse this will expect
> > > the target to have combiner patterns.
> > 
> > The cost estimation wouldn't be very good, but aren't combine patterns 
> > enough for the whole thing? Don't you model your mul instruction as:
> > 
> > (mult:V4SI
> >    (match_operand:V4SI)
> >    (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> > 
> > anyway? Seems that combine should be able to handle it. What currently 
> > happens that we fail to generate the right instruction?
> 
> At vec_init, I can recognize an idiom in order to generate vec_duplicate but
> I can't really insist on the single lane load.. something like:
> 
> vc:V4SI[0] = c
> vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> va:V4SI = vb:V4SI <op> vt:V4SI
> 
> Or is there any other way to do this?

Can you elaborate on "I can't really insist on the single lane load"?
What's the single lane load in your example?  I'd expect the instruction
pattern as quoted to just work (and I hope we expand an uniform
constructor { a, a, a, a } properly using vec_duplicate).

Richard.

> Cheers
> VP
> 
> > 
> > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR 
> > for vec_duplicate, adding new nodes is always painful.
> > 
> > > This enhancement could possibly help further optimizing larger scenarios such
> > > as linear systems.
> > >
> > > Regards
> > > VP
> > 
> > -- 
> > Marc Glisse
> >
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-25  9:25     ` Richard Biener
@ 2013-09-27 14:50       ` Vidya Praveen
  2013-09-27 15:19         ` Vidya Praveen
  0 siblings, 1 reply; 20+ messages in thread
From: Vidya Praveen @ 2013-09-27 14:50 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, ook

On Wed, Sep 25, 2013 at 10:24:56AM +0100, Richard Biener wrote:
> On Tue, 24 Sep 2013, Vidya Praveen wrote:
> 
> > On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote:
> > > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > > 
> > > > Hello,
> > > >
> > > > This post details some thoughts on an enhancement to the vectorizer that
> > > > could take advantage of the SIMD instructions that allows indexed element
> > > > as an operand thus reducing the need for duplication and possibly improve
> > > > reuse of previously loaded data.
> > > >
> > > > Appreciate your opinion on this.
> > > >
> > > > ---
> > > >
> > > > A phrase like this:
> > > >
> > > > for(i=0;i<4;i++)
> > > >   a[i] = b[i] <op> c[2];
> > > >
> > > > is usually vectorized as:
> > > >
> > > >  va:V4SI = a[0:3]
> > > >  vb:V4SI = b[0:3]
> > > >  t = c[2]
> > > >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
> > > >  ...
> > > >  va:V4SI = vb:V4SI <op> vc:V4SI
> > > >
> > > > But this could be simplified further if a target has instructions that support
> > > > indexed element as a parameter. For example an instruction like this:
> > > >
> > > >  mul v0.4s, v1.4s, v2.4s[2]
> > > >
> > > > can perform multiplication of each element of v2.4s with the third element of
> > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > > elements of v0.4s.
> > > >
> > > > For this to happen, vectorizer needs to understand this idiom and treat the
> > > > operand c[2] specially (and by taking in to consideration if the machine
> > > > supports indexed element as an operand for <op> through a target hook or macro)
> > > > and consider this as vectorizable statement without having to duplicate the
> > > > elements explicitly.
> > > >
> > > > There are fews ways this could be represented at gimple:
> > > >
> > > >  ...
> > > >  va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > >  ...
> > > >
> > > > or by allowing a vectorizer treat an indexed element as a valid operand in a
> > > > vectorizable statement:
> > > 
> > > Might as well allow any scalar then...
> > 
> > Yes, I had given an example below.
> > 
> > > 
> > > >  ...
> > > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> > > >  ...
> > > >
> > > > For the sake of explanation, the above two representations assumes that
> > > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the
> > > > only use of 'c' then it may be safer to just load one element and use it like
> > > > this:
> > > >
> > > >  vc:V4SI[0] = c[2]
> > > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > >
> > > > This could also mean that expressions involving scalar could be treated
> > > > similarly. For example,
> > > >
> > > >  for(i=0;i<4;i++)
> > > >    a[i] = b[i] <op> c
> > > >
> > > > could be vectorized as:
> > > >
> > > >  vc:V4SI[0] = c
> > > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > >
> > > > Such a change would also require new standard pattern names to be defined for
> > > > each <op>.
> > > >
> > > > Alternatively, having something like this:
> > > >
> > > >  ...
> > > >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > >  va:V4SI = vb:V4SI <op> vt:V4SI
> > > >  ...
> > > >
> > > > would remove the need to introduce several new standard pattern names but have
> > > > just one to represent vec_duplicate(vec_select()) but ofcourse this will expect
> > > > the target to have combiner patterns.
> > > 
> > > The cost estimation wouldn't be very good, but aren't combine patterns 
> > > enough for the whole thing? Don't you model your mul instruction as:
> > > 
> > > (mult:V4SI
> > >    (match_operand:V4SI)
> > >    (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> > > 
> > > anyway? Seems that combine should be able to handle it. What currently 
> > > happens that we fail to generate the right instruction?
> > 
> > At vec_init, I can recognize an idiom in order to generate vec_duplicate but
> > I can't really insist on the single lane load.. something like:
> > 
> > vc:V4SI[0] = c
> > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > va:V4SI = vb:V4SI <op> vt:V4SI
> > 
> > Or is there any other way to do this?
> 
> Can you elaborate on "I can't really insist on the single lane load"?
> What's the single lane load in your example? 

Loading just one lane of the vector like this:

vc:V4SI[0] = c // from the above scalar example

or 

vc:V4SI[0] = c[2] 

is what I meant by single lane load. In this example:

t = c[2] 
...
vb:v4si = b[0:3] 
vc:v4si = { t, t, t, t }
va:v4si = vb:v4si <op> vc:v4si 

If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
seen as vec_select:SI (vect_t 0) ). 

> I'd expect the instruction
> pattern as quoted to just work (and I hope we expand an uniform
> constructor { a, a, a, a } properly using vec_duplicate).

As much as I went through the code, this is only done using vect_init. It is
not expanded as vec_duplicate from, for example, store_constructor() of expr.c

VP

> 
> Richard.
> 
> > Cheers
> > VP
> > 
> > > 
> > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR 
> > > for vec_duplicate, adding new nodes is always painful.
> > > 
> > > > This enhancement could possibly help further optimizing larger scenarios such
> > > > as linear systems.
> > > >
> > > > Regards
> > > > VP
> > > 
> > > -- 
> > > Marc Glisse
> > >
> > 
> > 
> > 
> 
> -- 
> Richard Biener <rguenther@suse.de>
> SUSE / SUSE Labs
> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> GF: Jeff Hawn, Jennifer Guild, Felix Imend
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-27 14:50       ` Vidya Praveen
@ 2013-09-27 15:19         ` Vidya Praveen
  2013-09-30 12:55           ` Vidya Praveen
  0 siblings, 1 reply; 20+ messages in thread
From: Vidya Praveen @ 2013-09-27 15:19 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, ook

On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
[...]
> > > I can't really insist on the single lane load.. something like:
> > > 
> > > vc:V4SI[0] = c
> > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > 
> > > Or is there any other way to do this?
> > 
> > Can you elaborate on "I can't really insist on the single lane load"?
> > What's the single lane load in your example? 
> 
> Loading just one lane of the vector like this:
> 
> vc:V4SI[0] = c // from the above scalar example
> 
> or 
> 
> vc:V4SI[0] = c[2] 
> 
> is what I meant by single lane load. In this example:
> 
> t = c[2] 
> ...
> vb:v4si = b[0:3] 
> vc:v4si = { t, t, t, t }
> va:v4si = vb:v4si <op> vc:v4si 
> 
> If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
> seen as vec_select:SI (vect_t 0) ). 
> 
> > I'd expect the instruction
> > pattern as quoted to just work (and I hope we expand an uniform
> > constructor { a, a, a, a } properly using vec_duplicate).
> 
> As much as I went through the code, this is only done using vect_init. It is
> not expanded as vec_duplicate from, for example, store_constructor() of expr.c

Do you see any issues if we expand such constructor as vec_duplicate directly 
instead of going through vect_init way? 

VP


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-27 15:19         ` Vidya Praveen
@ 2013-09-30 12:55           ` Vidya Praveen
  2013-09-30 13:19             ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Vidya Praveen @ 2013-09-30 12:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, ook

On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> [...]
> > > > I can't really insist on the single lane load.. something like:
> > > > 
> > > > vc:V4SI[0] = c
> > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > > 
> > > > Or is there any other way to do this?
> > > 
> > > Can you elaborate on "I can't really insist on the single lane load"?
> > > What's the single lane load in your example? 
> > 
> > Loading just one lane of the vector like this:
> > 
> > vc:V4SI[0] = c // from the above scalar example
> > 
> > or 
> > 
> > vc:V4SI[0] = c[2] 
> > 
> > is what I meant by single lane load. In this example:
> > 
> > t = c[2] 
> > ...
> > vb:v4si = b[0:3] 
> > vc:v4si = { t, t, t, t }
> > va:v4si = vb:v4si <op> vc:v4si 
> > 
> > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
> > seen as vec_select:SI (vect_t 0) ). 
> > 
> > > I'd expect the instruction
> > > pattern as quoted to just work (and I hope we expand an uniform
> > > constructor { a, a, a, a } properly using vec_duplicate).
> > 
> > As much as I went through the code, this is only done using vect_init. It is
> > not expanded as vec_duplicate from, for example, store_constructor() of expr.c
> 
> Do you see any issues if we expand such constructor as vec_duplicate directly 
> instead of going through vect_init way? 

Sorry, that was a bad question.

But here's what I would like to propose as a first step. Please tell me if this
is acceptable or if it makes sense:

- Introduce standard pattern names 

"vmulim4" - vector muliply with second operand as indexed operand

Example:

(define_insn "vmuliv4si4"
   [set (match_operand:V4SI 0 "register_operand")
        (mul:V4SI (match_operand:V4SI 1 "register_operand")
                  (vec_duplicate:V4SI
                    (vec_select:SI
                      (match_operand:V4SI 2 "register_operand")
                      (match_operand:V4SI 3 "immediate_operand)))))]
 ...
)

"vlmovmn3" - move where one of the operands is specific lane of a vector and 
             other is a scalar. 

Example:

(define_insn "vlmovv4sisi3"
  [set (vec_select:SI (match_operand:V4SI 0 "register_operand")
                      (match_operand:SI 1 "immediate_operand"))
       (match_operand:SI 2 "memory_operand")]
  ...
)

- Identify the following idiom and expand through the above standard patterns:

  t = c[m] 
  vc[0:n] = { t, t, t, t}
  a[0:n] = b[0:n] * vc[0:n] 

as 

 (insn (set (vec_select:SI (reg:V4SI 0) 0) (mem:SI ... )))
 (insn (set (reg:V4SI 1)
            (mult:V4SI (reg:V4SI 2)
                       (vec_duplicate:V4SI (vec_select:SI (reg:V4SI 0) 0)))))

If this path is acceptable, then I can extend this to support 

"vmaddim4" - multiply and add (with indexed element as multiplier)
"vmsubim4" - multiply and subtract (with indexed element as multiplier)

Please let me know your thoughts.

Cheers
VP


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-25  9:22       ` Richard Biener
@ 2013-09-30 13:01         ` Vidya Praveen
  0 siblings, 0 replies; 20+ messages in thread
From: Vidya Praveen @ 2013-09-30 13:01 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, ook

On Wed, Sep 25, 2013 at 10:22:05AM +0100, Richard Biener wrote:
> On Tue, 24 Sep 2013, Vidya Praveen wrote:
> 
> > On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote:
> > > On Mon, 9 Sep 2013, Marc Glisse wrote:
> > > 
> > > > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > > > 
> > > > > Hello,
> > > > > 
> > > > > This post details some thoughts on an enhancement to the vectorizer that
> > > > > could take advantage of the SIMD instructions that allows indexed element
> > > > > as an operand thus reducing the need for duplication and possibly improve
> > > > > reuse of previously loaded data.
> > > > > 
> > > > > Appreciate your opinion on this.
> > > > > 
> > > > > ---
> > > > > 
> > > > > A phrase like this:
> > > > > 
> > > > > for(i=0;i<4;i++)
> > > > >   a[i] = b[i] <op> c[2];
> > > > > 
> > > > > is usually vectorized as:
> > > > > 
> > > > >  va:V4SI = a[0:3]
> > > > >  vb:V4SI = b[0:3]
> > > > >  t = c[2]
> > > > >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
> > > > >  ...
> > > > >  va:V4SI = vb:V4SI <op> vc:V4SI
> > > > > 
> > > > > But this could be simplified further if a target has instructions that
> > > > > support
> > > > > indexed element as a parameter. For example an instruction like this:
> > > > > 
> > > > >  mul v0.4s, v1.4s, v2.4s[2]
> > > > > 
> > > > > can perform multiplication of each element of v2.4s with the third element
> > > > > of
> > > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > > > elements of v0.4s.
> > > > > 
> > > > > For this to happen, vectorizer needs to understand this idiom and treat the
> > > > > operand c[2] specially (and by taking in to consideration if the machine
> > > > > supports indexed element as an operand for <op> through a target hook or
> > > > > macro)
> > > > > and consider this as vectorizable statement without having to duplicate the
> > > > > elements explicitly.
> > > > > 
> > > > > There are fews ways this could be represented at gimple:
> > > > > 
> > > > >  ...
> > > > >  va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > > >  ...
> > > > > 
> > > > > or by allowing a vectorizer treat an indexed element as a valid operand in a
> > > > > vectorizable statement:
> > > > 
> > > > Might as well allow any scalar then...
> > > 
> > > I agree.  The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form
> > > would necessarily be two extra separate statements and thus subject
> > > to CSE obfuscating it enough for RTL expansion to no longer notice it.
> > 
> > I also thought about having a specialized expression like
> > 
> > VEC_INDEXED_<op>_EXPR < arg0, arg1, arg2, index> 
> > 
> > to mean:
> > 
> > arg0 = arg1 <op> arg2[index]
> > 
> > and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR
> > is handled in expr.c. But I dropped this idea since we may need to introduce
> > many such nodes.
> > 
> > > 
> > > That said, allowing mixed scalar/vector ops isn't very nice and
> > > your scheme can be simplified by just using
> > > 
> > >   vc:V4SI = VEC_DUPLICATE_EXPR <...>
> > >   va:V4SI = vb:V4SI <op> vc:V4SI
> > > 
> > > where the expander only has to see that vc:V4SI is defined by
> > > a duplicate.
> > 
> > I did try out something like this quickly before I posted this RFC, though
> > I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select())
> > 
> > for: 
> > 
> >   for(i=0;i<8;i++)
> >     a[i] = b[2] * c[i];
> > 
> > I could generate:
> > 
> >   ...
> >   <bb 8>:
> >   _88 = prolog_loop_adjusted_niters.6_60 * 4;
> >   vectp_c.13_87 = c_10(D) + _88;
> >   vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B];                     <<<<<<<<
> >   vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0);         <<<<<<<<
> >   _96 = prolog_loop_adjusted_niters.6_60 * 4;
> >   vectp_a.19_95 = a_6(D) + _96;
> >   vect__12.14_115 = MEM[(int *)vectp_c.13_87];
> >   vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93;       <<<<<<<<
> >   MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116;                 <<<<<<<<
> >   vectp_c.12_118 = vectp_c.13_87 + 16;
> >   vectp_a.18_119 = vectp_a.19_95 + 16;
> >   ivtmp_120 = 1;
> >   if (ivtmp_120 < bnd.8_62)
> >     goto <bb 9>;
> >   else
> >     goto <bb 11>;
> > 
> >   <bb 9>:
> >   # vectp_c.12_89 = PHI <vectp_c.12_118(8)>
> >   # vectp_a.18_97 = PHI <vectp_a.18_119(8)>
> >   # ivtmp_14 = PHI <ivtmp_120(8)>
> >   vect__12.14_91 = MEM[(int *)vectp_c.12_89];                      <<<<<<<<
> >   vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93;         <<<<<<<<
> >   MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94;
> >   ...
> > 
> > It's a crude implementation so VEC_DUP is printed as:
> > 
> >   (vect_ldidx_.16_92) <<< ??? >>> (0);
> > 
> > 
> > > > >  ...
> > > > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> > > > >  ...
> > > > >
> > > > > For the sake of explanation, the above two representations assumes that
> > > > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is
> > > > > the
> > > > > only use of 'c' then it may be safer to just load one element and use it
> > > > > like
> > > > > this:
> > > > >
> > > > >  vc:V4SI[0] = c[2]
> > > > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > > > 
> > > > > This could also mean that expressions involving scalar could be treated
> > > > > similarly. For example,
> > > > > 
> > > > >  for(i=0;i<4;i++)
> > > > >    a[i] = b[i] <op> c
> > > > > 
> > > > > could be vectorized as:
> > > > > 
> > > > >  vc:V4SI[0] = c
> > > > >  va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > > > 
> > > > > Such a change would also require new standard pattern names to be defined
> > > > > for
> > > > > each <op>.
> > > > > 
> > > > > Alternatively, having something like this:
> > > > > 
> > > > >  ...
> > > > >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > > >  va:V4SI = vb:V4SI <op> vt:V4SI
> > > > >  ...
> > > > > 
> > > > > would remove the need to introduce several new standard pattern names but
> > > > > have
> > > > > just one to represent vec_duplicate(vec_select()) but ofcourse this will
> > > > > expect
> > > > > the target to have combiner patterns.
> > > > 
> > > > The cost estimation wouldn't be very good, but aren't combine patterns enough
> > > > for the whole thing? Don't you model your mul instruction as:
> > > > 
> > > > (mult:V4SI
> > > >   (match_operand:V4SI)
> > > >   (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> > > > 
> > > > anyway? Seems that combine should be able to handle it. What currently happens
> > > > that we fail to generate the right instruction?
> > > > 
> > > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for
> > > > vec_duplicate, adding new nodes is always painful.
> > > 
> > > True, though CONSTRUCTOR isn't a good vec_duplicate primitive.  But yes,
> > > we have it that way at the moment and indeed adding new nodes is always
> > > painful.
> > > 
> > > > > This enhancement could possibly help further optimizing larger scenarios
> > > > > such
> > > > > as linear systems.
> > > 
> > > Given that the vectorizer already handles all cases you quote but
> > > just the expansion doesn't use the targets special abilities - can't
> > > you just teach the expander to lookup the definition of the
> > > vectors and see if it is an uniform CONSTRUCTOR?
> > > 
> > > Richard.
> > > 
> > 
> > I did think of handling this as a part of expanding CONSTRUCTOR but I thought
> > it may not a good idea if we want to enhance this support in the future to
> > handle larger cases like this one (hypothetical example):
> > 
> >   for i = 0 to 3
> >     for j = 0 to 3
> >       a[j] += b[j] * c[i]
> > 
> > to
> >  
> >   a[0:3] += b[0:3] + c[0] 
> >   a[0:3] += b[0:3] + c[1] 
> >   a[0:3] += b[0:3] + c[2] 
> >   a[0:3] += b[0:3] + c[3]

Actually there was a typo in my previous example, it should've been:

   a[0:3] += b[0:3] * c[0]
   a[0:3] += b[0:3] * c[1]
   a[0:3] += b[0:3] * c[2]
   a[0:3] += b[0:3] * c[3]

> > 
> > Secondly, I am not sure if introducing a single lane load at the time of 
> > expansion and removing or expecting the existing scalar load to be removed
> > later as it is unused, is a good idea. 
> 
> The above example requires unrolling and scalar cleanup anyway so you'll
> currently see sth like
> 
> tem00_1 = c[0];
> tem0_2 = { tem00_1, tem00_1, tem00_1, tem00_1 };
> temb_3 = b[0:3];
> tems_4 = temb_3 + tem0_2;
> tema_5 = a[0:3];
> tems_6 = tema_5 + tems_4;
> a[0:3] = tems_6;
> 
> which at expansion time can be short-cut easily without even going to
> expand the constructor.  That's the feature of TER which defers
> expansion of SSA name definition statements until they are first
> used.  In this case when you expand tems_4 you can pattern match
> the case you are interested in (not 100% sure which it is, I suppose
> it is vector register + vector_select < vector register >) and
> simply expand it optimally.  The "consumed" SSA names will have no
> further uses and will simply not be expanded at all.

Thanks. This sounds better. What I'm interested in is:

<vecreg> += <vecreg> * vec_duplicate( vec_select( <vecreg> <index>))

Cheers
VP



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-30 12:55           ` Vidya Praveen
@ 2013-09-30 13:19             ` Richard Biener
  2013-09-30 14:00               ` Vidya Praveen
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2013-09-30 13:19 UTC (permalink / raw)
  To: Vidya Praveen; +Cc: gcc, ook

On Mon, 30 Sep 2013, Vidya Praveen wrote:

> On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > [...]
> > > > > I can't really insist on the single lane load.. something like:
> > > > > 
> > > > > vc:V4SI[0] = c
> > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > > > 
> > > > > Or is there any other way to do this?
> > > > 
> > > > Can you elaborate on "I can't really insist on the single lane load"?
> > > > What's the single lane load in your example? 
> > > 
> > > Loading just one lane of the vector like this:
> > > 
> > > vc:V4SI[0] = c // from the above scalar example
> > > 
> > > or 
> > > 
> > > vc:V4SI[0] = c[2] 
> > > 
> > > is what I meant by single lane load. In this example:
> > > 
> > > t = c[2] 
> > > ...
> > > vb:v4si = b[0:3] 
> > > vc:v4si = { t, t, t, t }
> > > va:v4si = vb:v4si <op> vc:v4si 
> > > 
> > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
> > > seen as vec_select:SI (vect_t 0) ). 
> > > 
> > > > I'd expect the instruction
> > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > 
> > > As much as I went through the code, this is only done using vect_init. It is
> > > not expanded as vec_duplicate from, for example, store_constructor() of expr.c
> > 
> > Do you see any issues if we expand such constructor as vec_duplicate directly 
> > instead of going through vect_init way? 
> 
> Sorry, that was a bad question.
> 
> But here's what I would like to propose as a first step. Please tell me if this
> is acceptable or if it makes sense:
> 
> - Introduce standard pattern names 
> 
> "vmulim4" - vector muliply with second operand as indexed operand
> 
> Example:
> 
> (define_insn "vmuliv4si4"
>    [set (match_operand:V4SI 0 "register_operand")
>         (mul:V4SI (match_operand:V4SI 1 "register_operand")
>                   (vec_duplicate:V4SI
>                     (vec_select:SI
>                       (match_operand:V4SI 2 "register_operand")
>                       (match_operand:V4SI 3 "immediate_operand)))))]
>  ...
> )

We could factor this with providing a standard pattern name for

(define_insn "vdupi<mode>"
  [set (match_operand:<mode> 0 "register_operand")
       (vec_duplicate:<mode>
          (vec_select:<scalarmode>
             (match_operand:<mode> 1 "register_operand")
             (match_operand:SI 2 "immediate_operand))))]

(you use V4SI for the immediate?  Ideally vdupi has another custom
mode for the vector index).

Note that this factored pattern is already available as vec_perm_const!
It is simply (vec_perm_const:V4SI <source> <source> <immediate-selector>).

Which means that on the GIMPLE level we should try to combine

el_4 = BIT_FIELD_REF <v_3, ...>;
v_5 = { el_4, el_4, ... };

into

v_5 = VEC_PERM_EXPR <v_3, v_3, ...>;

which it should already do with simplify_permutation.

But I'm not sure what you are after at then end ;)

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-30 13:19             ` Richard Biener
@ 2013-09-30 14:00               ` Vidya Praveen
  2013-10-01  8:26                 ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Vidya Praveen @ 2013-09-30 14:00 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, ook

On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> On Mon, 30 Sep 2013, Vidya Praveen wrote:
> 
> > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > [...]
> > > > > > I can't really insist on the single lane load.. something like:
> > > > > > 
> > > > > > vc:V4SI[0] = c
> > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > > > > 
> > > > > > Or is there any other way to do this?
> > > > > 
> > > > > Can you elaborate on "I can't really insist on the single lane load"?
> > > > > What's the single lane load in your example? 
> > > > 
> > > > Loading just one lane of the vector like this:
> > > > 
> > > > vc:V4SI[0] = c // from the above scalar example
> > > > 
> > > > or 
> > > > 
> > > > vc:V4SI[0] = c[2] 
> > > > 
> > > > is what I meant by single lane load. In this example:
> > > > 
> > > > t = c[2] 
> > > > ...
> > > > vb:v4si = b[0:3] 
> > > > vc:v4si = { t, t, t, t }
> > > > va:v4si = vb:v4si <op> vc:v4si 
> > > > 
> > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
> > > > seen as vec_select:SI (vect_t 0) ). 
> > > > 
> > > > > I'd expect the instruction
> > > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > 
> > > > As much as I went through the code, this is only done using vect_init. It is
> > > > not expanded as vec_duplicate from, for example, store_constructor() of expr.c
> > > 
> > > Do you see any issues if we expand such constructor as vec_duplicate directly 
> > > instead of going through vect_init way? 
> > 
> > Sorry, that was a bad question.
> > 
> > But here's what I would like to propose as a first step. Please tell me if this
> > is acceptable or if it makes sense:
> > 
> > - Introduce standard pattern names 
> > 
> > "vmulim4" - vector muliply with second operand as indexed operand
> > 
> > Example:
> > 
> > (define_insn "vmuliv4si4"
> >    [set (match_operand:V4SI 0 "register_operand")
> >         (mul:V4SI (match_operand:V4SI 1 "register_operand")
> >                   (vec_duplicate:V4SI
> >                     (vec_select:SI
> >                       (match_operand:V4SI 2 "register_operand")
> >                       (match_operand:V4SI 3 "immediate_operand)))))]
> >  ...
> > )
> 
> We could factor this with providing a standard pattern name for
> 
> (define_insn "vdupi<mode>"
>   [set (match_operand:<mode> 0 "register_operand")
>        (vec_duplicate:<mode>
>           (vec_select:<scalarmode>
>              (match_operand:<mode> 1 "register_operand")
>              (match_operand:SI 2 "immediate_operand))))]

This is good. I did think about this but then I thought of avoiding the need
for combiner patterns :-) 

But do you find the lane specific mov pattern I proposed, acceptable? 

> (you use V4SI for the immediate?  

Sorry typo again!! It should've been SI.

> Ideally vdupi has another custom
> mode for the vector index).
> 
> Note that this factored pattern is already available as vec_perm_const!
> It is simply (vec_perm_const:V4SI <source> <source> <immediate-selector>).
> 
> Which means that on the GIMPLE level we should try to combine
> 
> el_4 = BIT_FIELD_REF <v_3, ...>;
> v_5 = { el_4, el_4, ... };

I don't think we reach this state at all for the scenarios in discussion.
what we generally have is:

 el_4 = MEM_REF < array + index*size >
 v_5 = { el_4, ... }

Or am I missing something?

> 
> into
> 
> v_5 = VEC_PERM_EXPR <v_3, v_3, ...>;
> 
> which it should already do with simplify_permutation.
> 
> But I'm not sure what you are after at then end ;)
> 
> Richard.
>
 
Regards
VP

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-09-30 14:00               ` Vidya Praveen
@ 2013-10-01  8:26                 ` Richard Biener
  2013-10-11 14:54                   ` Vidya Praveen
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2013-10-01  8:26 UTC (permalink / raw)
  To: Vidya Praveen; +Cc: gcc, ook

On Mon, 30 Sep 2013, Vidya Praveen wrote:

> On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > 
> > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > [...]
> > > > > > > I can't really insist on the single lane load.. something like:
> > > > > > > 
> > > > > > > vc:V4SI[0] = c
> > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > > > > > 
> > > > > > > Or is there any other way to do this?
> > > > > > 
> > > > > > Can you elaborate on "I can't really insist on the single lane load"?
> > > > > > What's the single lane load in your example? 
> > > > > 
> > > > > Loading just one lane of the vector like this:
> > > > > 
> > > > > vc:V4SI[0] = c // from the above scalar example
> > > > > 
> > > > > or 
> > > > > 
> > > > > vc:V4SI[0] = c[2] 
> > > > > 
> > > > > is what I meant by single lane load. In this example:
> > > > > 
> > > > > t = c[2] 
> > > > > ...
> > > > > vb:v4si = b[0:3] 
> > > > > vc:v4si = { t, t, t, t }
> > > > > va:v4si = vb:v4si <op> vc:v4si 
> > > > > 
> > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
> > > > > seen as vec_select:SI (vect_t 0) ). 
> > > > > 
> > > > > > I'd expect the instruction
> > > > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > > 
> > > > > As much as I went through the code, this is only done using vect_init. It is
> > > > > not expanded as vec_duplicate from, for example, store_constructor() of expr.c
> > > > 
> > > > Do you see any issues if we expand such constructor as vec_duplicate directly 
> > > > instead of going through vect_init way? 
> > > 
> > > Sorry, that was a bad question.
> > > 
> > > But here's what I would like to propose as a first step. Please tell me if this
> > > is acceptable or if it makes sense:
> > > 
> > > - Introduce standard pattern names 
> > > 
> > > "vmulim4" - vector muliply with second operand as indexed operand
> > > 
> > > Example:
> > > 
> > > (define_insn "vmuliv4si4"
> > >    [set (match_operand:V4SI 0 "register_operand")
> > >         (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > >                   (vec_duplicate:V4SI
> > >                     (vec_select:SI
> > >                       (match_operand:V4SI 2 "register_operand")
> > >                       (match_operand:V4SI 3 "immediate_operand)))))]
> > >  ...
> > > )
> > 
> > We could factor this with providing a standard pattern name for
> > 
> > (define_insn "vdupi<mode>"
> >   [set (match_operand:<mode> 0 "register_operand")
> >        (vec_duplicate:<mode>
> >           (vec_select:<scalarmode>
> >              (match_operand:<mode> 1 "register_operand")
> >              (match_operand:SI 2 "immediate_operand))))]
> 
> This is good. I did think about this but then I thought of avoiding the need
> for combiner patterns :-) 
> 
> But do you find the lane specific mov pattern I proposed, acceptable? 

The specific mul pattern?  As said, consider factoring to vdupi to
avoid an explosion in required special optabs.

> > (you use V4SI for the immediate?  
> 
> Sorry typo again!! It should've been SI.
> 
> > Ideally vdupi has another custom
> > mode for the vector index).
> > 
> > Note that this factored pattern is already available as vec_perm_const!
> > It is simply (vec_perm_const:V4SI <source> <source> <immediate-selector>).
> > 
> > Which means that on the GIMPLE level we should try to combine
> > 
> > el_4 = BIT_FIELD_REF <v_3, ...>;
> > v_5 = { el_4, el_4, ... };
> 
> I don't think we reach this state at all for the scenarios in discussion.
> what we generally have is:
> 
>  el_4 = MEM_REF < array + index*size >
>  v_5 = { el_4, ... }
> 
> Or am I missing something?

Well, but in that case I doubt it is profitable (or even valid!) to
turn this into a vector lane load from the array.  If it is profitable
to perform a vector read (because we're going to use the other elements
of the vector as well) then the vectorizer should produce a vector
load and materialize the uniform vector from one of its elements.

Maybe at this point you should show us a compilable C testcase
with a loop that should be vectorized using your instructions in
the end?

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-10-01  8:26                 ` Richard Biener
@ 2013-10-11 14:54                   ` Vidya Praveen
  2013-10-11 15:05                     ` Jakub Jelinek
  2013-10-14  8:05                     ` Richard Biener
  0 siblings, 2 replies; 20+ messages in thread
From: Vidya Praveen @ 2013-10-11 14:54 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, ook, marc.glisse

On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote:
> On Mon, 30 Sep 2013, Vidya Praveen wrote:
> 
> > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > 
> > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > > [...]
> > > > > > > > I can't really insist on the single lane load.. something like:
> > > > > > > > 
> > > > > > > > vc:V4SI[0] = c
> > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > > > > > > 
> > > > > > > > Or is there any other way to do this?
> > > > > > > 
> > > > > > > Can you elaborate on "I can't really insist on the single lane load"?
> > > > > > > What's the single lane load in your example? 
> > > > > > 
> > > > > > Loading just one lane of the vector like this:
> > > > > > 
> > > > > > vc:V4SI[0] = c // from the above scalar example
> > > > > > 
> > > > > > or 
> > > > > > 
> > > > > > vc:V4SI[0] = c[2] 
> > > > > > 
> > > > > > is what I meant by single lane load. In this example:
> > > > > > 
> > > > > > t = c[2] 
> > > > > > ...
> > > > > > vb:v4si = b[0:3] 
> > > > > > vc:v4si = { t, t, t, t }
> > > > > > va:v4si = vb:v4si <op> vc:v4si 
> > > > > > 
> > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
> > > > > > seen as vec_select:SI (vect_t 0) ). 
> > > > > > 
> > > > > > > I'd expect the instruction
> > > > > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > > > 
> > > > > > As much as I went through the code, this is only done using vect_init. It is
> > > > > > not expanded as vec_duplicate from, for example, store_constructor() of expr.c
> > > > > 
> > > > > Do you see any issues if we expand such constructor as vec_duplicate directly 
> > > > > instead of going through vect_init way? 
> > > > 
> > > > Sorry, that was a bad question.
> > > > 
> > > > But here's what I would like to propose as a first step. Please tell me if this
> > > > is acceptable or if it makes sense:
> > > > 
> > > > - Introduce standard pattern names 
> > > > 
> > > > "vmulim4" - vector muliply with second operand as indexed operand
> > > > 
> > > > Example:
> > > > 
> > > > (define_insn "vmuliv4si4"
> > > >    [set (match_operand:V4SI 0 "register_operand")
> > > >         (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > > >                   (vec_duplicate:V4SI
> > > >                     (vec_select:SI
> > > >                       (match_operand:V4SI 2 "register_operand")
> > > >                       (match_operand:V4SI 3 "immediate_operand)))))]
> > > >  ...
> > > > )
> > > 
> > > We could factor this with providing a standard pattern name for
> > > 
> > > (define_insn "vdupi<mode>"
> > >   [set (match_operand:<mode> 0 "register_operand")
> > >        (vec_duplicate:<mode>
> > >           (vec_select:<scalarmode>
> > >              (match_operand:<mode> 1 "register_operand")
> > >              (match_operand:SI 2 "immediate_operand))))]
> > 
> > This is good. I did think about this but then I thought of avoiding the need
> > for combiner patterns :-) 
> > 
> > But do you find the lane specific mov pattern I proposed, acceptable? 
> 
> The specific mul pattern?  As said, consider factoring to vdupi to
> avoid an explosion in required special optabs.
> 
> > > (you use V4SI for the immediate?  
> > 
> > Sorry typo again!! It should've been SI.
> > 
> > > Ideally vdupi has another custom
> > > mode for the vector index).
> > > 
> > > Note that this factored pattern is already available as vec_perm_const!
> > > It is simply (vec_perm_const:V4SI <source> <source> <immediate-selector>).
> > > 
> > > Which means that on the GIMPLE level we should try to combine
> > > 
> > > el_4 = BIT_FIELD_REF <v_3, ...>;
> > > v_5 = { el_4, el_4, ... };
> > 
> > I don't think we reach this state at all for the scenarios in discussion.
> > what we generally have is:
> > 
> >  el_4 = MEM_REF < array + index*size >
> >  v_5 = { el_4, ... }
> > 
> > Or am I missing something?
> 
> Well, but in that case I doubt it is profitable (or even valid!) to
> turn this into a vector lane load from the array.  If it is profitable
> to perform a vector read (because we're going to use the other elements
> of the vector as well) then the vectorizer should produce a vector
> load and materialize the uniform vector from one of its elements.
> 
> Maybe at this point you should show us a compilable C testcase
> with a loop that should be vectorized using your instructions in
> the end?

Here's a compilable example:

void 
foo (int *__restrict__ a,
     int *__restrict__ b,
     int *__restrict__ c)
{
  int i;

  for (i = 0; i < 8; i++)
    a[i] = b[i] * c[2];
}

This is vectorized by duplicating c[2] now. But I'm trying to take advantage
of target instructions that can take a vector register as second argument but
use only one element (by using the same value for all the lanes) of the 
vector register.

Eg. mul <vec-reg>, <vec-reg>, <vec-reg>[index]
    mla <vec-reg>, <vec-reg>, <vec-reg>[index] // multiply and add

But for a loop like the one in the C example given, I will have to load the
c[2] in one element of the vector register (leaving the remaining unused)
rather. This is why I was proposing to load just one element in a vector 
register (what I meant as "lane specific load"). The benefit of doing this is
that we avoid explicit duplication, however such a simplification can only
be done where such support is available - the reason why I was thinking in
terms of optional standard pattern name. Another benefit is we will also be
able to support scalars in the expression like in the following example:

void
foo (int *__restrict__ a,
     int *__restrict__ b,
     int c)
{
  int i;

  for (i = 0; i < 8; i++)
    a[i] = b[i] * c;
}


Another example which can take advantage of the target feature:

void 
foo (int *__restrict__ a,
     int *__restrict__ b,
     int *__restrict__ c)
{
  int i,j;

  for (i = 0; i < 8; i++)
    for (j = 0; j < 8; j++)
      a[j] += b[j] * c[i];
}

This scenario we discussed this earlier (you suggested handling this at TER).

Cheers
VP


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-10-11 14:54                   ` Vidya Praveen
@ 2013-10-11 15:05                     ` Jakub Jelinek
  2013-12-04 17:07                       ` Vidya Praveen
  2013-10-14  8:05                     ` Richard Biener
  1 sibling, 1 reply; 20+ messages in thread
From: Jakub Jelinek @ 2013-10-11 15:05 UTC (permalink / raw)
  To: Vidya Praveen; +Cc: Richard Biener, gcc, ook, marc.glisse

On Fri, Oct 11, 2013 at 03:54:08PM +0100, Vidya Praveen wrote:
> Here's a compilable example:
> 
> void 
> foo (int *__restrict__ a,
>      int *__restrict__ b,
>      int *__restrict__ c)
> {
>   int i;
> 
>   for (i = 0; i < 8; i++)
>     a[i] = b[i] * c[2];
> }
> 
> This is vectorized by duplicating c[2] now. But I'm trying to take advantage
> of target instructions that can take a vector register as second argument but
> use only one element (by using the same value for all the lanes) of the 
> vector register.
> 
> Eg. mul <vec-reg>, <vec-reg>, <vec-reg>[index]
>     mla <vec-reg>, <vec-reg>, <vec-reg>[index] // multiply and add
> 
> But for a loop like the one in the C example given, I will have to load the
> c[2] in one element of the vector register (leaving the remaining unused)
> rather. This is why I was proposing to load just one element in a vector 
> register (what I meant as "lane specific load"). The benefit of doing this is
> that we avoid explicit duplication, however such a simplification can only
> be done where such support is available - the reason why I was thinking in
> terms of optional standard pattern name. Another benefit is we will also be
> able to support scalars in the expression like in the following example:
> 
> void
> foo (int *__restrict__ a,
>      int *__restrict__ b,
>      int c)
> {
>   int i;
> 
>   for (i = 0; i < 8; i++)
>     a[i] = b[i] * c;
> }

So just during combine let the broadcast operation be combined with the
arithmetics?  Intel AVX512 ISA has similar feature, not sure what exactly
they are doing for this.  That said, the broadcast is likely going to be
hoisted before the loop, and in that case is it really cheaper to have
it unbroadcasted in a vector register rather than to broadcast it before the
loop and just use there?

	Jakub

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-10-11 14:54                   ` Vidya Praveen
  2013-10-11 15:05                     ` Jakub Jelinek
@ 2013-10-14  8:05                     ` Richard Biener
  2013-12-04 16:10                       ` Vidya Praveen
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Biener @ 2013-10-14  8:05 UTC (permalink / raw)
  To: Vidya Praveen; +Cc: gcc, ook, marc.glisse

On Fri, 11 Oct 2013, Vidya Praveen wrote:

> On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote:
> > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > 
> > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > > 
> > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > > > [...]
> > > > > > > > > I can't really insist on the single lane load.. something like:
> > > > > > > > > 
> > > > > > > > > vc:V4SI[0] = c
> > > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > > > > > > > 
> > > > > > > > > Or is there any other way to do this?
> > > > > > > > 
> > > > > > > > Can you elaborate on "I can't really insist on the single lane load"?
> > > > > > > > What's the single lane load in your example? 
> > > > > > > 
> > > > > > > Loading just one lane of the vector like this:
> > > > > > > 
> > > > > > > vc:V4SI[0] = c // from the above scalar example
> > > > > > > 
> > > > > > > or 
> > > > > > > 
> > > > > > > vc:V4SI[0] = c[2] 
> > > > > > > 
> > > > > > > is what I meant by single lane load. In this example:
> > > > > > > 
> > > > > > > t = c[2] 
> > > > > > > ...
> > > > > > > vb:v4si = b[0:3] 
> > > > > > > vc:v4si = { t, t, t, t }
> > > > > > > va:v4si = vb:v4si <op> vc:v4si 
> > > > > > > 
> > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> > > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
> > > > > > > seen as vec_select:SI (vect_t 0) ). 
> > > > > > > 
> > > > > > > > I'd expect the instruction
> > > > > > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > > > > 
> > > > > > > As much as I went through the code, this is only done using vect_init. It is
> > > > > > > not expanded as vec_duplicate from, for example, store_constructor() of expr.c
> > > > > > 
> > > > > > Do you see any issues if we expand such constructor as vec_duplicate directly 
> > > > > > instead of going through vect_init way? 
> > > > > 
> > > > > Sorry, that was a bad question.
> > > > > 
> > > > > But here's what I would like to propose as a first step. Please tell me if this
> > > > > is acceptable or if it makes sense:
> > > > > 
> > > > > - Introduce standard pattern names 
> > > > > 
> > > > > "vmulim4" - vector muliply with second operand as indexed operand
> > > > > 
> > > > > Example:
> > > > > 
> > > > > (define_insn "vmuliv4si4"
> > > > >    [set (match_operand:V4SI 0 "register_operand")
> > > > >         (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > > > >                   (vec_duplicate:V4SI
> > > > >                     (vec_select:SI
> > > > >                       (match_operand:V4SI 2 "register_operand")
> > > > >                       (match_operand:V4SI 3 "immediate_operand)))))]
> > > > >  ...
> > > > > )
> > > > 
> > > > We could factor this with providing a standard pattern name for
> > > > 
> > > > (define_insn "vdupi<mode>"
> > > >   [set (match_operand:<mode> 0 "register_operand")
> > > >        (vec_duplicate:<mode>
> > > >           (vec_select:<scalarmode>
> > > >              (match_operand:<mode> 1 "register_operand")
> > > >              (match_operand:SI 2 "immediate_operand))))]
> > > 
> > > This is good. I did think about this but then I thought of avoiding the need
> > > for combiner patterns :-) 
> > > 
> > > But do you find the lane specific mov pattern I proposed, acceptable? 
> > 
> > The specific mul pattern?  As said, consider factoring to vdupi to
> > avoid an explosion in required special optabs.
> > 
> > > > (you use V4SI for the immediate?  
> > > 
> > > Sorry typo again!! It should've been SI.
> > > 
> > > > Ideally vdupi has another custom
> > > > mode for the vector index).
> > > > 
> > > > Note that this factored pattern is already available as vec_perm_const!
> > > > It is simply (vec_perm_const:V4SI <source> <source> <immediate-selector>).
> > > > 
> > > > Which means that on the GIMPLE level we should try to combine
> > > > 
> > > > el_4 = BIT_FIELD_REF <v_3, ...>;
> > > > v_5 = { el_4, el_4, ... };
> > > 
> > > I don't think we reach this state at all for the scenarios in discussion.
> > > what we generally have is:
> > > 
> > >  el_4 = MEM_REF < array + index*size >
> > >  v_5 = { el_4, ... }
> > > 
> > > Or am I missing something?
> > 
> > Well, but in that case I doubt it is profitable (or even valid!) to
> > turn this into a vector lane load from the array.  If it is profitable
> > to perform a vector read (because we're going to use the other elements
> > of the vector as well) then the vectorizer should produce a vector
> > load and materialize the uniform vector from one of its elements.
> > 
> > Maybe at this point you should show us a compilable C testcase
> > with a loop that should be vectorized using your instructions in
> > the end?
> 
> Here's a compilable example:
> 
> void 
> foo (int *__restrict__ a,
>      int *__restrict__ b,
>      int *__restrict__ c)
> {
>   int i;
> 
>   for (i = 0; i < 8; i++)
>     a[i] = b[i] * c[2];
> }
> 
> This is vectorized by duplicating c[2] now. But I'm trying to take advantage
> of target instructions that can take a vector register as second argument but
> use only one element (by using the same value for all the lanes) of the 
> vector register.
> 
> Eg. mul <vec-reg>, <vec-reg>, <vec-reg>[index]
>     mla <vec-reg>, <vec-reg>, <vec-reg>[index] // multiply and add
> 
> But for a loop like the one in the C example given, I will have to load the
> c[2] in one element of the vector register (leaving the remaining unused)
> rather. This is why I was proposing to load just one element in a vector 
> register (what I meant as "lane specific load"). The benefit of doing this is
> that we avoid explicit duplication, however such a simplification can only
> be done where such support is available - the reason why I was thinking in
> terms of optional standard pattern name. Another benefit is we will also be
> able to support scalars in the expression like in the following example:
> 
> void
> foo (int *__restrict__ a,
>      int *__restrict__ b,
>      int c)
> {
>   int i;
> 
>   for (i = 0; i < 8; i++)
>     a[i] = b[i] * c;
> }

Both cases can be handled by patterns that match

  (mul:VXSI (reg:VXSI
             (vec_duplicate:VXSI reg:SI)))

You'd then "consume" the vec_duplicate and implement it as
load scalar into element zero of the vector and use index mult
with index zero.

This handling can be either implemented at combine level (this
is a two-instruction combine only!) or via TER.

> Another example which can take advantage of the target feature:
> 
> void 
> foo (int *__restrict__ a,
>      int *__restrict__ b,
>      int *__restrict__ c)
> {
>   int i,j;
> 
>   for (i = 0; i < 8; i++)
>     for (j = 0; j < 8; j++)
>       a[j] += b[j] * c[i];
> }
> 
> This scenario we discussed this earlier (you suggested handling this at TER).

Yeah, though for optimal code you'd want to load c[0:8] once with a
vector load.  Eventually the vectorizer outer loop vectorization support
can be massaged to handle this optimally.

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-10-14  8:05                     ` Richard Biener
@ 2013-12-04 16:10                       ` Vidya Praveen
  2013-12-06 11:48                         ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Vidya Praveen @ 2013-12-04 16:10 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, ook, marc.glisse

Hi Richi,

Apologies for the late response. I was on vacation.

On Mon, Oct 14, 2013 at 09:04:58AM +0100, Richard Biener wrote:
> On Fri, 11 Oct 2013, Vidya Praveen wrote:
> 
> > On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote:
> > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > 
> > > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > > > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > > > 
> > > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > > > > [...]
> > > > > > > > > > I can't really insist on the single lane load.. something like:
> > > > > > > > > > 
> > > > > > > > > > vc:V4SI[0] = c
> > > > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > > > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > > > > > > > > 
> > > > > > > > > > Or is there any other way to do this?
> > > > > > > > > 
> > > > > > > > > Can you elaborate on "I can't really insist on the single lane load"?
> > > > > > > > > What's the single lane load in your example? 
> > > > > > > > 
> > > > > > > > Loading just one lane of the vector like this:
> > > > > > > > 
> > > > > > > > vc:V4SI[0] = c // from the above scalar example
> > > > > > > > 
> > > > > > > > or 
> > > > > > > > 
> > > > > > > > vc:V4SI[0] = c[2] 
> > > > > > > > 
> > > > > > > > is what I meant by single lane load. In this example:
> > > > > > > > 
> > > > > > > > t = c[2] 
> > > > > > > > ...
> > > > > > > > vb:v4si = b[0:3] 
> > > > > > > > vc:v4si = { t, t, t, t }
> > > > > > > > va:v4si = vb:v4si <op> vc:v4si 
> > > > > > > > 
> > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> > > > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
> > > > > > > > seen as vec_select:SI (vect_t 0) ). 
> > > > > > > > 
> > > > > > > > > I'd expect the instruction
> > > > > > > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > > > > > 
> > > > > > > > As much as I went through the code, this is only done using vect_init. It is
> > > > > > > > not expanded as vec_duplicate from, for example, store_constructor() of expr.c
> > > > > > > 
> > > > > > > Do you see any issues if we expand such constructor as vec_duplicate directly 
> > > > > > > instead of going through vect_init way? 
> > > > > > 
> > > > > > Sorry, that was a bad question.
> > > > > > 
> > > > > > But here's what I would like to propose as a first step. Please tell me if this
> > > > > > is acceptable or if it makes sense:
> > > > > > 
> > > > > > - Introduce standard pattern names 
> > > > > > 
> > > > > > "vmulim4" - vector muliply with second operand as indexed operand
> > > > > > 
> > > > > > Example:
> > > > > > 
> > > > > > (define_insn "vmuliv4si4"
> > > > > >    [set (match_operand:V4SI 0 "register_operand")
> > > > > >         (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > > > > >                   (vec_duplicate:V4SI
> > > > > >                     (vec_select:SI
> > > > > >                       (match_operand:V4SI 2 "register_operand")
> > > > > >                       (match_operand:V4SI 3 "immediate_operand)))))]
> > > > > >  ...
> > > > > > )
> > > > > 
> > > > > We could factor this with providing a standard pattern name for
> > > > > 
> > > > > (define_insn "vdupi<mode>"
> > > > >   [set (match_operand:<mode> 0 "register_operand")
> > > > >        (vec_duplicate:<mode>
> > > > >           (vec_select:<scalarmode>
> > > > >              (match_operand:<mode> 1 "register_operand")
> > > > >              (match_operand:SI 2 "immediate_operand))))]
> > > > 
> > > > This is good. I did think about this but then I thought of avoiding the need
> > > > for combiner patterns :-) 
> > > > 
> > > > But do you find the lane specific mov pattern I proposed, acceptable? 
> > > 
> > > The specific mul pattern?  As said, consider factoring to vdupi to
> > > avoid an explosion in required special optabs.
> > > 
> > > > > (you use V4SI for the immediate?  
> > > > 
> > > > Sorry typo again!! It should've been SI.
> > > > 
> > > > > Ideally vdupi has another custom
> > > > > mode for the vector index).
> > > > > 
> > > > > Note that this factored pattern is already available as vec_perm_const!
> > > > > It is simply (vec_perm_const:V4SI <source> <source> <immediate-selector>).
> > > > > 
> > > > > Which means that on the GIMPLE level we should try to combine
> > > > > 
> > > > > el_4 = BIT_FIELD_REF <v_3, ...>;
> > > > > v_5 = { el_4, el_4, ... };
> > > > 
> > > > I don't think we reach this state at all for the scenarios in discussion.
> > > > what we generally have is:
> > > > 
> > > >  el_4 = MEM_REF < array + index*size >
> > > >  v_5 = { el_4, ... }
> > > > 
> > > > Or am I missing something?
> > > 
> > > Well, but in that case I doubt it is profitable (or even valid!) to
> > > turn this into a vector lane load from the array.  If it is profitable
> > > to perform a vector read (because we're going to use the other elements
> > > of the vector as well) then the vectorizer should produce a vector
> > > load and materialize the uniform vector from one of its elements.
> > > 
> > > Maybe at this point you should show us a compilable C testcase
> > > with a loop that should be vectorized using your instructions in
> > > the end?
> > 
> > Here's a compilable example:
> > 
> > void 
> > foo (int *__restrict__ a,
> >      int *__restrict__ b,
> >      int *__restrict__ c)
> > {
> >   int i;
> > 
> >   for (i = 0; i < 8; i++)
> >     a[i] = b[i] * c[2];
> > }
> > 
> > This is vectorized by duplicating c[2] now. But I'm trying to take advantage
> > of target instructions that can take a vector register as second argument but
> > use only one element (by using the same value for all the lanes) of the 
> > vector register.
> > 
> > Eg. mul <vec-reg>, <vec-reg>, <vec-reg>[index]
> >     mla <vec-reg>, <vec-reg>, <vec-reg>[index] // multiply and add
> > 
> > But for a loop like the one in the C example given, I will have to load the
> > c[2] in one element of the vector register (leaving the remaining unused)
> > rather. This is why I was proposing to load just one element in a vector 
> > register (what I meant as "lane specific load"). The benefit of doing this is
> > that we avoid explicit duplication, however such a simplification can only
> > be done where such support is available - the reason why I was thinking in
> > terms of optional standard pattern name. Another benefit is we will also be
> > able to support scalars in the expression like in the following example:
> > 
> > void
> > foo (int *__restrict__ a,
> >      int *__restrict__ b,
> >      int c)
> > {
> >   int i;
> > 
> >   for (i = 0; i < 8; i++)
> >     a[i] = b[i] * c;
> > }
> 
> Both cases can be handled by patterns that match
> 
>   (mul:VXSI (reg:VXSI
>              (vec_duplicate:VXSI reg:SI)))

How do I arrive at this pattern in the first place? Assuming vec_init with
uniform values are expanded as vec_duplicate, it will still be two expressions.

That is,

(set reg:VXSI (vec_duplicate:VXSI (reg:SI)))
(set reg:VXSI (mul:VXSI (reg:VXSI) (reg:VXSI)))

> You'd then "consume" the vec_duplicate and implement it as
> load scalar into element zero of the vector and use index mult
> with index zero.

If I understand this correctly, you are suggesting to leave the scalar
load from memory as it is but treat the 

(mul:VXSI (reg:VXSI (vec_duplicate:VXSI reg:SI)))

as 

load reg:VXSI[0], reg:SI
mul reg:VXSI, reg:VXSI, re:VXSI[0] // by reusing the destination register perhaps

either by generating instructions directly or by using define_split. Am I right?

If I'm right, then my concern is that it may be possible to simplify this further
by loading directly to a indexed vector register from memory but it's too late at
this point for such simplification to be possible.

Please let me know what am I not understanding.

> This handling can be either implemented at combine level (this
> is a two-instruction combine only!) or via TER.
>
> > Another example which can take advantage of the target feature:
> > 
> > void 
> > foo (int *__restrict__ a,
> >      int *__restrict__ b,
> >      int *__restrict__ c)
> > {
> >   int i,j;
> > 
> >   for (i = 0; i < 8; i++)
> >     for (j = 0; j < 8; j++)
> >       a[j] += b[j] * c[i];
> > }
> > 
> > This scenario we discussed this earlier (you suggested handling this at TER).
> 
> Yeah, though for optimal code you'd want to load c[0:8] once with a
> vector load.  Eventually the vectorizer outer loop vectorization support
> can be massaged to handle this optimally.

I agree. 

Thanks
VP.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-10-11 15:05                     ` Jakub Jelinek
@ 2013-12-04 17:07                       ` Vidya Praveen
  0 siblings, 0 replies; 20+ messages in thread
From: Vidya Praveen @ 2013-12-04 17:07 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Biener, gcc, ook, marc.glisse

Hi Jakub,

Apologies for the late response.

On Fri, Oct 11, 2013 at 04:05:24PM +0100, Jakub Jelinek wrote:
> On Fri, Oct 11, 2013 at 03:54:08PM +0100, Vidya Praveen wrote:
> > Here's a compilable example:
> > 
> > void 
> > foo (int *__restrict__ a,
> >      int *__restrict__ b,
> >      int *__restrict__ c)
> > {
> >   int i;
> > 
> >   for (i = 0; i < 8; i++)
> >     a[i] = b[i] * c[2];
> > }
> > 
> > This is vectorized by duplicating c[2] now. But I'm trying to take advantage
> > of target instructions that can take a vector register as second argument but
> > use only one element (by using the same value for all the lanes) of the 
> > vector register.
> > 
> > Eg. mul <vec-reg>, <vec-reg>, <vec-reg>[index]
> >     mla <vec-reg>, <vec-reg>, <vec-reg>[index] // multiply and add
> > 
> > But for a loop like the one in the C example given, I will have to load the
> > c[2] in one element of the vector register (leaving the remaining unused)
> > rather. This is why I was proposing to load just one element in a vector 
> > register (what I meant as "lane specific load"). The benefit of doing this is
> > that we avoid explicit duplication, however such a simplification can only
> > be done where such support is available - the reason why I was thinking in
> > terms of optional standard pattern name. Another benefit is we will also be
> > able to support scalars in the expression like in the following example:
> > 
> > void
> > foo (int *__restrict__ a,
> >      int *__restrict__ b,
> >      int c)
> > {
> >   int i;
> > 
> >   for (i = 0; i < 8; i++)
> >     a[i] = b[i] * c;
> > }
> 
> So just during combine let the broadcast operation be combined with the
> arithmetics?  

Yes. I can do that. But I always want it to be possible to recognize and load
directly to the indexed vector register from memory.


> Intel AVX512 ISA has similar feature, not sure what exactly
> they are doing for this. 

Thanks. I'll try to go through the code to understand.

> That said, the broadcast is likely going to be
> hoisted before the loop, and in that case is it really cheaper to have
> it unbroadcasted in a vector register rather than to broadcast it before the
> loop and just use there?

Could you explain what do you mean by unbroadcast? The constructor needs to be
expanded in one way or another, isn't it? I thought expanding to vec_duplicate
when the values are uniform is the most efficient when vec_duplicate could be
supported by the target. If you had meant that each element of vector is loaded
separately, I am thinking how can I combine such an operation with the arithmetic
operation.

Thanks
VP.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] Vectorization of indexed elements
  2013-12-04 16:10                       ` Vidya Praveen
@ 2013-12-06 11:48                         ` Richard Biener
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Biener @ 2013-12-06 11:48 UTC (permalink / raw)
  To: Vidya Praveen; +Cc: gcc, ook, marc.glisse

On Wed, 4 Dec 2013, Vidya Praveen wrote:

> Hi Richi,
> 
> Apologies for the late response. I was on vacation.
> 
> On Mon, Oct 14, 2013 at 09:04:58AM +0100, Richard Biener wrote:
> > > void
> > > foo (int *__restrict__ a,
> > >      int *__restrict__ b,
> > >      int c)
> > > {
> > >   int i;
> > > 
> > >   for (i = 0; i < 8; i++)
> > >     a[i] = b[i] * c;
> > > }
> > 
> > Both cases can be handled by patterns that match
> > 
> >   (mul:VXSI (reg:VXSI
> >              (vec_duplicate:VXSI reg:SI)))
> 
> How do I arrive at this pattern in the first place? Assuming vec_init with
> uniform values are expanded as vec_duplicate, it will still be two expressions.
> 
> That is,
> 
> (set reg:VXSI (vec_duplicate:VXSI (reg:SI)))
> (set reg:VXSI (mul:VXSI (reg:VXSI) (reg:VXSI)))

Yes, but then combine comes along and creates

 (set reg:VXSI (mul:VXSI (reg:VXSI (vec_duplicate:VXSI (reg:SI)))))

which matches one of your define_insn[_and_split]s.

> > You'd then "consume" the vec_duplicate and implement it as
> > load scalar into element zero of the vector and use index mult
> > with index zero.
> 
> If I understand this correctly, you are suggesting to leave the scalar
> load from memory as it is but treat the 
> 
> (mul:VXSI (reg:VXSI (vec_duplicate:VXSI reg:SI)))
> 
> as 
> 
> load reg:VXSI[0], reg:SI
> mul reg:VXSI, reg:VXSI, re:VXSI[0] // by reusing the destination register perhaps
> 
> either by generating instructions directly or by using define_split. Am I right?

Possibly.  Or allow memory as operand 2 for your pattern (so, not
reg:SI but mem:SI).  Combine should be happy with that, too.
 
> If I'm right, then my concern is that it may be possible to simplify this further
> by loading directly to a indexed vector register from memory but it's too late at
> this point for such simplification to be possible.
> 
> Please let me know what am I not understanding.

Not sure.  Did you try it?

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-12-06 11:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-09 17:25 [RFC] Vectorization of indexed elements Vidya Praveen
2013-09-09 18:02 ` Marc Glisse
2013-09-10  8:25   ` Richard Biener
2013-09-24 15:03     ` Vidya Praveen
2013-09-25  9:22       ` Richard Biener
2013-09-30 13:01         ` Vidya Praveen
2013-09-24 15:04   ` Vidya Praveen
2013-09-25  9:25     ` Richard Biener
2013-09-27 14:50       ` Vidya Praveen
2013-09-27 15:19         ` Vidya Praveen
2013-09-30 12:55           ` Vidya Praveen
2013-09-30 13:19             ` Richard Biener
2013-09-30 14:00               ` Vidya Praveen
2013-10-01  8:26                 ` Richard Biener
2013-10-11 14:54                   ` Vidya Praveen
2013-10-11 15:05                     ` Jakub Jelinek
2013-12-04 17:07                       ` Vidya Praveen
2013-10-14  8:05                     ` Richard Biener
2013-12-04 16:10                       ` Vidya Praveen
2013-12-06 11:48                         ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).