targetm.vectorize.builtin_vec

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* targetm.vectorize.builtin_vec_perm
@ 2009-11-17  1:40 Richard Henderson
  2009-11-17  2:39 ` targetm.vectorize.builtin_vec_perm Joern Rennecke
  2009-11-17  8:59 ` targetm.vectorize.builtin_vec_perm Ira Rosen
  0 siblings, 2 replies; 6+ messages in thread
From: Richard Henderson @ 2009-11-17  1:40 UTC (permalink / raw)
  To: irar; +Cc: gcc

What is this hook supposed to do?  There is no description of its arguments.

What is the theory of operation of permute within the vectorizer?  Do 
you actually need variable permute, or would constants be ok?

I'm contemplating adding a tree- and gimple-level VEC_PERMUTE_EXPR of 
the form:

   VEC_PERMUTE_EXPR (vlow, vhigh, vperm)

which would be exactly equal to

   (vec_select
     (vec_concat vlow vhigh)
     vperm)

at the rtl level.  I.e. vperm is an integral vector of the same number 
of elements as vlow.

Truly variable permutation is something that's only supported by ppc and 
spu.  Intel AVX has a limited variable permutation -- 64-bit or 32-bit 
elements can be rearranged but only within a 128-bit subvector.
So if you're working with 128-bit vectors, it's fully variable, but if 
you're working with 256-bit vectors, it's like doing 2 128-bit permute 
operations in parallel.  Intel before AVX has no variable permute.

HOWEVER!  Most of the useful permutations that I can think of for the 
optimizers to generate are actually constant.  And these can be 
implemented everywhere (with varying degrees of efficiency).

Anyway, I'm thinking that it might be better to add such a general 
operation instead of continuing to add things like

	VEC_EXTRACT_EVEN_EXPR,
	VEC_EXTRACT_ODD_EXPR,
	VEC_INTERLEAVE_HIGH_EXPR,
	VEC_INTERLEAVE_LOW_EXPR,

and other obvious patterns like broadcast, duplicate even to odd, 
duplicate odd to even, etc.

I can imagine having some sort of target hook that computed a cost 
metric for a given constant permutation pattern.  For instance, I'd 
imagine that the interleave patterns are half as expensive as a full 
permute for altivec, due to not having to load a mask.  This hook would 
be fairly complicated for x86, given all of the permuting insns that 
were incrementally added in various ISA revisions, but such is life.

In any case, would a VEC_PERMUTE_EXPR, as described above, work for the 
uses of builtin_vec_perm within the vectorizer at present?

r~

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: targetm.vectorize.builtin_vec_perm
  2009-11-17  1:40 targetm.vectorize.builtin_vec_perm Richard Henderson
@ 2009-11-17  2:39 ` Joern Rennecke
  2009-11-17  9:10   ` targetm.vectorize.builtin_vec_perm Ira Rosen
  2009-11-17 19:21   ` targetm.vectorize.builtin_vec_perm Richard Henderson
  2009-11-17  8:59 ` targetm.vectorize.builtin_vec_perm Ira Rosen
  1 sibling, 2 replies; 6+ messages in thread
From: Joern Rennecke @ 2009-11-17  2:39 UTC (permalink / raw)
  To: gcc

Quoting Richard Henderson <rth@redhat.com>:

> Truly variable permutation is something that's only supported by ppc
> and spu.

SH64 also has variable permutation.  16 bit elements within its 64 bit
vector size can be permuted.

> HOWEVER!  Most of the useful permutations that I can think of for the
> optimizers to generate are actually constant.  And these can be
> implemented everywhere (with varying degrees of efficiency).

Variable permutations could be very useful for doing vector operations on
unaligned inputs.  To some degrees shifts can be used, but if they only have
C semantics you'll get corner cases with word-sized shifts when the  
input is actually aligned.
>
> Anyway, I'm thinking that it might be better to add such a general
> operation instead of continuing to add things like
>
> 	VEC_EXTRACT_EVEN_EXPR,
> 	VEC_EXTRACT_ODD_EXPR,
> 	VEC_INTERLEAVE_HIGH_EXPR,
> 	VEC_INTERLEAVE_LOW_EXPR,
>
> and other obvious patterns like broadcast, duplicate even to odd,
> duplicate odd to even, etc.

ARC mxp has a vector exchange operation that would likely not fit into
whatever scheme you are thinking of ...  It is very useful for matrix
transposition, except that without a target hook to transpose a matrix,
it is not likely to be generated.

> I can imagine having some sort of target hook that computed a cost
> metric for a given constant permutation pattern.  For instance, I'd
> imagine that the interleave patterns are half as expensive as a full
> permute for altivec, due to not having to load a mask.  This hook would
> be fairly complicated for x86, given all of the permuting insns that
> were incrementally added in various ISA revisions, but such is life.

There should be some way to account for the difference between the cost
in straight-line code, where a mask load is a hard cost, a large loop,
where the load can be hoisted at the cost of some target-dependent
register pressure (e.g. being able to use inverted masks might save half
of the cost), and a tight loop, where the constant load can be easily
amortized over the entire loop.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: targetm.vectorize.builtin_vec_perm
  2009-11-17  2:39 ` targetm.vectorize.builtin_vec_perm Joern Rennecke
@ 2009-11-17  9:10   ` Ira Rosen
  2009-11-17 19:21   ` targetm.vectorize.builtin_vec_perm Richard Henderson
  1 sibling, 0 replies; 6+ messages in thread
From: Ira Rosen @ 2009-11-17  9:10 UTC (permalink / raw)
  To: Joern Rennecke; +Cc: gcc


> > I can imagine having some sort of target hook that computed a cost
> > metric for a given constant permutation pattern.  For instance, I'd
> > imagine that the interleave patterns are half as expensive as a full
> > permute for altivec, due to not having to load a mask.  This hook would
> > be fairly complicated for x86, given all of the permuting insns that
> > were incrementally added in various ISA revisions, but such is life.
>
> There should be some way to account for the difference between the cost
> in straight-line code, where a mask load is a hard cost, a large loop,
> where the load can be hoisted at the cost of some target-dependent
> register pressure (e.g. being able to use inverted masks might save half
> of the cost), and a tight loop, where the constant load can be easily
> amortized over the entire loop.

Vectorizer cost model already does that. AFAIU, vectorizer cost model will
call the cost model hook to get a cost of a permute, and then incorporate
that cost into the general loop/basic block vectorization cost.

Ira


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: targetm.vectorize.builtin_vec_perm
  2009-11-17  2:39 ` targetm.vectorize.builtin_vec_perm Joern Rennecke
  2009-11-17  9:10   ` targetm.vectorize.builtin_vec_perm Ira Rosen
@ 2009-11-17 19:21   ` Richard Henderson
  1 sibling, 0 replies; 6+ messages in thread
From: Richard Henderson @ 2009-11-17 19:21 UTC (permalink / raw)
  To: gcc

On 11/16/2009 06:37 PM, Joern Rennecke wrote:
> Variable permutations could be very useful for doing vector operations on
> unaligned inputs. To some degrees shifts can be used, but if they only have
> C semantics you'll get corner cases with word-sized shifts when the
> input is actually aligned.

I'm assuming at this point that generic code will (continue to) use the
MISALIGNED_INDIRECT_REF operation to load-and-realign such inputs.  The 
ppc port does in fact use its variable permutation insn to implement 
this instruction.  This special case is fairly important for targets 
that don't implement the fully general variable permute.

(Curiously, the spu port fails to implement movmisalign<mode>.)

r~

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: targetm.vectorize.builtin_vec_perm
  2009-11-17  1:40 targetm.vectorize.builtin_vec_perm Richard Henderson
  2009-11-17  2:39 ` targetm.vectorize.builtin_vec_perm Joern Rennecke
@ 2009-11-17  8:59 ` Ira Rosen
  2009-11-17 12:51   ` targetm.vectorize.builtin_vec_perm Dorit Nuzman
  1 sibling, 1 reply; 6+ messages in thread
From: Ira Rosen @ 2009-11-17  8:59 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc



Richard Henderson <rth@redhat.com> wrote on 17/11/2009 03:39:42:

> Richard Henderson <rth@redhat.com>
> 17/11/2009 03:39
>
> To
>
> Ira Rosen/Haifa/IBM@IBMIL
>
> cc
>
> gcc@gcc.gnu.org
>
> Subject
>
> targetm.vectorize.builtin_vec_perm
>
> What is this hook supposed to do?  There is no description of its
arguments.
>
> What is the theory of operation of permute within the vectorizer?  Do
> you actually need variable permute, or would constants be ok?

It is currently used for a specific load permutation of RGB to YUV
conversion (http://gcc.gnu.org/ml/gcc-patches/2008-07/msg00445.html). The
arguments are vector type and mask type (the last one is returned by the
hook).

The permute is constant, it depends on the number of loads (group size) and
their type. However, there are cases, that we may want to support in the
future, that require variable permute - indirect accesses, for example.

>
> I'm contemplating adding a tree- and gimple-level VEC_PERMUTE_EXPR of
> the form:
>
>    VEC_PERMUTE_EXPR (vlow, vhigh, vperm)
>
> which would be exactly equal to
>
>    (vec_select
>      (vec_concat vlow vhigh)
>      vperm)
>
> at the rtl level.  I.e. vperm is an integral vector of the same number
> of elements as vlow.
>
> Truly variable permutation is something that's only supported by ppc and
> spu.

Also Altivec and SPU support byte permutation (and not only element
permutation), however, the vectorizer does not make use of this at present.

> Intel AVX has a limited variable permutation -- 64-bit or 32-bit
> elements can be rearranged but only within a 128-bit subvector.
> So if you're working with 128-bit vectors, it's fully variable, but if
> you're working with 256-bit vectors, it's like doing 2 128-bit permute
> operations in parallel.  Intel before AVX has no variable permute.
>
> HOWEVER!  Most of the useful permutations that I can think of for the
> optimizers to generate are actually constant.  And these can be
> implemented everywhere (with varying degrees of efficiency).
>
> Anyway, I'm thinking that it might be better to add such a general
> operation instead of continuing to add things like
>
>    VEC_EXTRACT_EVEN_EXPR,
>    VEC_EXTRACT_ODD_EXPR,
>    VEC_INTERLEAVE_HIGH_EXPR,
>    VEC_INTERLEAVE_LOW_EXPR,
>
> and other obvious patterns like broadcast, duplicate even to odd,
> duplicate odd to even, etc.

If the back end will be able to identify specific masks, e.g., {0,2,4,6} as
extract even operation, then we can certainly remove those codes.

>
> I can imagine having some sort of target hook that computed a cost
> metric for a given constant permutation pattern.  For instance, I'd
> imagine that the interleave patterns are half as expensive as a full
> permute for altivec, due to not having to load a mask.  This hook would
> be fairly complicated for x86, given all of the permuting insns that
> were incrementally added in various ISA revisions, but such is life.
>
> In any case, would a VEC_PERMUTE_EXPR, as described above, work for the
> uses of builtin_vec_perm within the vectorizer at present?

Yes.

Ira

>
>
> r~

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: targetm.vectorize.builtin_vec_perm
  2009-11-17  8:59 ` targetm.vectorize.builtin_vec_perm Ira Rosen
@ 2009-11-17 12:51   ` Dorit Nuzman
  0 siblings, 0 replies; 6+ messages in thread
From: Dorit Nuzman @ 2009-11-17 12:51 UTC (permalink / raw)
  To: Ira Rosen; +Cc: gcc, Richard Henderson

...
>
> >
> > I'm contemplating adding a tree- and gimple-level VEC_PERMUTE_EXPR of
> > the form:
> >
> >    VEC_PERMUTE_EXPR (vlow, vhigh, vperm)
> >
> > which would be exactly equal to
> >
> >    (vec_select
> >      (vec_concat vlow vhigh)
> >      vperm)
> >
> > at the rtl level.  I.e. vperm is an integral vector of the same number
> > of elements as vlow.
> >
> > Truly variable permutation is something that's only supported by ppc
and
> > spu.
>
> Also Altivec and SPU support byte permutation (and not only element
> permutation), however, the vectorizer does not make use of this at
present.
>

Yes. I was trying to think if it would be useful to express
byte-permutations instead of element-permutations, but the only two useful
cases that came to mind are things we have covered by other, probably more
appropriate, idioms.

[One is realignment (for which we use the builtin_mask_for_load +
REALIGN_LOAD). The other is the VEC_PACK_TRUNC idiom (where the number of
elements in 'vperm' would be twice the number of elements as 'vlow'), but
other VEC_PACK variants are a little more than just a special case of
permute.]

So (unless we want VEC_PERMUTE to cover these cases, which I think we
don't), an element-wise permutations should suffice, so sounds like a good
suggestion to me.

> > Intel AVX has a limited variable permutation -- 64-bit or 32-bit
> > elements can be rearranged but only within a 128-bit subvector.
> > So if you're working with 128-bit vectors, it's fully variable, but if
> > you're working with 256-bit vectors, it's like doing 2 128-bit permute
> > operations in parallel.  Intel before AVX has no variable permute.
> >
> > HOWEVER!  Most of the useful permutations that I can think of for the
> > optimizers to generate are actually constant.  And these can be
> > implemented everywhere (with varying degrees of efficiency).
> >

That's true for the moment, but there are cases where a variable permute
would be useful for vectorization. E.g. where vectors are used as a lookup
table. One example I know of is for finding delimiters (e.g. for XML
processing) - a lookup table of 256 bits holds one bit per ASCII character
to indicates if a character is a delimiter or not, and the scalar code
looks something like this:
table[256]={1,0,0,....};
for (i...)
   if (table[data[i]] == 1)
     {found delimiter}
...and this is vectorized with 2 vector registers that hold the lookup
table and a shift on the input data vector to create the permutation mask
to access the table. I think there should be other examples for lookup
tables like that used for vectorization. I also saw variable permutes used
for sorting (
http://www.dia.eui.upm.es/asignatu/pro_par/articulos/AASort.pdf).

Indeed there are some serious challenges to overcome in order to do all
that automatically in the compiler... but some pattern-matching based
vectorization approach could conceptually do this.

Also, if one day someone was to introduce platform-independent vector
intrinsics, then such a generic permute would allow programmers to take
advantage of it, even for the cases that would be otherwise too complicated
for the compiler to auto-vectorize.

So I think it would be nice to allow the more general form, but since it
will probably take a while before we actually make use of it, it's probably
not critical for the short term...

> > Anyway, I'm thinking that it might be better to add such a general
> > operation instead of continuing to add things like
> >
> >    VEC_EXTRACT_EVEN_EXPR,
> >    VEC_EXTRACT_ODD_EXPR,
> >    VEC_INTERLEAVE_HIGH_EXPR,
> >    VEC_INTERLEAVE_LOW_EXPR,
> >
> > and other obvious patterns like broadcast, duplicate even to odd,
> > duplicate odd to even, etc.
>

agreed

> If the back end will be able to identify specific masks, e.g., {0,2,4,6}
as
> extract even operation, then we can certainly remove those codes.
>

agreed

dorit

> >
> > I can imagine having some sort of target hook that computed a cost
> > metric for a given constant permutation pattern.  For instance, I'd
> > imagine that the interleave patterns are half as expensive as a full
> > permute for altivec, due to not having to load a mask.  This hook would
> > be fairly complicated for x86, given all of the permuting insns that
> > were incrementally added in various ISA revisions, but such is life.
> >
> > In any case, would a VEC_PERMUTE_EXPR, as described above, work for the
> > uses of builtin_vec_perm within the vectorizer at present?
>
> Yes.
>
> Ira
>
> >
> >
> > r~
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-11-17 19:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-17  1:40 targetm.vectorize.builtin_vec_perm Richard Henderson
2009-11-17  2:39 ` targetm.vectorize.builtin_vec_perm Joern Rennecke
2009-11-17  9:10   ` targetm.vectorize.builtin_vec_perm Ira Rosen
2009-11-17 19:21   ` targetm.vectorize.builtin_vec_perm Richard Henderson
2009-11-17  8:59 ` targetm.vectorize.builtin_vec_perm Ira Rosen
2009-11-17 12:51   ` targetm.vectorize.builtin_vec_perm Dorit Nuzman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).