* targetm.vectorize.builtin_vec_perm @ 2009-11-17 1:40 Richard Henderson 2009-11-17 2:39 ` targetm.vectorize.builtin_vec_perm Joern Rennecke 2009-11-17 8:59 ` targetm.vectorize.builtin_vec_perm Ira Rosen 0 siblings, 2 replies; 6+ messages in thread From: Richard Henderson @ 2009-11-17 1:40 UTC (permalink / raw) To: irar; +Cc: gcc What is this hook supposed to do? There is no description of its arguments. What is the theory of operation of permute within the vectorizer? Do you actually need variable permute, or would constants be ok? I'm contemplating adding a tree- and gimple-level VEC_PERMUTE_EXPR of the form: VEC_PERMUTE_EXPR (vlow, vhigh, vperm) which would be exactly equal to (vec_select (vec_concat vlow vhigh) vperm) at the rtl level. I.e. vperm is an integral vector of the same number of elements as vlow. Truly variable permutation is something that's only supported by ppc and spu. Intel AVX has a limited variable permutation -- 64-bit or 32-bit elements can be rearranged but only within a 128-bit subvector. So if you're working with 128-bit vectors, it's fully variable, but if you're working with 256-bit vectors, it's like doing 2 128-bit permute operations in parallel. Intel before AVX has no variable permute. HOWEVER! Most of the useful permutations that I can think of for the optimizers to generate are actually constant. And these can be implemented everywhere (with varying degrees of efficiency). Anyway, I'm thinking that it might be better to add such a general operation instead of continuing to add things like VEC_EXTRACT_EVEN_EXPR, VEC_EXTRACT_ODD_EXPR, VEC_INTERLEAVE_HIGH_EXPR, VEC_INTERLEAVE_LOW_EXPR, and other obvious patterns like broadcast, duplicate even to odd, duplicate odd to even, etc. I can imagine having some sort of target hook that computed a cost metric for a given constant permutation pattern. For instance, I'd imagine that the interleave patterns are half as expensive as a full permute for altivec, due to not having to load a mask. This hook would be fairly complicated for x86, given all of the permuting insns that were incrementally added in various ISA revisions, but such is life. In any case, would a VEC_PERMUTE_EXPR, as described above, work for the uses of builtin_vec_perm within the vectorizer at present? r~ ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: targetm.vectorize.builtin_vec_perm 2009-11-17 1:40 targetm.vectorize.builtin_vec_perm Richard Henderson @ 2009-11-17 2:39 ` Joern Rennecke 2009-11-17 9:10 ` targetm.vectorize.builtin_vec_perm Ira Rosen 2009-11-17 19:21 ` targetm.vectorize.builtin_vec_perm Richard Henderson 2009-11-17 8:59 ` targetm.vectorize.builtin_vec_perm Ira Rosen 1 sibling, 2 replies; 6+ messages in thread From: Joern Rennecke @ 2009-11-17 2:39 UTC (permalink / raw) To: gcc Quoting Richard Henderson <rth@redhat.com>: > Truly variable permutation is something that's only supported by ppc > and spu. SH64 also has variable permutation. 16 bit elements within its 64 bit vector size can be permuted. > HOWEVER! Most of the useful permutations that I can think of for the > optimizers to generate are actually constant. And these can be > implemented everywhere (with varying degrees of efficiency). Variable permutations could be very useful for doing vector operations on unaligned inputs. To some degrees shifts can be used, but if they only have C semantics you'll get corner cases with word-sized shifts when the input is actually aligned. > > Anyway, I'm thinking that it might be better to add such a general > operation instead of continuing to add things like > > VEC_EXTRACT_EVEN_EXPR, > VEC_EXTRACT_ODD_EXPR, > VEC_INTERLEAVE_HIGH_EXPR, > VEC_INTERLEAVE_LOW_EXPR, > > and other obvious patterns like broadcast, duplicate even to odd, > duplicate odd to even, etc. ARC mxp has a vector exchange operation that would likely not fit into whatever scheme you are thinking of ... It is very useful for matrix transposition, except that without a target hook to transpose a matrix, it is not likely to be generated. > I can imagine having some sort of target hook that computed a cost > metric for a given constant permutation pattern. For instance, I'd > imagine that the interleave patterns are half as expensive as a full > permute for altivec, due to not having to load a mask. This hook would > be fairly complicated for x86, given all of the permuting insns that > were incrementally added in various ISA revisions, but such is life. There should be some way to account for the difference between the cost in straight-line code, where a mask load is a hard cost, a large loop, where the load can be hoisted at the cost of some target-dependent register pressure (e.g. being able to use inverted masks might save half of the cost), and a tight loop, where the constant load can be easily amortized over the entire loop. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: targetm.vectorize.builtin_vec_perm 2009-11-17 2:39 ` targetm.vectorize.builtin_vec_perm Joern Rennecke @ 2009-11-17 9:10 ` Ira Rosen 2009-11-17 19:21 ` targetm.vectorize.builtin_vec_perm Richard Henderson 1 sibling, 0 replies; 6+ messages in thread From: Ira Rosen @ 2009-11-17 9:10 UTC (permalink / raw) To: Joern Rennecke; +Cc: gcc > > I can imagine having some sort of target hook that computed a cost > > metric for a given constant permutation pattern. For instance, I'd > > imagine that the interleave patterns are half as expensive as a full > > permute for altivec, due to not having to load a mask. This hook would > > be fairly complicated for x86, given all of the permuting insns that > > were incrementally added in various ISA revisions, but such is life. > > There should be some way to account for the difference between the cost > in straight-line code, where a mask load is a hard cost, a large loop, > where the load can be hoisted at the cost of some target-dependent > register pressure (e.g. being able to use inverted masks might save half > of the cost), and a tight loop, where the constant load can be easily > amortized over the entire loop. Vectorizer cost model already does that. AFAIU, vectorizer cost model will call the cost model hook to get a cost of a permute, and then incorporate that cost into the general loop/basic block vectorization cost. Ira ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: targetm.vectorize.builtin_vec_perm 2009-11-17 2:39 ` targetm.vectorize.builtin_vec_perm Joern Rennecke 2009-11-17 9:10 ` targetm.vectorize.builtin_vec_perm Ira Rosen @ 2009-11-17 19:21 ` Richard Henderson 1 sibling, 0 replies; 6+ messages in thread From: Richard Henderson @ 2009-11-17 19:21 UTC (permalink / raw) To: gcc On 11/16/2009 06:37 PM, Joern Rennecke wrote: > Variable permutations could be very useful for doing vector operations on > unaligned inputs. To some degrees shifts can be used, but if they only have > C semantics you'll get corner cases with word-sized shifts when the > input is actually aligned. I'm assuming at this point that generic code will (continue to) use the MISALIGNED_INDIRECT_REF operation to load-and-realign such inputs. The ppc port does in fact use its variable permutation insn to implement this instruction. This special case is fairly important for targets that don't implement the fully general variable permute. (Curiously, the spu port fails to implement movmisalign<mode>.) r~ ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: targetm.vectorize.builtin_vec_perm 2009-11-17 1:40 targetm.vectorize.builtin_vec_perm Richard Henderson 2009-11-17 2:39 ` targetm.vectorize.builtin_vec_perm Joern Rennecke @ 2009-11-17 8:59 ` Ira Rosen 2009-11-17 12:51 ` targetm.vectorize.builtin_vec_perm Dorit Nuzman 1 sibling, 1 reply; 6+ messages in thread From: Ira Rosen @ 2009-11-17 8:59 UTC (permalink / raw) To: Richard Henderson; +Cc: gcc Richard Henderson <rth@redhat.com> wrote on 17/11/2009 03:39:42: > Richard Henderson <rth@redhat.com> > 17/11/2009 03:39 > > To > > Ira Rosen/Haifa/IBM@IBMIL > > cc > > gcc@gcc.gnu.org > > Subject > > targetm.vectorize.builtin_vec_perm > > What is this hook supposed to do? There is no description of its arguments. > > What is the theory of operation of permute within the vectorizer? Do > you actually need variable permute, or would constants be ok? It is currently used for a specific load permutation of RGB to YUV conversion (http://gcc.gnu.org/ml/gcc-patches/2008-07/msg00445.html). The arguments are vector type and mask type (the last one is returned by the hook). The permute is constant, it depends on the number of loads (group size) and their type. However, there are cases, that we may want to support in the future, that require variable permute - indirect accesses, for example. > > I'm contemplating adding a tree- and gimple-level VEC_PERMUTE_EXPR of > the form: > > VEC_PERMUTE_EXPR (vlow, vhigh, vperm) > > which would be exactly equal to > > (vec_select > (vec_concat vlow vhigh) > vperm) > > at the rtl level. I.e. vperm is an integral vector of the same number > of elements as vlow. > > Truly variable permutation is something that's only supported by ppc and > spu. Also Altivec and SPU support byte permutation (and not only element permutation), however, the vectorizer does not make use of this at present. > Intel AVX has a limited variable permutation -- 64-bit or 32-bit > elements can be rearranged but only within a 128-bit subvector. > So if you're working with 128-bit vectors, it's fully variable, but if > you're working with 256-bit vectors, it's like doing 2 128-bit permute > operations in parallel. Intel before AVX has no variable permute. > > HOWEVER! Most of the useful permutations that I can think of for the > optimizers to generate are actually constant. And these can be > implemented everywhere (with varying degrees of efficiency). > > Anyway, I'm thinking that it might be better to add such a general > operation instead of continuing to add things like > > VEC_EXTRACT_EVEN_EXPR, > VEC_EXTRACT_ODD_EXPR, > VEC_INTERLEAVE_HIGH_EXPR, > VEC_INTERLEAVE_LOW_EXPR, > > and other obvious patterns like broadcast, duplicate even to odd, > duplicate odd to even, etc. If the back end will be able to identify specific masks, e.g., {0,2,4,6} as extract even operation, then we can certainly remove those codes. > > I can imagine having some sort of target hook that computed a cost > metric for a given constant permutation pattern. For instance, I'd > imagine that the interleave patterns are half as expensive as a full > permute for altivec, due to not having to load a mask. This hook would > be fairly complicated for x86, given all of the permuting insns that > were incrementally added in various ISA revisions, but such is life. > > In any case, would a VEC_PERMUTE_EXPR, as described above, work for the > uses of builtin_vec_perm within the vectorizer at present? Yes. Ira > > > r~ ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: targetm.vectorize.builtin_vec_perm 2009-11-17 8:59 ` targetm.vectorize.builtin_vec_perm Ira Rosen @ 2009-11-17 12:51 ` Dorit Nuzman 0 siblings, 0 replies; 6+ messages in thread From: Dorit Nuzman @ 2009-11-17 12:51 UTC (permalink / raw) To: Ira Rosen; +Cc: gcc, Richard Henderson ... > > > > > I'm contemplating adding a tree- and gimple-level VEC_PERMUTE_EXPR of > > the form: > > > > VEC_PERMUTE_EXPR (vlow, vhigh, vperm) > > > > which would be exactly equal to > > > > (vec_select > > (vec_concat vlow vhigh) > > vperm) > > > > at the rtl level. I.e. vperm is an integral vector of the same number > > of elements as vlow. > > > > Truly variable permutation is something that's only supported by ppc and > > spu. > > Also Altivec and SPU support byte permutation (and not only element > permutation), however, the vectorizer does not make use of this at present. > Yes. I was trying to think if it would be useful to express byte-permutations instead of element-permutations, but the only two useful cases that came to mind are things we have covered by other, probably more appropriate, idioms. [One is realignment (for which we use the builtin_mask_for_load + REALIGN_LOAD). The other is the VEC_PACK_TRUNC idiom (where the number of elements in 'vperm' would be twice the number of elements as 'vlow'), but other VEC_PACK variants are a little more than just a special case of permute.] So (unless we want VEC_PERMUTE to cover these cases, which I think we don't), an element-wise permutations should suffice, so sounds like a good suggestion to me. > > Intel AVX has a limited variable permutation -- 64-bit or 32-bit > > elements can be rearranged but only within a 128-bit subvector. > > So if you're working with 128-bit vectors, it's fully variable, but if > > you're working with 256-bit vectors, it's like doing 2 128-bit permute > > operations in parallel. Intel before AVX has no variable permute. > > > > HOWEVER! Most of the useful permutations that I can think of for the > > optimizers to generate are actually constant. And these can be > > implemented everywhere (with varying degrees of efficiency). > > That's true for the moment, but there are cases where a variable permute would be useful for vectorization. E.g. where vectors are used as a lookup table. One example I know of is for finding delimiters (e.g. for XML processing) - a lookup table of 256 bits holds one bit per ASCII character to indicates if a character is a delimiter or not, and the scalar code looks something like this: table[256]={1,0,0,....}; for (i...) if (table[data[i]] == 1) {found delimiter} ...and this is vectorized with 2 vector registers that hold the lookup table and a shift on the input data vector to create the permutation mask to access the table. I think there should be other examples for lookup tables like that used for vectorization. I also saw variable permutes used for sorting ( http://www.dia.eui.upm.es/asignatu/pro_par/articulos/AASort.pdf). Indeed there are some serious challenges to overcome in order to do all that automatically in the compiler... but some pattern-matching based vectorization approach could conceptually do this. Also, if one day someone was to introduce platform-independent vector intrinsics, then such a generic permute would allow programmers to take advantage of it, even for the cases that would be otherwise too complicated for the compiler to auto-vectorize. So I think it would be nice to allow the more general form, but since it will probably take a while before we actually make use of it, it's probably not critical for the short term... > > Anyway, I'm thinking that it might be better to add such a general > > operation instead of continuing to add things like > > > > VEC_EXTRACT_EVEN_EXPR, > > VEC_EXTRACT_ODD_EXPR, > > VEC_INTERLEAVE_HIGH_EXPR, > > VEC_INTERLEAVE_LOW_EXPR, > > > > and other obvious patterns like broadcast, duplicate even to odd, > > duplicate odd to even, etc. > agreed > If the back end will be able to identify specific masks, e.g., {0,2,4,6} as > extract even operation, then we can certainly remove those codes. > agreed dorit > > > > I can imagine having some sort of target hook that computed a cost > > metric for a given constant permutation pattern. For instance, I'd > > imagine that the interleave patterns are half as expensive as a full > > permute for altivec, due to not having to load a mask. This hook would > > be fairly complicated for x86, given all of the permuting insns that > > were incrementally added in various ISA revisions, but such is life. > > > > In any case, would a VEC_PERMUTE_EXPR, as described above, work for the > > uses of builtin_vec_perm within the vectorizer at present? > > Yes. > > Ira > > > > > > > r~ > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-11-17 19:21 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-11-17 1:40 targetm.vectorize.builtin_vec_perm Richard Henderson 2009-11-17 2:39 ` targetm.vectorize.builtin_vec_perm Joern Rennecke 2009-11-17 9:10 ` targetm.vectorize.builtin_vec_perm Ira Rosen 2009-11-17 19:21 ` targetm.vectorize.builtin_vec_perm Richard Henderson 2009-11-17 8:59 ` targetm.vectorize.builtin_vec_perm Ira Rosen 2009-11-17 12:51 ` targetm.vectorize.builtin_vec_perm Dorit Nuzman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).