public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
@ 2012-03-17  1:06 marc.glisse at normalesup dot org
  2012-03-17  4:58 ` [Bug target/52607] " marc.glisse at normalesup dot org
                   ` (29 more replies)
  0 siblings, 30 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-17  1:06 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

             Bug #: 52607
           Summary: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
    Classification: Unclassified
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: marc.glisse@normalesup.org


Hello,

this is really just a follow-up to PR52568. The permutations {0,2,1,3} and
{1,3,0,2} can be realized with a very similar technique.

Starting from 0123:
vpermilpd+vperm2f128->3210
vblendpd(0123,3210)->0213

or:
vpermilpd->1032
vperm2f128->2301
vblendpd(1032,2301)->1302

I am not sure if there is a nice way to generalize this or if the function
expand_vec_perm_vperm2f128_vblend should be cloned a few times and slightly
modified.

(these permutations are less important to me than 1230 was)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
@ 2012-03-17  4:58 ` marc.glisse at normalesup dot org
  2012-03-17 19:55 ` marc.glisse at normalesup dot org
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-17  4:58 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #1 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-17 01:05:57 UTC ---
Note that {1,2,0,3} seems harder, I need one extra vpermilpd. Actually, it
looks like every v4df shuffle can be realized as a vblendpd of a vpermilpd and
a vpermilpd+vperm2f128. For v8sf, it also seems true but may require the
version of vpermilps that takes its controls from a register/memory.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
  2012-03-17  4:58 ` [Bug target/52607] " marc.glisse at normalesup dot org
@ 2012-03-17 19:55 ` marc.glisse at normalesup dot org
  2012-03-17 20:11 ` marc.glisse at normalesup dot org
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-17 19:55 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #2 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-17 19:20:36 UTC ---
Created attachment 26908
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26908
copy-paste patch for 0213 and 1302

This seems to handle 0213 and 1302 (I only vaguely looked at the generated
code, can't do proper testing).

It is really a copy-paste of the function that handles 1230. I didn't try to
understand everything, so there may be things that made sense in the original
function but don't anymore here. It should be possible to merge the 2 new
functions, but merging them with the previous one looks harder.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
  2012-03-17  4:58 ` [Bug target/52607] " marc.glisse at normalesup dot org
  2012-03-17 19:55 ` marc.glisse at normalesup dot org
@ 2012-03-17 20:11 ` marc.glisse at normalesup dot org
  2012-03-17 22:17 ` marc.glisse at normalesup dot org
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-17 20:11 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #3 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-17 19:55:18 UTC ---
Uh. I feel silly, but it looks like vshufpd could replace vpermilpd+vblendpd in
many cases, including the original 1230 from PR52568...


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (2 preceding siblings ...)
  2012-03-17 20:11 ` marc.glisse at normalesup dot org
@ 2012-03-17 22:17 ` marc.glisse at normalesup dot org
  2012-03-18 14:51 ` marc.glisse at normalesup dot org
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-17 22:17 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

Marc Glisse <marc.glisse at normalesup dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #26908|0                           |1
        is obsolete|                            |

--- Comment #4 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-17 22:03:08 UTC ---
Created attachment 26909
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26909
patch

Here is a try. Again, I just looked at the generated code on a couple examples,
which isn't very reliable...

expand_vec_perm_vperm2f128_vblend0 is already covered by
expand_vec_perm_vperm2f128_vblend1, but it is confusing to have a 3-instruction
function generate only 2.

I didn't do generic permutations with 4 instructions.

There is probably more that can be done with vshufp[sd].


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (3 preceding siblings ...)
  2012-03-17 22:17 ` marc.glisse at normalesup dot org
@ 2012-03-18 14:51 ` marc.glisse at normalesup dot org
  2012-03-18 20:13 ` marc.glisse at normalesup dot org
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-18 14:51 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

Marc Glisse <marc.glisse at normalesup dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #26909|0                           |1
        is obsolete|                            |

--- Comment #5 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-18 12:53:13 UTC ---
Created attachment 26911
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26911
patch

With this one, all V4DF shuffles on one vector are done in at most 3
instructions (and are correct). Doing V8SF at the same time was getting
confusing so I dropped it for the last 2 functions, which end up looking almost
like: "if the pattern is 0112 do this, if it is 0130 do that, etc".

I didn't check if all the functions are still used by at least one pattern...

Note: my access to an avx machine is not sufficient to submit a patch, so feel
free to take pieces of this and modify/test/submit them (I have a copyright
assignment).


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (4 preceding siblings ...)
  2012-03-18 14:51 ` marc.glisse at normalesup dot org
@ 2012-03-18 20:13 ` marc.glisse at normalesup dot org
  2012-03-19 11:16 ` jakub at gcc dot gnu.org
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-18 20:13 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #6 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-18 18:58:44 UTC ---
Created attachment 26912
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26912
generic shuffle of a single v8sf

An additional function (I should find better names...) to handle generic
shuffles of a single v8sf in 4 instructions. Only tested on {6,2,3,3,5,2,3,7}.

By the way, expand_vec_perm_vperm2f128_vblend2 does vpermilpd+vperm2f128 in
this order, but it would be better to do it in the reverse order (adapting the
mask), because it is common to need several __builtin_shuffle(x,*) and the
vperm2f128 can then be shared.

I also noticed while experimenting that -mavx2 generates vpermd instead of
vpermps (the vpermq->vpermpd change didn't affect that).


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (5 preceding siblings ...)
  2012-03-18 20:13 ` marc.glisse at normalesup dot org
@ 2012-03-19 11:16 ` jakub at gcc dot gnu.org
  2012-03-19 17:19 ` jakub at gcc dot gnu.org
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-19 11:16 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-19 11:13:52 UTC ---
Created attachment 26915
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26915
gcc48-pr52607.patch

AVX2 changes.
This improves:
1) for {x, x, x, x} V4DFmode permutations we now don't generate
vpermpd+vperm2f128, and just emit vpermpd (for all other V4DFmode permutations
we were already emitting just vpermpd)
2) for most of V8SFmode permutations we now emit code like
vmovdqa .LC0(%rip), %ymm1; vpermps %ymm0, %ymm1, %ymm0 instead of
vperm2f128 $0, %ymm0, %ymm0, %ymm0; vpermilps .LC0(%rip), %ymm0, %ymm0
3) broadcast permutations improvements for V8SFmode (using vbroadcastss) as
well
as some integer modes


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (6 preceding siblings ...)
  2012-03-19 11:16 ` jakub at gcc dot gnu.org
@ 2012-03-19 17:19 ` jakub at gcc dot gnu.org
  2012-03-19 19:00 ` marc.glisse at normalesup dot org
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-19 17:19 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |rth at gcc dot gnu.org

--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-19 17:13:53 UTC ---
I'm not very keen on having too many different routines, the more generic they
are, the better.  So IMHO e.g. the two insn sequence, vperm2[if]128 + some one
insn shuffle could look like:

/* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to expand
   a vector permutation using two instructions, vperm2f128 resp.
   vperm2i128 followed by any single in-lane permutation.  */

static bool
expand_vec_perm_vperm2f128 (struct expand_vec_perm_d *d)
{
  struct expand_vec_perm_d dfirst, dsecond;
  unsigned i, j, nelt = d->nelt, nelt2 = nelt / 2, perm;
  bool ok;

  if (!TARGET_AVX
      || GET_MODE_SIZE (d->vmode) != 32
      || (d->vmode != V8SFmode && d->vmode != V4DFmode && !TARGET_AVX2))
    return false;

  dsecond = *d;
  if (d->op0 == d->op1)
    dsecond.op1 = gen_reg_rtx (d->vmode);
  dsecond.testing_p = true;

  /* ((perm << 2)|perm) & 0x33 is the vperm2[fi]128
     immediate.  For perm < 16 the second permutation uses
     d->op0 as first operand, for perm >= 16 it uses d->op1
     as first operand.  The second operand is the result of
     vperm2[fi]128.  */
  for (perm = 0; perm < 32; perm++)
    {
      /* Ignore permutations which do not move anything cross-lane.  */
      if (perm < 16)
        {
          if ((perm & 0xc) == (1 << 2)) continue;
          if ((perm & 3) == 0) continue;
          if ((perm & 0xf) == ((3 << 2) | 2)) continue;
        }
      else
        {
          if ((perm & 0xc) == (3 << 2)) continue;
          if ((perm & 3) == 2) continue;
          if ((perm & 0xf) == (1 << 2)) continue;
        }

      for (i = 0; i < nelt; i++)
        {
          j = d->perm[i] / nelt2;
          if (j == ((perm >> (2 * (i >= nelt2))) & 3))
            dsecond.perm[i] = nelt + (i & nelt2) + (d->perm[i] & (nelt2 - 1));
          else if (j == (unsigned) (i >= nelt2) + 2 * (perm >= 16))
            dsecond.perm[i] = d->perm[i] & (nelt - 1);
          else
            break;
        }

      if (i == nelt)
        {
          start_sequence ();
          ok = expand_vec_perm_1 (&dsecond);
          end_sequence ();
        }
      else
        ok = false;

      if (ok)
        {
          if (d->testing_p)
            return true;

          dsecond.testing_p = false;
          dfirst = *d;
          if (d->op0 == d->op1)
            dfirst.target = dsecond.op1;
          else
            dfirst.target = gen_reg_rtx (d->vmode);
          for (i = 0; i < nelt; i++)
            dfirst.perm[i] = (i & (nelt2 - 1))
                             + ((perm >> (2 * (i >= nelt2))) & 3) * nelt2;

          ok = expand_vec_perm_1 (&dfirst);
          gcc_assert (ok);

          dsecond.op1 = dfirst.target;
          if (perm >= 16)
            dsecond.op0 = dfirst.op1;

          ok = expand_vec_perm_1 (&dsecond);
          gcc_assert (ok);

          return true;
        }

      /* For d->op0 == d->op1 the only useful vperm2f128 permutation
         is 0x10.  */
      if (d->op0 == d->op1)
        return false;
    }

  return false;
}

This will handle e.g. vperm2f128 + {vshufpd,vblendpd,vunpcklpd,vunpckhpd} etc.
But with the current expand_vselect implementation it might be too costly, at
least memory-wise.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (7 preceding siblings ...)
  2012-03-19 17:19 ` jakub at gcc dot gnu.org
@ 2012-03-19 19:00 ` marc.glisse at normalesup dot org
  2012-03-19 19:29 ` jakub at gcc dot gnu.org
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-19 19:00 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #9 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-19 18:29:50 UTC ---
(In reply to comment #8)
> I'm not very keen on having too many different routines, the more generic they
> are, the better.

Agreed, that was one of my concerns from the first message in this bug, but to
experiment it was easier to have separate functions.

> So IMHO e.g. the two insn sequence, vperm2[if]128 + some one
> insn shuffle could look like:
> 
> /* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to expand
>    a vector permutation using two instructions, vperm2f128 resp.
>    vperm2i128 followed by any single in-lane permutation.  */

I haven't yet looked at it closely enough to understand what it does (those
functions are surprisingly confusing when you don't write them yourself), but
that looks interesting.

My first idea in order to make things more generic was to tentatively turn
__builtin_shuffle(x,m) into __builtin_shuffle(x,vperm2f128(x,x,33),mm) where mm
avoids any cross-lane. The 2-vector no-cross-lane shuffle should take at most 3
instructions in v4df or v8sf (I haven't checked if it works now) and that's
where most of the work would happen (instead of having many routines for
single-vector shuffles that almost all start with vperm2f128). Then you would
probably want to check how many instructions it used, since it could be more or
less than one of the few instruction sequences that don't start with
vperm2f128.

>From a quick look, it looks like you may be doing something even more
generic...

> This will handle e.g. vperm2f128 + {vshufpd,vblendpd,vunpcklpd,vunpckhpd} etc.

Cool!


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (8 preceding siblings ...)
  2012-03-19 19:00 ` marc.glisse at normalesup dot org
@ 2012-03-19 19:29 ` jakub at gcc dot gnu.org
  2012-03-19 20:09 ` rth at gcc dot gnu.org
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-19 19:29 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-19 19:18:33 UTC ---
Created attachment 26925
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26925
gcc48-vselect.patch

Patch to decrease memory cost of expand_vselect{,_vconcat}, by keeping around
a skeleton of the vselect/vconcat insn.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (9 preceding siblings ...)
  2012-03-19 19:29 ` jakub at gcc dot gnu.org
@ 2012-03-19 20:09 ` rth at gcc dot gnu.org
  2012-03-20  9:26 ` jakub at gcc dot gnu.org
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: rth at gcc dot gnu.org @ 2012-03-19 20:09 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #11 from Richard Henderson <rth at gcc dot gnu.org> 2012-03-19 19:36:03 UTC ---
The vselect patch looks pretty good.

+  if (icode >= 0 && !testing_p)
+    x = copy_rtx (PATTERN (vselect_insn));
...
+  if (!testing_p)
+    emit_insn (x);

could be merged for clarity.

The patch in comment #8 looks plausible.  It could
stand to have some comments added though.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (10 preceding siblings ...)
  2012-03-19 20:09 ` rth at gcc dot gnu.org
@ 2012-03-20  9:26 ` jakub at gcc dot gnu.org
  2012-03-20 16:27 ` jakub at gcc dot gnu.org
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-20  9:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2012-03-20
     Ever Confirmed|0                           |1

--- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-20 09:17:41 UTC ---
Testing the 3 patches now (AVX2 improvements, expand_vselect and #c8 with
further comments).  For 3/4 insn sequences, I agree with the proposal to
attempt to handle d->op0 == d->op1 cross-lane shuffles as two operand in-lane
shuffles
after vperm2f128 swapping the lanes.  Two insn expanders could be groupped into
expand_vec_perm_2 and three insn expanders into expand_vec_perm_3.
We need to write some further 2 and 3 insn in-lane expanders though, as shown
by:
typedef double V4DF __attribute__((vector_size (4 * sizeof (double))));
typedef long V4DI __attribute__((vector_size (4 * sizeof (long))));

#define A(a, b, c, d) \
__attribute__((noinline, noclone)) V4DF \
f##a##b##c##d (V4DF x, V4DF y) \
{\
  V4DI m = { a, b, c, d }; \
  return __builtin_shuffle (x, y, m); \
}
#define B(b, c, d) A(0, b, c, d) A(1, b, c, d) A(4, b, c, d) A(5, b, c, d)
#define C(c, d) B(0, c, d) B(1, c, d) B(4, c, d) B(5, c, d)
#define D(d) C(2, d) C(3, d) C(6, d) C(7, d)
#define E D(2) D(3) D(6) D(7)
E

int
main ()
{
  V4DF x = { 0.5, 1.5, 2.5, 3.5 }, y = { 4.5, 5.5, 6.5, 7.5 }, z;
#undef A
#define A(a, b, c, d) \
  z = f##a##b##c##d (x, y); \
  if (z[0] != a + .5 || z[1] != b + .5 || z[2] != c + .5 || z[3] != d + .5) \
    __builtin_abort ();
  E
  return 0;
}


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (11 preceding siblings ...)
  2012-03-20  9:26 ` jakub at gcc dot gnu.org
@ 2012-03-20 16:27 ` jakub at gcc dot gnu.org
  2012-03-20 16:56 ` jakub at gcc dot gnu.org
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-20 16:27 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-20 16:26:08 UTC ---
Author: jakub
Date: Tue Mar 20 16:25:54 2012
New Revision: 185577

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=185577
Log:
    PR target/52607
    * config/i386/i386.md ("isa" attribute): Add avx2 and noavx2.
    ("enabled" attribute): Handle avx2 and noavx2 isas.
    * config/i386/sse.md (avx2_vec_dupv8sf_1, avx2_pbroadcast<mode>_1):
    New insns.
    (vec_dup<mode>): Add avx2 =x,x alternative.
    (vec_dup<mode> splitter): Don't split if TARGET_AVX2.
    (*avx_vperm_broadcast_<mode>): Don't split V4DFmode if TARGET_AVX2.
    For TARGET_AVX2, V8SFmode and elt == 0 split into vbroadcastss.
    * config/i386/i386.c (expand_vec_perm_pshufb): Emit also vpermps
    for V8SFmode.
    (expand_vec_perm_1): For broadcasts, use avx2_pbroadcast<mode>_1
    if possible, handle also V8SFmode.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.md
    trunk/gcc/config/i386/sse.md


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (12 preceding siblings ...)
  2012-03-20 16:27 ` jakub at gcc dot gnu.org
@ 2012-03-20 16:56 ` jakub at gcc dot gnu.org
  2012-03-20 19:05 ` marc.glisse at normalesup dot org
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-20 16:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-20 16:51:50 UTC ---
Author: jakub
Date: Tue Mar 20 16:51:41 2012
New Revision: 185579

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=185579
Log:
    PR target/52607
    * config/i386/i386.c (expand_vec_perm_vperm2f128): New function.
    (ix86_expand_vec_perm_const_1): Call it.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (13 preceding siblings ...)
  2012-03-20 16:56 ` jakub at gcc dot gnu.org
@ 2012-03-20 19:05 ` marc.glisse at normalesup dot org
  2012-03-20 19:09 ` marc.glisse at normalesup dot org
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-20 19:05 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #15 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-20 19:00:32 UTC ---
If I am not mistaken, the V8SF shuffle 22022246 is doable by a vperm2f128 that
takes 01234567 to 01230123, followed by a vshufps (mask 138 maybe). Was your
patch supposed to handle it?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (14 preceding siblings ...)
  2012-03-20 19:05 ` marc.glisse at normalesup dot org
@ 2012-03-20 19:09 ` marc.glisse at normalesup dot org
  2012-03-20 22:03 ` marc.glisse at normalesup dot org
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-20 19:09 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #16 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-20 19:05:22 UTC ---
(In reply to comment #15)
> If I am not mistaken, the V8SF shuffle 22022246 is doable by a vperm2f128 that
> takes 01234567 to 01230123, followed by a vshufps (mask 138 maybe). Was your
> patch supposed to handle it?

Uh, no it isn't supposed to handle it (there would be redundancy and it
wouldn't know where to take elements from), sorry, forget that comment.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (15 preceding siblings ...)
  2012-03-20 19:09 ` marc.glisse at normalesup dot org
@ 2012-03-20 22:03 ` marc.glisse at normalesup dot org
  2012-03-25 14:09 ` marc.glisse at normalesup dot org
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-20 22:03 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

Marc Glisse <marc.glisse at normalesup dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #26912|0                           |1
        is obsolete|                            |

--- Comment #17 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-20 21:50:40 UTC ---
Created attachment 26938
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26938
intra-lane shuffle in 3 insn

This (mostly untested) patch is a reformulation of the generic v8sf single
vector shuffle in 4 insn as a generic intra-lane 2 vector shuffle in at most 3
insn. Reformulating __builtin_shuffle(x,m) as
__builtin_shuffle(x,vperm2f128(x,1),mm) would then guarantee a maximum size of
4.

Note that the strategy of doing a 2-vector shuffle by shuffling (not restricted
to one vpermilp*) each vector and blending the results gives a maximum of 9
insn, whereas the current code often generates twice that number.


By the way, I have trouble understanding this comment:
      /* For d->op0 == d->op1 the only useful vperm2f128 permutation
         is 0x10.  */
Is it really 0x10, or is there a stray 0 at the end and it is really just 1?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (16 preceding siblings ...)
  2012-03-20 22:03 ` marc.glisse at normalesup dot org
@ 2012-03-25 14:09 ` marc.glisse at normalesup dot org
  2012-03-27 16:59 ` jakub at gcc dot gnu.org
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-25 14:09 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

Marc Glisse <marc.glisse at normalesup dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #26938|0                           |1
        is obsolete|                            |

--- Comment #18 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-25 13:52:09 UTC ---
Created attachment 26979
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26979
default case

An updated version of this simple, generic-case shuffle (do note that I didn't
run the generated code, just checked that it compiled and the instructions
generated looked roughly ok). With the patch, we have (concerning v4df and
v8sf):

- no single-vector shuffle takes more than 4 insn,
- no 2-vector shuffle takes more than 9 insn (or 3 (+ 2 movs for constants...)
with AVX2).

I think the current code already guarantees than anything that can be done in a
single instruction is.

Some possible goals (making everything optimal may be a bit hard) would be:

- everything that can be done in 2 insn is,
- no single-vector v4df takes more than 3 insn,
- one or two extra optimizations, if they are generic enough.

I do wonder occasionally about allowing wild indexes (jokers, places where you
can put anything) in shuffles, whether it is exposed to users or just an
internal tool.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (17 preceding siblings ...)
  2012-03-25 14:09 ` marc.glisse at normalesup dot org
@ 2012-03-27 16:59 ` jakub at gcc dot gnu.org
  2012-03-27 17:25 ` jakub at gcc dot gnu.org
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-27 16:59 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |UNCONFIRMED
     Ever Confirmed|1                           |0

--- Comment #19 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-27 16:53:24 UTC ---
(In reply to comment #17)
> By the way, I have trouble understanding this comment:
>       /* For d->op0 == d->op1 the only useful vperm2f128 permutation
>          is 0x10.  */
> Is it really 0x10, or is there a stray 0 at the end and it is really just 1?

Yeah, you are right, perm 1 (which actually is emitted as $33 immediate anyway,
as vperm2f128 insn is used with the same register operand twice.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (18 preceding siblings ...)
  2012-03-27 16:59 ` jakub at gcc dot gnu.org
@ 2012-03-27 17:25 ` jakub at gcc dot gnu.org
  2012-03-27 18:23 ` marc.glisse at normalesup dot org
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-27 17:25 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #20 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-27 17:10:00 UTC ---
Thanks for working on this.

I don't like much the calls to ix86_expand_vec_perm_const_1, if you are looking
for exactly two insn permutations, then really the two insn permutation
functions should be groupped together into expand_vec_perm_2 and you should
call that instead, or if it is 1 or 2, then expand_vec_perm_1 ||
expand_vec_perm_2.

expand_vec_perm_vperm2f128_merge has probably swapped the meaning of dfirst and
dsecond permutations when it first performs the dsecond permutation.

Lastly for each routine it is desirable to think whether it might be useful for
other vector modes (likely 32-byte only) for TARGET_AVX2.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (19 preceding siblings ...)
  2012-03-27 17:25 ` jakub at gcc dot gnu.org
@ 2012-03-27 18:23 ` marc.glisse at normalesup dot org
  2012-03-27 21:21 ` marc.glisse at normalesup dot org
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-27 18:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #21 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-27 18:21:39 UTC ---
(In reply to comment #20)
> I don't like much the calls to ix86_expand_vec_perm_const_1, if you are looking
> for exactly two insn permutations,

Actually, it isn't just 2 insn. The call in expand_vec_perm_vperm2f128_merge
can take 3, and the calls in expand_vec_perm_perm_blend(...,true) up to 4 (this
is how I get a maximum of 9 insn, 1+2*4). But some more splits of
ix86_expand_vec_perm_const_1 to avoid recursive calls should be doable, if you
don't like the recursion.

> then really the two insn permutation
> functions should be groupped together into expand_vec_perm_2 and you should
> call that instead, or if it is 1 or 2, then expand_vec_perm_1 ||
> expand_vec_perm_2.

Yes, this grouping by size makes sense, whether it ends up being used or not.
Although there are expanders in the "3" category that occasionally get lucky
and generate only 2 :-)

> expand_vec_perm_vperm2f128_merge has probably swapped the meaning of dfirst and
> dsecond permutations when it first performs the dsecond permutation.

If you are just talking of the naming of the variables, yes, I completely agree
they should be swapped (or given more explicit names, like swap_lanes and
dintra).


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (20 preceding siblings ...)
  2012-03-27 18:23 ` marc.glisse at normalesup dot org
@ 2012-03-27 21:21 ` marc.glisse at normalesup dot org
  2012-03-29 13:41 ` rth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-27 21:21 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #22 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-27 20:57:16 UTC ---
(In reply to comment #20)
> Lastly for each routine it is desirable to think whether it might be useful for
> other vector modes (likely 32-byte only) for TARGET_AVX2.

I am not very familiar with the integer versions, so I tried:
#include <x86intrin.h>
__v32qi f(__v32qi x){
  __v32qi
m={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31};
  return __builtin_shuffle(x,m);
}

$ gcc r.c -S -O1 -mavx && cat r.s
r.c: In function 'f':
r.c:2:9: error: invalid position or size operand to BIT_FIELD_REF
BIT_FIELD_REF <x_2(D), 8, -128>
r.c:2:9: note: in statement
D.5992_24 = BIT_FIELD_REF <x_2(D), 8, -128>;

r.c:2:9: error: invalid position or size operand to BIT_FIELD_REF
BIT_FIELD_REF <x_2(D), 8, -120>
r.c:2:9: note: in statement
D.5993_25 = BIT_FIELD_REF <x_2(D), 8, -120>;

[...]

r.c:2:9: internal compiler error: verify_gimple failed

(with -mavx2 it works fine)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (21 preceding siblings ...)
  2012-03-27 21:21 ` marc.glisse at normalesup dot org
@ 2012-03-29 13:41 ` rth at gcc dot gnu.org
  2012-03-29 14:20 ` marc.glisse at normalesup dot org
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: rth at gcc dot gnu.org @ 2012-03-29 13:41 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

Richard Henderson <rth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|2012-03-20 00:00:00         |2012-03-29
     Ever Confirmed|0                           |1

--- Comment #23 from Richard Henderson <rth at gcc dot gnu.org> 2012-03-29 13:36:23 UTC ---
(In reply to comment #18)

+  if (!d->testing_p)
+    dsecond.target = gen_reg_rtx (dsecond.vmode);
+  dfirst.op1 = dsecond.target;

This bit has a problem with testing_p in that we'll have op0==op1
while testing and not when expanding.  Which means that testing_p
will be checking something else.

I've been meaning to convert i386 from op0==op1 to one_operand_p,
like I used in targets I converted later, like ia64.  I'll see about
making this change this afternoon, and then you can update your
patch to match.

+  ok = expand_vec_perm_1 (&dsecond);
+  ok &= ix86_expand_vec_perm_const_1 (&dfirst);
+
+  if (!ok)
+    return false;
+
+  return true;

Better with a short-circuit to avoid extra work:

  return (expand_vec_perm_1 (&dsecond)
          && ix86_expand_vec_perm_const_1 (&dfirst));

Otherwise the patch looks pretty good.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (22 preceding siblings ...)
  2012-03-29 13:41 ` rth at gcc dot gnu.org
@ 2012-03-29 14:20 ` marc.glisse at normalesup dot org
  2012-03-31  9:49 ` marc.glisse at normalesup dot org
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-29 14:20 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #24 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-29 14:19:11 UTC ---
(In reply to comment #23)
> (In reply to comment #18)
> 
> +  if (!d->testing_p)
> +    dsecond.target = gen_reg_rtx (dsecond.vmode);
> +  dfirst.op1 = dsecond.target;
> 
> This bit has a problem with testing_p in that we'll have op0==op1
> while testing and not when expanding.  Which means that testing_p
> will be checking something else.

Unless d->target==d->op0 (is that the case? I was kind of assuming it wasn't),
it looks ok, but I agree that it should be improved. From other code, it looks
like using gen_reg_rtx in testing is fine and avoiding it is just an
optimization.

On the other hand, if I remember correctly, the function could just return true
early when testing (like the other function does) and assert during expansion,
since it is not supposed to fail (except for the initial mode/target check),
that would document the intent better.

> I've been meaning to convert i386 from op0==op1 to one_operand_p,
> like I used in targets I converted later, like ia64.  I'll see about
> making this change this afternoon, and then you can update your
> patch to match.

ok (no promise timewise).

> +  ok = expand_vec_perm_1 (&dsecond);
> +  ok &= ix86_expand_vec_perm_const_1 (&dfirst);
> +
> +  if (!ok)
> +    return false;
> +
> +  return true;
> 
> Better with a short-circuit to avoid extra work:
> 
>   return (expand_vec_perm_1 (&dsecond)
>           && ix86_expand_vec_perm_const_1 (&dfirst));

Indeed!

Thanks for the comments.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (23 preceding siblings ...)
  2012-03-29 14:20 ` marc.glisse at normalesup dot org
@ 2012-03-31  9:49 ` marc.glisse at normalesup dot org
  2012-03-31 14:12 ` marc.glisse at normalesup dot org
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-31  9:49 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #25 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-31 09:37:51 UTC ---
The test for AVX2 in expand_vec_perm_interleave2 might be too strict. For the
V4DF shuffle 4,0,2,6, removing that check lets the compiler generate a nice
vunpcklpd+vpermilpd (as opposed to 3 insn with my patch and 5+ without). The
expansion of dfinal is already protected (so the function returns false for
4,2,0,6), I haven't checked whether something else (dremap?) needs protecting,
but it doesn't look like it.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (24 preceding siblings ...)
  2012-03-31  9:49 ` marc.glisse at normalesup dot org
@ 2012-03-31 14:12 ` marc.glisse at normalesup dot org
  2012-04-09 16:51 ` marc.glisse at normalesup dot org
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-31 14:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

Marc Glisse <marc.glisse at normalesup dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #26979|0                           |1
        is obsolete|                            |

--- Comment #26 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-31 14:02:54 UTC ---
Created attachment 27052
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27052
default case

Updated with your comments, still can't properly test.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (25 preceding siblings ...)
  2012-03-31 14:12 ` marc.glisse at normalesup dot org
@ 2012-04-09 16:51 ` marc.glisse at normalesup dot org
  2012-04-11 16:49 ` marc.glisse at normalesup dot org
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-04-09 16:51 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #27 from Marc Glisse <marc.glisse at normalesup dot org> 2012-04-09 16:50:47 UTC ---
Notes to self (or other):
- Intel's SDE makes it possible to test without appropriate hardware;
- for V4DF shuffles, there seems to be a very simple generic solution that
performs two vperm2f128 and then one vshufpd.

permutation (a,b,c,d), input (x,y):
t1 = vperm2f128(x,y,(a/2)+16*(c/2));
t2 = vperm2f128(x,y,(b/2)+16*(d/2));
return vshufpd(t1,t2,(a%2)+2*(b%2)+4*(c%2)+8*(d%2));

(when t1 or t2 is equal to x or y, it generates only 2 insn in cases that the
current code doesn't detect, like {3,1,2,2})


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (26 preceding siblings ...)
  2012-04-09 16:51 ` marc.glisse at normalesup dot org
@ 2012-04-11 16:49 ` marc.glisse at normalesup dot org
  2012-04-11 20:35 ` marc.glisse at normalesup dot org
  2012-05-14 21:29 ` glisse at gcc dot gnu.org
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-04-11 16:49 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #28 from Marc Glisse <marc.glisse at normalesup dot org> 2012-04-11 16:48:47 UTC ---
A difficulty I hadn't foreseen is that the code that canonicalizes permutations
(and in particular checks if one of the operands is unused) is in
ix86_expand_vec_perm_const. So if I ask expand_vec_perm_1 to generate the
2-operand 0,1,2,3 permutation, it will happily generate vperm2f128 with
immediate 16 without noticing that it is the identity on the first operand. I
should probably move that code into its own function so I can call it before
expand_vec_perm_1.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (27 preceding siblings ...)
  2012-04-11 16:49 ` marc.glisse at normalesup dot org
@ 2012-04-11 20:35 ` marc.glisse at normalesup dot org
  2012-05-14 21:29 ` glisse at gcc dot gnu.org
  29 siblings, 0 replies; 31+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-04-11 20:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #29 from Marc Glisse <marc.glisse at normalesup dot org> 2012-04-11 20:35:00 UTC ---
Created attachment 27136
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27136
V4DF generic shuffle

A patch (independent from the others) implementing what is explained in the
last 2 comments. It is simple and works really well, all V4DF shuffles (even
with 2 vectors) take only 3 insn (and often just 2). It only requires AVX, but
also improves a lot on the current AVX2 code which casts to vectors of integers
and uses up to 9 insn (although my "default case" patch also goes down to 3
insn on AVX2).

The drawback is that it is limited to V4DF. vshufps is a different enough beast
from vshufpd that it would require a different code, which wouldn't even apply
that often. For V8SF, my "default case" patch seems more interesting. Integer
vectors have different instructions again...

By the way, I tested all V4DF permutations (there are only 2^12 of them) in the
simulator. I also have a file (400K) with the code for each permutation, that
looks like the following:
0,0,0,0
    vpermilpd    $0, %ymm0, %ymm0
    vperm2f128    $0, %ymm0, %ymm0, %ymm0
[...]
1,7,6,3
        vperm2f128      $48, %ymm1, %ymm0, %ymm2
        vperm2f128      $19, %ymm1, %ymm0, %ymm0
        vshufpd $11, %ymm0, %ymm2, %ymm0
1,7,6,4
        vperm2f128      $48, %ymm1, %ymm0, %ymm0
        vperm2f128      $33, %ymm1, %ymm1, %ymm1
        vshufpd $3, %ymm1, %ymm0, %ymm0
[...]
If anyone wants to take a look, tell me and I'll attach it.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug target/52607] v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2}
  2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
                   ` (28 preceding siblings ...)
  2012-04-11 20:35 ` marc.glisse at normalesup dot org
@ 2012-05-14 21:29 ` glisse at gcc dot gnu.org
  29 siblings, 0 replies; 31+ messages in thread
From: glisse at gcc dot gnu.org @ 2012-05-14 21:29 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607

--- Comment #30 from Marc Glisse <glisse at gcc dot gnu.org> 2012-05-14 20:19:39 UTC ---
Author: glisse
Date: Mon May 14 20:19:30 2012
New Revision: 187479

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187479
Log:
2012-05-14  Marc Glisse  <marc.glisse@inria.fr>

    PR target/52607
    * config/i386/i386.c (ix86_expand_vec_perm_const): Move code to ...
    (canonicalize_perm): ... new function.
    (expand_vec_perm_2vperm2f128_vshuf): New function.
    (ix86_expand_vec_perm_const_1): Call it.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2012-05-14 20:20 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-17  1:06 [Bug target/52607] New: v4df __builtin_shuffle with {0,2,1,3} or {1,3,0,2} marc.glisse at normalesup dot org
2012-03-17  4:58 ` [Bug target/52607] " marc.glisse at normalesup dot org
2012-03-17 19:55 ` marc.glisse at normalesup dot org
2012-03-17 20:11 ` marc.glisse at normalesup dot org
2012-03-17 22:17 ` marc.glisse at normalesup dot org
2012-03-18 14:51 ` marc.glisse at normalesup dot org
2012-03-18 20:13 ` marc.glisse at normalesup dot org
2012-03-19 11:16 ` jakub at gcc dot gnu.org
2012-03-19 17:19 ` jakub at gcc dot gnu.org
2012-03-19 19:00 ` marc.glisse at normalesup dot org
2012-03-19 19:29 ` jakub at gcc dot gnu.org
2012-03-19 20:09 ` rth at gcc dot gnu.org
2012-03-20  9:26 ` jakub at gcc dot gnu.org
2012-03-20 16:27 ` jakub at gcc dot gnu.org
2012-03-20 16:56 ` jakub at gcc dot gnu.org
2012-03-20 19:05 ` marc.glisse at normalesup dot org
2012-03-20 19:09 ` marc.glisse at normalesup dot org
2012-03-20 22:03 ` marc.glisse at normalesup dot org
2012-03-25 14:09 ` marc.glisse at normalesup dot org
2012-03-27 16:59 ` jakub at gcc dot gnu.org
2012-03-27 17:25 ` jakub at gcc dot gnu.org
2012-03-27 18:23 ` marc.glisse at normalesup dot org
2012-03-27 21:21 ` marc.glisse at normalesup dot org
2012-03-29 13:41 ` rth at gcc dot gnu.org
2012-03-29 14:20 ` marc.glisse at normalesup dot org
2012-03-31  9:49 ` marc.glisse at normalesup dot org
2012-03-31 14:12 ` marc.glisse at normalesup dot org
2012-04-09 16:51 ` marc.glisse at normalesup dot org
2012-04-11 16:49 ` marc.glisse at normalesup dot org
2012-04-11 20:35 ` marc.glisse at normalesup dot org
2012-05-14 21:29 ` glisse at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).