public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 2/2, x86] Add palignr support for AVX2.
@ 2014-04-29 13:51 Evgeny Stupachenko
  2014-04-29 15:03 ` H.J. Lu
  2014-04-29 15:55 ` Richard Henderson
  0 siblings, 2 replies; 15+ messages in thread
From: Evgeny Stupachenko @ 2014-04-29 13:51 UTC (permalink / raw)
  To: GCC Patches, Richard Biener, Uros Bizjak; +Cc: H.J. Lu, Richard Henderson

Hi,

The patch adds use of palignr instruction, when we have one operand
permutation like:
{5 6 7 0 1 2 3 4}:

Treating this as {5 6 7 8 9 a b c} on 2 operands, and therefore palignr on 5.

Bootstrap and make check passed.

Is it ok?

Evgeny

2014-04-29  Evgeny Stupachenko  <evstupac@gmail.com>

        * config/i386/i386.c (expand_vec_perm_palignr_one_operand): New.
        Enables PALIGNR on one operand permutation.
        * config/i386/i386.c (expand_vec_perm_1): Try PALIGNR on one operand.


diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 002d295..8950cf7 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -42807,6 +42807,97 @@ expand_vec_perm_pshufb (struct expand_vec_perm_d *d)
   return true;
 }

+/* A subroutine of ix86_expand_vec_perm_1.  Try to use just palignr
+   instruction for one operand permutation.  This is better than pshufb
+   as does not require to pass big constant and faster on some x86
+   architectures.  */
+
+static bool
+expand_vec_perm_palignr_one_operand (struct expand_vec_perm_d *d)
+{
+  unsigned i, nelt = d->nelt;
+  unsigned min;
+  unsigned in_order_length, in_order_length_max;
+  rtx shift, shift1, target, tmp;
+
+  /* PALIGNR of 2 128-bits registers takes only 1 instrucion.
+     Requires SSSE3.  */
+  if (GET_MODE_SIZE (d->vmode) == 16)
+    {
+      if(!TARGET_SSSE3)
+       return false;
+    }
+  /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions:
+     PERM and PALIGNR.  It is more profitable than 2 PSHUFB and PERM.  */
+  else if (GET_MODE_SIZE (d->vmode) == 32)
+    {
+      if(!TARGET_AVX2)
+       return false;
+    }
+  else
+    return false;
+
+  if (d->one_operand_p != true)
+    return false;
+
+  /* For an in order permutation with one operand like: {5 6 7 0 1 2 3 4}
+     PALIGNR is better than PSHUFB.  Check for an order in permutation.  */
+  in_order_length = 0;
+  in_order_length_max = 0;
+  if (d->one_operand_p == true)
+    for (i = 0; i < 2 * nelt; ++i)
+      {
+       if ((d->perm[(i + 1) & (nelt - 1)] -
+            d->perm[i & (nelt - 1)]) != 1)
+         {
+           if (in_order_length > in_order_length_max)
+               in_order_length_max = in_order_length;
+             in_order_length = 0;
+         }
+       else
+         in_order_length++;
+      }
+
+  /* If not an ordered permutation then try something else.  */
+  if (in_order_length_max != nelt - 1)
+    return false;
+
+  min = d->perm[0];
+
+  shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
+  shift1 = GEN_INT ((min - nelt / 2) *
+          GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
+
+  if (GET_MODE_SIZE (d->vmode) != 32)
+    {
+      target = gen_reg_rtx (TImode);
+      emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d->op1),
+                                     gen_lowpart (TImode, d->op0), shift));
+    }
+  else
+    {
+      target = gen_reg_rtx (V2TImode);
+      tmp = gen_reg_rtx (V4DImode);
+      emit_insn (gen_avx2_permv2ti (tmp,
+                                   gen_lowpart (V4DImode, d->op0),
+                                   gen_lowpart (V4DImode, d->op1),
+                                   GEN_INT (33)));
+      if (min < nelt / 2)
+        emit_insn (gen_avx2_palignrv2ti (target,
+                                        gen_lowpart (V2TImode, tmp),
+                                        gen_lowpart (V2TImode, d->op0),
+                                        shift));
+      else
+       emit_insn (gen_avx2_palignrv2ti (target,
+                                        gen_lowpart (V2TImode, d->op1),
+                                        gen_lowpart (V2TImode, tmp),
+                                        shift1));
+    }
+  emit_move_insn (d->target, gen_lowpart (d->vmode, target));
+
+  return true;
+}
+
 static bool expand_vec_perm_vpshufb2_vpermq (struct expand_vec_perm_d *d);

 /* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to instantiate D
@@ -42943,6 +43034,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
   if (expand_vec_perm_vpermil (d))
     return true;

+  /* Try palignr on one operand.  */
+  if (expand_vec_perm_palignr_one_operand (d))
+    return true;
+
   /* Try the SSSE3 pshufb or XOP vpperm or AVX2 vperm2i128,
      vpshufb, vpermd, vpermps or vpermq variable permutation.  */
   if (expand_vec_perm_pshufb (d))

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-04-29 13:51 [PATCH 2/2, x86] Add palignr support for AVX2 Evgeny Stupachenko
@ 2014-04-29 15:03 ` H.J. Lu
  2014-04-29 16:54   ` Evgeny Stupachenko
  2014-04-29 17:27   ` Jakub Jelinek
  2014-04-29 15:55 ` Richard Henderson
  1 sibling, 2 replies; 15+ messages in thread
From: H.J. Lu @ 2014-04-29 15:03 UTC (permalink / raw)
  To: Evgeny Stupachenko
  Cc: GCC Patches, Richard Biener, Uros Bizjak, Richard Henderson

On Tue, Apr 29, 2014 at 6:50 AM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> Hi,
>
> The patch adds use of palignr instruction, when we have one operand
> permutation like:
> {5 6 7 0 1 2 3 4}:
>
> Treating this as {5 6 7 8 9 a b c} on 2 operands, and therefore palignr on 5.
>
> Bootstrap and make check passed.
>
> Is it ok?
>
> Evgeny
>
> 2014-04-29  Evgeny Stupachenko  <evstupac@gmail.com>
>
>         * config/i386/i386.c (expand_vec_perm_palignr_one_operand): New.
>         Enables PALIGNR on one operand permutation.
>         * config/i386/i386.c (expand_vec_perm_1): Try PALIGNR on one operand.
>
>

I think it is better to include some testcases in this
patch so that the backend change and its testcases
are self-contained.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-04-29 13:51 [PATCH 2/2, x86] Add palignr support for AVX2 Evgeny Stupachenko
  2014-04-29 15:03 ` H.J. Lu
@ 2014-04-29 15:55 ` Richard Henderson
  2014-04-29 17:23   ` Evgeny Stupachenko
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2014-04-29 15:55 UTC (permalink / raw)
  To: Evgeny Stupachenko, GCC Patches, Richard Biener, Uros Bizjak; +Cc: H.J. Lu

On 04/29/2014 06:50 AM, Evgeny Stupachenko wrote:
> +  if (d->one_operand_p != true)
> +    return false;

This looks odd.  Better as !d->one_operand_p.

> +
> +  /* For an in order permutation with one operand like: {5 6 7 0 1 2 3 4}
> +     PALIGNR is better than PSHUFB.  Check for an order in permutation.  */

FWIW, "in order permutation" sounds like a contradiction in terms.
Better to describe this as a rotate.

> +  in_order_length = 0;
> +  in_order_length_max = 0;
> +  if (d->one_operand_p == true)

You've just tested one_operand above.

> +    for (i = 0; i < 2 * nelt; ++i)

Which means that 2*nelt is doing twice as much work as needed.

> +      {
> +       if ((d->perm[(i + 1) & (nelt - 1)] -
> +            d->perm[i & (nelt - 1)]) != 1)

Surely we can avoid re-reading the comparison value...

  next = (d->perm[0] + 1) & (nelt - 1);
  for (i = 1; i < nelt; ++i)
    {
      if (d->perm[i] != next)
        return false;
      next = (next + 1) & (nelt - 1);
    }

> +         {
> +           if (in_order_length > in_order_length_max)
> +               in_order_length_max = in_order_length;
> +             in_order_length = 0;
> +         }
> +       else
> +         in_order_length++;
> +      }
> +
> +  /* If not an ordered permutation then try something else.  */
> +  if (in_order_length_max != nelt - 1)
> +    return false;

I don't understand what this length and max stuff is trying to accomplish.

> +
> +  min = d->perm[0];
> +
> +  shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
> +  shift1 = GEN_INT ((min - nelt / 2) *
> +          GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
> +
> +  if (GET_MODE_SIZE (d->vmode) != 32)

Positive tests are almost always clearer: == 16.


r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-04-29 15:03 ` H.J. Lu
@ 2014-04-29 16:54   ` Evgeny Stupachenko
  2014-04-29 17:27   ` Jakub Jelinek
  1 sibling, 0 replies; 15+ messages in thread
From: Evgeny Stupachenko @ 2014-04-29 16:54 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GCC Patches, Richard Biener, Uros Bizjak, Richard Henderson

On Tue, Apr 29, 2014 at 6:58 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Tue, Apr 29, 2014 at 6:50 AM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>> Hi,
>>
>> The patch adds use of palignr instruction, when we have one operand
>> permutation like:
>> {5 6 7 0 1 2 3 4}:
>>
>> Treating this as {5 6 7 8 9 a b c} on 2 operands, and therefore palignr on 5.
>>
>> Bootstrap and make check passed.
>>
>> Is it ok?
>>
>> Evgeny
>>
>> 2014-04-29  Evgeny Stupachenko  <evstupac@gmail.com>
>>
>>         * config/i386/i386.c (expand_vec_perm_palignr_one_operand): New.
>>         Enables PALIGNR on one operand permutation.
>>         * config/i386/i386.c (expand_vec_perm_1): Try PALIGNR on one operand.
>>
>>
>
> I think it is better to include some testcases in this
> patch so that the backend change and its testcases
> are self-contained.
>
I have no idea how to generate permutation for palignr right now, but
I have a patch fixing pr52252 generating such (first part of it is
currently under review with corresponding tests, next part will
generate palignr instructions on the tests).
Initial implementation of palirnr support was committed without tests.
Current patches are extension of this initial implementation.

> Thanks.
>
> --
> H.J.

Thanks,
Evgeny

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-04-29 15:55 ` Richard Henderson
@ 2014-04-29 17:23   ` Evgeny Stupachenko
  2014-04-29 17:41     ` Richard Henderson
  0 siblings, 1 reply; 15+ messages in thread
From: Evgeny Stupachenko @ 2014-04-29 17:23 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, Richard Biener, Uros Bizjak, H.J. Lu

Thanks. The path is fixed according to your comments:

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 002d295..aa6372a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -42807,6 +42807,79 @@ expand_vec_perm_pshufb (struct expand_vec_perm_d *d)
   return true;
 }

+/* A subroutine of ix86_expand_vec_perm_1.  Try to use just palignr
+   instruction for one operand permutation.  This is better than pshufb
+   as does not require to pass big constant and faster on some x86
+   architectures.  */
+
+static bool
+expand_vec_perm_palignr_one_operand (struct expand_vec_perm_d *d)
+{
+  unsigned i, nelt = d->nelt;
+  unsigned min;
+  rtx shift, shift1, target, tmp;
+
+  /* PALIGNR of 2 128-bits registers takes only 1 instrucion.
+     Requires SSSE3.  */
+  if (GET_MODE_SIZE (d->vmode) == 16)
+    {
+      if (!TARGET_SSSE3)
+       return false;
+    }
+  /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions:
+     PERM and PALIGNR.  It is more profitable than 2 PSHUFB and PERM.  */
+  else if (GET_MODE_SIZE (d->vmode) == 32)
+    {
+      if (!TARGET_AVX2)
+       return false;
+    }
+  else
+    return false;
+
+  if (!d->one_operand_p)
+    return false;
+
+  /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
+     PALIGNR is better than PSHUFB.  Check for a rotation in permutation.  */
+  for (i = 0; i < nelt; ++i)
+    if ((((d->perm[(i + 1) & (nelt - 1)] - d->perm[i])) & (nelt - 1)) != 1)
+      return false;
+
+  min = d->perm[0];
+  shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
+  shift1 = GEN_INT ((min - nelt / 2) *
+          GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
+
+  if (GET_MODE_SIZE (d->vmode) == 16)
+    {
+      target = gen_reg_rtx (TImode);
+      emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d->op1),
+                                     gen_lowpart (TImode, d->op0), shift));
+    }
+  else
+    {
+      target = gen_reg_rtx (V2TImode);
+      tmp = gen_reg_rtx (V4DImode);
+      emit_insn (gen_avx2_permv2ti (tmp,
+                                   gen_lowpart (V4DImode, d->op0),
+                                   gen_lowpart (V4DImode, d->op1),
+                                   GEN_INT (33)));
+      if (min < nelt / 2)
+        emit_insn (gen_avx2_palignrv2ti (target,
+                                        gen_lowpart (V2TImode, tmp),
+                                        gen_lowpart (V2TImode, d->op0),
+                                        shift));
+      else
+       emit_insn (gen_avx2_palignrv2ti (target,
+                                        gen_lowpart (V2TImode, d->op1),
+                                        gen_lowpart (V2TImode, tmp),
+                                        shift1));
+    }
+  emit_move_insn (d->target, gen_lowpart (d->vmode, target));
+
+  return true;
+}
+
 static bool expand_vec_perm_vpshufb2_vpermq (struct expand_vec_perm_d *d);

 /* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to instantiate D
@@ -42943,6 +43016,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
   if (expand_vec_perm_vpermil (d))
     return true;

+  /* Try palignr on one operand.  */
+  if (expand_vec_perm_palignr_one_operand (d))
+    return true;
+
   /* Try the SSSE3 pshufb or XOP vpperm or AVX2 vperm2i128,
      vpshufb, vpermd, vpermps or vpermq variable permutation.  */
   if (expand_vec_perm_pshufb (d))

On Tue, Apr 29, 2014 at 6:50 PM, Richard Henderson <rth@redhat.com> wrote:
> On 04/29/2014 06:50 AM, Evgeny Stupachenko wrote:
>> +  if (d->one_operand_p != true)
>> +    return false;
>
> This looks odd.  Better as !d->one_operand_p.
>
>> +
>> +  /* For an in order permutation with one operand like: {5 6 7 0 1 2 3 4}
>> +     PALIGNR is better than PSHUFB.  Check for an order in permutation.  */
>
> FWIW, "in order permutation" sounds like a contradiction in terms.
> Better to describe this as a rotate.
>
>> +  in_order_length = 0;
>> +  in_order_length_max = 0;
>> +  if (d->one_operand_p == true)
>
> You've just tested one_operand above.
>
>> +    for (i = 0; i < 2 * nelt; ++i)
>
> Which means that 2*nelt is doing twice as much work as needed.
>
>> +      {
>> +       if ((d->perm[(i + 1) & (nelt - 1)] -
>> +            d->perm[i & (nelt - 1)]) != 1)
>
> Surely we can avoid re-reading the comparison value...
>
>   next = (d->perm[0] + 1) & (nelt - 1);
>   for (i = 1; i < nelt; ++i)
>     {
>       if (d->perm[i] != next)
>         return false;
>       next = (next + 1) & (nelt - 1);
>     }
>
>> +         {
>> +           if (in_order_length > in_order_length_max)
>> +               in_order_length_max = in_order_length;
>> +             in_order_length = 0;
>> +         }
>> +       else
>> +         in_order_length++;
>> +      }
>> +
>> +  /* If not an ordered permutation then try something else.  */
>> +  if (in_order_length_max != nelt - 1)
>> +    return false;
>
> I don't understand what this length and max stuff is trying to accomplish.
>
>> +
>> +  min = d->perm[0];
>> +
>> +  shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
>> +  shift1 = GEN_INT ((min - nelt / 2) *
>> +          GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
>> +
>> +  if (GET_MODE_SIZE (d->vmode) != 32)
>
> Positive tests are almost always clearer: == 16.
>
>
> r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-04-29 15:03 ` H.J. Lu
  2014-04-29 16:54   ` Evgeny Stupachenko
@ 2014-04-29 17:27   ` Jakub Jelinek
  1 sibling, 0 replies; 15+ messages in thread
From: Jakub Jelinek @ 2014-04-29 17:27 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Evgeny Stupachenko, GCC Patches, Richard Biener, Uros Bizjak,
	Richard Henderson

On Tue, Apr 29, 2014 at 07:58:39AM -0700, H.J. Lu wrote:
> On Tue, Apr 29, 2014 at 6:50 AM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> > The patch adds use of palignr instruction, when we have one operand
> > permutation like:
> > {5 6 7 0 1 2 3 4}:
> >
> > Treating this as {5 6 7 8 9 a b c} on 2 operands, and therefore palignr on 5.
> >
> > Bootstrap and make check passed.
> >
> > Is it ok?
> >
> > Evgeny
> >
> > 2014-04-29  Evgeny Stupachenko  <evstupac@gmail.com>
> >
> >         * config/i386/i386.c (expand_vec_perm_palignr_one_operand): New.
> >         Enables PALIGNR on one operand permutation.
> >         * config/i386/i386.c (expand_vec_perm_1): Try PALIGNR on one operand.
> >
> >
> 
> I think it is better to include some testcases in this
> patch so that the backend change and its testcases
> are self-contained.

Note that most likely it should be already covered by
gcc.dg/torture/vshuf*.c, especially with GCC_TEST_RUN_EXPENSIVE=1.
Though, apparently we need to add new tests for AVX-512, because
v64qi, v32hi, v16si and v8di aren't covered (the last 3 are trivial,
for v64qi supposedly we'd need to write new vshuf-64.inc.

	Jakub

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-04-29 17:23   ` Evgeny Stupachenko
@ 2014-04-29 17:41     ` Richard Henderson
  2014-04-30 14:31       ` Evgeny Stupachenko
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2014-04-29 17:41 UTC (permalink / raw)
  To: Evgeny Stupachenko; +Cc: GCC Patches, Richard Biener, Uros Bizjak, H.J. Lu

On 04/29/2014 10:13 AM, Evgeny Stupachenko wrote:
> +  /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
> +     PALIGNR is better than PSHUFB.  Check for a rotation in permutation.  */
> +  for (i = 0; i < nelt; ++i)
> +    if ((((d->perm[(i + 1) & (nelt - 1)] - d->perm[i])) & (nelt - 1)) != 1)
> +      return false;
> +
> +  min = d->perm[0];

Why are you running this loop NELT times instead of NELT-1 like I suggested?
Why is that test expression so complicated?

Obviously d->perm[0] is the anchor and everything else can be computed relative
to that.  I'd expect no more than

  min = d->perm[0];
  for (i = 1; i < nelt; ++i)
    if (d->perm[i] != ((min + i) & (nelt - 1)))
      return false;

Now that I think of it,

> +  /* PALIGNR of 2 128-bits registers takes only 1 instrucion.
> +     Requires SSSE3.  */
> +  if (GET_MODE_SIZE (d->vmode) == 16)
> +    {
> +      if (!TARGET_SSSE3)
> +       return false;
> +    }
> +  /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions:
> +     PERM and PALIGNR.  It is more profitable than 2 PSHUFB and PERM.  */
> +  else if (GET_MODE_SIZE (d->vmode) == 32)
> +    {
> +      if (!TARGET_AVX2)
> +       return false;
> +    }
> +  else
> +    return false;
> +
> +  if (!d->one_operand_p)
> +    return false;

The comments are wrong.  Move the operand_p test to the top,
where it should be, and they're more obviously wrong:

  "must have one operand"
  "palignr of two operands..."

I think your comment

  /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
     we want to use PALIGNR.  */

belongs up there instead of those two incorrect comments.



r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-04-29 17:41     ` Richard Henderson
@ 2014-04-30 14:31       ` Evgeny Stupachenko
  2014-05-05 16:54         ` Evgeny Stupachenko
  0 siblings, 1 reply; 15+ messages in thread
From: Evgeny Stupachenko @ 2014-04-30 14:31 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, Richard Biener, Uros Bizjak, H.J. Lu

On Tue, Apr 29, 2014 at 9:39 PM, Richard Henderson <rth@redhat.com> wrote:
> On 04/29/2014 10:13 AM, Evgeny Stupachenko wrote:
>> +  /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
>> +     PALIGNR is better than PSHUFB.  Check for a rotation in permutation.  */
>> +  for (i = 0; i < nelt; ++i)
>> +    if ((((d->perm[(i + 1) & (nelt - 1)] - d->perm[i])) & (nelt - 1)) != 1)
>> +      return false;
>> +
>> +  min = d->perm[0];
>
> Why are you running this loop NELT times instead of NELT-1 like I suggested?
> Why is that test expression so complicated?
>
> Obviously d->perm[0] is the anchor and everything else can be computed relative
> to that.  I'd expect no more than
>
>   min = d->perm[0];
>   for (i = 1; i < nelt; ++i)
>     if (d->perm[i] != ((min + i) & (nelt - 1)))
>       return false;

Agree on this. The loop is less complicated.

>
> Now that I think of it,
>
>> +  /* PALIGNR of 2 128-bits registers takes only 1 instrucion.
>> +     Requires SSSE3.  */
>> +  if (GET_MODE_SIZE (d->vmode) == 16)
>> +    {
>> +      if (!TARGET_SSSE3)
>> +       return false;
>> +    }
>> +  /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions:
>> +     PERM and PALIGNR.  It is more profitable than 2 PSHUFB and PERM.  */
>> +  else if (GET_MODE_SIZE (d->vmode) == 32)
>> +    {
>> +      if (!TARGET_AVX2)
>> +       return false;
>> +    }
>> +  else
>> +    return false;
>> +
>> +  if (!d->one_operand_p)
>> +    return false;
>
> The comments are wrong.  Move the operand_p test to the top,
> where it should be, and they're more obviously wrong:
>
>   "must have one operand"
>   "palignr of two operands..."
>
> I think your comment
>
>   /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
>      we want to use PALIGNR.  */
>
> belongs up there instead of those two incorrect comments.
>

What I mean in the comments
  16 bytes case:
    For 1 operand permutation
      Rotation will cost "palignr" which is better than "pshufb" as
has smaller opcode (6 vs 9) and always costs 1 tick (pshufb takes 3-5
ticks on some x86 archs).
    For 2 operands permutation
      If "palignr" is applicable it reduces instructions from: "2
pshufb and or" to "palignr and pshufb". Profitable for the same
reasons as above.
  32 bytes case:
    For 1 operand permutation
      Rotation will cost only 2 instruction "palignr and perm" which
is better than "2 pshufb and perm".
    For 2 operands permutation
      If palignr is applicable it reduces instructions from: "4 pshufb
2 perm and or" to "palignr, 2 pshufb, perm and or" and profitable for
the same reasons as above.

So the final reason is the same for 1 and 2 operands case. However I
agree to extend the comments as they are not clear.
Maybe we should unite one and two operand case into 1 function? I can
submit such patch when patch 1/2 is committed.

Thanks,
Evgeny

>
>
> r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-04-30 14:31       ` Evgeny Stupachenko
@ 2014-05-05 16:54         ` Evgeny Stupachenko
  2014-05-19 10:43           ` Evgeny Stupachenko
  2014-05-19 16:21           ` Richard Henderson
  0 siblings, 2 replies; 15+ messages in thread
From: Evgeny Stupachenko @ 2014-05-05 16:54 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, Richard Biener, Uros Bizjak, H.J. Lu

Assuming first part of the patch is committed. Is the following patch
ok? It passes bootstrap and make check.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 91f6f21..475448e 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -42808,6 +42808,7 @@ expand_vec_perm_pshufb (struct expand_vec_perm_d *d)
 }

 static bool expand_vec_perm_vpshufb2_vpermq (struct expand_vec_perm_d *d);
+static bool expand_vec_perm_palignr (struct expand_vec_perm_d *d);

 /* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to instantiate D
    in a single instruction.  */
@@ -42943,6 +42944,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
   if (expand_vec_perm_vpermil (d))
     return true;

+  /* Try palignr on one operand.  */
+  if (d->one_operand_p && expand_vec_perm_palignr (d))
+    return true;
+
   /* Try the SSSE3 pshufb or XOP vpperm or AVX2 vperm2i128,
      vpshufb, vpermd, vpermps or vpermq variable permutation.  */
   if (expand_vec_perm_pshufb (d))
@@ -43040,22 +43045,36 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d)
   else
     return false;

-  min = 2 * nelt, max = 0;
-  for (i = 0; i < nelt; ++i)
+  /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
+     PALIGNR is better than any other permutaion expand.
+     Check for a rotation in permutation.  */
+  if (d->one_operand_p)
     {
-      unsigned e = d->perm[i];
-      if (e < min)
-       min = e;
-      if (e > max)
-       max = e;
+      min = d->perm[0];
+      for (i = 1; i < nelt; ++i)
+       if (d->perm[i] != ((min + i) & (nelt - 1)))
+         return false;
     }
-  if (min == 0 || max - min >= nelt)
-    return false;
+  /* For a 2 operand permutaion we check if elements fit within one vector.  */
+  else
+    {
+      min = 2 * nelt, max = 0;
+      for (i = 0; i < nelt; ++i)
+       {
+         unsigned e = d->perm[i];
+         if (e < min)
+           min = e;
+         if (e > max)
+           max = e;
+       }
+      if (min == 0 || max - min >= nelt)
+       return false;

-  /* Given that we have SSSE3, we know we'll be able to implement the
-     single operand permutation after the palignr with pshufb.  */
-  if (d->testing_p)
-    return true;
+      /* Given that we have SSSE3, we know we'll be able to implement the
+        single operand permutation after the palignr with pshufb.  */
+      if (d->testing_p)
+       return true;
+    }

   dcopy = *d;
   shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
@@ -43089,6 +43108,14 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d)
     }


+  /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
+     the rest single operand permutation is just move.  */
+  if (d->one_operand_p)
+    {
+      emit_move_insn (d->target, gen_lowpart (d->vmode, target));
+      return true;
+    }
+
   dcopy.op0 = dcopy.op1 = gen_lowpart (d->vmode, target);
   dcopy.one_operand_p = true;


On Wed, Apr 30, 2014 at 6:25 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> On Tue, Apr 29, 2014 at 9:39 PM, Richard Henderson <rth@redhat.com> wrote:
>> On 04/29/2014 10:13 AM, Evgeny Stupachenko wrote:
>>> +  /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
>>> +     PALIGNR is better than PSHUFB.  Check for a rotation in permutation.  */
>>> +  for (i = 0; i < nelt; ++i)
>>> +    if ((((d->perm[(i + 1) & (nelt - 1)] - d->perm[i])) & (nelt - 1)) != 1)
>>> +      return false;
>>> +
>>> +  min = d->perm[0];
>>
>> Why are you running this loop NELT times instead of NELT-1 like I suggested?
>> Why is that test expression so complicated?
>>
>> Obviously d->perm[0] is the anchor and everything else can be computed relative
>> to that.  I'd expect no more than
>>
>>   min = d->perm[0];
>>   for (i = 1; i < nelt; ++i)
>>     if (d->perm[i] != ((min + i) & (nelt - 1)))
>>       return false;
>
> Agree on this. The loop is less complicated.
>
>>
>> Now that I think of it,
>>
>>> +  /* PALIGNR of 2 128-bits registers takes only 1 instrucion.
>>> +     Requires SSSE3.  */
>>> +  if (GET_MODE_SIZE (d->vmode) == 16)
>>> +    {
>>> +      if (!TARGET_SSSE3)
>>> +       return false;
>>> +    }
>>> +  /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions:
>>> +     PERM and PALIGNR.  It is more profitable than 2 PSHUFB and PERM.  */
>>> +  else if (GET_MODE_SIZE (d->vmode) == 32)
>>> +    {
>>> +      if (!TARGET_AVX2)
>>> +       return false;
>>> +    }
>>> +  else
>>> +    return false;
>>> +
>>> +  if (!d->one_operand_p)
>>> +    return false;
>>
>> The comments are wrong.  Move the operand_p test to the top,
>> where it should be, and they're more obviously wrong:
>>
>>   "must have one operand"
>>   "palignr of two operands..."
>>
>> I think your comment
>>
>>   /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
>>      we want to use PALIGNR.  */
>>
>> belongs up there instead of those two incorrect comments.
>>
>
> What I mean in the comments
>   16 bytes case:
>     For 1 operand permutation
>       Rotation will cost "palignr" which is better than "pshufb" as
> has smaller opcode (6 vs 9) and always costs 1 tick (pshufb takes 3-5
> ticks on some x86 archs).
>     For 2 operands permutation
>       If "palignr" is applicable it reduces instructions from: "2
> pshufb and or" to "palignr and pshufb". Profitable for the same
> reasons as above.
>   32 bytes case:
>     For 1 operand permutation
>       Rotation will cost only 2 instruction "palignr and perm" which
> is better than "2 pshufb and perm".
>     For 2 operands permutation
>       If palignr is applicable it reduces instructions from: "4 pshufb
> 2 perm and or" to "palignr, 2 pshufb, perm and or" and profitable for
> the same reasons as above.
>
> So the final reason is the same for 1 and 2 operands case. However I
> agree to extend the comments as they are not clear.
> Maybe we should unite one and two operand case into 1 function? I can
> submit such patch when patch 1/2 is committed.
>
> Thanks,
> Evgeny
>
>>
>>
>> r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-05-05 16:54         ` Evgeny Stupachenko
@ 2014-05-19 10:43           ` Evgeny Stupachenko
  2014-05-19 16:21           ` Richard Henderson
  1 sibling, 0 replies; 15+ messages in thread
From: Evgeny Stupachenko @ 2014-05-19 10:43 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, Richard Biener, Uros Bizjak, H.J. Lu

Ping.

On Mon, May 5, 2014 at 8:54 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> Assuming first part of the patch is committed. Is the following patch
> ok? It passes bootstrap and make check.
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 91f6f21..475448e 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -42808,6 +42808,7 @@ expand_vec_perm_pshufb (struct expand_vec_perm_d *d)
>  }
>
>  static bool expand_vec_perm_vpshufb2_vpermq (struct expand_vec_perm_d *d);
> +static bool expand_vec_perm_palignr (struct expand_vec_perm_d *d);
>
>  /* A subroutine of ix86_expand_vec_perm_builtin_1.  Try to instantiate D
>     in a single instruction.  */
> @@ -42943,6 +42944,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
>    if (expand_vec_perm_vpermil (d))
>      return true;
>
> +  /* Try palignr on one operand.  */
> +  if (d->one_operand_p && expand_vec_perm_palignr (d))
> +    return true;
> +
>    /* Try the SSSE3 pshufb or XOP vpperm or AVX2 vperm2i128,
>       vpshufb, vpermd, vpermps or vpermq variable permutation.  */
>    if (expand_vec_perm_pshufb (d))
> @@ -43040,22 +43045,36 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d)
>    else
>      return false;
>
> -  min = 2 * nelt, max = 0;
> -  for (i = 0; i < nelt; ++i)
> +  /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
> +     PALIGNR is better than any other permutaion expand.
> +     Check for a rotation in permutation.  */
> +  if (d->one_operand_p)
>      {
> -      unsigned e = d->perm[i];
> -      if (e < min)
> -       min = e;
> -      if (e > max)
> -       max = e;
> +      min = d->perm[0];
> +      for (i = 1; i < nelt; ++i)
> +       if (d->perm[i] != ((min + i) & (nelt - 1)))
> +         return false;
>      }
> -  if (min == 0 || max - min >= nelt)
> -    return false;
> +  /* For a 2 operand permutaion we check if elements fit within one vector.  */
> +  else
> +    {
> +      min = 2 * nelt, max = 0;
> +      for (i = 0; i < nelt; ++i)
> +       {
> +         unsigned e = d->perm[i];
> +         if (e < min)
> +           min = e;
> +         if (e > max)
> +           max = e;
> +       }
> +      if (min == 0 || max - min >= nelt)
> +       return false;
>
> -  /* Given that we have SSSE3, we know we'll be able to implement the
> -     single operand permutation after the palignr with pshufb.  */
> -  if (d->testing_p)
> -    return true;
> +      /* Given that we have SSSE3, we know we'll be able to implement the
> +        single operand permutation after the palignr with pshufb.  */
> +      if (d->testing_p)
> +       return true;
> +    }
>
>    dcopy = *d;
>    shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
> @@ -43089,6 +43108,14 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d)
>      }
>
>
> +  /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
> +     the rest single operand permutation is just move.  */
> +  if (d->one_operand_p)
> +    {
> +      emit_move_insn (d->target, gen_lowpart (d->vmode, target));
> +      return true;
> +    }
> +
>    dcopy.op0 = dcopy.op1 = gen_lowpart (d->vmode, target);
>    dcopy.one_operand_p = true;
>
>
> On Wed, Apr 30, 2014 at 6:25 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>> On Tue, Apr 29, 2014 at 9:39 PM, Richard Henderson <rth@redhat.com> wrote:
>>> On 04/29/2014 10:13 AM, Evgeny Stupachenko wrote:
>>>> +  /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
>>>> +     PALIGNR is better than PSHUFB.  Check for a rotation in permutation.  */
>>>> +  for (i = 0; i < nelt; ++i)
>>>> +    if ((((d->perm[(i + 1) & (nelt - 1)] - d->perm[i])) & (nelt - 1)) != 1)
>>>> +      return false;
>>>> +
>>>> +  min = d->perm[0];
>>>
>>> Why are you running this loop NELT times instead of NELT-1 like I suggested?
>>> Why is that test expression so complicated?
>>>
>>> Obviously d->perm[0] is the anchor and everything else can be computed relative
>>> to that.  I'd expect no more than
>>>
>>>   min = d->perm[0];
>>>   for (i = 1; i < nelt; ++i)
>>>     if (d->perm[i] != ((min + i) & (nelt - 1)))
>>>       return false;
>>
>> Agree on this. The loop is less complicated.
>>
>>>
>>> Now that I think of it,
>>>
>>>> +  /* PALIGNR of 2 128-bits registers takes only 1 instrucion.
>>>> +     Requires SSSE3.  */
>>>> +  if (GET_MODE_SIZE (d->vmode) == 16)
>>>> +    {
>>>> +      if (!TARGET_SSSE3)
>>>> +       return false;
>>>> +    }
>>>> +  /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions:
>>>> +     PERM and PALIGNR.  It is more profitable than 2 PSHUFB and PERM.  */
>>>> +  else if (GET_MODE_SIZE (d->vmode) == 32)
>>>> +    {
>>>> +      if (!TARGET_AVX2)
>>>> +       return false;
>>>> +    }
>>>> +  else
>>>> +    return false;
>>>> +
>>>> +  if (!d->one_operand_p)
>>>> +    return false;
>>>
>>> The comments are wrong.  Move the operand_p test to the top,
>>> where it should be, and they're more obviously wrong:
>>>
>>>   "must have one operand"
>>>   "palignr of two operands..."
>>>
>>> I think your comment
>>>
>>>   /* For a rotaion permutation with one operand like: {5 6 7 0 1 2 3 4}
>>>      we want to use PALIGNR.  */
>>>
>>> belongs up there instead of those two incorrect comments.
>>>
>>
>> What I mean in the comments
>>   16 bytes case:
>>     For 1 operand permutation
>>       Rotation will cost "palignr" which is better than "pshufb" as
>> has smaller opcode (6 vs 9) and always costs 1 tick (pshufb takes 3-5
>> ticks on some x86 archs).
>>     For 2 operands permutation
>>       If "palignr" is applicable it reduces instructions from: "2
>> pshufb and or" to "palignr and pshufb". Profitable for the same
>> reasons as above.
>>   32 bytes case:
>>     For 1 operand permutation
>>       Rotation will cost only 2 instruction "palignr and perm" which
>> is better than "2 pshufb and perm".
>>     For 2 operands permutation
>>       If palignr is applicable it reduces instructions from: "4 pshufb
>> 2 perm and or" to "palignr, 2 pshufb, perm and or" and profitable for
>> the same reasons as above.
>>
>> So the final reason is the same for 1 and 2 operands case. However I
>> agree to extend the comments as they are not clear.
>> Maybe we should unite one and two operand case into 1 function? I can
>> submit such patch when patch 1/2 is committed.
>>
>> Thanks,
>> Evgeny
>>
>>>
>>>
>>> r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-05-05 16:54         ` Evgeny Stupachenko
  2014-05-19 10:43           ` Evgeny Stupachenko
@ 2014-05-19 16:21           ` Richard Henderson
  2014-06-04 17:06             ` Evgeny Stupachenko
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2014-05-19 16:21 UTC (permalink / raw)
  To: Evgeny Stupachenko; +Cc: GCC Patches, Richard Biener, Uros Bizjak, H.J. Lu

On 05/05/2014 09:54 AM, Evgeny Stupachenko wrote:
> @@ -42943,6 +42944,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
>    if (expand_vec_perm_vpermil (d))
>      return true;
> 
> +  /* Try palignr on one operand.  */
> +  if (d->one_operand_p && expand_vec_perm_palignr (d))
> +    return true;

No, because unless in_order and SSSE3, expand_vec_perm_palignr generates at
least 2 insns, and by contract expand_vec_perm_1 must generate only one.

I think what might help you out is to have the rotate permutation matched
directly, rather than have to have it converted to a shift.

Thus I think you'd do well to start this series with a patch that adds a
pattern of the form

(define_insn "*ssse3_palignr<mode>_perm"
  [(set (match_operand:V_128 0 "register_operand" "=x,x")
        (vec_select:V_128
           (match_operand:V_128 1 "register_operand" "0,x")
           (match_operand:V_128 2 "nonimmediate_operand" "xm,xm")
           (match_parallel 3 "palign_operand"
             [(match_operand 4 "const_int_operand" "")]
  "TARGET_SSSE3"
{
  enum machine_mode imode = GET_INNER_MODE (GET_MODE (operands[0]));
  operands[3] = GEN_INT (INTVAL (operands[4]) * GET_MODE_SIZE (imode));

  switch (which_alternative)
    {
    case 0:
      return "palignr\t{%3, %2, %0|%0, %2, %3}";
    case 1:
      return "vpalignr\t{%3, %2, %1, %0|%0, %1, %2, %3}";
    default:
      gcc_unreachable ();
    }
}
  [(set_attr "isa" "noavx,avx")
   (set_attr "type" "sseishft")
   (set_attr "atom_unit" "sishuf")
   (set_attr "prefix_data16" "1,*")
   (set_attr "prefix_extra" "1")
   (set_attr "length_immediate" "1")
   (set_attr "prefix" "orig,vex")])

where the palign_operand function verifies that the constants are all in order.
 This is very similar to the way we define the broadcast type patterns.

You'll need a similar pattern with a different predicate for the avx2 palignr,
since it's not a simple increment, but also verifying the cross-lane constraint.

With that as patch 1/1, I believe that will significantly tidy up what else
you're attempting to change with this series.



r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-05-19 16:21           ` Richard Henderson
@ 2014-06-04 17:06             ` Evgeny Stupachenko
  2014-06-04 19:04               ` Richard Henderson
  0 siblings, 1 reply; 15+ messages in thread
From: Evgeny Stupachenko @ 2014-06-04 17:06 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, Richard Biener, Uros Bizjak, H.J. Lu

Is it ok to use the following pattern?

patch passed bootstrap and make check, but one test failed:
gcc/testsuite/gcc.target/i386/vect-rebuild.c
It failed on /* { dg-final { scan-assembler-times "\tv?permilpd\[ \t\]" 1 } } */
which is now palignr. However, both palignr and permilpd costs 1 tick
and take 6 bytes in the opcode.
I vote for modifying the test to scan for palignr:
/* { dg-final { scan-assembler-times "\tv?palignr\[ \t\]" 1 } } */

2014-06-04  Evgeny Stupachenko  <evstupac@gmail.com>

         * config/i386/sse.md (*ssse3_palignr<mode>_perm): New.
         * config/i386/predicates.md (palignr_operand): New.
         Indicates if permutation is suitable for palignr instruction.


diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index 2ef1384..8266f3e 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -1417,6 +1417,22 @@
   return true;
 })

+;; Return true if OP is a parallel for a palignr permute.
+(define_predicate "palignr_operand"
+  (and (match_code "parallel")
+       (match_code "const_int" "a"))
+{
+  int elt = INTVAL (XVECEXP (op, 0, 0));
+  int i, nelt = XVECLEN (op, 0);
+
+  /* Check that an order in the permutation is suitable for palignr.
+     For example, {5 6 7 0 1 2 3 4} is "palignr 5, xmm, xmm".  */
+  for (i = 1; i < nelt; ++i)
+    if (INTVAL (XVECEXP (op, 0, i)) != ((elt + i) % nelt))
+      return false;
+  return true;
+})
+
 ;; Return true if OP is a proper third operand to vpblendw256.
 (define_predicate "avx2_pblendw_operand"
   (match_code "const_int")
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c91626b..5e8fd65 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -11454,6 +11454,36 @@
     }
 })

+(define_insn "*ssse3_palignr<mode>_perm"
+  [(set (match_operand:V_128 0 "register_operand" "=x,x")
+      (vec_select:V_128
+       (match_operand:V_128 1 "register_operand" "0,x")
+       (match_parallel 2 "palignr_operand"
+         [(match_operand 3 "const_int_operand" "n, n")])))]
+  "TARGET_SSSE3"
+{
+  enum machine_mode imode = GET_MODE_INNER (GET_MODE (operands[0]));
+  operands[2] = GEN_INT (INTVAL (operands[3]) * GET_MODE_SIZE (imode));
+
+  switch (which_alternative)
+    {
+    case 0:
+      return "palignr\t{%2, %1, %0|%0, %1, %2}";
+    case 1:
+      return "vpalignr\t{%2, %1, %1, %0|%0, %1, %1, %2}";
+    default:
+      gcc_unreachable ();
+    }
+}
+  [(set_attr "isa" "noavx,avx")
+   (set_attr "type" "sseishft")
+   (set_attr "atom_unit" "sishuf")
+   (set_attr "prefix_data16" "1,*")
+   (set_attr "prefix_extra" "1")
+   (set_attr "length_immediate" "1")
+   (set_attr "prefix" "orig,vex")])
+
+
 (define_insn "abs<mode>2"
   [(set (match_operand:MMXMODEI 0 "register_operand" "=y")
        (abs:MMXMODEI

On Mon, May 19, 2014 at 8:00 PM, Richard Henderson <rth@redhat.com> wrote:
> On 05/05/2014 09:54 AM, Evgeny Stupachenko wrote:
>> @@ -42943,6 +42944,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
>>    if (expand_vec_perm_vpermil (d))
>>      return true;
>>
>> +  /* Try palignr on one operand.  */
>> +  if (d->one_operand_p && expand_vec_perm_palignr (d))
>> +    return true;
>
> No, because unless in_order and SSSE3, expand_vec_perm_palignr generates at
> least 2 insns, and by contract expand_vec_perm_1 must generate only one.
>
> I think what might help you out is to have the rotate permutation matched
> directly, rather than have to have it converted to a shift.
>
> Thus I think you'd do well to start this series with a patch that adds a
> pattern of the form
>
> (define_insn "*ssse3_palignr<mode>_perm"
>   [(set (match_operand:V_128 0 "register_operand" "=x,x")
>         (vec_select:V_128
>            (match_operand:V_128 1 "register_operand" "0,x")
>            (match_operand:V_128 2 "nonimmediate_operand" "xm,xm")
>            (match_parallel 3 "palign_operand"
>              [(match_operand 4 "const_int_operand" "")]
>   "TARGET_SSSE3"
> {
>   enum machine_mode imode = GET_INNER_MODE (GET_MODE (operands[0]));
>   operands[3] = GEN_INT (INTVAL (operands[4]) * GET_MODE_SIZE (imode));
>
>   switch (which_alternative)
>     {
>     case 0:
>       return "palignr\t{%3, %2, %0|%0, %2, %3}";
>     case 1:
>       return "vpalignr\t{%3, %2, %1, %0|%0, %1, %2, %3}";
>     default:
>       gcc_unreachable ();
>     }
> }
>   [(set_attr "isa" "noavx,avx")
>    (set_attr "type" "sseishft")
>    (set_attr "atom_unit" "sishuf")
>    (set_attr "prefix_data16" "1,*")
>    (set_attr "prefix_extra" "1")
>    (set_attr "length_immediate" "1")
>    (set_attr "prefix" "orig,vex")])
>
> where the palign_operand function verifies that the constants are all in order.
>  This is very similar to the way we define the broadcast type patterns.
>
> You'll need a similar pattern with a different predicate for the avx2 palignr,
> since it's not a simple increment, but also verifying the cross-lane constraint.
>
> With that as patch 1/1, I believe that will significantly tidy up what else
> you're attempting to change with this series.
>
>
>
> r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-06-04 17:06             ` Evgeny Stupachenko
@ 2014-06-04 19:04               ` Richard Henderson
  2014-06-04 21:24                 ` Evgeny Stupachenko
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2014-06-04 19:04 UTC (permalink / raw)
  To: Evgeny Stupachenko; +Cc: GCC Patches, Richard Biener, Uros Bizjak, H.J. Lu

On 06/04/2014 10:06 AM, Evgeny Stupachenko wrote:
> Is it ok to use the following pattern?
> 
> patch passed bootstrap and make check, but one test failed:
> gcc/testsuite/gcc.target/i386/vect-rebuild.c
> It failed on /* { dg-final { scan-assembler-times "\tv?permilpd\[ \t\]" 1 } } */
> which is now palignr. However, both palignr and permilpd costs 1 tick
> and take 6 bytes in the opcode.
> I vote for modifying the test to scan for palignr:
> /* { dg-final { scan-assembler-times "\tv?palignr\[ \t\]" 1 } } */
> 
> 2014-06-04  Evgeny Stupachenko  <evstupac@gmail.com>
> 
>          * config/i386/sse.md (*ssse3_palignr<mode>_perm): New.
>          * config/i386/predicates.md (palignr_operand): New.
>          Indicates if permutation is suitable for palignr instruction.

Surely permilpd avoids some sort of reformatting penalty when actually using
doubles.

If you move this pattern down below the other vec_select patterns, we'll prefer
the others for matching masks.


r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-06-04 19:04               ` Richard Henderson
@ 2014-06-04 21:24                 ` Evgeny Stupachenko
  2014-06-04 21:52                   ` Richard Henderson
  0 siblings, 1 reply; 15+ messages in thread
From: Evgeny Stupachenko @ 2014-06-04 21:24 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, Richard Biener, Uros Bizjak, H.J. Lu

Thanks. Moving pattern down helps. Now make check for the following
patch passed:

diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index 2ef1384..8266f3e 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -1417,6 +1417,22 @@
   return true;
 })

+;; Return true if OP is a parallel for a palignr permute.
+(define_predicate "palignr_operand"
+  (and (match_code "parallel")
+       (match_code "const_int" "a"))
+{
+  int elt = INTVAL (XVECEXP (op, 0, 0));
+  int i, nelt = XVECLEN (op, 0);
+
+  /* Check that an order in the permutation is suitable for palignr.
+     For example, {5 6 7 0 1 2 3 4} is "palignr 5, xmm, xmm".  */
+  for (i = 1; i < nelt; ++i)
+    if (INTVAL (XVECEXP (op, 0, i)) != ((elt + i) % nelt))
+      return false;
+  return true;
+})
+
 ;; Return true if OP is a proper third operand to vpblendw256.
 (define_predicate "avx2_pblendw_operand"
   (match_code "const_int")
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index c91626b..d907353 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -14551,6 +14551,35 @@
    (set_attr "prefix" "vex")
    (set_attr "mode" "<sseinsnmode>")])

+(define_insn "*ssse3_palignr<mode>_perm"
+  [(set (match_operand:V_128 0 "register_operand" "=x,x")
+      (vec_select:V_128
+       (match_operand:V_128 1 "register_operand" "0,x")
+       (match_parallel 2 "palignr_operand"
+         [(match_operand 3 "const_int_operand" "n, n")])))]
+  "TARGET_SSSE3"
+{
+  enum machine_mode imode = GET_MODE_INNER (GET_MODE (operands[0]));
+  operands[2] = GEN_INT (INTVAL (operands[3]) * GET_MODE_SIZE (imode));
+
+  switch (which_alternative)
+    {
+    case 0:
+      return "palignr\t{%2, %1, %0|%0, %1, %2}";
+    case 1:
+      return "vpalignr\t{%2, %1, %1, %0|%0, %1, %1, %2}";
+    default:
+      gcc_unreachable ();
+    }
+}
+  [(set_attr "isa" "noavx,avx")
+   (set_attr "type" "sseishft")
+   (set_attr "atom_unit" "sishuf")
+   (set_attr "prefix_data16" "1,*")
+   (set_attr "prefix_extra" "1")
+   (set_attr "length_immediate" "1")
+   (set_attr "prefix" "orig,vex")])
+
 (define_expand "avx_vinsertf128<mode>"
   [(match_operand:V_256 0 "register_operand")
    (match_operand:V_256 1 "register_operand")

On Wed, Jun 4, 2014 at 11:04 PM, Richard Henderson <rth@redhat.com> wrote:
> On 06/04/2014 10:06 AM, Evgeny Stupachenko wrote:
>> Is it ok to use the following pattern?
>>
>> patch passed bootstrap and make check, but one test failed:
>> gcc/testsuite/gcc.target/i386/vect-rebuild.c
>> It failed on /* { dg-final { scan-assembler-times "\tv?permilpd\[ \t\]" 1 } } */
>> which is now palignr. However, both palignr and permilpd costs 1 tick
>> and take 6 bytes in the opcode.
>> I vote for modifying the test to scan for palignr:
>> /* { dg-final { scan-assembler-times "\tv?palignr\[ \t\]" 1 } } */
>>
>> 2014-06-04  Evgeny Stupachenko  <evstupac@gmail.com>
>>
>>          * config/i386/sse.md (*ssse3_palignr<mode>_perm): New.
>>          * config/i386/predicates.md (palignr_operand): New.
>>          Indicates if permutation is suitable for palignr instruction.
>
> Surely permilpd avoids some sort of reformatting penalty when actually using
> doubles.
>
> If you move this pattern down below the other vec_select patterns, we'll prefer
> the others for matching masks.
>
>
> r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2, x86] Add palignr support for AVX2.
  2014-06-04 21:24                 ` Evgeny Stupachenko
@ 2014-06-04 21:52                   ` Richard Henderson
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Henderson @ 2014-06-04 21:52 UTC (permalink / raw)
  To: Evgeny Stupachenko; +Cc: GCC Patches, Richard Biener, Uros Bizjak, H.J. Lu

On 06/04/2014 02:23 PM, Evgeny Stupachenko wrote:
> Thanks. Moving pattern down helps. Now make check for the following
> patch passed:

Excellent.  This version looks good.


r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-06-04 21:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-29 13:51 [PATCH 2/2, x86] Add palignr support for AVX2 Evgeny Stupachenko
2014-04-29 15:03 ` H.J. Lu
2014-04-29 16:54   ` Evgeny Stupachenko
2014-04-29 17:27   ` Jakub Jelinek
2014-04-29 15:55 ` Richard Henderson
2014-04-29 17:23   ` Evgeny Stupachenko
2014-04-29 17:41     ` Richard Henderson
2014-04-30 14:31       ` Evgeny Stupachenko
2014-05-05 16:54         ` Evgeny Stupachenko
2014-05-19 10:43           ` Evgeny Stupachenko
2014-05-19 16:21           ` Richard Henderson
2014-06-04 17:06             ` Evgeny Stupachenko
2014-06-04 19:04               ` Richard Henderson
2014-06-04 21:24                 ` Evgeny Stupachenko
2014-06-04 21:52                   ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).