public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286)
@ 2017-04-03 20:34 Jakub Jelinek
  2017-04-04  6:40 ` Uros Bizjak
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Jelinek @ 2017-04-03 20:34 UTC (permalink / raw)
  To: Uros Bizjak, Kirill Yukhin; +Cc: gcc-patches

Hi!

This patch deals just with correctness of vector shifts by scalar
non-immediate.  The manuals say the shift count is bits [0:63] of
the corresponding source operand (XMM reg or memory in some cases),
and if the count is bigger than number of bits - 1 in the vector element,
it is treated as number of bits shift count.
We are modelling it as SImode shift count though, the upper 32 bits
may be random in some cases which causes wrong-code.
Fixed by using DImode that matches what the insns do.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Any thoughts on what to do to generate reasonable code when the shift count
comes from memory (e.g. as int variable) or is in the low bits of some XMM
regioster?
First of all, perhaps we could have some combiner (or peephole) pattern that would
transform sign-extend from e.g. SI to DI on the shift count into zero-extend
if there are no other uses of the extension result - if the shift count is
negative in SImode (or even QImode), then it is already large number and the
upper 32 bits or more don't really change anything on that.
Then perhaps we could emit pmovzxdq for SSE4.1+ instead of going through
GPRs and back, or for SSE2 pxor on a scratch reg and punpck* to get it zero
extended.  Not sure if we want to add =v / vm alternative to
zero_extendsidi2*, it already has some x but with ?s that prevent the RA
from using it.  So thoughts on that?

2017-04-03  Jakub Jelinek  <jakub@redhat.com>

	PR target/80286
	* config/i386/i386.c (ix86_expand_args_builtin): If op has scalar
	int mode, convert_modes it to mode as unsigned, otherwise use
	lowpart_subreg to mode rather than SImode.
	* config/i386/sse.md (<mask_codefor>ashr<mode>3<mask_name>,
	ashr<mode>3, ashr<mode>3<mask_name>, <shift_insn><mode>3<mask_name>):
	Use DImode instead of SImode for the shift count operand.
	* config/i386/mmx.md (mmx_ashr<mode>3, mmx_<shift_insn><mode>3):
	Likewise.
testsuite/
	* gcc.target/i386/avx-pr80286.c: New test.
	* gcc.dg/pr80286.c: New test.

--- gcc/config/i386/i386.c.jj	2017-04-03 10:40:22.000000000 +0200
+++ gcc/config/i386/i386.c	2017-04-03 18:31:39.482367634 +0200
@@ -35582,10 +35582,17 @@ ix86_expand_args_builtin (const struct b
 	{
 	  /* SIMD shift insns take either an 8-bit immediate or
 	     register as count.  But builtin functions take int as
-	     count.  If count doesn't match, we put it in register.  */
+	     count.  If count doesn't match, we put it in register.
+	     The instructions are using 64-bit count, if op is just
+	     32-bit, zero-extend it, as negative shift counts
+	     are undefined behavior and zero-extension is more
+	     efficient.  */
 	  if (!match)
 	    {
-	      op = lowpart_subreg (SImode, op, GET_MODE (op));
+	      if (SCALAR_INT_MODE_P (GET_MODE (op)))
+		op = convert_modes (mode, GET_MODE (op), op, 1);
+	      else
+		op = lowpart_subreg (mode, op, GET_MODE (op));
 	      if (!insn_p->operand[i + 1].predicate (op, mode))
 		op = copy_to_reg (op);
 	    }
--- gcc/config/i386/sse.md.jj	2017-04-03 13:43:50.179572564 +0200
+++ gcc/config/i386/sse.md	2017-04-03 18:01:19.713852914 +0200
@@ -10620,7 +10620,7 @@ (define_insn "<mask_codefor>ashr<mode>3<
   [(set (match_operand:VI24_AVX512BW_1 0 "register_operand" "=v,v")
 	(ashiftrt:VI24_AVX512BW_1
 	  (match_operand:VI24_AVX512BW_1 1 "nonimmediate_operand" "v,vm")
-	  (match_operand:SI 2 "nonmemory_operand" "v,N")))]
+	  (match_operand:DI 2 "nonmemory_operand" "v,N")))]
   "TARGET_AVX512VL"
   "vpsra<ssemodesuffix>\t{%2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2}"
   [(set_attr "type" "sseishft")
@@ -10634,7 +10634,7 @@ (define_insn "ashr<mode>3"
   [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
 	(ashiftrt:VI24_AVX2
 	  (match_operand:VI24_AVX2 1 "register_operand" "0,x")
-	  (match_operand:SI 2 "nonmemory_operand" "xN,xN")))]
+	  (match_operand:DI 2 "nonmemory_operand" "xN,xN")))]
   "TARGET_SSE2"
   "@
    psra<ssemodesuffix>\t{%2, %0|%0, %2}
@@ -10667,7 +10667,7 @@ (define_insn "ashr<mode>3<mask_name>"
   [(set (match_operand:VI248_AVX512BW_AVX512VL 0 "register_operand" "=v,v")
 	(ashiftrt:VI248_AVX512BW_AVX512VL
 	  (match_operand:VI248_AVX512BW_AVX512VL 1 "nonimmediate_operand" "v,vm")
-	  (match_operand:SI 2 "nonmemory_operand" "v,N")))]
+	  (match_operand:DI 2 "nonmemory_operand" "v,N")))]
   "TARGET_AVX512F"
   "vpsra<ssemodesuffix>\t{%2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2}"
   [(set_attr "type" "sseishft")
@@ -10681,7 +10681,7 @@ (define_insn "<shift_insn><mode>3<mask_n
   [(set (match_operand:VI2_AVX2_AVX512BW 0 "register_operand" "=x,v")
 	(any_lshift:VI2_AVX2_AVX512BW
 	  (match_operand:VI2_AVX2_AVX512BW 1 "register_operand" "0,v")
-	  (match_operand:SI 2 "nonmemory_operand" "xN,vN")))]
+	  (match_operand:DI 2 "nonmemory_operand" "xN,vN")))]
   "TARGET_SSE2 && <mask_mode512bit_condition> && <mask_avx512bw_condition>"
   "@
    p<vshift><ssemodesuffix>\t{%2, %0|%0, %2}
@@ -10700,7 +10700,7 @@ (define_insn "<shift_insn><mode>3<mask_n
   [(set (match_operand:VI48_AVX2 0 "register_operand" "=x,x,v")
 	(any_lshift:VI48_AVX2
 	  (match_operand:VI48_AVX2 1 "register_operand" "0,x,v")
-	  (match_operand:SI 2 "nonmemory_operand" "xN,xN,vN")))]
+	  (match_operand:DI 2 "nonmemory_operand" "xN,xN,vN")))]
   "TARGET_SSE2 && <mask_mode512bit_condition>"
   "@
    p<vshift><ssemodesuffix>\t{%2, %0|%0, %2}
@@ -10720,7 +10720,7 @@ (define_insn "<shift_insn><mode>3<mask_n
   [(set (match_operand:VI48_512 0 "register_operand" "=v,v")
 	(any_lshift:VI48_512
 	  (match_operand:VI48_512 1 "nonimmediate_operand" "v,m")
-	  (match_operand:SI 2 "nonmemory_operand" "vN,N")))]
+	  (match_operand:DI 2 "nonmemory_operand" "vN,N")))]
   "TARGET_AVX512F && <mask_mode512bit_condition>"
   "vp<vshift><ssemodesuffix>\t{%2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2}"
   [(set_attr "isa" "avx512f")
--- gcc/config/i386/mmx.md.jj	2017-04-03 13:43:50.119573339 +0200
+++ gcc/config/i386/mmx.md	2017-04-03 18:01:19.708852979 +0200
@@ -930,7 +930,7 @@ (define_insn "mmx_ashr<mode>3"
   [(set (match_operand:MMXMODE24 0 "register_operand" "=y")
         (ashiftrt:MMXMODE24
 	  (match_operand:MMXMODE24 1 "register_operand" "0")
-	  (match_operand:SI 2 "nonmemory_operand" "yN")))]
+	  (match_operand:DI 2 "nonmemory_operand" "yN")))]
   "TARGET_MMX"
   "psra<mmxvecsize>\t{%2, %0|%0, %2}"
   [(set_attr "type" "mmxshft")
@@ -944,7 +944,7 @@ (define_insn "mmx_<shift_insn><mode>3"
   [(set (match_operand:MMXMODE248 0 "register_operand" "=y")
         (any_lshift:MMXMODE248
 	  (match_operand:MMXMODE248 1 "register_operand" "0")
-	  (match_operand:SI 2 "nonmemory_operand" "yN")))]
+	  (match_operand:DI 2 "nonmemory_operand" "yN")))]
   "TARGET_MMX"
   "p<vshift><mmxvecsize>\t{%2, %0|%0, %2}"
   [(set_attr "type" "mmxshft")
--- gcc/testsuite/gcc.target/i386/avx-pr80286.c.jj	2017-04-03 18:44:07.552698281 +0200
+++ gcc/testsuite/gcc.target/i386/avx-pr80286.c	2017-04-03 18:43:51.000000000 +0200
@@ -0,0 +1,26 @@
+/* PR target/80286 */
+/* { dg-do run { target avx } } */
+/* { dg-options "-O2 -mavx" } */
+
+#include "avx-check.h"
+#include <immintrin.h>
+
+__m256i m;
+
+__attribute__((noinline, noclone)) __m128i
+foo (__m128i x)
+{
+  int s = _mm_cvtsi128_si32 (_mm256_castsi256_si128 (m));
+  return _mm_srli_epi16 (x, s);
+}
+
+static void
+avx_test (void)
+{
+  __m128i a = (__m128i) (__v8hi) { 1 << 7, 2 << 8, 3 << 9, 4 << 10, 5 << 11, 6 << 12, 7 << 13, 8 << 12 };
+  m = (__m256i) (__v8si) { 7, 8, 9, 10, 11, 12, 13, 14 };
+  __m128i c = foo (a);
+  __m128i b = (__m128i) (__v8hi) { 1, 2 << 1, 3 << 2, 4 << 3, 5 << 4, 6 << 5, 7 << 6, 8 << 5 };
+  if (__builtin_memcmp (&c, &b, sizeof (__m128i)))
+    __builtin_abort ();
+}
--- gcc/testsuite/gcc.dg/pr80286.c.jj	2017-04-03 18:45:27.574663948 +0200
+++ gcc/testsuite/gcc.dg/pr80286.c	2017-04-03 18:45:18.386782707 +0200
@@ -0,0 +1,23 @@
+/* PR target/80286 */
+/* { dg-do run } */
+/* { dg-options "-O2 -Wno-psabi" } */
+
+typedef int V __attribute__((vector_size (4 * sizeof (int))));
+
+__attribute__((noinline, noclone)) V
+foo (V x, V y)
+{
+  return x << y[0];
+}
+
+int
+main ()
+{
+  V x = { 1, 2, 3, 4 };
+  V y = { 5, 6, 7, 8 };
+  V z = foo (x, y);
+  V e = { 1 << 5, 2 << 5, 3 << 5, 4 << 5 };
+  if (__builtin_memcmp (&z, &e, sizeof (V)))
+    __builtin_abort ();
+  return 0;
+}

	Jakub

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286)
  2017-04-03 20:34 [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286) Jakub Jelinek
@ 2017-04-04  6:40 ` Uros Bizjak
  2017-04-04 12:01   ` Jakub Jelinek
  0 siblings, 1 reply; 11+ messages in thread
From: Uros Bizjak @ 2017-04-04  6:40 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, gcc-patches

On Mon, Apr 3, 2017 at 10:34 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> Hi!
>
> This patch deals just with correctness of vector shifts by scalar
> non-immediate.  The manuals say the shift count is bits [0:63] of
> the corresponding source operand (XMM reg or memory in some cases),
> and if the count is bigger than number of bits - 1 in the vector element,
> it is treated as number of bits shift count.
> We are modelling it as SImode shift count though, the upper 32 bits
> may be random in some cases which causes wrong-code.
> Fixed by using DImode that matches what the insns do.

IIRC, SImode was choosen to simplify GPR->XMM register moves on 32bit
target. It does look this was wrong choice from the correctness point.

> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> Any thoughts on what to do to generate reasonable code when the shift count
> comes from memory (e.g. as int variable) or is in the low bits of some XMM
> regioster?

The problem with int variable from memory is, that shifts access full
128bits for their count operand, so this is effectively a no-go. If
there is a 128bit count value in memory, we can maybe define shift
pattern with:

(subreg:DI (match_operand:V2DI 2 "general_operand" "xmN,vmN"))

?

> First of all, perhaps we could have some combiner (or peephole) pattern that would
> transform sign-extend from e.g. SI to DI on the shift count into zero-extend
> if there are no other uses of the extension result - if the shift count is
> negative in SImode (or even QImode), then it is already large number and the
> upper 32 bits or more don't really change anything on that.

We can introduce shift patterns with embedded extensions, and split
them to zext + shift. These new patterns can be easily macroized with
any_extend code iterator and SWI124 mode iterator, so we avoid pattern
explosion.

> Then perhaps we could emit pmovzxdq for SSE4.1+ instead of going through
> GPRs and back, or for SSE2 pxor on a scratch reg and punpck* to get it zero
> extended.  Not sure if we want to add =v / vm alternative to
> zero_extendsidi2*, it already has some x but with ?s that prevent the RA
> from using it.  So thoughts on that?

The ? is there to discourage RA from allocating xmm reg (all these
alternatives have * on xmm reg), in effect instructing RA to prefer
GPRs. If the value is already in xmm reg, then I expect ? alternative
will be used. So, yes, v/v alternative as you proposed would be a good
addition to zero_extendsidi alternatives. Please note though that
pmovzxdq operates on a vector value, so memory operands should be
avoided.

>
> 2017-04-03  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/80286
>         * config/i386/i386.c (ix86_expand_args_builtin): If op has scalar
>         int mode, convert_modes it to mode as unsigned, otherwise use
>         lowpart_subreg to mode rather than SImode.
>         * config/i386/sse.md (<mask_codefor>ashr<mode>3<mask_name>,
>         ashr<mode>3, ashr<mode>3<mask_name>, <shift_insn><mode>3<mask_name>):
>         Use DImode instead of SImode for the shift count operand.
>         * config/i386/mmx.md (mmx_ashr<mode>3, mmx_<shift_insn><mode>3):
>         Likewise.
> testsuite/
>         * gcc.target/i386/avx-pr80286.c: New test.
>         * gcc.dg/pr80286.c: New test.

OK for trunk and backports.

Thanks,
Uros.

> --- gcc/config/i386/i386.c.jj   2017-04-03 10:40:22.000000000 +0200
> +++ gcc/config/i386/i386.c      2017-04-03 18:31:39.482367634 +0200
> @@ -35582,10 +35582,17 @@ ix86_expand_args_builtin (const struct b
>         {
>           /* SIMD shift insns take either an 8-bit immediate or
>              register as count.  But builtin functions take int as
> -            count.  If count doesn't match, we put it in register.  */
> +            count.  If count doesn't match, we put it in register.
> +            The instructions are using 64-bit count, if op is just
> +            32-bit, zero-extend it, as negative shift counts
> +            are undefined behavior and zero-extension is more
> +            efficient.  */
>           if (!match)
>             {
> -             op = lowpart_subreg (SImode, op, GET_MODE (op));
> +             if (SCALAR_INT_MODE_P (GET_MODE (op)))
> +               op = convert_modes (mode, GET_MODE (op), op, 1);
> +             else
> +               op = lowpart_subreg (mode, op, GET_MODE (op));
>               if (!insn_p->operand[i + 1].predicate (op, mode))
>                 op = copy_to_reg (op);
>             }
> --- gcc/config/i386/sse.md.jj   2017-04-03 13:43:50.179572564 +0200
> +++ gcc/config/i386/sse.md      2017-04-03 18:01:19.713852914 +0200
> @@ -10620,7 +10620,7 @@ (define_insn "<mask_codefor>ashr<mode>3<
>    [(set (match_operand:VI24_AVX512BW_1 0 "register_operand" "=v,v")
>         (ashiftrt:VI24_AVX512BW_1
>           (match_operand:VI24_AVX512BW_1 1 "nonimmediate_operand" "v,vm")
> -         (match_operand:SI 2 "nonmemory_operand" "v,N")))]
> +         (match_operand:DI 2 "nonmemory_operand" "v,N")))]
>    "TARGET_AVX512VL"
>    "vpsra<ssemodesuffix>\t{%2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2}"
>    [(set_attr "type" "sseishft")
> @@ -10634,7 +10634,7 @@ (define_insn "ashr<mode>3"
>    [(set (match_operand:VI24_AVX2 0 "register_operand" "=x,x")
>         (ashiftrt:VI24_AVX2
>           (match_operand:VI24_AVX2 1 "register_operand" "0,x")
> -         (match_operand:SI 2 "nonmemory_operand" "xN,xN")))]
> +         (match_operand:DI 2 "nonmemory_operand" "xN,xN")))]
>    "TARGET_SSE2"
>    "@
>     psra<ssemodesuffix>\t{%2, %0|%0, %2}
> @@ -10667,7 +10667,7 @@ (define_insn "ashr<mode>3<mask_name>"
>    [(set (match_operand:VI248_AVX512BW_AVX512VL 0 "register_operand" "=v,v")
>         (ashiftrt:VI248_AVX512BW_AVX512VL
>           (match_operand:VI248_AVX512BW_AVX512VL 1 "nonimmediate_operand" "v,vm")
> -         (match_operand:SI 2 "nonmemory_operand" "v,N")))]
> +         (match_operand:DI 2 "nonmemory_operand" "v,N")))]
>    "TARGET_AVX512F"
>    "vpsra<ssemodesuffix>\t{%2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2}"
>    [(set_attr "type" "sseishft")
> @@ -10681,7 +10681,7 @@ (define_insn "<shift_insn><mode>3<mask_n
>    [(set (match_operand:VI2_AVX2_AVX512BW 0 "register_operand" "=x,v")
>         (any_lshift:VI2_AVX2_AVX512BW
>           (match_operand:VI2_AVX2_AVX512BW 1 "register_operand" "0,v")
> -         (match_operand:SI 2 "nonmemory_operand" "xN,vN")))]
> +         (match_operand:DI 2 "nonmemory_operand" "xN,vN")))]
>    "TARGET_SSE2 && <mask_mode512bit_condition> && <mask_avx512bw_condition>"
>    "@
>     p<vshift><ssemodesuffix>\t{%2, %0|%0, %2}
> @@ -10700,7 +10700,7 @@ (define_insn "<shift_insn><mode>3<mask_n
>    [(set (match_operand:VI48_AVX2 0 "register_operand" "=x,x,v")
>         (any_lshift:VI48_AVX2
>           (match_operand:VI48_AVX2 1 "register_operand" "0,x,v")
> -         (match_operand:SI 2 "nonmemory_operand" "xN,xN,vN")))]
> +         (match_operand:DI 2 "nonmemory_operand" "xN,xN,vN")))]
>    "TARGET_SSE2 && <mask_mode512bit_condition>"
>    "@
>     p<vshift><ssemodesuffix>\t{%2, %0|%0, %2}
> @@ -10720,7 +10720,7 @@ (define_insn "<shift_insn><mode>3<mask_n
>    [(set (match_operand:VI48_512 0 "register_operand" "=v,v")
>         (any_lshift:VI48_512
>           (match_operand:VI48_512 1 "nonimmediate_operand" "v,m")
> -         (match_operand:SI 2 "nonmemory_operand" "vN,N")))]
> +         (match_operand:DI 2 "nonmemory_operand" "vN,N")))]
>    "TARGET_AVX512F && <mask_mode512bit_condition>"
>    "vp<vshift><ssemodesuffix>\t{%2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2}"
>    [(set_attr "isa" "avx512f")
> --- gcc/config/i386/mmx.md.jj   2017-04-03 13:43:50.119573339 +0200
> +++ gcc/config/i386/mmx.md      2017-04-03 18:01:19.708852979 +0200
> @@ -930,7 +930,7 @@ (define_insn "mmx_ashr<mode>3"
>    [(set (match_operand:MMXMODE24 0 "register_operand" "=y")
>          (ashiftrt:MMXMODE24
>           (match_operand:MMXMODE24 1 "register_operand" "0")
> -         (match_operand:SI 2 "nonmemory_operand" "yN")))]
> +         (match_operand:DI 2 "nonmemory_operand" "yN")))]
>    "TARGET_MMX"
>    "psra<mmxvecsize>\t{%2, %0|%0, %2}"
>    [(set_attr "type" "mmxshft")
> @@ -944,7 +944,7 @@ (define_insn "mmx_<shift_insn><mode>3"
>    [(set (match_operand:MMXMODE248 0 "register_operand" "=y")
>          (any_lshift:MMXMODE248
>           (match_operand:MMXMODE248 1 "register_operand" "0")
> -         (match_operand:SI 2 "nonmemory_operand" "yN")))]
> +         (match_operand:DI 2 "nonmemory_operand" "yN")))]
>    "TARGET_MMX"
>    "p<vshift><mmxvecsize>\t{%2, %0|%0, %2}"
>    [(set_attr "type" "mmxshft")
> --- gcc/testsuite/gcc.target/i386/avx-pr80286.c.jj      2017-04-03 18:44:07.552698281 +0200
> +++ gcc/testsuite/gcc.target/i386/avx-pr80286.c 2017-04-03 18:43:51.000000000 +0200
> @@ -0,0 +1,26 @@
> +/* PR target/80286 */
> +/* { dg-do run { target avx } } */
> +/* { dg-options "-O2 -mavx" } */
> +
> +#include "avx-check.h"
> +#include <immintrin.h>
> +
> +__m256i m;
> +
> +__attribute__((noinline, noclone)) __m128i
> +foo (__m128i x)
> +{
> +  int s = _mm_cvtsi128_si32 (_mm256_castsi256_si128 (m));
> +  return _mm_srli_epi16 (x, s);
> +}
> +
> +static void
> +avx_test (void)
> +{
> +  __m128i a = (__m128i) (__v8hi) { 1 << 7, 2 << 8, 3 << 9, 4 << 10, 5 << 11, 6 << 12, 7 << 13, 8 << 12 };
> +  m = (__m256i) (__v8si) { 7, 8, 9, 10, 11, 12, 13, 14 };
> +  __m128i c = foo (a);
> +  __m128i b = (__m128i) (__v8hi) { 1, 2 << 1, 3 << 2, 4 << 3, 5 << 4, 6 << 5, 7 << 6, 8 << 5 };
> +  if (__builtin_memcmp (&c, &b, sizeof (__m128i)))
> +    __builtin_abort ();
> +}
> --- gcc/testsuite/gcc.dg/pr80286.c.jj   2017-04-03 18:45:27.574663948 +0200
> +++ gcc/testsuite/gcc.dg/pr80286.c      2017-04-03 18:45:18.386782707 +0200
> @@ -0,0 +1,23 @@
> +/* PR target/80286 */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -Wno-psabi" } */
> +
> +typedef int V __attribute__((vector_size (4 * sizeof (int))));
> +
> +__attribute__((noinline, noclone)) V
> +foo (V x, V y)
> +{
> +  return x << y[0];
> +}
> +
> +int
> +main ()
> +{
> +  V x = { 1, 2, 3, 4 };
> +  V y = { 5, 6, 7, 8 };
> +  V z = foo (x, y);
> +  V e = { 1 << 5, 2 << 5, 3 << 5, 4 << 5 };
> +  if (__builtin_memcmp (&z, &e, sizeof (V)))
> +    __builtin_abort ();
> +  return 0;
> +}
>
>         Jakub

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286)
  2017-04-04  6:40 ` Uros Bizjak
@ 2017-04-04 12:01   ` Jakub Jelinek
  2017-04-04 12:33     ` Uros Bizjak
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Jelinek @ 2017-04-04 12:01 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Kirill Yukhin, gcc-patches

On Tue, Apr 04, 2017 at 08:39:59AM +0200, Uros Bizjak wrote:
> > Any thoughts on what to do to generate reasonable code when the shift count
> > comes from memory (e.g. as int variable) or is in the low bits of some XMM
> > regioster?
> 
> The problem with int variable from memory is, that shifts access full
> 128bits for their count operand, so this is effectively a no-go. If
> there is a 128bit count value in memory, we can maybe define shift
> pattern with:
> 
> (subreg:DI (match_operand:V2DI 2 "general_operand" "xmN,vmN"))
> 
> ?

Well, if the original memory is say int, then we can't just read it as V2DI
or V4SI.

> > First of all, perhaps we could have some combiner (or peephole) pattern that would
> > transform sign-extend from e.g. SI to DI on the shift count into zero-extend
> > if there are no other uses of the extension result - if the shift count is
> > negative in SImode (or even QImode), then it is already large number and the
> > upper 32 bits or more don't really change anything on that.
> 
> We can introduce shift patterns with embedded extensions, and split
> them to zext + shift. These new patterns can be easily macroized with
> any_extend code iterator and SWI124 mode iterator, so we avoid pattern
> explosion.

I assume split those before reload.  Because we want to give reload a chance
to do the zero extension on GPRs if it is more beneficial, and it might
choose to store it into memory and load into XMM from memory and that is
hard to do after reload.

> > Then perhaps we could emit pmovzxdq for SSE4.1+ instead of going through
> > GPRs and back, or for SSE2 pxor on a scratch reg and punpck* to get it zero
> > extended.  Not sure if we want to add =v / vm alternative to
> > zero_extendsidi2*, it already has some x but with ?s that prevent the RA
> > from using it.  So thoughts on that?
> 
> The ? is there to discourage RA from allocating xmm reg (all these
> alternatives have * on xmm reg), in effect instructing RA to prefer
> GPRs. If the value is already in xmm reg, then I expect ? alternative
> will be used. So, yes, v/v alternative as you proposed would be a good
> addition to zero_extendsidi alternatives. Please note though that
> pmovzxdq operates on a vector value, so memory operands should be
> avoided.

With ? in front of it or without?  I admit I've only tried so far:
@@ -4049,24 +4049,29 @@ (define_expand "extendsidi2"
 })

 (define_insn "*extendsidi2_rex64"
-  [(set (match_operand:DI 0 "register_operand" "=*a,r")
-       (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" "*0,rm")))]
+  [(set (match_operand:DI 0 "register_operand" "=*a,r,v")
+       (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" "*0,rm,vm")))]
   "TARGET_64BIT"
   "@
    {cltq|cdqe}
-   movs{lq|x}\t{%1, %0|%0, %1}"
-  [(set_attr "type" "imovx")
-   (set_attr "mode" "DI")
-   (set_attr "prefix_0f" "0")
-   (set_attr "modrm" "0,1")])
+   movs{lq|x}\t{%1, %0|%0, %1}
+   %vpmovsxdq\t{%1, %0|%0, %1}"
+  [(set_attr "isa" "*,*,sse4")
+   (set_attr "type" "imovx,imovx,ssemov")
+   (set_attr "mode" "DI,DI,TI")
+   (set_attr "prefix_0f" "0,0,*")
+   (set_attr "prefix_extra" "*,*,1")
+   (set_attr "prefix" "orig,orig,maybe_evex")
+   (set_attr "modrm" "0,1,*")])


and with the ? in front of v it for some reason didn't trigger.
I'll try the zero_extendsidi2 now and see how it works.

> OK for trunk and backports.

Committed to trunk so far, backports in a week or so when I backport
dozens of other patches together with it.

	Jakub

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286)
  2017-04-04 12:01   ` Jakub Jelinek
@ 2017-04-04 12:33     ` Uros Bizjak
  2017-04-04 15:09       ` Jakub Jelinek
  0 siblings, 1 reply; 11+ messages in thread
From: Uros Bizjak @ 2017-04-04 12:33 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, gcc-patches

On Tue, Apr 4, 2017 at 2:00 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Apr 04, 2017 at 08:39:59AM +0200, Uros Bizjak wrote:
>> > Any thoughts on what to do to generate reasonable code when the shift count
>> > comes from memory (e.g. as int variable) or is in the low bits of some XMM
>> > regioster?
>>
>> The problem with int variable from memory is, that shifts access full
>> 128bits for their count operand, so this is effectively a no-go. If
>> there is a 128bit count value in memory, we can maybe define shift
>> pattern with:
>>
>> (subreg:DI (match_operand:V2DI 2 "general_operand" "xmN,vmN"))
>>
>> ?
>
> Well, if the original memory is say int, then we can't just read it as V2DI
> or V4SI.

Of course. The above was for the case when we *want* to load from
memory. The insn loads full 128bit value.

>> > First of all, perhaps we could have some combiner (or peephole) pattern that would
>> > transform sign-extend from e.g. SI to DI on the shift count into zero-extend
>> > if there are no other uses of the extension result - if the shift count is
>> > negative in SImode (or even QImode), then it is already large number and the
>> > upper 32 bits or more don't really change anything on that.
>>
>> We can introduce shift patterns with embedded extensions, and split
>> them to zext + shift. These new patterns can be easily macroized with
>> any_extend code iterator and SWI124 mode iterator, so we avoid pattern
>> explosion.
>
> I assume split those before reload.  Because we want to give reload a chance
> to do the zero extension on GPRs if it is more beneficial, and it might
> choose to store it into memory and load into XMM from memory and that is
> hard to do after reload.

Yes, split before reload, and hope that alternative's decorations play
well with RA.

>> > Then perhaps we could emit pmovzxdq for SSE4.1+ instead of going through
>> > GPRs and back, or for SSE2 pxor on a scratch reg and punpck* to get it zero
>> > extended.  Not sure if we want to add =v / vm alternative to
>> > zero_extendsidi2*, it already has some x but with ?s that prevent the RA
>> > from using it.  So thoughts on that?
>>
>> The ? is there to discourage RA from allocating xmm reg (all these
>> alternatives have * on xmm reg), in effect instructing RA to prefer
>> GPRs. If the value is already in xmm reg, then I expect ? alternative
>> will be used. So, yes, v/v alternative as you proposed would be a good
>> addition to zero_extendsidi alternatives. Please note though that
>> pmovzxdq operates on a vector value, so memory operands should be
>> avoided.
>
> With ? in front of it or without?  I admit I've only tried so far:

I'd leave ?* in this case. In my experience, RA allocates alternative
with ?* only when really needed.

> @@ -4049,24 +4049,29 @@ (define_expand "extendsidi2"
>  })
>
>  (define_insn "*extendsidi2_rex64"
> -  [(set (match_operand:DI 0 "register_operand" "=*a,r")
> -       (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" "*0,rm")))]
> +  [(set (match_operand:DI 0 "register_operand" "=*a,r,v")
> +       (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" "*0,rm,vm")))]
>    "TARGET_64BIT"
>    "@
>     {cltq|cdqe}
> -   movs{lq|x}\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "imovx")
> -   (set_attr "mode" "DI")
> -   (set_attr "prefix_0f" "0")
> -   (set_attr "modrm" "0,1")])
> +   movs{lq|x}\t{%1, %0|%0, %1}
> +   %vpmovsxdq\t{%1, %0|%0, %1}"
> +  [(set_attr "isa" "*,*,sse4")
> +   (set_attr "type" "imovx,imovx,ssemov")
> +   (set_attr "mode" "DI,DI,TI")
> +   (set_attr "prefix_0f" "0,0,*")
> +   (set_attr "prefix_extra" "*,*,1")
> +   (set_attr "prefix" "orig,orig,maybe_evex")
> +   (set_attr "modrm" "0,1,*")])
>
>
> and with the ? in front of v it for some reason didn't trigger.
> I'll try the zero_extendsidi2 now and see how it works.
>
>> OK for trunk and backports.
>
> Committed to trunk so far, backports in a week or so when I backport
> dozens of other patches together with it.
>
>         Jakub

Uros.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286)
  2017-04-04 12:33     ` Uros Bizjak
@ 2017-04-04 15:09       ` Jakub Jelinek
  2017-04-06  7:34         ` Uros Bizjak
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Jelinek @ 2017-04-04 15:09 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Kirill Yukhin, gcc-patches

On Tue, Apr 04, 2017 at 02:33:24PM +0200, Uros Bizjak wrote:
> > I assume split those before reload.  Because we want to give reload a chance
> > to do the zero extension on GPRs if it is more beneficial, and it might
> > choose to store it into memory and load into XMM from memory and that is
> > hard to do after reload.
> 
> Yes, split before reload, and hope that alternative's decorations play
> well with RA.

Haven't done these splitters yet, just playing now with:
typedef long long __m256i __attribute__ ((__vector_size__ (32), __may_alias__));
typedef int __v4si __attribute__ ((__vector_size__ (16)));
typedef short __v8hi __attribute__ ((__vector_size__ (16)));
typedef int __v8si __attribute__ ((__vector_size__ (32)));
typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));
extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_castsi256_si128 (__m256i __A) { return (__m128i) __builtin_ia32_si_si256 ((__v8si)__A); }
extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cvtsi128_si32 (__m128i __A) { return __builtin_ia32_vec_ext_v4si ((__v4si)__A, 0); }
extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_srli_epi16 (__m128i __A, int __B) { return (__m128i)__builtin_ia32_psrlwi128 ((__v8hi)__A, __B); }
__m256i m;
__m128i foo (__m128i minmax)
{
  int shift = _mm_cvtsi128_si32 (_mm256_castsi256_si128 (m));
  return _mm_srli_epi16 (minmax, shift);
}
to see what it emits (in that case we already have zero extension rather
than sign extension).
> > With ? in front of it or without?  I admit I've only tried so far:
> 
> I'd leave ?* in this case. In my experience, RA allocates alternative
> with ?* only when really needed.

So far I have following, which seems to work fine for the above testcase and
-O2 -m64 -mavx2, but doesn't work for -O2 -m32 -mavx2.
For 64-bit combiner matches the *vec_extractv4si_0_zext pattern and as that
doesn't have ? nor * in the constraint, it is used.
For 32-bit there is no such pattern and we end up with just zero_extendsidi2
pattern and apparently either the ? or * prevent IRA/LRA from using it.
If I remove both ?*, I get nice code even for 32-bit.

--- gcc/config/i386/sse.md.jj	2017-04-04 12:45:08.000000000 +0200
+++ gcc/config/i386/sse.md	2017-04-04 16:54:58.667382522 +0200
@@ -13517,16 +13517,17 @@ (define_insn "*vec_extract<ssevecmodelow
   [(set_attr "isa" "*,sse4,*,*")])
 
 (define_insn_and_split "*vec_extractv4si_0_zext"
-  [(set (match_operand:DI 0 "register_operand" "=r")
+  [(set (match_operand:DI 0 "register_operand" "=r,x,v")
 	(zero_extend:DI
 	  (vec_select:SI
-	    (match_operand:V4SI 1 "register_operand" "v")
+	    (match_operand:V4SI 1 "register_operand" "v,x,v")
 	    (parallel [(const_int 0)]))))]
   "TARGET_64BIT && TARGET_SSE2 && TARGET_INTER_UNIT_MOVES_FROM_VEC"
   "#"
   "&& reload_completed"
   [(set (match_dup 0) (zero_extend:DI (match_dup 1)))]
-  "operands[1] = gen_lowpart (SImode, operands[1]);")
+  "operands[1] = gen_lowpart (SImode, operands[1]);"
+  [(set_attr "isa" "*,sse4,avx512f")])
 
 (define_insn "*vec_extractv2di_0_sse"
   [(set (match_operand:DI 0 "nonimmediate_operand"     "=v,m")
--- gcc/config/i386/i386.md.jj	2017-04-03 13:43:50.000000000 +0200
+++ gcc/config/i386/i386.md	2017-04-04 16:54:09.786014373 +0200
@@ -3767,10 +3767,10 @@ (define_expand "zero_extendsidi2"
 
 (define_insn "*zero_extendsidi2"
   [(set (match_operand:DI 0 "nonimmediate_operand"
-			"=r,?r,?o,r   ,o,?*Ym,?!*y,?r ,?r,?*Yi,?*x,*r")
+			"=r,?r,?o,r   ,o,?*Ym,?!*y,?r ,?r,?*Yi,?*x,*r,?*x,?*v")
 	(zero_extend:DI
 	 (match_operand:SI 1 "x86_64_zext_operand"
-	        	"0 ,rm,r ,rmWz,0,r   ,m   ,*Yj,*x,r   ,m  ,*k")))]
+	        	"0 ,rm,r ,rmWz,0,r   ,m   ,*Yj,*x,r   ,m  ,*k,x  ,v")))]
   ""
 {
   switch (get_attr_type (insn))
@@ -3791,6 +3791,14 @@ (define_insn "*zero_extendsidi2"
       return "%vpextrd\t{$0, %1, %k0|%k0, %1, 0}";
 
     case TYPE_SSEMOV:
+      if (SSE_REG_P (operands[0]) && SSE_REG_P (operands[1]))
+	{
+	  if (EXT_REX_SSE_REG_P (operands[0])
+	      || EXT_REX_SSE_REG_P (operands[1]))
+	    return "vpmovzxdq\t{%t1, %g0|%g0, %t1}";
+	  else
+	    return "%vpmovzxdq\t{%1, %0|%0, %1}";
+	}
       if (GENERAL_REG_P (operands[0]))
 	return "%vmovd\t{%1, %k0|%k0, %1}";
 
@@ -3814,6 +3822,10 @@ (define_insn "*zero_extendsidi2"
 	      (const_string "sse2")
 	    (eq_attr "alternative" "11")
 	      (const_string "x64_avx512bw")
+	    (eq_attr "alternative" "12")
+	      (const_string "sse4")
+	    (eq_attr "alternative" "13")
+	      (const_string "avx512f")
 	   ]
 	   (const_string "*")))
    (set (attr "type")
@@ -3821,7 +3833,7 @@ (define_insn "*zero_extendsidi2"
 	      (const_string "multi")
 	    (eq_attr "alternative" "5,6")
 	      (const_string "mmxmov")
-	    (eq_attr "alternative" "7,9,10")
+	    (eq_attr "alternative" "7,9,10,12,13")
 	      (const_string "ssemov")
 	    (eq_attr "alternative" "8")
 	      (const_string "sselog1")
@@ -3830,7 +3842,7 @@ (define_insn "*zero_extendsidi2"
 	   ]
 	   (const_string "imovx")))
    (set (attr "prefix_extra")
-     (if_then_else (eq_attr "alternative" "8")
+     (if_then_else (eq_attr "alternative" "8,12,13")
        (const_string "1")
        (const_string "*")))
    (set (attr "length_immediate")
@@ -3848,8 +3860,10 @@ (define_insn "*zero_extendsidi2"
    (set (attr "mode")
      (cond [(eq_attr "alternative" "5,6")
 	      (const_string "DI")
-	    (eq_attr "alternative" "7,8,9")
+	    (eq_attr "alternative" "7,8,9,12")
 	      (const_string "TI")
+	    (eq_attr "alternative" "13")
+	      (const_string "OI")
 	   ]
 	   (const_string "SI")))])
 

	Jakub

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286)
  2017-04-04 15:09       ` Jakub Jelinek
@ 2017-04-06  7:34         ` Uros Bizjak
  2017-04-06  8:40           ` Uros Bizjak
  2017-04-06  8:40           ` Jakub Jelinek
  0 siblings, 2 replies; 11+ messages in thread
From: Uros Bizjak @ 2017-04-06  7:34 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, gcc-patches

On Tue, Apr 4, 2017 at 5:09 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Apr 04, 2017 at 02:33:24PM +0200, Uros Bizjak wrote:
>> > I assume split those before reload.  Because we want to give reload a chance
>> > to do the zero extension on GPRs if it is more beneficial, and it might
>> > choose to store it into memory and load into XMM from memory and that is
>> > hard to do after reload.
>>
>> Yes, split before reload, and hope that alternative's decorations play
>> well with RA.
>
> Haven't done these splitters yet, just playing now with:
> typedef long long __m256i __attribute__ ((__vector_size__ (32), __may_alias__));
> typedef int __v4si __attribute__ ((__vector_size__ (16)));
> typedef short __v8hi __attribute__ ((__vector_size__ (16)));
> typedef int __v8si __attribute__ ((__vector_size__ (32)));
> typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));
> extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> _mm256_castsi256_si128 (__m256i __A) { return (__m128i) __builtin_ia32_si_si256 ((__v8si)__A); }
> extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> _mm_cvtsi128_si32 (__m128i __A) { return __builtin_ia32_vec_ext_v4si ((__v4si)__A, 0); }
> extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> _mm_srli_epi16 (__m128i __A, int __B) { return (__m128i)__builtin_ia32_psrlwi128 ((__v8hi)__A, __B); }
> __m256i m;
> __m128i foo (__m128i minmax)
> {
>   int shift = _mm_cvtsi128_si32 (_mm256_castsi256_si128 (m));
>   return _mm_srli_epi16 (minmax, shift);
> }
> to see what it emits (in that case we already have zero extension rather
> than sign extension).
>> > With ? in front of it or without?  I admit I've only tried so far:
>>
>> I'd leave ?* in this case. In my experience, RA allocates alternative
>> with ?* only when really needed.
>
> So far I have following, which seems to work fine for the above testcase and
> -O2 -m64 -mavx2, but doesn't work for -O2 -m32 -mavx2.
> For 64-bit combiner matches the *vec_extractv4si_0_zext pattern and as that
> doesn't have ? nor * in the constraint, it is used.
> For 32-bit there is no such pattern and we end up with just zero_extendsidi2
> pattern and apparently either the ? or * prevent IRA/LRA from using it.
> If I remove both ?*, I get nice code even for 32-bit.

Newly introduced alternatives (x/x) and (v/v) are valid also for
32-bit targets, so we have to adjust insn constraint of
*vec_extractv4si_0_zext and enable alternatives accordingly. After the
adjustment, the pattern will be split to a zero-extend.

With -m32, I get:

(insn 10 8 13 2 (set (reg:SI 98)
        (vec_select:SI (reg:V4SI 95)
            (parallel [
                    (const_int 0 [0])
                ]))) "pr80286.c":9 3663 {*vec_extractv4si_0}
     (expr_list:REG_DEAD (reg:V4SI 95)
        (nil)))
(insn 13 10 14 2 (set (reg:DI 101 [ _7 ])
        (zero_extend:DI (reg:SI 98))) "pr80286.c":11 131 {*zero_extendsidi2}
     (expr_list:REG_DEAD (reg:SI 98)
        (nil)))

and for SSE4+, combine can merge these two patterns to
*vec_extractv4si_0_zext, with the anticipation that pmovzx will be
generated.

Uros.

> --- gcc/config/i386/sse.md.jj   2017-04-04 12:45:08.000000000 +0200
> +++ gcc/config/i386/sse.md      2017-04-04 16:54:58.667382522 +0200
> @@ -13517,16 +13517,17 @@ (define_insn "*vec_extract<ssevecmodelow
>    [(set_attr "isa" "*,sse4,*,*")])
>
>  (define_insn_and_split "*vec_extractv4si_0_zext"
> -  [(set (match_operand:DI 0 "register_operand" "=r")
> +  [(set (match_operand:DI 0 "register_operand" "=r,x,v")
>         (zero_extend:DI
>           (vec_select:SI
> -           (match_operand:V4SI 1 "register_operand" "v")
> +           (match_operand:V4SI 1 "register_operand" "v,x,v")
>             (parallel [(const_int 0)]))))]
>    "TARGET_64BIT && TARGET_SSE2 && TARGET_INTER_UNIT_MOVES_FROM_VEC"
>    "#"
>    "&& reload_completed"
>    [(set (match_dup 0) (zero_extend:DI (match_dup 1)))]
> -  "operands[1] = gen_lowpart (SImode, operands[1]);")
> +  "operands[1] = gen_lowpart (SImode, operands[1]);"
> +  [(set_attr "isa" "*,sse4,avx512f")])
>
>  (define_insn "*vec_extractv2di_0_sse"
>    [(set (match_operand:DI 0 "nonimmediate_operand"     "=v,m")
> --- gcc/config/i386/i386.md.jj  2017-04-03 13:43:50.000000000 +0200
> +++ gcc/config/i386/i386.md     2017-04-04 16:54:09.786014373 +0200
> @@ -3767,10 +3767,10 @@ (define_expand "zero_extendsidi2"
>
>  (define_insn "*zero_extendsidi2"
>    [(set (match_operand:DI 0 "nonimmediate_operand"
> -                       "=r,?r,?o,r   ,o,?*Ym,?!*y,?r ,?r,?*Yi,?*x,*r")
> +                       "=r,?r,?o,r   ,o,?*Ym,?!*y,?r ,?r,?*Yi,?*x,*r,?*x,?*v")
>         (zero_extend:DI
>          (match_operand:SI 1 "x86_64_zext_operand"
> -                       "0 ,rm,r ,rmWz,0,r   ,m   ,*Yj,*x,r   ,m  ,*k")))]
> +                       "0 ,rm,r ,rmWz,0,r   ,m   ,*Yj,*x,r   ,m  ,*k,x  ,v")))]
>    ""
>  {
>    switch (get_attr_type (insn))
> @@ -3791,6 +3791,14 @@ (define_insn "*zero_extendsidi2"
>        return "%vpextrd\t{$0, %1, %k0|%k0, %1, 0}";
>
>      case TYPE_SSEMOV:
> +      if (SSE_REG_P (operands[0]) && SSE_REG_P (operands[1]))
> +       {
> +         if (EXT_REX_SSE_REG_P (operands[0])
> +             || EXT_REX_SSE_REG_P (operands[1]))
> +           return "vpmovzxdq\t{%t1, %g0|%g0, %t1}";
> +         else
> +           return "%vpmovzxdq\t{%1, %0|%0, %1}";
> +       }
>        if (GENERAL_REG_P (operands[0]))
>         return "%vmovd\t{%1, %k0|%k0, %1}";
>
> @@ -3814,6 +3822,10 @@ (define_insn "*zero_extendsidi2"
>               (const_string "sse2")
>             (eq_attr "alternative" "11")
>               (const_string "x64_avx512bw")
> +           (eq_attr "alternative" "12")
> +             (const_string "sse4")
> +           (eq_attr "alternative" "13")
> +             (const_string "avx512f")
>            ]
>            (const_string "*")))
>     (set (attr "type")
> @@ -3821,7 +3833,7 @@ (define_insn "*zero_extendsidi2"
>               (const_string "multi")
>             (eq_attr "alternative" "5,6")
>               (const_string "mmxmov")
> -           (eq_attr "alternative" "7,9,10")
> +           (eq_attr "alternative" "7,9,10,12,13")
>               (const_string "ssemov")
>             (eq_attr "alternative" "8")
>               (const_string "sselog1")
> @@ -3830,7 +3842,7 @@ (define_insn "*zero_extendsidi2"
>            ]
>            (const_string "imovx")))
>     (set (attr "prefix_extra")
> -     (if_then_else (eq_attr "alternative" "8")
> +     (if_then_else (eq_attr "alternative" "8,12,13")
>         (const_string "1")
>         (const_string "*")))
>     (set (attr "length_immediate")
> @@ -3848,8 +3860,10 @@ (define_insn "*zero_extendsidi2"
>     (set (attr "mode")
>       (cond [(eq_attr "alternative" "5,6")
>               (const_string "DI")
> -           (eq_attr "alternative" "7,8,9")
> +           (eq_attr "alternative" "7,8,9,12")
>               (const_string "TI")
> +           (eq_attr "alternative" "13")
> +             (const_string "OI")
>            ]
>            (const_string "SI")))])
>
>
>         Jakub

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286)
  2017-04-06  7:34         ` Uros Bizjak
  2017-04-06  8:40           ` Uros Bizjak
@ 2017-04-06  8:40           ` Jakub Jelinek
  2017-04-06  8:47             ` Uros Bizjak
  2017-04-06  8:48             ` Jakub Jelinek
  1 sibling, 2 replies; 11+ messages in thread
From: Jakub Jelinek @ 2017-04-06  8:40 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Kirill Yukhin, gcc-patches

On Thu, Apr 06, 2017 at 09:33:58AM +0200, Uros Bizjak wrote:
> Newly introduced alternatives (x/x) and (v/v) are valid also for
> 32-bit targets, so we have to adjust insn constraint of
> *vec_extractv4si_0_zext and enable alternatives accordingly. After the

That is true.  But if we provide just the x/x and v/v alternatives in
*vec_extractv4si_0_zext, then it will be forced to always do the zero
extraction on the SSE registers in 32-bit mode.  Is that what we want?

As for the define_insn_and_split that would transform sign extensions
used solely by the vector shifts by scalar shift count, did you mean
something like following (for every shift pattern)?

--- sse.md.jj1	2017-04-04 19:51:01.000000000 +0200
+++ sse.md	2017-04-06 10:26:26.877545109 +0200
@@ -10696,6 +10696,22 @@
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn_and_split "*<shift_insn><mode>3<mask_name>_1"
+  [(set (match_operand:VI2_AVX2_AVX512BW 0 "register_operand")
+	(any_lshift:VI2_AVX2_AVX512BW
+	  (match_operand:VI2_AVX2_AVX512BW 1 "register_operand")
+	  (sign_extend:DI (match_operand:SI 2 "nonmemory_operand"))))]
+  "TARGET_SSE2 && <mask_mode512bit_condition> && <mask_avx512bw_condition>
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 3) (zero_extend:DI (match_dup 2)))
+   (set (match_dup 0) (any_lshift:VI2_AVX2_AVX512BW
+			(match_dup 1) (match_dup 3)))]
+{
+  operands[3] = gen_reg_rtx (DImode);
+})
+
 (define_insn "<shift_insn><mode>3<mask_name>"
   [(set (match_operand:VI48_AVX2 0 "register_operand" "=x,x,v")
 	(any_lshift:VI48_AVX2

The problem with that is that apparently our infrastructure doesn't support
define_subst for define_insn_and_split (and define_split), so either we'd
need to have separate define_insn_and_split for masked and for non-masked,
or we'd need to extend the define_subst infrastructure for
define_insn_and_split somehow.  Looking say at
(define_subst "mask"
  [(set (match_operand:SUBST_V 0)
        (match_operand:SUBST_V 1))]
  "TARGET_AVX512F"
  [(set (match_dup 0)
        (vec_merge:SUBST_V
          (match_dup 1)
          (match_operand:SUBST_V 2 "vector_move_operand" "0C")
          (match_operand:<avx512fmaskmode> 3 "register_operand" "Yk")))])
that is a transformation we want to do on the define_insn part of
define_insn_and_split, but not exactly what we want to do on the split
part of the insn - there we want literaly match_dup 0, match_dup 1,
and instead of the 2 other match_operand match_dup 2 and match_dup 3.

	Jakub

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286)
  2017-04-06  7:34         ` Uros Bizjak
@ 2017-04-06  8:40           ` Uros Bizjak
  2017-04-06  8:40           ` Jakub Jelinek
  1 sibling, 0 replies; 11+ messages in thread
From: Uros Bizjak @ 2017-04-06  8:40 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2866 bytes --]

On Thu, Apr 6, 2017 at 9:33 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Tue, Apr 4, 2017 at 5:09 PM, Jakub Jelinek <jakub@redhat.com> wrote:
>> On Tue, Apr 04, 2017 at 02:33:24PM +0200, Uros Bizjak wrote:
>>> > I assume split those before reload.  Because we want to give reload a chance
>>> > to do the zero extension on GPRs if it is more beneficial, and it might
>>> > choose to store it into memory and load into XMM from memory and that is
>>> > hard to do after reload.
>>>
>>> Yes, split before reload, and hope that alternative's decorations play
>>> well with RA.
>>
>> Haven't done these splitters yet, just playing now with:
>> typedef long long __m256i __attribute__ ((__vector_size__ (32), __may_alias__));
>> typedef int __v4si __attribute__ ((__vector_size__ (16)));
>> typedef short __v8hi __attribute__ ((__vector_size__ (16)));
>> typedef int __v8si __attribute__ ((__vector_size__ (32)));
>> typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));
>> extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>> _mm256_castsi256_si128 (__m256i __A) { return (__m128i) __builtin_ia32_si_si256 ((__v8si)__A); }
>> extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>> _mm_cvtsi128_si32 (__m128i __A) { return __builtin_ia32_vec_ext_v4si ((__v4si)__A, 0); }
>> extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>> _mm_srli_epi16 (__m128i __A, int __B) { return (__m128i)__builtin_ia32_psrlwi128 ((__v8hi)__A, __B); }
>> __m256i m;
>> __m128i foo (__m128i minmax)
>> {
>>   int shift = _mm_cvtsi128_si32 (_mm256_castsi256_si128 (m));
>>   return _mm_srli_epi16 (minmax, shift);
>> }
>> to see what it emits (in that case we already have zero extension rather
>> than sign extension).
>>> > With ? in front of it or without?  I admit I've only tried so far:
>>>
>>> I'd leave ?* in this case. In my experience, RA allocates alternative
>>> with ?* only when really needed.
>>
>> So far I have following, which seems to work fine for the above testcase and
>> -O2 -m64 -mavx2, but doesn't work for -O2 -m32 -mavx2.
>> For 64-bit combiner matches the *vec_extractv4si_0_zext pattern and as that
>> doesn't have ? nor * in the constraint, it is used.
>> For 32-bit there is no such pattern and we end up with just zero_extendsidi2
>> pattern and apparently either the ? or * prevent IRA/LRA from using it.
>> If I remove both ?*, I get nice code even for 32-bit.
>
> Newly introduced alternatives (x/x) and (v/v) are valid also for
> 32-bit targets, so we have to adjust insn constraint of
> *vec_extractv4si_0_zext and enable alternatives accordingly. After the
> adjustment, the pattern will be split to a zero-extend.

Attached patch fixes your testcase above for 64 and 32-bit targets.
What do you think?

Uros.

[-- Attachment #2: p.diff.txt --]
[-- Type: text/plain, Size: 4274 bytes --]

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 6ed2390..d1c3c16 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3767,10 +3767,10 @@
 
 (define_insn "*zero_extendsidi2"
   [(set (match_operand:DI 0 "nonimmediate_operand"
-			"=r,?r,?o,r   ,o,?*Ym,?!*y,?r ,?r,?*Yi,?*x,*r")
+		"=r,?r,?o,r   ,o,?*Ym,?!*y,?r ,?r,?*Yi,?*x,?*x,?*v,*r")
 	(zero_extend:DI
 	 (match_operand:SI 1 "x86_64_zext_operand"
-	        	"0 ,rm,r ,rmWz,0,r   ,m   ,*Yj,*x,r   ,m  ,*k")))]
+	        "0 ,rm,r ,rmWz,0,r   ,m   ,*Yj,*x,r   ,m  , *x, *v,*k")))]
   ""
 {
   switch (get_attr_type (insn))
@@ -3791,6 +3791,15 @@
       return "%vpextrd\t{$0, %1, %k0|%k0, %1, 0}";
 
     case TYPE_SSEMOV:
+      if (SSE_REG_P (operands[0]) && SSE_REG_P (operands[1]))
+	{
+	  if (EXT_REX_SSE_REG_P (operands[0])
+	      || EXT_REX_SSE_REG_P (operands[1]))
+	    return "vpmovzxdq\t{%t1, %g0|%g0, %t1}";
+	  else
+	    return "%vpmovzxdq\t{%1, %0|%0, %1}";
+	}
+
       if (GENERAL_REG_P (operands[0]))
 	return "%vmovd\t{%1, %k0|%k0, %1}";
 
@@ -3813,6 +3822,10 @@
 	    (eq_attr "alternative" "10")
 	      (const_string "sse2")
 	    (eq_attr "alternative" "11")
+	      (const_string "sse4")
+	    (eq_attr "alternative" "12")
+	      (const_string "avx512f")
+	    (eq_attr "alternative" "13")
 	      (const_string "x64_avx512bw")
 	   ]
 	   (const_string "*")))
@@ -3821,16 +3834,16 @@
 	      (const_string "multi")
 	    (eq_attr "alternative" "5,6")
 	      (const_string "mmxmov")
-	    (eq_attr "alternative" "7,9,10")
+	    (eq_attr "alternative" "7,9,10,11,12")
 	      (const_string "ssemov")
 	    (eq_attr "alternative" "8")
 	      (const_string "sselog1")
-	    (eq_attr "alternative" "11")
+	    (eq_attr "alternative" "13")
 	      (const_string "mskmov")
 	   ]
 	   (const_string "imovx")))
    (set (attr "prefix_extra")
-     (if_then_else (eq_attr "alternative" "8")
+     (if_then_else (eq_attr "alternative" "8,11,12")
        (const_string "1")
        (const_string "*")))
    (set (attr "length_immediate")
@@ -3848,7 +3861,7 @@
    (set (attr "mode")
      (cond [(eq_attr "alternative" "5,6")
 	      (const_string "DI")
-	    (eq_attr "alternative" "7,8,9")
+	    (eq_attr "alternative" "7,8,9,11,12")
 	      (const_string "TI")
 	   ]
 	   (const_string "SI")))])
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 15ced88..094404b 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -13516,18 +13516,6 @@
   "#"
   [(set_attr "isa" "*,sse4,*,*")])
 
-(define_insn_and_split "*vec_extractv4si_0_zext"
-  [(set (match_operand:DI 0 "register_operand" "=r")
-	(zero_extend:DI
-	  (vec_select:SI
-	    (match_operand:V4SI 1 "register_operand" "v")
-	    (parallel [(const_int 0)]))))]
-  "TARGET_64BIT && TARGET_SSE2 && TARGET_INTER_UNIT_MOVES_FROM_VEC"
-  "#"
-  "&& reload_completed"
-  [(set (match_dup 0) (zero_extend:DI (match_dup 1)))]
-  "operands[1] = gen_lowpart (SImode, operands[1]);")
-
 (define_insn "*vec_extractv2di_0_sse"
   [(set (match_operand:DI 0 "nonimmediate_operand"     "=v,m")
 	(vec_select:DI
@@ -13546,6 +13534,35 @@
   [(set (match_dup 0) (match_dup 1))]
   "operands[1] = gen_lowpart (<MODE>mode, operands[1]);")
 
+(define_insn "*vec_extractv4si_0_zext_sse4"
+  [(set (match_operand:DI 0 "register_operand" "=r,x,v")
+	(zero_extend:DI
+	  (vec_select:SI
+	    (match_operand:V4SI 1 "register_operand" "Yj,x,v")
+	    (parallel [(const_int 0)]))))]
+  "TARGET_SSE4_1"
+  "#"
+  [(set_attr "isa" "x64,*,avx512f")])
+
+(define_insn "*vec_extractv4si_0_zext"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(zero_extend:DI
+	  (vec_select:SI
+	    (match_operand:V4SI 1 "register_operand" "x")
+	    (parallel [(const_int 0)]))))]
+  "TARGET_64BIT && TARGET_SSE2 && TARGET_INTER_UNIT_MOVES_FROM_VEC"
+  "#")
+
+(define_split
+  [(set (match_operand:DI 0 "register_operand")
+	(zero_extend:DI
+	  (vec_select:SI
+	    (match_operand:V4SI 1 "register_operand")
+	    (parallel [(const_int 0)]))))]
+  "TARGET_SSE2 && reload_completed"
+  [(set (match_dup 0) (zero_extend:DI (match_dup 1)))]
+  "operands[1] = gen_lowpart (SImode, operands[1]);")
+
 (define_insn "*vec_extractv4si"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=rm,rm,Yr,*x,x,Yv")
 	(vec_select:SI

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286)
  2017-04-06  8:40           ` Jakub Jelinek
@ 2017-04-06  8:47             ` Uros Bizjak
  2017-04-06  9:56               ` Jakub Jelinek
  2017-04-06  8:48             ` Jakub Jelinek
  1 sibling, 1 reply; 11+ messages in thread
From: Uros Bizjak @ 2017-04-06  8:47 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Kirill Yukhin, gcc-patches

On Thu, Apr 6, 2017 at 10:40 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Apr 06, 2017 at 09:33:58AM +0200, Uros Bizjak wrote:
>> Newly introduced alternatives (x/x) and (v/v) are valid also for
>> 32-bit targets, so we have to adjust insn constraint of
>> *vec_extractv4si_0_zext and enable alternatives accordingly. After the
>
> That is true.  But if we provide just the x/x and v/v alternatives in
> *vec_extractv4si_0_zext, then it will be forced to always do the zero
> extraction on the SSE registers in 32-bit mode.  Is that what we want?

Yes, for SSE4 targets. We are sure that we have SSE source register
here, and there is no direct zero-extension to a general reg in
32-bit case.

> As for the define_insn_and_split that would transform sign extensions
> used solely by the vector shifts by scalar shift count, did you mean
> something like following (for every shift pattern)?
>
> --- sse.md.jj1  2017-04-04 19:51:01.000000000 +0200
> +++ sse.md      2017-04-06 10:26:26.877545109 +0200
> @@ -10696,6 +10696,22 @@
>     (set_attr "prefix" "orig,vex")
>     (set_attr "mode" "<sseinsnmode>")])
>
> +(define_insn_and_split "*<shift_insn><mode>3<mask_name>_1"
> +  [(set (match_operand:VI2_AVX2_AVX512BW 0 "register_operand")
> +       (any_lshift:VI2_AVX2_AVX512BW
> +         (match_operand:VI2_AVX2_AVX512BW 1 "register_operand")
> +         (sign_extend:DI (match_operand:SI 2 "nonmemory_operand"))))]
> +  "TARGET_SSE2 && <mask_mode512bit_condition> && <mask_avx512bw_condition>
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 3) (zero_extend:DI (match_dup 2)))
> +   (set (match_dup 0) (any_lshift:VI2_AVX2_AVX512BW
> +                       (match_dup 1) (match_dup 3)))]
> +{
> +  operands[3] = gen_reg_rtx (DImode);
> +})
>
Yes, something like this. You ca use any_extend instead of
sign_extend, so the pattern will also remove possible zero_extend of
count operand.

>  (define_insn "<shift_insn><mode>3<mask_name>"
>    [(set (match_operand:VI48_AVX2 0 "register_operand" "=x,x,v")
>         (any_lshift:VI48_AVX2
>
> The problem with that is that apparently our infrastructure doesn't support
> define_subst for define_insn_and_split (and define_split), so either we'd
> need to have separate define_insn_and_split for masked and for non-masked,
> or we'd need to extend the define_subst infrastructure for
> define_insn_and_split somehow.  Looking say at
> (define_subst "mask"
>   [(set (match_operand:SUBST_V 0)
>         (match_operand:SUBST_V 1))]
>   "TARGET_AVX512F"
>   [(set (match_dup 0)
>         (vec_merge:SUBST_V
>           (match_dup 1)
>           (match_operand:SUBST_V 2 "vector_move_operand" "0C")
>           (match_operand:<avx512fmaskmode> 3 "register_operand" "Yk")))])
> that is a transformation we want to do on the define_insn part of
> define_insn_and_split, but not exactly what we want to do on the split
> part of the insn - there we want literaly match_dup 0, match_dup 1,
> and instead of the 2 other match_operand match_dup 2 and match_dup 3.

Hm, I'm not that versed in define_subst, but that looks quite a
drawback of define_subst to me.

Uros.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286)
  2017-04-06  8:40           ` Jakub Jelinek
  2017-04-06  8:47             ` Uros Bizjak
@ 2017-04-06  8:48             ` Jakub Jelinek
  1 sibling, 0 replies; 11+ messages in thread
From: Jakub Jelinek @ 2017-04-06  8:48 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Kirill Yukhin, gcc-patches

On Thu, Apr 06, 2017 at 10:40:07AM +0200, Jakub Jelinek wrote:
> On Thu, Apr 06, 2017 at 09:33:58AM +0200, Uros Bizjak wrote:
> > Newly introduced alternatives (x/x) and (v/v) are valid also for
> > 32-bit targets, so we have to adjust insn constraint of
> > *vec_extractv4si_0_zext and enable alternatives accordingly. After the
> 
> That is true.  But if we provide just the x/x and v/v alternatives in
> *vec_extractv4si_0_zext, then it will be forced to always do the zero
> extraction on the SSE registers in 32-bit mode.  Is that what we want?

Also, I think we can do the zero extension even without SSE4.1,
if we have a spare SSE register (or before reload), we can use
pxor into that scratch reg and punpck* it, if we don't, we can
construct a V4SI constaint in memory with { -1, 0, 0, 0 } or so
and and with that.

	Jakub

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286)
  2017-04-06  8:47             ` Uros Bizjak
@ 2017-04-06  9:56               ` Jakub Jelinek
  0 siblings, 0 replies; 11+ messages in thread
From: Jakub Jelinek @ 2017-04-06  9:56 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Kirill Yukhin, gcc-patches

On Thu, Apr 06, 2017 at 10:47:03AM +0200, Uros Bizjak wrote:
> > +(define_insn_and_split "*<shift_insn><mode>3<mask_name>_1"
> > +  [(set (match_operand:VI2_AVX2_AVX512BW 0 "register_operand")
> > +       (any_lshift:VI2_AVX2_AVX512BW
> > +         (match_operand:VI2_AVX2_AVX512BW 1 "register_operand")
> > +         (sign_extend:DI (match_operand:SI 2 "nonmemory_operand"))))]
> > +  "TARGET_SSE2 && <mask_mode512bit_condition> && <mask_avx512bw_condition>
> > +   && can_create_pseudo_p ()"
> > +  "#"
> > +  "&& 1"
> > +  [(set (match_dup 3) (zero_extend:DI (match_dup 2)))
> > +   (set (match_dup 0) (any_lshift:VI2_AVX2_AVX512BW
> > +                       (match_dup 1) (match_dup 3)))]
> > +{
> > +  operands[3] = gen_reg_rtx (DImode);
> > +})
> >
> Yes, something like this. You ca use any_extend instead of
> sign_extend, so the pattern will also remove possible zero_extend of
> count operand.

The pattern splits it immediately (during split1) into a zext + shift,
so unless we let the pattern survive in this form (but then we need
constraints and it is unclear which ones) after reload, I don't see
advantage in matching it for zext, it is split exactly to what there
used to be before.

> >           (match_operand:<avx512fmaskmode> 3 "register_operand" "Yk")))])
> > that is a transformation we want to do on the define_insn part of
> > define_insn_and_split, but not exactly what we want to do on the split
> > part of the insn - there we want literaly match_dup 0, match_dup 1,
> > and instead of the 2 other match_operand match_dup 2 and match_dup 3.
> 
> Hm, I'm not that versed in define_subst, but that looks quite a
> drawback of define_subst to me.

Perhaps, but we'd need to define what it means to subst a
define_insn_and_split.

	Jakub

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-04-06  9:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-03 20:34 [PATCH] Fix MMX/SSE/AVX* shifts by non-immediate scalar (PR target/80286) Jakub Jelinek
2017-04-04  6:40 ` Uros Bizjak
2017-04-04 12:01   ` Jakub Jelinek
2017-04-04 12:33     ` Uros Bizjak
2017-04-04 15:09       ` Jakub Jelinek
2017-04-06  7:34         ` Uros Bizjak
2017-04-06  8:40           ` Uros Bizjak
2017-04-06  8:40           ` Jakub Jelinek
2017-04-06  8:47             ` Uros Bizjak
2017-04-06  9:56               ` Jakub Jelinek
2017-04-06  8:48             ` Jakub Jelinek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).