[PATCH] [i386] Combine avx_vec_concatv16si and avx512f_zero_extendv16hiv16si2_1 to avx512f_zero_extendv16hiv16si2

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] [i386] Combine avx_vec_concatv16si and avx512f_zero_extendv16hiv16si2_1 to avx512f_zero_extendv16hiv16si2_2.
@ 2021-08-11  6:43 liuhongt
  2021-08-11  7:58 ` Jakub Jelinek
  0 siblings, 1 reply; 14+ messages in thread
From: liuhongt @ 2021-08-11  6:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: crazylht, ubizjak, jakub

Hi:
  Add define_insn_and_split to combine avx_vec_concatv16si/2 and
avx512f_zero_extendv16hiv16si2_1 since the latter already zero_extend
the upper bits, similar for other patterns which are related to
pmovzx{bw,wd,dq}.

It will do optimization like

-       vmovdqa %ymm0, %ymm0    # 7     [c=4 l=6]  avx_vec_concatv16si/2
        vpmovzxwd       %ymm0, %zmm0    # 22    [c=4 l=6]  avx512f_zero_extendv16hiv16si2
        ret             # 25    [c=0 l=1]  simple_return_internal

  Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
  Ok for trunk?

gcc/ChangeLog:

	PR target/101846
	* config/i386/sse.md (*avx2_zero_extendv16qiv16hi2_2): New
	post_reload define_insn_and_split.
	(*avx512bw_zero_extendv32qiv32hi2_2): Ditto.
	(*sse4_1_zero_extendv8qiv8hi2_4): Ditto.
	(*avx512f_zero_extendv16hiv16si2_2): Ditto.
	(*avx2_zero_extendv8hiv8si2_2): Ditto.
	(*sse4_1_zero_extendv4hiv4si2_4): Ditto.
	(*avx512f_zero_extendv8siv8di2_2): Ditto.
	(*avx2_zero_extendv4siv4di2_2): Ditto.
	(*sse4_1_zero_extendv2siv2di2_4): Ditto.

gcc/testsuite/ChangeLog:

	PR target/101846
	* gcc.target/i386/pr101846-1.c: New test.
---
 gcc/config/i386/sse.md                     | 220 +++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr101846-1.c |  95 +++++++++
 2 files changed, 315 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-1.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index a46a2373547..6450c058458 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -673,8 +673,14 @@ (define_mode_iterator VI12_128 [V16QI V8HI])
 (define_mode_iterator VI14_128 [V16QI V4SI])
 (define_mode_iterator VI124_128 [V16QI V8HI V4SI])
 (define_mode_iterator VI24_128 [V8HI V4SI])
+(define_mode_iterator VI128_128 [V16QI V8HI V2DI])
 (define_mode_iterator VI248_128 [V8HI V4SI V2DI])
+(define_mode_iterator VI248_256 [V16HI V8SI V4DI])
+(define_mode_iterator VI248_512 [V32HI V16SI V8DI])
 (define_mode_iterator VI48_128 [V4SI V2DI])
+(define_mode_iterator VI148_512 [V64QI V16SI V8DI])
+(define_mode_iterator VI148_256 [V32QI V8SI V4DI])
+(define_mode_iterator VI148_128 [V16QI V4SI V2DI])
 
 ;; Various 256bit and 512 vector integer mode combinations
 (define_mode_iterator VI124_256 [V32QI V16HI V8SI])
@@ -18499,6 +18505,26 @@ (define_insn_and_split "*avx2_zero_extendv16qiv16hi2_1"
   operands[1] = lowpart_subreg (V16QImode, operands[1], V32QImode);
 })
 
+(define_insn_and_split "*avx2_zero_extendv16qiv16hi2_2"
+  [(set (match_operand:V32QI 0 "register_operand" "=v")
+	(vec_select:V32QI
+	  (vec_concat:V64QI
+	    (subreg:V32QI
+	      (vec_concat:VI248_256
+		(match_operand:<ssehalfvecmode> 1 "nonimmediate_operand" "vm")
+		(match_operand:<ssehalfvecmode> 2 "const0_operand" "C")) 0)
+	    (match_operand:V32QI 3 "const0_operand" "C"))
+	  (match_parallel 4 "pmovzx_parallel"
+	    [(match_operand 5 "const_int_operand" "n")])))]
+  "TARGET_AVX2"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (zero_extend:V16HI (match_dup 1)))]
+{
+  operands[0] = lowpart_subreg (V16HImode, operands[0], V32QImode);
+  operands[1] = lowpart_subreg (V16QImode, operands[1], <ssehalfvecmode>mode);
+})
+
 (define_expand "<insn>v16qiv16hi2"
   [(set (match_operand:V16HI 0 "register_operand")
 	(any_extend:V16HI
@@ -18533,6 +18559,26 @@ (define_insn_and_split "*avx512bw_zero_extendv32qiv32hi2_1"
   operands[1] = lowpart_subreg (V32QImode, operands[1], V64QImode);
 })
 
+(define_insn_and_split "*avx512bw_zero_extendv32qiv32hi2_2"
+  [(set (match_operand:V64QI 0 "register_operand" "=v")
+	(vec_select:V64QI
+	  (vec_concat:V128QI
+	    (subreg:V64QI
+	      (vec_concat:VI248_512
+		(match_operand:<ssehalfvecmode> 1 "nonimmediate_operand" "vm")
+		(match_operand:<ssehalfvecmode> 2 "const0_operand" "C")) 0)
+	    (match_operand:V64QI 3 "const0_operand" "C"))
+	  (match_parallel 4 "pmovzx_parallel"
+	    [(match_operand 5 "const_int_operand" "n")])))]
+  "TARGET_AVX512BW"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (zero_extend:V32HI (match_dup 1)))]
+{
+  operands[0] = lowpart_subreg (V32HImode, operands[0], V64QImode);
+  operands[1] = lowpart_subreg (V32QImode, operands[1], <ssehalfvecmode>mode);
+})
+
 (define_expand "<insn>v32qiv32hi2"
   [(set (match_operand:V32HI 0 "register_operand")
 	(any_extend:V32HI
@@ -18619,6 +18665,41 @@ (define_insn_and_split "*sse4_1_zero_extendv8qiv8hi2_3"
 }
   [(set_attr "isa" "noavx,noavx,avx")])
 
+(define_insn_and_split "*sse4_1_zero_extendv8qiv8hi2_4"
+  [(set (match_operand:V16QI 0 "register_operand" "=Yr,*x,Yw")
+	(vec_select:V16QI
+	  (vec_concat:V32QI
+	    (subreg:V16QI
+	      (vec_concat:VI248_128
+		(match_operand:<ssehalfvecmode> 1 "vector_operand" "YrBm,*xBm,Ywm")
+		(match_operand:<ssehalfvecmode> 2 "const0_operand" "C,C,C")) 0)
+	    (match_operand:V16QI 3 "const0_operand" "C,C,C"))
+	  (match_parallel 4 "pmovzx_parallel"
+	    [(match_operand 5 "const_int_operand" "n,n,n")])))]
+  "TARGET_SSE4_1"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0)
+	(zero_extend:V8HI
+	  (vec_select:V8QI
+	    (match_dup 1)
+	    (parallel [(const_int 0) (const_int 1)
+		       (const_int 2) (const_int 3)
+		       (const_int 4) (const_int 5)
+		       (const_int 6) (const_int 7)]))))]
+{
+  operands[0] = lowpart_subreg (V8HImode, operands[0], V16QImode);
+  if (MEM_P (operands[1]))
+    {
+      operands[1] = lowpart_subreg (V8QImode, operands[1], <ssehalfvecmode>mode);
+      operands[1] = gen_rtx_ZERO_EXTEND (V8HImode, operands[1]);
+      emit_insn (gen_rtx_SET (operands[0], operands[1]));
+      DONE;
+    }
+  operands[1] = lowpart_subreg (V16QImode, operands[1], <ssehalfvecmode>mode);
+}
+  [(set_attr "isa" "noavx,noavx,avx")])
+
 (define_expand "<insn>v8qiv8hi2"
   [(set (match_operand:V8HI 0 "register_operand")
 	(any_extend:V8HI
@@ -18809,6 +18890,26 @@ (define_insn_and_split "avx512f_zero_extendv16hiv16si2_1"
   operands[1] = lowpart_subreg (V16HImode, operands[1], V32HImode);
 })
 
+(define_insn_and_split "*avx512f_zero_extendv16hiv16si2_2"
+  [(set (match_operand:V32HI 0 "register_operand" "=v")
+	(vec_select:V32HI
+	  (vec_concat:V64HI
+	    (subreg:V32HI
+	      (vec_concat:VI148_512
+	        (match_operand:<ssehalfvecmode> 1 "nonimmediate_operand" "vm")
+		(match_operand:<ssehalfvecmode> 2 "const0_operand" "C")) 0)
+	    (match_operand:V32HI 3 "const0_operand" "C"))
+	  (match_parallel 4 "pmovzx_parallel"
+	    [(match_operand 5 "const_int_operand" "n")])))]
+  "TARGET_AVX512F"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (zero_extend:V16SI (match_dup 1)))]
+{
+  operands[0] = lowpart_subreg (V16SImode, operands[0], V32HImode);
+  operands[1] = lowpart_subreg (V16HImode, operands[1], <ssehalfvecmode>mode);
+})
+
 (define_insn "avx2_<code>v8hiv8si2<mask_name>"
   [(set (match_operand:V8SI 0 "register_operand" "=v")
 	(any_extend:V8SI
@@ -18843,6 +18944,27 @@ (define_insn_and_split "avx2_zero_extendv8hiv8si2_1"
   operands[1] = lowpart_subreg (V8HImode, operands[1], V16HImode);
 })
 
+(define_insn_and_split "*avx2_zero_extendv8hiv8si2_2"
+  [(set (match_operand:V16HI 0 "register_operand" "=v")
+	(vec_select:V16HI
+	  (vec_concat:V32HI
+	    (subreg:V16HI
+	      (vec_concat:VI148_256
+		(match_operand:<ssehalfvecmode> 1 "nonimmediate_operand" "vm")
+		(match_operand:<ssehalfvecmode> 2 "const0_operand" "C")) 0)
+	    (match_operand:V16HI 3 "const0_operand" "C"))
+	  (match_parallel 4 "pmovzx_parallel"
+	    [(match_operand 5 "const_int_operand" "n")])))]
+  "TARGET_AVX2"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (zero_extend:V8SI (match_dup 1)))]
+{
+  operands[0] = lowpart_subreg (V8SImode, operands[0], V16HImode);
+  operands[1] = lowpart_subreg (V8HImode, operands[1], <ssehalfvecmode>mode);
+})
+
+
 (define_insn "sse4_1_<code>v4hiv4si2<mask_name>"
   [(set (match_operand:V4SI 0 "register_operand" "=Yr,*x,v")
 	(any_extend:V4SI
@@ -18932,6 +19054,39 @@ (define_insn_and_split "*sse4_1_zero_extendv4hiv4si2_3"
 }
   [(set_attr "isa" "noavx,noavx,avx")])
 
+(define_insn_and_split "*sse4_1_zero_extendv4hiv4si2_4"
+  [(set (match_operand:V8HI 0 "register_operand" "=Yr,*x,v")
+	(vec_select:V8HI
+	  (vec_concat:V16HI
+	    (subreg:V8HI
+	      (vec_concat:VI148_128
+		(match_operand:<ssehalfvecmode> 1 "vector_operand" "YrBm,*xBm,vm")
+		(match_operand:<ssehalfvecmode> 2 "const0_operand" "C,C,C")) 0)
+	    (match_operand:V8HI 3 "const0_operand" "C,C,C"))
+	  (match_parallel 4 "pmovzx_parallel"
+	    [(match_operand 5 "const_int_operand" "n,n,n")])))]
+  "TARGET_SSE4_1"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0)
+	(zero_extend:V4SI
+	  (vec_select:V4HI
+	    (match_dup 1)
+	    (parallel [(const_int 0) (const_int 1)
+		       (const_int 2) (const_int 3)]))))]
+{
+  operands[0] = lowpart_subreg (V4SImode, operands[0], V8HImode);
+  if (MEM_P (operands[1]))
+    {
+      operands[1] = lowpart_subreg (V4HImode, operands[1], <ssehalfvecmode>mode);
+      operands[1] = gen_rtx_ZERO_EXTEND (V4SImode, operands[1]);
+      emit_insn (gen_rtx_SET (operands[0], operands[1]));
+      DONE;
+    }
+  operands[1] = lowpart_subreg (V8HImode, operands[1], <ssehalfvecmode>mode);
+}
+  [(set_attr "isa" "noavx,noavx,avx")])
+
 (define_insn "avx512f_<code>v8qiv8di2<mask_name>"
   [(set (match_operand:V8DI 0 "register_operand" "=v")
 	(any_extend:V8DI
@@ -19242,6 +19397,24 @@ (define_insn_and_split "*avx512f_zero_extendv8siv8di2_1"
   operands[1] = lowpart_subreg (V8SImode, operands[1], V16SImode);
 })
 
+(define_insn_and_split "*avx512f_zero_extendv8siv8di2_2"
+  [(set (match_operand:V16SI 0 "register_operand" "=v")
+	(vec_select:V16SI
+	  (vec_concat:V32SI
+	    (vec_concat:V16SI
+	      (match_operand:V8SI 1 "nonimmediate_operand" "vm")
+	      (match_operand:V8SI 2 "const0_operand" "C"))
+	    (match_operand:V16SI 3 "const0_operand" "C"))
+	  (match_parallel 4 "pmovzx_parallel"
+	    [(match_operand 5 "const_int_operand" "n")])))]
+  "TARGET_AVX512F"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (zero_extend:V8DI (match_dup 1)))]
+{
+  operands[0] = lowpart_subreg (V8DImode, operands[0], V16SImode);
+})
+
 (define_expand "<insn>v8siv8di2"
   [(set (match_operand:V8DI 0 "register_operand" "=v")
 	(any_extend:V8DI
@@ -19276,6 +19449,24 @@ (define_insn_and_split "*avx2_zero_extendv4siv4di2_1"
   operands[1] = lowpart_subreg (V4SImode, operands[1], V8SImode);
 })
 
+(define_insn_and_split "*avx2_zero_extendv4siv4di2_2"
+  [(set (match_operand:V8SI 0 "register_operand" "=v")
+	(vec_select:V8SI
+	  (vec_concat:V16SI
+	    (vec_concat:V8SI
+	      (match_operand:V4SI 1 "nonimmediate_operand" "vm")
+	      (match_operand:V4SI 2 "const0_operand" "C"))
+	    (match_operand:V8SI 3 "const0_operand" "C"))
+	  (match_parallel 4 "pmovzx_parallel"
+	    [(match_operand 5 "const_int_operand" "n")])))]
+  "TARGET_AVX2"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (zero_extend:V4DI (match_dup 1)))]
+{
+  operands[0] = lowpart_subreg (V4DImode, operands[0], V8SImode);
+})
+
 (define_expand "<insn>v4siv4di2"
   [(set (match_operand:V4DI 0 "register_operand")
 	(any_extend:V4DI
@@ -19352,6 +19543,35 @@ (define_insn_and_split "*sse4_1_zero_extendv2siv2di2_3"
 }
   [(set_attr "isa" "noavx,noavx,avx")])
 
+(define_insn_and_split "*sse4_1_zero_extendv2siv2di2_4"
+  [(set (match_operand:V4SI 0 "register_operand" "=Yr,*x,v")
+	(vec_select:V4SI
+	  (vec_concat:V8SI
+	    (vec_concat:V4SI
+	      (match_operand:V2SI 1 "vector_operand" "YrBm, *xBm, vm")
+	      (match_operand:V2SI 2 "const0_operand" "C,C,C"))
+	    (match_operand:V4SI 3 "const0_operand" "C,C,C"))
+	  (match_parallel 4 "pmovzx_parallel"
+	    [(match_operand 5 "const_int_operand" "n,n,n")])))]
+  "TARGET_SSE4_1"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0)
+	(zero_extend:V2DI
+	  (vec_select:V2SI (match_dup 1)
+			   (parallel [(const_int 0) (const_int 1)]))))]
+{
+  operands[0] = lowpart_subreg (V2DImode, operands[0], V4SImode);
+  if (MEM_P (operands[1]))
+    {
+      operands[1] = gen_rtx_ZERO_EXTEND (V2DImode, operands[1]);
+      emit_insn (gen_rtx_SET (operands[0], operands[1]));
+      DONE;
+    }
+  operands[1] = lowpart_subreg (V4SImode, operands[1], V2SImode);
+}
+  [(set_attr "isa" "noavx,noavx,avx")])
+
 (define_expand "<insn>v2siv2di2"
   [(set (match_operand:V2DI 0 "register_operand")
 	(any_extend:V2DI
diff --git a/gcc/testsuite/gcc.target/i386/pr101846-1.c b/gcc/testsuite/gcc.target/i386/pr101846-1.c
new file mode 100644
index 00000000000..40d95bde6fd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101846-1.c
@@ -0,0 +1,95 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512bw -mavx512vl -mavx512dq -O2" } */
+/* { dg-final { scan-assembler-not "vmov" } } */
+/* { dg-final { scan-assembler-times "vpmovzxbw" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovzxwd" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovzxdq" "3" } } */
+
+typedef short v4hi __attribute__((vector_size (8)));
+typedef short v8hi __attribute__((vector_size (16)));
+typedef short v16hi __attribute__((vector_size (32)));
+typedef short v32hi __attribute__((vector_size (64)));
+typedef char v8qi __attribute__((vector_size (8)));
+typedef char v16qi __attribute__((vector_size (16)));
+typedef char v32qi __attribute__((vector_size (32)));
+typedef char v64qi __attribute__((vector_size (64)));
+typedef int v2si __attribute__((vector_size (8)));
+typedef int v4si __attribute__((vector_size (16)));
+typedef int v8si __attribute__((vector_size (32)));
+typedef int v16si __attribute__((vector_size (64)));
+
+v32hi
+foo_zxwd_512 (v16hi x)
+{
+  return __builtin_shufflevector (x, (v16hi) {},
+				  0, 16, 1, 17, 2, 18, 3, 19,
+				  4, 20, 5, 21, 6, 22, 7, 23,
+				  8, 24, 9, 25, 10, 26, 11, 27,
+				  12, 28, 13, 29, 14, 30, 15, 31);
+}
+
+v16hi
+foo_zxwd_256 (v8hi x)
+{
+  return __builtin_shufflevector (x, (v8hi) {},
+				  0, 8, 1, 9, 2, 10, 3, 11,
+				  4, 12, 5, 13, 6, 14, 7, 15);
+}
+
+v8hi
+foo_zxwd_128 (v4hi x)
+{
+  return __builtin_shufflevector (x, (v4hi) {}, 0, 4, 1, 5, 2, 6, 3, 7);
+}
+
+v16si
+foo_zxdq_512 (v8si x)
+{
+  return __builtin_shufflevector (x, (v8si) {},
+				  0, 8, 1, 9, 2, 10, 3, 11,
+				  4, 12, 5, 13, 6, 14, 7, 15);
+}
+
+v8si
+foo_zxdq_256 (v4si x)
+{
+  return __builtin_shufflevector (x, (v4si) {}, 0, 4, 1, 5, 2, 6, 3, 7);
+}
+
+v4si
+foo_zxdq_128 (v2si x)
+{
+  return __builtin_shufflevector (x, (v2si) {}, 0, 2, 1, 3);
+}
+
+v64qi
+foo_zxbw_512 (v32qi x)
+{
+  return __builtin_shufflevector (x, (v32qi) {},
+				  0, 32, 1, 33, 2, 34, 3, 35,
+				  4, 36, 5, 37, 6, 38, 7, 39,
+				  8, 40, 9, 41, 10, 42, 11, 43,
+				  12, 44, 13, 45, 14, 46, 15, 47,
+				  16, 48, 17, 49, 18, 50, 19, 51,
+				  20, 52, 21, 53, 22, 54, 23, 55,
+				  24, 56, 25, 57, 26, 58, 27, 59,
+				  28, 60, 29, 61, 30, 62, 31, 63);
+}
+
+v32qi
+foo_zxbw_256 (v16qi x)
+{
+  return __builtin_shufflevector (x, (v16qi) {},
+				  0, 16, 1, 17, 2, 18, 3, 19,
+				  4, 20, 5, 21, 6, 22, 7, 23,
+				  8, 24, 9, 25, 10, 26, 11, 27,
+				  12, 28, 13, 29, 14, 30, 15, 31);
+}
+
+v16qi
+foo_zxbw_128 (v8qi x)
+{
+  return __builtin_shufflevector (x, (v8qi) {},
+				  0, 8, 1, 9, 2, 10, 3, 11,
+				  4, 12, 5, 13, 6, 14, 7, 15);
+}
-- 
2.27.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [i386] Combine avx_vec_concatv16si and avx512f_zero_extendv16hiv16si2_1 to avx512f_zero_extendv16hiv16si2_2.
  2021-08-11  6:43 [PATCH] [i386] Combine avx_vec_concatv16si and avx512f_zero_extendv16hiv16si2_1 to avx512f_zero_extendv16hiv16si2_2 liuhongt
@ 2021-08-11  7:58 ` Jakub Jelinek
  2021-08-11  9:32   ` Hongtao Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Jelinek @ 2021-08-11  7:58 UTC (permalink / raw)
  To: liuhongt; +Cc: gcc-patches, crazylht, ubizjak

On Wed, Aug 11, 2021 at 02:43:06PM +0800, liuhongt wrote:
>   Add define_insn_and_split to combine avx_vec_concatv16si/2 and
> avx512f_zero_extendv16hiv16si2_1 since the latter already zero_extend
> the upper bits, similar for other patterns which are related to
> pmovzx{bw,wd,dq}.
> 
> It will do optimization like
> 
> -       vmovdqa %ymm0, %ymm0    # 7     [c=4 l=6]  avx_vec_concatv16si/2
>         vpmovzxwd       %ymm0, %zmm0    # 22    [c=4 l=6]  avx512f_zero_extendv16hiv16si2
>         ret             # 25    [c=0 l=1]  simple_return_internal
> 
>   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
>   Ok for trunk?
> 
> gcc/ChangeLog:
> 
> 	PR target/101846
> 	* config/i386/sse.md (*avx2_zero_extendv16qiv16hi2_2): New
> 	post_reload define_insn_and_split.

The ChangeLog doesn't mention the newly added mode iterators, perhaps it
should.

> 	(*avx512bw_zero_extendv32qiv32hi2_2): Ditto.
> 	(*sse4_1_zero_extendv8qiv8hi2_4): Ditto.
> 	(*avx512f_zero_extendv16hiv16si2_2): Ditto.
> 	(*avx2_zero_extendv8hiv8si2_2): Ditto.
> 	(*sse4_1_zero_extendv4hiv4si2_4): Ditto.
> 	(*avx512f_zero_extendv8siv8di2_2): Ditto.
> 	(*avx2_zero_extendv4siv4di2_2): Ditto.
> 	(*sse4_1_zero_extendv2siv2di2_4): Ditto.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR target/101846
> 	* gcc.target/i386/pr101846-1.c: New test.
> ---
>  gcc/config/i386/sse.md                     | 220 +++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr101846-1.c |  95 +++++++++
>  2 files changed, 315 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-1.c
> 
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index a46a2373547..6450c058458 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -673,8 +673,14 @@ (define_mode_iterator VI12_128 [V16QI V8HI])
>  (define_mode_iterator VI14_128 [V16QI V4SI])
>  (define_mode_iterator VI124_128 [V16QI V8HI V4SI])
>  (define_mode_iterator VI24_128 [V8HI V4SI])
> +(define_mode_iterator VI128_128 [V16QI V8HI V2DI])

And this mode iterator isn't used anywhere in the patch it seems.

Otherwise LGTM, although it fixes just particular, though perhaps very
important, cases, for detecting generally that some operations on
a vector aren't needed because following permutation that uses it never
reads those elements is something that would need to be done on gimple.

Would it be possible to handle also the similar pmovzx{bd,wq,bq} cases?

	Jakub


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [i386] Combine avx_vec_concatv16si and avx512f_zero_extendv16hiv16si2_1 to avx512f_zero_extendv16hiv16si2_2.
  2021-08-11  7:58 ` Jakub Jelinek
@ 2021-08-11  9:32   ` Hongtao Liu
  2021-08-12  5:43     ` [PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb} liuhongt
  0 siblings, 1 reply; 14+ messages in thread
From: Hongtao Liu @ 2021-08-11  9:32 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: liuhongt, GCC Patches, Uros Bizjak

On Wed, Aug 11, 2021 at 3:58 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Wed, Aug 11, 2021 at 02:43:06PM +0800, liuhongt wrote:
> >   Add define_insn_and_split to combine avx_vec_concatv16si/2 and
> > avx512f_zero_extendv16hiv16si2_1 since the latter already zero_extend
> > the upper bits, similar for other patterns which are related to
> > pmovzx{bw,wd,dq}.
> >
> > It will do optimization like
> >
> > -       vmovdqa %ymm0, %ymm0    # 7     [c=4 l=6]  avx_vec_concatv16si/2
> >         vpmovzxwd       %ymm0, %zmm0    # 22    [c=4 l=6]  avx512f_zero_extendv16hiv16si2
> >         ret             # 25    [c=0 l=1]  simple_return_internal
> >
> >   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> >   Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >       PR target/101846
> >       * config/i386/sse.md (*avx2_zero_extendv16qiv16hi2_2): New
> >       post_reload define_insn_and_split.
>
> The ChangeLog doesn't mention the newly added mode iterators, perhaps it
> should.
>
> >       (*avx512bw_zero_extendv32qiv32hi2_2): Ditto.
> >       (*sse4_1_zero_extendv8qiv8hi2_4): Ditto.
> >       (*avx512f_zero_extendv16hiv16si2_2): Ditto.
> >       (*avx2_zero_extendv8hiv8si2_2): Ditto.
> >       (*sse4_1_zero_extendv4hiv4si2_4): Ditto.
> >       (*avx512f_zero_extendv8siv8di2_2): Ditto.
> >       (*avx2_zero_extendv4siv4di2_2): Ditto.
> >       (*sse4_1_zero_extendv2siv2di2_4): Ditto.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       PR target/101846
> >       * gcc.target/i386/pr101846-1.c: New test.
> > ---
> >  gcc/config/i386/sse.md                     | 220 +++++++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr101846-1.c |  95 +++++++++
> >  2 files changed, 315 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-1.c
> >
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index a46a2373547..6450c058458 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -673,8 +673,14 @@ (define_mode_iterator VI12_128 [V16QI V8HI])
> >  (define_mode_iterator VI14_128 [V16QI V4SI])
> >  (define_mode_iterator VI124_128 [V16QI V8HI V4SI])
> >  (define_mode_iterator VI24_128 [V8HI V4SI])
> > +(define_mode_iterator VI128_128 [V16QI V8HI V2DI])
>
> And this mode iterator isn't used anywhere in the patch it seems.
>
> Otherwise LGTM, although it fixes just particular, though perhaps very
> important, cases, for detecting generally that some operations on
> a vector aren't needed because following permutation that uses it never
> reads those elements is something that would need to be done on gimple.
>
> Would it be possible to handle also the similar pmovzx{bd,wq,bq} cases?
Yes, regarding testcase bar, vec_perm can be implemented as vpmovdw
and vinserti64x4, and the latter instructions will be optimized off
since the upper bits are never used.
I'm working on a patch.
>
>         Jakub
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}.
  2021-08-11  9:32   ` Hongtao Liu
@ 2021-08-12  5:43     ` liuhongt
  2021-08-12  9:22       ` Jakub Jelinek
  0 siblings, 1 reply; 14+ messages in thread
From: liuhongt @ 2021-08-12  5:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: crazylht, ubizjak, jakub

Hi:
  This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb}
under AVX512.
  For scenarios(like pr101846-2.c) where the upper half is not used, this patch
generates better code with only one vpmov{wb,dw,qd} instruction. For
scenarios(like pr101846-3.c) where the upper half is actually used,  if the src
vector length is 256/512bits, the patch can still generate better code, but for
128bits, the code generation is worse.

128 bits upper half not used.

-       vpshufb .LC2(%rip), %xmm0, %xmm0
+       vpmovdw %xmm0, %xmm0

128 bits upper half used.
-       vpshufb .LC2(%rip), %xmm0, %xmm0
+       vpmovdw %xmm0, %xmm1
+       vmovq   %xmm1, %rax
+       vpinsrq $0, %rax, %xmm0, %xmm0

  Maybe expand_vec_perm_trunc_vinsert should only deal with 256/512bits of
vectors, but considering the real use of scenarios like pr101846-3.c
foo_*_128 possibility is relatively low, I still keep this part of the code.

  Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
  Ok for trunk?

gcc/ChangeLog:

	PR target/101846
	* config/i386/i386-expand.c (expand_vec_perm_trunc_vinsert):
	New function.
	(ix86_vectorize_vec_perm_const): Call
	expand_vec_perm_trunc_vinsert.
	* config/i386/sse.md (vec_set_lo_v32hi): New define_insn.
	(vec_set_lo_v64qi): Ditto.
	(vec_set_lo_<mode><mask_name>): Extend to no-avx512dq.

gcc/testsuite/ChangeLog:

	PR target/101846
	* gcc.target/i386/pr101846-2.c: New test.
	* gcc.target/i386/pr101846-3.c: New test.
---
 gcc/config/i386/i386-expand.c              | 125 +++++++++++++++++++++
 gcc/config/i386/sse.md                     |  60 +++++++++-
 gcc/testsuite/gcc.target/i386/pr101846-2.c |  81 +++++++++++++
 gcc/testsuite/gcc.target/i386/pr101846-3.c |  95 ++++++++++++++++
 4 files changed, 359 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-3.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index bd21efa9530..519caac2e15 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -18317,6 +18317,126 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
   return false;
 }
 
+/* A subroutine of ix86_expand_vec_perm_const_1.  Try to implement D
+   in terms of a pair of vpmovdw + vinserti128 instructions.  */
+static bool
+expand_vec_perm_trunc_vinsert (struct expand_vec_perm_d *d)
+{
+  unsigned i, nelt = d->nelt, mask = d->nelt - 1;
+  unsigned half = nelt / 2;
+  machine_mode half_mode, trunc_mode;
+
+  /* vpmov{wb,dw,qd} only available under AVX512.  */
+  if (!d->one_operand_p || !TARGET_AVX512F
+      || (!TARGET_AVX512VL  && GET_MODE_SIZE (d->vmode) < 64)
+      || GET_MODE_SIZE (GET_MODE_INNER (d->vmode)) > 4)
+    return false;
+
+  /* TARGET_AVX512BW is needed for vpmovwb.  */
+  if (GET_MODE_INNER (d->vmode) == E_QImode && !TARGET_AVX512BW)
+    return false;
+
+  for (i = 0; i < nelt; i++)
+    {
+      unsigned idx = d->perm[i] & mask;
+      if (idx != i * 2 && i < half)
+	return false;
+      if (idx != i && i >= half)
+	return false;
+    }
+
+  rtx (*gen_trunc) (rtx, rtx) = NULL;
+  rtx (*gen_vec_set_lo) (rtx, rtx, rtx) = NULL;
+  switch (d->vmode)
+    {
+    case E_V16QImode:
+      gen_trunc = gen_truncv8hiv8qi2;
+      gen_vec_set_lo = gen_vec_setv2di;
+      half_mode = V8QImode;
+      trunc_mode = V8HImode;
+      break;
+    case E_V32QImode:
+      gen_trunc = gen_truncv16hiv16qi2;
+      gen_vec_set_lo = gen_vec_set_lo_v32qi;
+      half_mode = V16QImode;
+      trunc_mode = V16HImode;
+      break;
+    case E_V64QImode:
+      gen_trunc = gen_truncv32hiv32qi2;
+      gen_vec_set_lo = gen_vec_set_lo_v64qi;
+      half_mode = V32QImode;
+      trunc_mode = V32HImode;
+      break;
+    case E_V8HImode:
+      gen_trunc = gen_truncv4siv4hi2;
+      gen_vec_set_lo = gen_vec_setv2di;
+      half_mode = V4HImode;
+      trunc_mode = V4SImode;
+      break;
+    case E_V16HImode:
+      gen_trunc = gen_truncv8siv8hi2;
+      gen_vec_set_lo = gen_vec_set_lo_v16hi;
+      half_mode = V8HImode;
+      trunc_mode = V8SImode;
+      break;
+    case E_V32HImode:
+      gen_trunc = gen_truncv16siv16hi2;
+      gen_vec_set_lo = gen_vec_set_lo_v32hi;
+      half_mode = V16HImode;
+      trunc_mode = V16SImode;
+      break;
+    case E_V4SImode:
+      gen_trunc = gen_truncv2div2si2;
+      gen_vec_set_lo = gen_vec_setv2di;
+      half_mode = V2SImode;
+      trunc_mode = V2DImode;
+      break;
+    case E_V8SImode:
+      gen_trunc = gen_truncv4div4si2;
+      gen_vec_set_lo = gen_vec_set_lo_v8si;
+      half_mode = V4SImode;
+      trunc_mode = V4DImode;
+      break;
+    case E_V16SImode:
+      gen_trunc = gen_truncv8div8si2;
+      gen_vec_set_lo = gen_vec_set_lo_v16si;
+      half_mode = V8SImode;
+      trunc_mode = V8DImode;
+      break;
+
+    default:
+      break;
+    }
+
+  if (gen_trunc == NULL)
+    return false;
+
+  rtx op_half = gen_reg_rtx (half_mode);
+  rtx op_trunc = d->op0;
+  if (d->vmode != trunc_mode)
+    op_trunc = lowpart_subreg (trunc_mode, op_trunc, d->vmode);
+  emit_insn (gen_trunc (op_half, op_trunc));
+
+  if (gen_vec_set_lo == gen_vec_setv2di)
+    {
+      op_half = lowpart_subreg (DImode, op_half, half_mode);
+      rtx op_dest = lowpart_subreg (V2DImode, d->op0, d->vmode);
+
+      /* vec_set<mode> require register_operand.  */
+      if (MEM_P (op_dest))
+	op_dest = force_reg (V2DImode, op_dest);
+      if (MEM_P (op_half))
+	op_half = force_reg (DImode, op_half);
+
+      emit_insn (gen_vec_set_lo (op_dest, op_half, GEN_INT(0)));
+      op_dest = lowpart_subreg (d->vmode, op_dest, V2DImode);
+      emit_move_insn (d->target, op_dest);
+    }
+  else
+    emit_insn (gen_vec_set_lo (d->target, d->op0, op_half));
+  return true;
+}
+
 /* A subroutine of ix86_expand_vec_perm_const_1.  Try to implement D
    in terms of a pair of pshuflw + pshufhw instructions.  */
 
@@ -21028,6 +21148,11 @@ ix86_vectorize_vec_perm_const (machine_mode vmode, rtx target, rtx op0,
   d.op0 = nop0;
   d.op1 = force_reg (vmode, d.op1);
 
+  /* Try to match vpmov{wb,dw,qd}, although vinserti128 will be generated,
+     it's very likely to be optimized off. So let's put the function here.  */
+  if (expand_vec_perm_trunc_vinsert (&d))
+    return true;
+
   if (ix86_expand_vec_perm_const_1 (&d))
     return true;
 
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index f631756c829..87e22332c83 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -15162,8 +15162,12 @@ (define_insn "vec_set_lo_<mode><mask_name>"
 		       (const_int 10) (const_int 11)
 		       (const_int 12) (const_int 13)
 		       (const_int 14) (const_int 15)]))))]
-  "TARGET_AVX512DQ"
-  "vinsert<shuffletype>32x8\t{$0x0, %2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2, 0x0}"
+  "TARGET_AVX512F && <mask_avx512dq_condition>"
+{
+  if (TARGET_AVX512DQ)
+    return "vinsert<shuffletype>32x8\t{$0x0, %2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2, 0x0}";
+  return "vinsert<shuffletype>64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}";
+}
   [(set_attr "type" "sselog")
    (set_attr "length_immediate" "1")
    (set_attr "prefix" "evex")
@@ -22806,6 +22810,28 @@ (define_insn "vec_set_hi_v16hi"
    (set_attr "prefix" "vex,evex")
    (set_attr "mode" "OI")])
 
+(define_insn "vec_set_lo_v32hi"
+  [(set (match_operand:V32HI 0 "register_operand" "=v")
+	(vec_concat:V32HI
+	  (match_operand:V16HI 2 "nonimmediate_operand" "vm")
+	  (vec_select:V16HI
+	    (match_operand:V32HI 1 "register_operand" "v")
+	    (parallel [(const_int 16) (const_int 17)
+		       (const_int 18) (const_int 19)
+		       (const_int 20) (const_int 21)
+		       (const_int 22) (const_int 23)
+		       (const_int 24) (const_int 25)
+		       (const_int 26) (const_int 27)
+		       (const_int 28) (const_int 29)
+		       (const_int 30) (const_int 31)]))))]
+  "TARGET_AVX512F"
+  "vinserti64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}"
+  [(set_attr "type" "sselog")
+   (set_attr "prefix_extra" "1")
+   (set_attr "length_immediate" "1")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "XI")])
+
 (define_insn "vec_set_lo_v32qi"
   [(set (match_operand:V32QI 0 "register_operand" "=x,v")
 	(vec_concat:V32QI
@@ -22854,6 +22880,36 @@ (define_insn "vec_set_hi_v32qi"
    (set_attr "prefix" "vex,evex")
    (set_attr "mode" "OI")])
 
+(define_insn "vec_set_lo_v64qi"
+  [(set (match_operand:V64QI 0 "register_operand" "=v")
+	(vec_concat:V64QI
+	  (match_operand:V32QI 2 "nonimmediate_operand" "vm")
+	  (vec_select:V32QI
+	    (match_operand:V64QI 1 "register_operand" "v")
+	    (parallel [(const_int 32) (const_int 33)
+		       (const_int 34) (const_int 35)
+		       (const_int 36) (const_int 37)
+		       (const_int 38) (const_int 39)
+		       (const_int 40) (const_int 41)
+		       (const_int 42) (const_int 43)
+		       (const_int 44) (const_int 45)
+		       (const_int 46) (const_int 47)
+		       (const_int 48) (const_int 49)
+		       (const_int 50) (const_int 51)
+		       (const_int 52) (const_int 53)
+		       (const_int 54) (const_int 55)
+		       (const_int 56) (const_int 57)
+		       (const_int 58) (const_int 59)
+		       (const_int 60) (const_int 61)
+		       (const_int 62) (const_int 63)]))))]
+  "TARGET_AVX512F"
+  "vinserti64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}"
+  [(set_attr "type" "sselog")
+   (set_attr "prefix_extra" "1")
+   (set_attr "length_immediate" "1")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "XI")])
+
 (define_insn "<avx_avx2>_maskload<ssemodesuffix><avxsizesuffix>"
   [(set (match_operand:V48_AVX2 0 "register_operand" "=x")
 	(unspec:V48_AVX2
diff --git a/gcc/testsuite/gcc.target/i386/pr101846-2.c b/gcc/testsuite/gcc.target/i386/pr101846-2.c
new file mode 100644
index 00000000000..af4ae8ccdd6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101846-2.c
@@ -0,0 +1,81 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512bw -mavx512vl -mavx512dq -O2" } */
+/* { dg-final { scan-assembler-times "vpmovwb" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovdw" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovqd" "3" } } */
+
+typedef short v4hi __attribute__((vector_size (8)));
+typedef short v8hi __attribute__((vector_size (16)));
+typedef short v16hi __attribute__((vector_size (32)));
+typedef short v32hi __attribute__((vector_size (64)));
+typedef char v8qi __attribute__((vector_size (8)));
+typedef char v16qi __attribute__((vector_size (16)));
+typedef char v32qi __attribute__((vector_size (32)));
+typedef char v64qi __attribute__((vector_size (64)));
+typedef int v2si __attribute__((vector_size (8)));
+typedef int v4si __attribute__((vector_size (16)));
+typedef int v8si __attribute__((vector_size (32)));
+typedef int v16si __attribute__((vector_size (64)));
+
+v16hi
+foo_dw_512 (v32hi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30);
+}
+
+v8hi
+foo_dw_256 (v16hi x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14);
+}
+
+v4hi
+foo_dw_128 (v8hi x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6);
+}
+
+v8si
+foo_qd_512 (v16si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14);
+}
+
+v4si
+foo_qd_256 (v8si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6);
+}
+
+v2si
+foo_qd_128 (v4si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2);
+}
+
+v32qi
+foo_wb_512 (v64qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  32, 34, 36, 38, 40, 42, 44, 46,
+				  48, 50, 52, 54, 56, 58, 60, 62);
+}
+
+v16qi
+foo_wb_256 (v32qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30);
+}
+
+v8qi
+foo_wb_128 (v16qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101846-3.c b/gcc/testsuite/gcc.target/i386/pr101846-3.c
new file mode 100644
index 00000000000..380b1220327
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101846-3.c
@@ -0,0 +1,95 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512bw -mavx512vl -mavx512dq -O2" } */
+/* { dg-final { scan-assembler-times "vpmovwb" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovdw" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovqd" "3" } } */
+
+typedef short v4hi __attribute__((vector_size (8)));
+typedef short v8hi __attribute__((vector_size (16)));
+typedef short v16hi __attribute__((vector_size (32)));
+typedef short v32hi __attribute__((vector_size (64)));
+typedef char v8qi __attribute__((vector_size (8)));
+typedef char v16qi __attribute__((vector_size (16)));
+typedef char v32qi __attribute__((vector_size (32)));
+typedef char v64qi __attribute__((vector_size (64)));
+typedef int v2si __attribute__((vector_size (8)));
+typedef int v4si __attribute__((vector_size (16)));
+typedef int v8si __attribute__((vector_size (32)));
+typedef int v16si __attribute__((vector_size (64)));
+
+v32hi
+foo_dw_512 (v32hi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  16, 17, 18, 19, 20, 21, 22, 23,
+				  24, 25, 26, 27, 28, 29, 30, 31);
+}
+
+v16hi
+foo_dw_256 (v16hi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  8, 9, 10, 11, 12, 13, 14, 15);
+}
+
+v8hi
+foo_dw_128 (v8hi x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 4, 5, 6, 7);
+}
+
+v16si
+foo_qd_512 (v16si x)
+{
+  return __builtin_shufflevector (x, x, 0,
+				  2, 4, 6, 8, 10, 12, 14,
+				  8, 9, 10, 11, 12, 13, 14, 15);
+}
+
+v8si
+foo_qd_256 (v8si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 4, 5, 6, 7);
+}
+
+v4si
+foo_qd_128 (v4si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 2, 3);
+}
+
+v64qi
+foo_wb_512 (v64qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  32, 34, 36, 38, 40, 42, 44, 46,
+				  48, 50, 52, 54, 56, 58, 60, 62,
+				  32, 33, 34, 35, 36, 37, 38, 39,
+				  40, 41, 42, 43, 44, 45, 46, 47,
+				  48, 49, 50, 51, 52, 53, 54, 55,
+				  56, 57, 58, 59, 60, 61, 62, 63);
+}
+
+v32qi
+foo_wb_256 (v32qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  16, 17, 18, 19, 20, 21, 22, 23,
+				  24, 25, 26, 27, 28, 29, 30, 31);
+}
+
+v16qi
+foo_wb_128 (v16qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  8, 9, 10, 11, 12, 13, 14, 15);
+}
+
-- 
2.27.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}.
  2021-08-12  5:43     ` [PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb} liuhongt
@ 2021-08-12  9:22       ` Jakub Jelinek
  2021-08-12  9:41         ` Jakub Jelinek
  2021-08-13  1:42         ` Hongtao Liu
  0 siblings, 2 replies; 14+ messages in thread
From: Jakub Jelinek @ 2021-08-12  9:22 UTC (permalink / raw)
  To: liuhongt; +Cc: gcc-patches, crazylht, ubizjak

On Thu, Aug 12, 2021 at 01:43:23PM +0800, liuhongt wrote:
> Hi:
>   This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb}
> under AVX512.
>   For scenarios(like pr101846-2.c) where the upper half is not used, this patch
> generates better code with only one vpmov{wb,dw,qd} instruction. For
> scenarios(like pr101846-3.c) where the upper half is actually used,  if the src
> vector length is 256/512bits, the patch can still generate better code, but for
> 128bits, the code generation is worse.
> 
> 128 bits upper half not used.
> 
> -       vpshufb .LC2(%rip), %xmm0, %xmm0
> +       vpmovdw %xmm0, %xmm0
> 
> 128 bits upper half used.
> -       vpshufb .LC2(%rip), %xmm0, %xmm0
> +       vpmovdw %xmm0, %xmm1
> +       vmovq   %xmm1, %rax
> +       vpinsrq $0, %rax, %xmm0, %xmm0
> 
>   Maybe expand_vec_perm_trunc_vinsert should only deal with 256/512bits of
> vectors, but considering the real use of scenarios like pr101846-3.c
> foo_*_128 possibility is relatively low, I still keep this part of the code.

I actually am not sure if even
 foo_dw_512:
 .LFB0:
        .cfi_startproc
-       vmovdqa64       %zmm0, %zmm1
-       vmovdqa64       .LC0(%rip), %zmm0
-       vpermi2w        %zmm1, %zmm1, %zmm0
+       vpmovdw %zmm0, %ymm1
+       vinserti64x4    $0x0, %ymm1, %zmm0, %zmm0
        ret
is always a win, the permutations we should care most about are in loops
and the constant load as well as the first move in that case likely go
away and it is one permutation insn vs. two.
Different case is e.g.
-       vmovdqa64       .LC5(%rip), %zmm2
-       vmovdqa64       %zmm0, %zmm1
-       vmovdqa64       .LC0(%rip), %zmm0
-       vpermi2w        %zmm1, %zmm1, %zmm2
-       vpermi2w        %zmm1, %zmm1, %zmm0
-       vpshufb .LC6(%rip), %zmm0, %zmm0
-       vpshufb .LC7(%rip), %zmm2, %zmm1
-       vporq   %zmm1, %zmm0, %zmm0
+       vpmovwb %zmm0, %ymm1
+       vinserti64x4    $0x0, %ymm1, %zmm0, %zmm0
So, I wonder if your new routine shouldn't be instead done after
in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
and handle the other vpmovdw etc. cases in combine splitters (see that we
only use low half or quarter of the result and transform whatever
permutation we've used into what we want).

And perhaps make the routine eventually more general, don't handle
just identity permutation in the upper half, but allow there other
permutations too (ones where that half can be represented by a single insn
permutation).
> 
>   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
>   Ok for trunk?
> 
> gcc/ChangeLog:
> 
> 	PR target/101846
> 	* config/i386/i386-expand.c (expand_vec_perm_trunc_vinsert):
> 	New function.
> 	(ix86_vectorize_vec_perm_const): Call
> 	expand_vec_perm_trunc_vinsert.
> 	* config/i386/sse.md (vec_set_lo_v32hi): New define_insn.
> 	(vec_set_lo_v64qi): Ditto.
> 	(vec_set_lo_<mode><mask_name>): Extend to no-avx512dq.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR target/101846
> 	* gcc.target/i386/pr101846-2.c: New test.
> 	* gcc.target/i386/pr101846-3.c: New test.
> ---
>  gcc/config/i386/i386-expand.c              | 125 +++++++++++++++++++++
>  gcc/config/i386/sse.md                     |  60 +++++++++-
>  gcc/testsuite/gcc.target/i386/pr101846-2.c |  81 +++++++++++++
>  gcc/testsuite/gcc.target/i386/pr101846-3.c |  95 ++++++++++++++++
>  4 files changed, 359 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-3.c
> 
> diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> index bd21efa9530..519caac2e15 100644
> --- a/gcc/config/i386/i386-expand.c
> +++ b/gcc/config/i386/i386-expand.c
> @@ -18317,6 +18317,126 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
>    return false;
>  }
>  
> +/* A subroutine of ix86_expand_vec_perm_const_1.  Try to implement D
> +   in terms of a pair of vpmovdw + vinserti128 instructions.  */
> +static bool
> +expand_vec_perm_trunc_vinsert (struct expand_vec_perm_d *d)
> +{
> +  unsigned i, nelt = d->nelt, mask = d->nelt - 1;
> +  unsigned half = nelt / 2;
> +  machine_mode half_mode, trunc_mode;
> +
> +  /* vpmov{wb,dw,qd} only available under AVX512.  */
> +  if (!d->one_operand_p || !TARGET_AVX512F
> +      || (!TARGET_AVX512VL  && GET_MODE_SIZE (d->vmode) < 64)

Too many spaces.
> +      || GET_MODE_SIZE (GET_MODE_INNER (d->vmode)) > 4)
> +    return false;
> +
> +  /* TARGET_AVX512BW is needed for vpmovwb.  */
> +  if (GET_MODE_INNER (d->vmode) == E_QImode && !TARGET_AVX512BW)
> +    return false;
> +
> +  for (i = 0; i < nelt; i++)
> +    {
> +      unsigned idx = d->perm[i] & mask;
> +      if (idx != i * 2 && i < half)
> +	return false;
> +      if (idx != i && i >= half)
> +	return false;
> +    }
> +
> +  rtx (*gen_trunc) (rtx, rtx) = NULL;
> +  rtx (*gen_vec_set_lo) (rtx, rtx, rtx) = NULL;
> +  switch (d->vmode)
> +    {
> +    case E_V16QImode:
> +      gen_trunc = gen_truncv8hiv8qi2;
> +      gen_vec_set_lo = gen_vec_setv2di;
> +      half_mode = V8QImode;
> +      trunc_mode = V8HImode;
> +      break;
> +    case E_V32QImode:
> +      gen_trunc = gen_truncv16hiv16qi2;
> +      gen_vec_set_lo = gen_vec_set_lo_v32qi;
> +      half_mode = V16QImode;
> +      trunc_mode = V16HImode;
> +      break;
> +    case E_V64QImode:
> +      gen_trunc = gen_truncv32hiv32qi2;
> +      gen_vec_set_lo = gen_vec_set_lo_v64qi;
> +      half_mode = V32QImode;
> +      trunc_mode = V32HImode;
> +      break;
> +    case E_V8HImode:
> +      gen_trunc = gen_truncv4siv4hi2;
> +      gen_vec_set_lo = gen_vec_setv2di;
> +      half_mode = V4HImode;
> +      trunc_mode = V4SImode;
> +      break;
> +    case E_V16HImode:
> +      gen_trunc = gen_truncv8siv8hi2;
> +      gen_vec_set_lo = gen_vec_set_lo_v16hi;
> +      half_mode = V8HImode;
> +      trunc_mode = V8SImode;
> +      break;
> +    case E_V32HImode:
> +      gen_trunc = gen_truncv16siv16hi2;
> +      gen_vec_set_lo = gen_vec_set_lo_v32hi;
> +      half_mode = V16HImode;
> +      trunc_mode = V16SImode;
> +      break;
> +    case E_V4SImode:
> +      gen_trunc = gen_truncv2div2si2;
> +      gen_vec_set_lo = gen_vec_setv2di;
> +      half_mode = V2SImode;
> +      trunc_mode = V2DImode;
> +      break;
> +    case E_V8SImode:
> +      gen_trunc = gen_truncv4div4si2;
> +      gen_vec_set_lo = gen_vec_set_lo_v8si;
> +      half_mode = V4SImode;
> +      trunc_mode = V4DImode;
> +      break;
> +    case E_V16SImode:
> +      gen_trunc = gen_truncv8div8si2;
> +      gen_vec_set_lo = gen_vec_set_lo_v16si;
> +      half_mode = V8SImode;
> +      trunc_mode = V8DImode;
> +      break;
> +
> +    default:
> +      break;
> +    }
> +
> +  if (gen_trunc == NULL)
> +    return false;

Isn't it simpler to return false; in the default: case?
> @@ -21028,6 +21148,11 @@ ix86_vectorize_vec_perm_const (machine_mode vmode, rtx target, rtx op0,
>    d.op0 = nop0;
>    d.op1 = force_reg (vmode, d.op1);
>  
> +  /* Try to match vpmov{wb,dw,qd}, although vinserti128 will be generated,
> +     it's very likely to be optimized off. So let's put the function here.  */

Two spaces after full stop.

	Jakub


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}.
  2021-08-12  9:22       ` Jakub Jelinek
@ 2021-08-12  9:41         ` Jakub Jelinek
  2021-08-13  1:42         ` Hongtao Liu
  1 sibling, 0 replies; 14+ messages in thread
From: Jakub Jelinek @ 2021-08-12  9:41 UTC (permalink / raw)
  To: liuhongt, gcc-patches

On Thu, Aug 12, 2021 at 11:22:48AM +0200, Jakub Jelinek via Gcc-patches wrote:
> So, I wonder if your new routine shouldn't be instead done after
> in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
> and handle the other vpmovdw etc. cases in combine splitters (see that we
> only use low half or quarter of the result and transform whatever
> permutation we've used into what we want).

E.g. in the first function, combine tries:
(set (reg:V16HI 85)
    (vec_select:V16HI (unspec:V32HI [
                (mem/u/c:V32HI (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S64 A512])
                (reg:V32HI 88) repeated x2
            ] UNSPEC_VPERMT2)
        (parallel [
                (const_int 0 [0])
                (const_int 1 [0x1])
                (const_int 2 [0x2])
                (const_int 3 [0x3])
                (const_int 4 [0x4])
                (const_int 5 [0x5])
                (const_int 6 [0x6])
                (const_int 7 [0x7])
                (const_int 8 [0x8])
                (const_int 9 [0x9])
                (const_int 10 [0xa])
                (const_int 11 [0xb])
                (const_int 12 [0xc])
                (const_int 13 [0xd])
                (const_int 14 [0xe])
                (const_int 15 [0xf])
            ])))
A combine splitter could run avoid_constant_pool_reference on the
first UNSPEC_VPERMT2 argument and check the permutation if it can be
optimized, ideally using some function call so that we wouldn't need too
many splitters.

	Jakub


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}.
  2021-08-12  9:22       ` Jakub Jelinek
  2021-08-12  9:41         ` Jakub Jelinek
@ 2021-08-13  1:42         ` Hongtao Liu
  2021-08-13  8:47           ` Jakub Jelinek
  1 sibling, 1 reply; 14+ messages in thread
From: Hongtao Liu @ 2021-08-13  1:42 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: liuhongt, GCC Patches, Uros Bizjak

On Thu, Aug 12, 2021 at 5:23 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Thu, Aug 12, 2021 at 01:43:23PM +0800, liuhongt wrote:
> > Hi:
> >   This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb}
> > under AVX512.
> >   For scenarios(like pr101846-2.c) where the upper half is not used, this patch
> > generates better code with only one vpmov{wb,dw,qd} instruction. For
> > scenarios(like pr101846-3.c) where the upper half is actually used,  if the src
> > vector length is 256/512bits, the patch can still generate better code, but for
> > 128bits, the code generation is worse.
> >
> > 128 bits upper half not used.
> >
> > -       vpshufb .LC2(%rip), %xmm0, %xmm0
> > +       vpmovdw %xmm0, %xmm0
> >
> > 128 bits upper half used.
> > -       vpshufb .LC2(%rip), %xmm0, %xmm0
> > +       vpmovdw %xmm0, %xmm1
> > +       vmovq   %xmm1, %rax
> > +       vpinsrq $0, %rax, %xmm0, %xmm0
> >
> >   Maybe expand_vec_perm_trunc_vinsert should only deal with 256/512bits of
> > vectors, but considering the real use of scenarios like pr101846-3.c
> > foo_*_128 possibility is relatively low, I still keep this part of the code.
>
> I actually am not sure if even
>  foo_dw_512:
>  .LFB0:
>         .cfi_startproc
> -       vmovdqa64       %zmm0, %zmm1
> -       vmovdqa64       .LC0(%rip), %zmm0
> -       vpermi2w        %zmm1, %zmm1, %zmm0
> +       vpmovdw %zmm0, %ymm1
> +       vinserti64x4    $0x0, %ymm1, %zmm0, %zmm0
>         ret
> is always a win, the permutations we should care most about are in loops
Yes, and vpmov{qd,dw} and vpermi{w,d} are both available under
avx512f, so expand_vec_perm_trunc_vinsert will never be matched when
it's placed in other 2 insn cases.
The only part we need to handle is vpmovwb which is under avx512bw but
vpermib require avx512vbmi.
> and the constant load as well as the first move in that case likely go
> away and it is one permutation insn vs. two.
> Different case is e.g.
> -       vmovdqa64       .LC5(%rip), %zmm2
> -       vmovdqa64       %zmm0, %zmm1
> -       vmovdqa64       .LC0(%rip), %zmm0
> -       vpermi2w        %zmm1, %zmm1, %zmm2
> -       vpermi2w        %zmm1, %zmm1, %zmm0
> -       vpshufb .LC6(%rip), %zmm0, %zmm0
> -       vpshufb .LC7(%rip), %zmm2, %zmm1
> -       vporq   %zmm1, %zmm0, %zmm0
> +       vpmovwb %zmm0, %ymm1
> +       vinserti64x4    $0x0, %ymm1, %zmm0, %zmm0
> So, I wonder if your new routine shouldn't be instead done after
> in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
> and handle the other vpmovdw etc. cases in combine splitters (see that we
> only use low half or quarter of the result and transform whatever
> permutation we've used into what we want).
>
Got it, i'll try that way.
> And perhaps make the routine eventually more general, don't handle
> just identity permutation in the upper half, but allow there other
> permutations too (ones where that half can be represented by a single insn
> permutation).
> >
> >   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> >   Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >       PR target/101846
> >       * config/i386/i386-expand.c (expand_vec_perm_trunc_vinsert):
> >       New function.
> >       (ix86_vectorize_vec_perm_const): Call
> >       expand_vec_perm_trunc_vinsert.
> >       * config/i386/sse.md (vec_set_lo_v32hi): New define_insn.
> >       (vec_set_lo_v64qi): Ditto.
> >       (vec_set_lo_<mode><mask_name>): Extend to no-avx512dq.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       PR target/101846
> >       * gcc.target/i386/pr101846-2.c: New test.
> >       * gcc.target/i386/pr101846-3.c: New test.
> > ---
> >  gcc/config/i386/i386-expand.c              | 125 +++++++++++++++++++++
> >  gcc/config/i386/sse.md                     |  60 +++++++++-
> >  gcc/testsuite/gcc.target/i386/pr101846-2.c |  81 +++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr101846-3.c |  95 ++++++++++++++++
> >  4 files changed, 359 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-3.c
> >
> > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> > index bd21efa9530..519caac2e15 100644
> > --- a/gcc/config/i386/i386-expand.c
> > +++ b/gcc/config/i386/i386-expand.c
> > @@ -18317,6 +18317,126 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
> >    return false;
> >  }
> >
> > +/* A subroutine of ix86_expand_vec_perm_const_1.  Try to implement D
> > +   in terms of a pair of vpmovdw + vinserti128 instructions.  */
> > +static bool
> > +expand_vec_perm_trunc_vinsert (struct expand_vec_perm_d *d)
> > +{
> > +  unsigned i, nelt = d->nelt, mask = d->nelt - 1;
> > +  unsigned half = nelt / 2;
> > +  machine_mode half_mode, trunc_mode;
> > +
> > +  /* vpmov{wb,dw,qd} only available under AVX512.  */
> > +  if (!d->one_operand_p || !TARGET_AVX512F
> > +      || (!TARGET_AVX512VL  && GET_MODE_SIZE (d->vmode) < 64)
>
> Too many spaces.
> > +      || GET_MODE_SIZE (GET_MODE_INNER (d->vmode)) > 4)
> > +    return false;
> > +
> > +  /* TARGET_AVX512BW is needed for vpmovwb.  */
> > +  if (GET_MODE_INNER (d->vmode) == E_QImode && !TARGET_AVX512BW)
> > +    return false;
> > +
> > +  for (i = 0; i < nelt; i++)
> > +    {
> > +      unsigned idx = d->perm[i] & mask;
> > +      if (idx != i * 2 && i < half)
> > +     return false;
> > +      if (idx != i && i >= half)
> > +     return false;
> > +    }
> > +
> > +  rtx (*gen_trunc) (rtx, rtx) = NULL;
> > +  rtx (*gen_vec_set_lo) (rtx, rtx, rtx) = NULL;
> > +  switch (d->vmode)
> > +    {
> > +    case E_V16QImode:
> > +      gen_trunc = gen_truncv8hiv8qi2;
> > +      gen_vec_set_lo = gen_vec_setv2di;
> > +      half_mode = V8QImode;
> > +      trunc_mode = V8HImode;
> > +      break;
> > +    case E_V32QImode:
> > +      gen_trunc = gen_truncv16hiv16qi2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v32qi;
> > +      half_mode = V16QImode;
> > +      trunc_mode = V16HImode;
> > +      break;
> > +    case E_V64QImode:
> > +      gen_trunc = gen_truncv32hiv32qi2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v64qi;
> > +      half_mode = V32QImode;
> > +      trunc_mode = V32HImode;
> > +      break;
> > +    case E_V8HImode:
> > +      gen_trunc = gen_truncv4siv4hi2;
> > +      gen_vec_set_lo = gen_vec_setv2di;
> > +      half_mode = V4HImode;
> > +      trunc_mode = V4SImode;
> > +      break;
> > +    case E_V16HImode:
> > +      gen_trunc = gen_truncv8siv8hi2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v16hi;
> > +      half_mode = V8HImode;
> > +      trunc_mode = V8SImode;
> > +      break;
> > +    case E_V32HImode:
> > +      gen_trunc = gen_truncv16siv16hi2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v32hi;
> > +      half_mode = V16HImode;
> > +      trunc_mode = V16SImode;
> > +      break;
> > +    case E_V4SImode:
> > +      gen_trunc = gen_truncv2div2si2;
> > +      gen_vec_set_lo = gen_vec_setv2di;
> > +      half_mode = V2SImode;
> > +      trunc_mode = V2DImode;
> > +      break;
> > +    case E_V8SImode:
> > +      gen_trunc = gen_truncv4div4si2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v8si;
> > +      half_mode = V4SImode;
> > +      trunc_mode = V4DImode;
> > +      break;
> > +    case E_V16SImode:
> > +      gen_trunc = gen_truncv8div8si2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v16si;
> > +      half_mode = V8SImode;
> > +      trunc_mode = V8DImode;
> > +      break;
> > +
> > +    default:
> > +      break;
> > +    }
> > +
> > +  if (gen_trunc == NULL)
> > +    return false;
>
> Isn't it simpler to return false; in the default: case?
> > @@ -21028,6 +21148,11 @@ ix86_vectorize_vec_perm_const (machine_mode vmode, rtx target, rtx op0,
> >    d.op0 = nop0;
> >    d.op1 = force_reg (vmode, d.op1);
> >
> > +  /* Try to match vpmov{wb,dw,qd}, although vinserti128 will be generated,
> > +     it's very likely to be optimized off. So let's put the function here.  */
>
> Two spaces after full stop.
>
>         Jakub
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}.
  2021-08-13  1:42         ` Hongtao Liu
@ 2021-08-13  8:47           ` Jakub Jelinek
  2021-08-13  9:03             ` Richard Sandiford
  2021-08-16  5:18             ` [PATCH] [i386] Optimize __builtin_shuffle_vector liuhongt
  0 siblings, 2 replies; 14+ messages in thread
From: Jakub Jelinek @ 2021-08-13  8:47 UTC (permalink / raw)
  To: Hongtao Liu, Richard Biener; +Cc: liuhongt, GCC Patches, Uros Bizjak

On Fri, Aug 13, 2021 at 09:42:00AM +0800, Hongtao Liu wrote:
> > So, I wonder if your new routine shouldn't be instead done after
> > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
> > and handle the other vpmovdw etc. cases in combine splitters (see that we
> > only use low half or quarter of the result and transform whatever
> > permutation we've used into what we want).
> >
> Got it, i'll try that way.

Note, IMHO the ultimate fix would be to add real support for the
__builtin_shufflevector -1 indices (meaning I don't care what will be in
that element, perhaps narrowed down to an implementation choice of
any element of the input vector(s) or 0).
As VEC_PERM_EXPR is currently used for both perms by variable permutation
vector and constant, I think we'd need to introduce VEC_PERM_CONST_EXPR,
which would be exactly like VEC_PERM_EXPR, except that the last operand
would be required to be a VECTOR_CST and that all ones element would mean
something different, the I don't care behavior.
The GIMPLE side would be fairly easy, except that there should be some
optimizations eventually, like when only certain subset of elements of
a vector are used later, we can mark the other elements as don't care.

The hard part would be backend expansion, especially x86.
I guess we could easily canonicalize VEC_PERM_EXPR with constant
permutations into VEC_PERM_CONST_EXPR by replacing all ones elements
with elements modulo the number of elements (or twice that for 2 operand
perms), but then in all the routines that recognize something we'd
need to special case the unknown elements to match anything during testing
and for expansion replace it by something that would match.
That is again a lot of work, but not extremely hard.  The hardest would be
to deal with the expand_vec_perm_1 handling many cases by trying to recog
an instruction.  Either we'd need to represent the unknown case by a magic
CONST_INT_WILDCARD or CONST_INT_RANGE that recog with the help of the
patterns would replace by some CONST_INT that matches it, but note we have
all those const_0_to_N_operand and in conditions
INTVAL (operands[3]) & 1 == 0 and INTVAL (operands[3]) + 1 == INTVAL (operands[4])
etc., or we'd need to either manually or semi-automatically build some code that
would try to guess right values for unknown before trying to recog it.

	Jakub

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}.
  2021-08-13  8:47           ` Jakub Jelinek
@ 2021-08-13  9:03             ` Richard Sandiford
  2021-08-16  7:19               ` Richard Biener
  2021-08-16  5:18             ` [PATCH] [i386] Optimize __builtin_shuffle_vector liuhongt
  1 sibling, 1 reply; 14+ messages in thread
From: Richard Sandiford @ 2021-08-13  9:03 UTC (permalink / raw)
  To: Jakub Jelinek via Gcc-patches
  Cc: Hongtao Liu, Richard Biener, Jakub Jelinek, liuhongt

Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Fri, Aug 13, 2021 at 09:42:00AM +0800, Hongtao Liu wrote:
>> > So, I wonder if your new routine shouldn't be instead done after
>> > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
>> > and handle the other vpmovdw etc. cases in combine splitters (see that we
>> > only use low half or quarter of the result and transform whatever
>> > permutation we've used into what we want).
>> >
>> Got it, i'll try that way.
>
> Note, IMHO the ultimate fix would be to add real support for the
> __builtin_shufflevector -1 indices (meaning I don't care what will be in
> that element, perhaps narrowed down to an implementation choice of
> any element of the input vector(s) or 0).
> As VEC_PERM_EXPR is currently used for both perms by variable permutation
> vector and constant, I think we'd need to introduce VEC_PERM_CONST_EXPR,
> which would be exactly like VEC_PERM_EXPR, except that the last operand
> would be required to be a VECTOR_CST and that all ones element would mean
> something different, the I don't care behavior.
> The GIMPLE side would be fairly easy, except that there should be some
> optimizations eventually, like when only certain subset of elements of
> a vector are used later, we can mark the other elements as don't care.

Another alternative I'd wondered about was keeping a single tree code,
but adding an extra operand with a “care/don't care” mask.  I think
that would fit with variable-length vectors better.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] [i386] Optimize __builtin_shuffle_vector.
  2021-08-13  8:47           ` Jakub Jelinek
  2021-08-13  9:03             ` Richard Sandiford
@ 2021-08-16  5:18             ` liuhongt
  2021-08-16  7:10               ` Jakub Jelinek
  1 sibling, 1 reply; 14+ messages in thread
From: liuhongt @ 2021-08-16  5:18 UTC (permalink / raw)
  To: gcc-patches; +Cc: jakub, crazylht

Hi:
  Here's updated patch which does 3 things:
1. Support vpermw/vpermb in ix86_expand_vec_one_operand_perm_avx512.
2. Support 256/128-bits vpermi2b in ix86_expand_vec_perm_vpermt2.
3. Add define_insn_and_split to optimize specific vector permutation to opmov{dw,wb,qd}.

  Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
  Ok for trunk?

gcc/ChangeLog:

	PR target/101846
	* config/i386/i386-expand.c (ix86_expand_vec_perm_vpermt2):
	Support vpermi2b for V32QI/V16QImode.
	(ix86_extract_perm_from_pool_constant): New function.
	(ix86_expand_vec_one_operand_perm_avx512): Support
	vpermw/vpermb under TARGET_AVX512BW/TARGET_AVX512VBMI.
	(expand_vec_perm_1): Adjust comments for upper.
	* config/i386/i386-protos.h (ix86_extract_perm_from_pool_constant):
	New declare.
	* config/i386/predicates.md (permvar_truncate_operand): New predicate.
	(pshufb_truncv4siv4hi_operand): Ditto.
	(pshufb_truncv8hiv8qi_operand): Ditto.
	* config/i386/sse.md (*avx512bw_permvar_truncv16siv16hi_1):
	New pre_reload define_insn_and_split.
	(*avx512f_permvar_truncv8siv8hi_1): Ditto.
	(*avx512f_vpermvar_truncv8div8si_1): Ditto.
	(*avx512f_permvar_truncv32hiv32qi_1): Ditto.
	(*avx512f_permvar_truncv16hiv16qi_1): Ditto.
	(*avx512f_permvar_truncv4div4si_1): Ditto.
	(*avx512f_pshufb_truncv8hiv8qi_1): Ditto.
	(*avx512f_pshufb_truncv4siv4hi_1): Ditto.
	(*avx512f_pshufd_truncv2div2si_1): Ditto.

gcc/testsuite/ChangeLog:

	PR target/101846
	* gcc.target/i386/pr101846-2.c: New test.
	* gcc.target/i386/pr101846-3.c: New test.
	* gcc.target/i386/pr101846-4.c: New test.
---
 gcc/config/i386/i386-expand.c              |  89 +++++++++-
 gcc/config/i386/i386-protos.h              |   1 +
 gcc/config/i386/predicates.md              |  90 ++++++++++
 gcc/config/i386/sse.md                     | 190 +++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr101846-2.c |  81 +++++++++
 gcc/testsuite/gcc.target/i386/pr101846-3.c |  73 ++++++++
 gcc/testsuite/gcc.target/i386/pr101846-4.c |  40 +++++
 7 files changed, 559 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-4.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index a652b25f534..56319cb6f6a 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -4778,6 +4778,18 @@ ix86_expand_vec_perm_vpermt2 (rtx target, rtx mask, rtx op0, rtx op1,
 
   switch (mode)
     {
+    case E_V16QImode:
+      if (TARGET_AVX512VL && TARGET_AVX512VBMI)
+	gen = gen_avx512vl_vpermt2varv16qi3;
+      break;
+    case E_V32QImode:
+      if (TARGET_AVX512VL && TARGET_AVX512VBMI)
+	gen = gen_avx512vl_vpermt2varv32qi3;
+      break;
+    case E_V64QImode:
+      if (TARGET_AVX512VBMI)
+	gen = gen_avx512bw_vpermt2varv64qi3;
+      break;
     case E_V8HImode:
       if (TARGET_AVX512VL && TARGET_AVX512BW)
 	gen = gen_avx512vl_vpermt2varv8hi3;
@@ -4786,10 +4798,6 @@ ix86_expand_vec_perm_vpermt2 (rtx target, rtx mask, rtx op0, rtx op1,
       if (TARGET_AVX512VL && TARGET_AVX512BW)
 	gen = gen_avx512vl_vpermt2varv16hi3;
       break;
-    case E_V64QImode:
-      if (TARGET_AVX512VBMI)
-	gen = gen_avx512bw_vpermt2varv64qi3;
-      break;
     case E_V32HImode:
       if (TARGET_AVX512BW)
 	gen = gen_avx512bw_vpermt2varv32hi3;
@@ -5487,6 +5495,45 @@ ix86_expand_sse_unpack (rtx dest, rtx src, bool unsigned_p, bool high_p)
     }
 }
 
+/* Return true if mem is pool constant which contains a const_vector
+   perm index, assign the index to PERM.  */
+bool
+ix86_extract_perm_from_pool_constant (int* perm, rtx mem)
+{
+  machine_mode mode = GET_MODE (mem);
+  int nelt = GET_MODE_NUNITS (mode);
+
+  if (!INTEGRAL_MODE_P (mode))
+    return false;
+
+    /* Needs to be constant pool.  */
+  if (!(MEM_P (mem))
+      || !SYMBOL_REF_P (XEXP (mem, 0))
+      || !CONSTANT_POOL_ADDRESS_P (XEXP (mem, 0)))
+   return false;
+
+  rtx constant = get_pool_constant (XEXP (mem, 0));
+
+  if (GET_CODE (constant) != CONST_VECTOR)
+    return false;
+
+  /* There could be some rtx like
+     (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
+     but with "*.LC1" refer to V2DI constant vector.  */
+  if (GET_MODE (constant) != mode)
+    {
+      constant = simplify_subreg (mode, constant, GET_MODE (constant), 0);
+
+      if (constant == nullptr || GET_CODE (constant) != CONST_VECTOR)
+	return false;
+    }
+
+  for (int i = 0; i != nelt; i++)
+    perm[i] = UINTVAL (XVECEXP (constant, 0, i));
+
+  return true;
+}
+
 /* Split operands 0 and 1 into half-mode parts.  Similar to split_double_mode,
    but works for floating pointer parameters and nonoffsetable memories.
    For pushes, it returns just stack offsets; the values will be saved
@@ -18086,6 +18133,7 @@ ix86_expand_vec_one_operand_perm_avx512 (struct expand_vec_perm_d *d)
 {
   machine_mode mode = GET_MODE (d->op0);
   machine_mode maskmode = mode;
+  unsigned inner_size = GET_MODE_SIZE (GET_MODE_INNER (mode));
   rtx (*gen) (rtx, rtx, rtx) = NULL;
   rtx target, op0, mask;
   rtx vec[64];
@@ -18096,6 +18144,18 @@ ix86_expand_vec_one_operand_perm_avx512 (struct expand_vec_perm_d *d)
   if (!TARGET_AVX512F)
     return false;
 
+  /* Accept VNxHImode and VNxQImode now.  */
+  if (!TARGET_AVX512VL && GET_MODE_SIZE (mode) < 64)
+    return false;
+
+  /* vpermw.  */
+  if (!TARGET_AVX512BW && inner_size == 2)
+    return false;
+
+  /* vpermb.   */
+  if (!TARGET_AVX512VBMI && inner_size == 1)
+    return false;
+
   switch (mode)
     {
     case E_V16SImode:
@@ -18112,6 +18172,25 @@ ix86_expand_vec_one_operand_perm_avx512 (struct expand_vec_perm_d *d)
       gen = gen_avx512f_permvarv8df;
       maskmode = V8DImode;
       break;
+    case E_V32HImode:
+      gen = gen_avx512bw_permvarv32hi;
+      break;
+    case E_V16HImode:
+      gen = gen_avx512vl_permvarv16hi;
+      break;
+    case E_V8HImode:
+      gen = gen_avx512vl_permvarv8hi;
+      break;
+    case E_V64QImode:
+      gen = gen_avx512bw_permvarv64qi;
+      break;
+    case E_V32QImode:
+      gen = gen_avx512vl_permvarv32qi;
+      break;
+    case E_V16QImode:
+      gen = gen_avx512vl_permvarv16qi;
+      break;
+
     default:
       return false;
     }
@@ -18301,7 +18380,7 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
   if (expand_vec_perm_palignr (d, true))
     return true;
 
-  /* Try the AVX512F vperm{s,d} instructions.  */
+  /* Try the AVX512F vperm{w,b,s,d} and instructions  */
   if (ix86_expand_vec_one_operand_perm_avx512 (d))
     return true;
 
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 07ac02aff69..2fd13074c81 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -260,6 +260,7 @@ extern void ix86_expand_sse2_mulvxdi3 (rtx, rtx, rtx);
 extern void ix86_expand_sse2_abs (rtx, rtx);
 extern bool ix86_expand_vector_init_duplicate (bool, machine_mode, rtx,
 					       rtx);
+extern bool ix86_extract_perm_from_pool_constant (int*, rtx);
 
 /* In i386-c.c  */
 extern void ix86_target_macros (void);
diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index 129205ac3a7..650d6354de9 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -1713,6 +1713,96 @@ (define_predicate "addsub_vs_parallel"
   return true;
 })
 
+;; Return true if OP is a constant pool in perm{w,d,b} which constains index
+;; match pmov{dw,wb,qd}.
+(define_predicate "permvar_truncate_operand"
+ (match_code "mem")
+{
+  int nelt = GET_MODE_NUNITS (mode);
+  int perm[128];
+  int id;
+
+  if (!INTEGRAL_MODE_P (mode) || !VECTOR_MODE_P (mode))
+    return false;
+
+  if (nelt < 2)
+    return false;
+
+  if (!ix86_extract_perm_from_pool_constant (&perm[0], op))
+    return false;
+
+  id = exact_log2 (nelt);
+
+  /* Check that the permutation is suitable for pmovz{bw,wd,dq}.
+     For example V16HImode to V8HImode
+     { 0 2 4 6 8 10 12 14 * * * * * * * * }.  */
+  for (int i = 0; i != nelt/2; i++)
+    if ((perm[i] & ((1 << id) - 1)) != i * 2)
+      return false;
+
+  return true;
+})
+
+;; Return true if OP is a constant pool in shufb which constains index
+;; match pmovdw.
+(define_predicate "pshufb_truncv4siv4hi_operand"
+ (match_code "mem")
+{
+  int perm[128];
+
+  if (mode != E_V16QImode)
+    return false;
+
+  if (!ix86_extract_perm_from_pool_constant (&perm[0], op))
+    return false;
+
+  /* Check that the permutation is suitable for pmovwd.
+     For example V16HImode to V8HImode
+     { 0 1 4 5 8 9 12 13 * * * * * * * * }.
+     index = i % 2 + (i / 2) * 4.  */
+  for (int i = 0; i != 8; i++)
+    {
+      /* if (SRC2[(i * 8)+7] = 1) then DEST[(i*8)+7..(i*8)+0] := 0;  */
+      if (perm[i] & 128)
+	return false;
+
+      if ((perm[i] & 15) != ((i & 1) + (i & 0xFE) * 2))
+	return false;
+     }
+
+  return true;
+})
+
+;; Return true if OP is a constant pool in shufb which constains index
+;; match pmovdw.
+(define_predicate "pshufb_truncv8hiv8qi_operand"
+ (match_code "mem")
+{
+  int perm[128];
+
+  if (mode != E_V16QImode)
+    return false;
+
+  if (!ix86_extract_perm_from_pool_constant (&perm[0], op))
+    return false;
+
+  /* Check that the permutation is suitable for pmovwd.
+     For example V16HImode to V8HImode
+     { 0 2 4 6 8 10 12 14 * * * * * * * * }.
+     index = i % 2 + (i / 2) * 4.  */
+  for (int i = 0; i != 8; i++)
+    {
+      /* if (SRC2[(i * 8)+7] = 1) then DEST[(i*8)+7..(i*8)+0] := 0;  */
+      if (perm[i] & 128)
+	return false;
+
+      if ((perm[i] & 15) != i * 2)
+	 return false;
+    }
+
+  return true;
+})
+
 ;; Return true if OP is a parallel for an pmovz{bw,wd,dq} vec_select,
 ;; where one of the two operands of the vec_concat is const0_operand.
 (define_predicate "pmovzx_parallel"
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 3a7bbaec7af..c9f21082beb 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -10978,6 +10978,64 @@ (define_insn "*avx512f_<code><pmov_src_lower><mode>2"
    (set_attr "prefix" "evex")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn_and_split "*avx512bw_permvar_truncv16siv16hi_1"
+  [(set (match_operand:V16HI 0 "nonimmediate_operand")
+	(vec_select:V16HI
+	  (unspec:V32HI
+	    [(match_operand:V32HI 1 "register_operand")
+	     (match_operand:V32HI 2 "permvar_truncate_operand")]
+	   UNSPEC_VPERMVAR)
+	  (parallel [(const_int 0) (const_int 1)
+		     (const_int 2) (const_int 3)
+		     (const_int 4) (const_int 5)
+		     (const_int 6) (const_int 7)
+		     (const_int 8) (const_int 9)
+		     (const_int 10) (const_int 11)
+		     (const_int 12) (const_int 13)
+		     (const_int 14) (const_int 15)])))]
+  "TARGET_AVX512BW && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(truncate:V16HI (match_dup 1)))]
+  "operands[1] = lowpart_subreg (V16SImode, operands[1], V32HImode);")
+
+(define_insn_and_split "*avx512f_permvar_truncv8siv8hi_1"
+  [(set (match_operand:V8HI 0 "nonimmediate_operand")
+	(vec_select:V8HI
+	  (unspec:V16HI
+	    [(match_operand:V16HI 1 "register_operand")
+	     (match_operand:V16HI 2 "permvar_truncate_operand")]
+	   UNSPEC_VPERMVAR)
+	  (parallel [(const_int 0) (const_int 1)
+		     (const_int 2) (const_int 3)
+		     (const_int 4) (const_int 5)
+		     (const_int 6) (const_int 7)])))]
+  "TARGET_AVX512VL && TARGET_AVX512BW && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(truncate:V8HI (match_dup 1)))]
+  "operands[1] = lowpart_subreg (V8SImode, operands[1], V16HImode);")
+
+(define_insn_and_split "*avx512f_vpermvar_truncv8div8si_1"
+  [(set (match_operand:V8SI 0 "nonimmediate_operand")
+	(vec_select:V8SI
+	  (unspec:V16SI
+	    [(match_operand:V16SI 1 "register_operand")
+	     (match_operand:V16SI 2 "permvar_truncate_operand")]
+	   UNSPEC_VPERMVAR)
+	  (parallel [(const_int 0) (const_int 1)
+		     (const_int 2) (const_int 3)
+		     (const_int 4) (const_int 5)
+		     (const_int 6) (const_int 7)])))]
+  "TARGET_AVX512F && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(truncate:V8SI (match_dup 1)))]
+  "operands[1] = lowpart_subreg (V8DImode, operands[1], V16SImode);")
+
 (define_insn "avx512f_<code><pmov_src_lower><mode>2_mask"
   [(set (match_operand:PMOV_DST_MODE_1 0 "nonimmediate_operand" "=v,m")
     (vec_merge:PMOV_DST_MODE_1
@@ -11018,6 +11076,36 @@ (define_insn "avx512bw_<code>v32hiv32qi2"
    (set_attr "prefix" "evex")
    (set_attr "mode" "XI")])
 
+(define_insn_and_split "*avx512f_permvar_truncv32hiv32qi_1"
+  [(set (match_operand:V32QI 0 "nonimmediate_operand")
+	(vec_select:V32QI
+	  (unspec:V64QI
+	    [(match_operand:V64QI 1 "register_operand")
+	     (match_operand:V64QI 2 "permvar_truncate_operand")]
+	   UNSPEC_VPERMVAR)
+	  (parallel [(const_int 0) (const_int 1)
+		     (const_int 2) (const_int 3)
+		     (const_int 4) (const_int 5)
+		     (const_int 6) (const_int 7)
+		     (const_int 8) (const_int 9)
+		     (const_int 10) (const_int 11)
+		     (const_int 12) (const_int 13)
+		     (const_int 14) (const_int 15)
+		     (const_int 16) (const_int 17)
+		     (const_int 18) (const_int 19)
+		     (const_int 20) (const_int 21)
+		     (const_int 22) (const_int 23)
+		     (const_int 24) (const_int 25)
+		     (const_int 26) (const_int 27)
+		     (const_int 28) (const_int 29)
+		     (const_int 30) (const_int 31)])))]
+  "TARGET_AVX512VBMI && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(truncate:V32QI (match_dup 1)))]
+  "operands[1] = lowpart_subreg (V32HImode, operands[1], V64QImode);")
+
 (define_insn "avx512bw_<code>v32hiv32qi2_mask"
   [(set (match_operand:V32QI 0 "nonimmediate_operand" "=v,m")
     (vec_merge:V32QI
@@ -11063,6 +11151,45 @@ (define_insn "*avx512vl_<code><ssedoublemodelower><mode>2"
    (set_attr "prefix" "evex")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn_and_split "*avx512f_permvar_truncv16hiv16qi_1"
+  [(set (match_operand:V16QI 0 "nonimmediate_operand")
+	(vec_select:V16QI
+	  (unspec:V32QI
+	    [(match_operand:V32QI 1 "register_operand")
+	     (match_operand:V32QI 2 "permvar_truncate_operand")]
+	   UNSPEC_VPERMVAR)
+	  (parallel [(const_int 0) (const_int 1)
+		     (const_int 2) (const_int 3)
+		     (const_int 4) (const_int 5)
+		     (const_int 6) (const_int 7)
+		     (const_int 8) (const_int 9)
+		     (const_int 10) (const_int 11)
+		     (const_int 12) (const_int 13)
+		     (const_int 14) (const_int 15)])))]
+  "TARGET_AVX512VL && TARGET_AVX512VBMI
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(truncate:V16QI (match_dup 1)))]
+  "operands[1] = lowpart_subreg (V16HImode, operands[1], V32QImode);")
+
+(define_insn_and_split "*avx512f_permvar_truncv4div4si_1"
+  [(set (match_operand:V4SI 0 "nonimmediate_operand")
+	(vec_select:V4SI
+	  (unspec:V8SI
+	    [(match_operand:V8SI 1 "register_operand")
+	     (match_operand:V8SI 2 "permvar_truncate_operand")]
+	   UNSPEC_VPERMVAR)
+	  (parallel [(const_int 0) (const_int 1)
+		     (const_int 2) (const_int 3)])))]
+  "TARGET_AVX512VL && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(truncate:V4SI (match_dup 1)))]
+  "operands[1] = lowpart_subreg (V4DImode, operands[1], V8SImode);")
+
 (define_insn "<avx512>_<code><ssedoublemodelower><mode>2_mask"
   [(set (match_operand:PMOV_DST_MODE_2 0 "nonimmediate_operand" "=v,m")
     (vec_merge:PMOV_DST_MODE_2
@@ -11121,6 +11248,27 @@ (define_insn "avx512vl_<code><mode>v<ssescalarnum>qi2"
    (set_attr "prefix" "evex")
    (set_attr "mode" "TI")])
 
+(define_insn_and_split "*avx512f_pshufb_truncv8hiv8qi_1"
+  [(set (match_operand:DI 0 "register_operand")
+	(vec_select:DI
+	  (subreg:V2DI
+	    (unspec:V16QI
+	      [(match_operand:V16QI 1 "register_operand")
+	       (match_operand:V16QI 2 "pshufb_truncv8hiv8qi_operand")]
+	   UNSPEC_PSHUFB) 0)
+	  (parallel [(const_int 0)])))]
+  "TARGET_AVX512VL && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+{
+  rtx op1 = gen_reg_rtx (V8QImode);
+  operands[1] = lowpart_subreg (V8HImode, operands[1], V16QImode);
+  emit_insn (gen_truncv8hiv8qi2 (op1, operands[1]));
+  emit_move_insn (operands[0], lowpart_subreg (DImode, op1, V8QImode));
+  DONE;
+})
+
 (define_insn "*avx512vl_<code>v2div2qi2_store_1"
   [(set (match_operand:V2QI 0 "memory_operand" "=m")
 	(any_truncate:V2QI
@@ -11476,6 +11624,27 @@ (define_insn "avx512vl_<code><mode>v<ssescalarnum>hi2"
    (set_attr "prefix" "evex")
    (set_attr "mode" "TI")])
 
+(define_insn_and_split "*avx512f_pshufb_truncv4siv4hi_1"
+  [(set (match_operand:DI 0 "register_operand")
+	(vec_select:DI
+	  (subreg:V2DI
+	    (unspec:V16QI
+	      [(match_operand:V16QI 1 "register_operand")
+	       (match_operand:V16QI 2 "pshufb_truncv4siv4hi_operand")]
+	   UNSPEC_PSHUFB) 0)
+	  (parallel [(const_int 0)])))]
+  "TARGET_AVX512VL && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+{
+  rtx op1 = gen_reg_rtx (V4HImode);
+  operands[1] = lowpart_subreg (V4SImode, operands[1], V16QImode);
+  emit_insn (gen_truncv4siv4hi2 (op1, operands[1]));
+  emit_move_insn (operands[0], lowpart_subreg (DImode, op1, V4HImode));
+  DONE;
+})
+
 (define_insn "*avx512vl_<code><mode>v4hi2_store_1"
   [(set (match_operand:V4HI 0 "memory_operand" "=m")
 	(any_truncate:V4HI
@@ -11699,6 +11868,27 @@ (define_insn "avx512vl_<code>v2div2si2"
    (set_attr "prefix" "evex")
    (set_attr "mode" "TI")])
 
+(define_insn_and_split "*avx512f_pshufd_truncv2div2si_1"
+  [(set (match_operand:DI 0 "register_operand")
+	(vec_select:DI
+	  (subreg:V2DI
+	    (vec_select:V4SI
+	      (match_operand:V4SI 1 "register_operand")
+	      (parallel [(const_int 0) (const_int 2)
+			 (const_int 2) (const_int 3)])) 0)
+	  (parallel [(const_int 0)])))]
+  "TARGET_AVX512VL && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+{
+  rtx op1 = gen_reg_rtx (V2SImode);
+  operands[1] = lowpart_subreg (V2DImode, operands[1], V4SImode);
+  emit_insn (gen_truncv2div2si2 (op1, operands[1]));
+  emit_move_insn (operands[0], lowpart_subreg (DImode, op1, V2SImode));
+  DONE;
+})
+
 (define_insn "*avx512vl_<code>v2div2si2_store_1"
   [(set (match_operand:V2SI 0 "memory_operand" "=m")
 	(any_truncate:V2SI
diff --git a/gcc/testsuite/gcc.target/i386/pr101846-2.c b/gcc/testsuite/gcc.target/i386/pr101846-2.c
new file mode 100644
index 00000000000..26c9ed511e5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101846-2.c
@@ -0,0 +1,81 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512vl -mavx512vbmi -O2" } */
+/* { dg-final { scan-assembler-times "vpmovwb" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovdw" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovqd" "3" } } */
+
+typedef short v4hi __attribute__((vector_size (8)));
+typedef short v8hi __attribute__((vector_size (16)));
+typedef short v16hi __attribute__((vector_size (32)));
+typedef short v32hi __attribute__((vector_size (64)));
+typedef char v8qi __attribute__((vector_size (8)));
+typedef char v16qi __attribute__((vector_size (16)));
+typedef char v32qi __attribute__((vector_size (32)));
+typedef char v64qi __attribute__((vector_size (64)));
+typedef int v2si __attribute__((vector_size (8)));
+typedef int v4si __attribute__((vector_size (16)));
+typedef int v8si __attribute__((vector_size (32)));
+typedef int v16si __attribute__((vector_size (64)));
+
+v16hi
+foo_dw_512 (v32hi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30);
+}
+
+v8hi
+foo_dw_256 (v16hi x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14);
+}
+
+v4hi
+foo_dw_128 (v8hi x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6);
+}
+
+v8si
+foo_qd_512 (v16si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14);
+}
+
+v4si
+foo_qd_256 (v8si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6);
+}
+
+v2si
+foo_qd_128 (v4si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2);
+}
+
+v32qi
+foo_wb_512 (v64qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  32, 34, 36, 38, 40, 42, 44, 46,
+				  48, 50, 52, 54, 56, 58, 60, 62);
+}
+
+v16qi
+foo_wb_256 (v32qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30);
+}
+
+v8qi
+foo_wb_128 (v16qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101846-3.c b/gcc/testsuite/gcc.target/i386/pr101846-3.c
new file mode 100644
index 00000000000..f774018a382
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101846-3.c
@@ -0,0 +1,73 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512vl -mavx512vbmi -O2" } */
+/* { dg-final { scan-assembler-times "vpermb" "2" } } */
+/* { dg-final { scan-assembler-times "vpermw" "2" } } */
+/* { dg-final { scan-assembler-times "vpermd" "2" } } */
+
+typedef short v4hi __attribute__((vector_size (8)));
+typedef short v8hi __attribute__((vector_size (16)));
+typedef short v16hi __attribute__((vector_size (32)));
+typedef short v32hi __attribute__((vector_size (64)));
+typedef char v8qi __attribute__((vector_size (8)));
+typedef char v16qi __attribute__((vector_size (16)));
+typedef char v32qi __attribute__((vector_size (32)));
+typedef char v64qi __attribute__((vector_size (64)));
+typedef int v2si __attribute__((vector_size (8)));
+typedef int v4si __attribute__((vector_size (16)));
+typedef int v8si __attribute__((vector_size (32)));
+typedef int v16si __attribute__((vector_size (64)));
+
+v32hi
+foow_512 (v32hi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  16, 17, 18, 19, 20, 21, 22, 23,
+				  24, 25, 26, 27, 28, 29, 30, 31);
+}
+
+v16hi
+foow_256 (v16hi x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14,
+				  8, 9, 10, 11, 12, 13, 14, 15);
+}
+
+
+v16si
+food_512 (v16si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14,
+				  8, 9, 10, 11, 12, 13, 14, 15);
+}
+
+v8si
+food_256 (v8si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 4, 5, 6, 7);
+}
+
+v64qi
+foob_512 (v64qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  32, 34, 36, 38, 40, 42, 44, 46,
+				  48, 50, 52, 54, 56, 58, 60, 62,
+				  32, 33, 34, 35, 36, 37, 38, 39,
+				  40, 41, 42, 43, 44, 45, 46, 47,
+				  48, 49, 50, 51, 52, 53, 54, 55,
+				  56, 57, 58, 59, 60, 61, 62, 63);
+}
+
+v32qi
+foob_256 (v32qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  16, 17, 18, 19, 20, 21, 22, 23,
+				  24, 25, 26, 27, 28, 29, 30, 31);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101846-4.c b/gcc/testsuite/gcc.target/i386/pr101846-4.c
new file mode 100644
index 00000000000..2a6163c4d72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101846-4.c
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512vl -mavx512vbmi -O2" } */
+/* { dg-final { scan-assembler-times "vpermi2b" "3" } } */
+
+typedef char v16qi __attribute__((vector_size (16)));
+typedef char v32qi __attribute__((vector_size (32)));
+typedef char v64qi __attribute__((vector_size (64)));
+
+
+v64qi
+foob_512 (v64qi x, v64qi y)
+{
+  return __builtin_shufflevector (x, y,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  32, 34, 36, 38, 40, 42, 44, 46,
+				  48, 50, 52, 54, 56, 58, 60, 62,
+				  64, 65, 66, 67, 68, 69, 70, 71,
+				  72, 73, 74, 77, 79, 74, 72, 70,
+				  89, 88, 78, 86, 85, 75, 83, 82,
+				  112, 108, 101, 100, 86, 96, 97, 95);
+}
+
+v32qi
+foob_256 (v32qi x, v32qi y)
+{
+  return __builtin_shufflevector (x, y,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  32, 34, 36, 38, 40, 42, 44, 46,
+				  48, 50, 52, 54, 56, 58, 60, 62);
+}
+
+v16qi
+foob_128 (v16qi x, v16qi y)
+{
+  return __builtin_shufflevector (x, y,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30);
+}
-- 
2.27.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [i386] Optimize __builtin_shuffle_vector.
  2021-08-16  5:18             ` [PATCH] [i386] Optimize __builtin_shuffle_vector liuhongt
@ 2021-08-16  7:10               ` Jakub Jelinek
  2021-08-16  7:25                 ` Hongtao Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Jelinek @ 2021-08-16  7:10 UTC (permalink / raw)
  To: liuhongt; +Cc: gcc-patches, jakub

On Mon, Aug 16, 2021 at 01:18:38PM +0800, liuhongt via Gcc-patches wrote:
> +  /* Accept VNxHImode and VNxQImode now.  */
> +  if (!TARGET_AVX512VL && GET_MODE_SIZE (mode) < 64)
> +    return false;
> +
> +  /* vpermw.  */
> +  if (!TARGET_AVX512BW && inner_size == 2)
> +    return false;
> +
> +  /* vpermb.   */

Too many spaces after dot.

> @@ -18301,7 +18380,7 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
>    if (expand_vec_perm_palignr (d, true))
>      return true;
>  
> -  /* Try the AVX512F vperm{s,d} instructions.  */
> +  /* Try the AVX512F vperm{w,b,s,d} and instructions  */

What is the " and" doing there?

> +  /* Check that the permutation is suitable for pmovz{bw,wd,dq}.
> +     For example V16HImode to V8HImode
> +     { 0 2 4 6 8 10 12 14 * * * * * * * * }.  */
> +  for (int i = 0; i != nelt/2; i++)

nelt / 2 please

Otherwise LGTM.

	Jakub


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}.
  2021-08-13  9:03             ` Richard Sandiford
@ 2021-08-16  7:19               ` Richard Biener
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Biener @ 2021-08-16  7:19 UTC (permalink / raw)
  To: Richard Sandiford, Jakub Jelinek via Gcc-patches, Hongtao Liu,
	Richard Biener, Jakub Jelinek, liuhongt

On Fri, Aug 13, 2021 at 11:04 AM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Fri, Aug 13, 2021 at 09:42:00AM +0800, Hongtao Liu wrote:
> >> > So, I wonder if your new routine shouldn't be instead done after
> >> > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
> >> > and handle the other vpmovdw etc. cases in combine splitters (see that we
> >> > only use low half or quarter of the result and transform whatever
> >> > permutation we've used into what we want).
> >> >
> >> Got it, i'll try that way.
> >
> > Note, IMHO the ultimate fix would be to add real support for the
> > __builtin_shufflevector -1 indices (meaning I don't care what will be in
> > that element, perhaps narrowed down to an implementation choice of
> > any element of the input vector(s) or 0).
> > As VEC_PERM_EXPR is currently used for both perms by variable permutation
> > vector and constant, I think we'd need to introduce VEC_PERM_CONST_EXPR,
> > which would be exactly like VEC_PERM_EXPR, except that the last operand
> > would be required to be a VECTOR_CST and that all ones element would mean
> > something different, the I don't care behavior.
> > The GIMPLE side would be fairly easy, except that there should be some
> > optimizations eventually, like when only certain subset of elements of
> > a vector are used later, we can mark the other elements as don't care.
>
> Another alternative I'd wondered about was keeping a single tree code,
> but adding an extra operand with a “care/don't care” mask.  I think
> that would fit with variable-length vectors better.

That sounds reasonable.  Note I avoided "quaternary" ops for
BIT_INSERT_EXPR but I don't see a good way to avoid that for
such VEC_PERM_EXPR extension.

Richard.

> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [i386] Optimize __builtin_shuffle_vector.
  2021-08-16  7:10               ` Jakub Jelinek
@ 2021-08-16  7:25                 ` Hongtao Liu
  2021-08-17  1:45                   ` Hongtao Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Hongtao Liu @ 2021-08-16  7:25 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: liuhongt, GCC Patches, jakub

On Mon, Aug 16, 2021 at 3:11 PM Jakub Jelinek via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Mon, Aug 16, 2021 at 01:18:38PM +0800, liuhongt via Gcc-patches wrote:
> > +  /* Accept VNxHImode and VNxQImode now.  */
> > +  if (!TARGET_AVX512VL && GET_MODE_SIZE (mode) < 64)
> > +    return false;
> > +
> > +  /* vpermw.  */
> > +  if (!TARGET_AVX512BW && inner_size == 2)
> > +    return false;
> > +
> > +  /* vpermb.   */
>
> Too many spaces after dot.
>
> > @@ -18301,7 +18380,7 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
> >    if (expand_vec_perm_palignr (d, true))
> >      return true;
> >
> > -  /* Try the AVX512F vperm{s,d} instructions.  */
> > +  /* Try the AVX512F vperm{w,b,s,d} and instructions  */
>
> What is the " and" doing there?
Typo.
>
> > +  /* Check that the permutation is suitable for pmovz{bw,wd,dq}.
> > +     For example V16HImode to V8HImode
> > +     { 0 2 4 6 8 10 12 14 * * * * * * * * }.  */
> > +  for (int i = 0; i != nelt/2; i++)
>
> nelt / 2 please
>
> Otherwise LGTM.
>
Thanks for the review.
>         Jakub
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [i386] Optimize __builtin_shuffle_vector.
  2021-08-16  7:25                 ` Hongtao Liu
@ 2021-08-17  1:45                   ` Hongtao Liu
  0 siblings, 0 replies; 14+ messages in thread
From: Hongtao Liu @ 2021-08-17  1:45 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: liuhongt, GCC Patches, jakub

On Mon, Aug 16, 2021 at 3:25 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Mon, Aug 16, 2021 at 3:11 PM Jakub Jelinek via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Mon, Aug 16, 2021 at 01:18:38PM +0800, liuhongt via Gcc-patches wrote:
> > > +  /* Accept VNxHImode and VNxQImode now.  */
> > > +  if (!TARGET_AVX512VL && GET_MODE_SIZE (mode) < 64)
> > > +    return false;
> > > +
> > > +  /* vpermw.  */
> > > +  if (!TARGET_AVX512BW && inner_size == 2)
> > > +    return false;
> > > +
> > > +  /* vpermb.   */
> >
> > Too many spaces after dot.
> >
> > > @@ -18301,7 +18380,7 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
> > >    if (expand_vec_perm_palignr (d, true))
> > >      return true;
> > >
> > > -  /* Try the AVX512F vperm{s,d} instructions.  */
> > > +  /* Try the AVX512F vperm{w,b,s,d} and instructions  */
> >
> > What is the " and" doing there?
> Typo.
> >
> > > +  /* Check that the permutation is suitable for pmovz{bw,wd,dq}.
> > > +     For example V16HImode to V8HImode
> > > +     { 0 2 4 6 8 10 12 14 * * * * * * * * }.  */
> > > +  for (int i = 0; i != nelt/2; i++)
> >
> > nelt / 2 please
> >
> > Otherwise LGTM.
> >
> Thanks for the review.
> >         Jakub
> >
>
>
> --
> BR,
> Hongtao

This patch caused FAIL: gcc.target/i386/pr82460-2.c scan-assembler-not
\\mvpermi2b\\M with -march=cascadelake

So adjust the testcase with the following patch.

    [i386] Adjust testcase.

    This testcase is used to detect reuse of perm mask in the main loop,
    in epilog, vpermi2b can still be used, so add the option
    --param=vect-epilogues-nomask=0.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/pr82460-2.c: Adjust testcase by adding
            --param=vect-epilogues-nomask=0

diff --git a/gcc/testsuite/gcc.target/i386/pr82460-2.c
b/gcc/testsuite/gcc.target/i386/pr82460-2.c
index 4a45beed715..8cdfb54f56a 100644
--- a/gcc/testsuite/gcc.target/i386/pr82460-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr82460-2.c
@@ -1,6 +1,6 @@
 /* PR target/82460 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -mavx512vbmi
-mprefer-vector-width=none" } */
+/* { dg-options "-O2 -ftree-vectorize -mavx512vbmi
-mprefer-vector-width=none --param=vect-epilogues-nomask=0" } */
 /* We want to reuse the permutation mask in the loop, so use vpermt2b rather
    than vpermi2b.  */
 /* { dg-final { scan-assembler-not {\mvpermi2b\M} } } */


--
BR,
Hongtao

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-08-17  1:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-11  6:43 [PATCH] [i386] Combine avx_vec_concatv16si and avx512f_zero_extendv16hiv16si2_1 to avx512f_zero_extendv16hiv16si2_2 liuhongt
2021-08-11  7:58 ` Jakub Jelinek
2021-08-11  9:32   ` Hongtao Liu
2021-08-12  5:43     ` [PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb} liuhongt
2021-08-12  9:22       ` Jakub Jelinek
2021-08-12  9:41         ` Jakub Jelinek
2021-08-13  1:42         ` Hongtao Liu
2021-08-13  8:47           ` Jakub Jelinek
2021-08-13  9:03             ` Richard Sandiford
2021-08-16  7:19               ` Richard Biener
2021-08-16  5:18             ` [PATCH] [i386] Optimize __builtin_shuffle_vector liuhongt
2021-08-16  7:10               ` Jakub Jelinek
2021-08-16  7:25                 ` Hongtao Liu
2021-08-17  1:45                   ` Hongtao Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).