[PATCH 0/5] x86: make better use of VPTERNLOG{D,Q}

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 0/5] x86: make better use of VPTERNLOG{D,Q}
@ 2023-06-21  6:24 Jan Beulich
  2023-06-21  6:25 ` [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations Jan Beulich
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Jan Beulich @ 2023-06-21  6:24 UTC (permalink / raw)
  To: gcc-patches; +Cc: Hongtao Liu, Kirill Yukhin

While there are some quite sophisticated 4-operand expanders,
2-operand binary logic which can't be expressed by just VPAND,
VPANDN, VPOR, or VPXOR doesn't utilize this insn to carry out
such operations in a single insn. Therefore the first two
patches address one of the sub-aspects of PR target/93768 (which
imo was closed prematurely), while the latter three ones extend
what was done for PR target/100711.

1: use VPTERNLOG for further bitwise two-vector operations
2: use VPTERNLOG also for certain andnot forms
3: allow memory operand for AVX2 splitter for PR target/100711
4: further PR target/100711-like splitting
5: yet more PR target/100711-like splitting

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations
  2023-06-21  6:24 [PATCH 0/5] x86: make better use of VPTERNLOG{D,Q} Jan Beulich
@ 2023-06-21  6:25 ` Jan Beulich
  2023-06-25  4:42   ` Hongtao Liu
  2023-06-21  6:27 ` [PATCH 2/5] x86: use VPTERNLOG also for certain andnot forms Jan Beulich
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-06-21  6:25 UTC (permalink / raw)
  To: gcc-patches; +Cc: Hongtao Liu, Kirill Yukhin

All combinations of and, ior, xor, and not involving two operands can be
expressed that way in a single insn.

gcc/

	PR target/93768
	* config/i386/i386.cc (ix86_rtx_costs): Further special-case
	bitwise vector operations.
	* config/i386/sse.md (*iornot<mode>3): New insn.
	(*xnor<mode>3): Likewise.
	(*<nlogic><mode>3): Likewise.
	(andor): New code iterator.
	(nlogic): New code attribute.
	(ternlog_nlogic): Likewise.

gcc/testsuite/

	PR target/93768
	gcc.target/i386/avx512-binop-not-1.h: New.
	gcc.target/i386/avx512-binop-not-2.h: New.
	gcc.target/i386/avx512f-orn-si-zmm-1.c: New test.
	gcc.target/i386/avx512f-orn-si-zmm-2.c: New test.
---
The use of VI matches that in e.g. one_cmpl<mode>2 /
<mask_codefor>one_cmpl<mode>2<mask_name> and *andnot<mode>3, despite
(here and there)
- V64QI and V32HI being needlessly excluded when AVX512BW isn't enabled,
- V<n>TI not being covered,
- vector modes more narrow than 16 bytes not being covered.

--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -21178,6 +21178,32 @@ ix86_rtx_costs (rtx x, machine_mode mode
       return false;
 
     case IOR:
+      if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
+	{
+	  /* (ior (not ...) ...) can be a single insn in AVX512.  */
+	  if (GET_CODE (XEXP (x, 0)) == NOT && TARGET_AVX512F
+	      && (GET_MODE_SIZE (mode) == 64
+		  || (TARGET_AVX512VL
+		      && (GET_MODE_SIZE (mode) == 32
+			  || GET_MODE_SIZE (mode) == 16))))
+	    {
+	      rtx right = GET_CODE (XEXP (x, 1)) != NOT
+			  ? XEXP (x, 1) : XEXP (XEXP (x, 1), 0);
+
+	      *total = ix86_vec_cost (mode, cost->sse_op)
+		       + rtx_cost (XEXP (XEXP (x, 0), 0), mode,
+				   outer_code, opno, speed)
+		       + rtx_cost (right, mode, outer_code, opno, speed);
+	      return true;
+	    }
+	  *total = ix86_vec_cost (mode, cost->sse_op);
+	}
+      else if (GET_MODE_SIZE (mode) > UNITS_PER_WORD)
+	*total = cost->add * 2;
+      else
+	*total = cost->add;
+      return false;
+
     case XOR:
       if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
 	*total = ix86_vec_cost (mode, cost->sse_op);
@@ -21198,11 +21224,20 @@ ix86_rtx_costs (rtx x, machine_mode mode
 	  /* pandn is a single instruction.  */
 	  if (GET_CODE (XEXP (x, 0)) == NOT)
 	    {
+	      rtx right = XEXP (x, 1);
+
+	      /* (and (not ...) (not ...)) can be a single insn in AVX512.  */
+	      if (GET_CODE (right) == NOT && TARGET_AVX512F
+		  && (GET_MODE_SIZE (mode) == 64
+		      || (TARGET_AVX512VL
+			  && (GET_MODE_SIZE (mode) == 32
+			      || GET_MODE_SIZE (mode) == 16))))
+		right = XEXP (right, 0);
+
 	      *total = ix86_vec_cost (mode, cost->sse_op)
 		       + rtx_cost (XEXP (XEXP (x, 0), 0), mode,
 				   outer_code, opno, speed)
-		       + rtx_cost (XEXP (x, 1), mode,
-				   outer_code, opno, speed);
+		       + rtx_cost (right, mode, outer_code, opno, speed);
 	      return true;
 	    }
 	  else if (GET_CODE (XEXP (x, 1)) == NOT)
@@ -21260,8 +21295,25 @@ ix86_rtx_costs (rtx x, machine_mode mode
 
     case NOT:
       if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
-	// vnot is pxor -1.
-	*total = ix86_vec_cost (mode, cost->sse_op) + 1;
+	{
+	  /* (not (xor ...)) can be a single insn in AVX512.  */
+	  if (GET_CODE (XEXP (x, 0)) == XOR && TARGET_AVX512F
+	      && (GET_MODE_SIZE (mode) == 64
+		  || (TARGET_AVX512VL
+		      && (GET_MODE_SIZE (mode) == 32
+			  || GET_MODE_SIZE (mode) == 16))))
+	    {
+	      *total = ix86_vec_cost (mode, cost->sse_op)
+		       + rtx_cost (XEXP (XEXP (x, 0), 0), mode,
+				   outer_code, opno, speed)
+		       + rtx_cost (XEXP (XEXP (x, 0), 1), mode,
+				   outer_code, opno, speed);
+	      return true;
+	    }
+
+	  // vnot is pxor -1.
+	  *total = ix86_vec_cost (mode, cost->sse_op) + 1;
+	}
       else if (GET_MODE_SIZE (mode) > UNITS_PER_WORD)
 	*total = cost->add * 2;
       else
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -17616,6 +17616,98 @@
   operands[2] = force_reg (V1TImode, CONSTM1_RTX (V1TImode));
 })
 
+(define_insn "*iornot<mode>3"
+  [(set (match_operand:VI 0 "register_operand" "=v,v,v,v")
+	(ior:VI
+	  (not:VI
+	    (match_operand:VI 1 "bcst_vector_operand" "v,Br,v,m"))
+	  (match_operand:VI 2 "bcst_vector_operand" "vBr,v,m,v")))]
+  "(<MODE_SIZE> == 64 || TARGET_AVX512VL
+    || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
+   && (register_operand (operands[1], <MODE>mode)
+       || register_operand (operands[2], <MODE>mode))"
+{
+  if (!register_operand (operands[1], <MODE>mode))
+    {
+      if (TARGET_AVX512VL)
+	return "vpternlog<ternlogsuffix>\t{$0xdd, %1, %2, %0|%0, %2, %1, 0xdd}";
+      return "vpternlog<ternlogsuffix>\t{$0xdd, %g1, %g2, %g0|%g0, %g2, %g1, 0xdd}";
+    }
+  if (TARGET_AVX512VL)
+    return "vpternlog<ternlogsuffix>\t{$0xbb, %2, %1, %0|%0, %1, %2, 0xbb}";
+  return "vpternlog<ternlogsuffix>\t{$0xbb, %g2, %g1, %g0|%g0, %g1, %g2, 0xbb}";
+}
+  [(set_attr "type" "sselog")
+   (set_attr "length_immediate" "1")
+   (set_attr "prefix" "evex")
+   (set (attr "mode")
+        (if_then_else (match_test "TARGET_AVX512VL")
+		      (const_string "<sseinsnmode>")
+		      (const_string "XI")))
+   (set (attr "enabled")
+	(if_then_else (eq_attr "alternative" "2,3")
+		      (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
+		      (const_string "*")))])
+
+(define_insn "*xnor<mode>3"
+  [(set (match_operand:VI 0 "register_operand" "=v,v")
+	(not:VI
+	  (xor:VI
+	    (match_operand:VI 1 "bcst_vector_operand" "%v,v")
+	    (match_operand:VI 2 "bcst_vector_operand" "vBr,m"))))]
+  "(<MODE_SIZE> == 64 || TARGET_AVX512VL
+    || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
+   && (register_operand (operands[1], <MODE>mode)
+       || register_operand (operands[2], <MODE>mode))"
+{
+  if (TARGET_AVX512VL)
+    return "vpternlog<ternlogsuffix>\t{$0x99, %2, %1, %0|%0, %1, %2, 0x99}";
+  else
+    return "vpternlog<ternlogsuffix>\t{$0x99, %g2, %g1, %g0|%g0, %g1, %g2, 0x99}";
+}
+  [(set_attr "type" "sselog")
+   (set_attr "length_immediate" "1")
+   (set_attr "prefix" "evex")
+   (set (attr "mode")
+        (if_then_else (match_test "TARGET_AVX512VL")
+		      (const_string "<sseinsnmode>")
+		      (const_string "XI")))
+   (set (attr "enabled")
+	(if_then_else (eq_attr "alternative" "1")
+		      (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
+		      (const_string "*")))])
+
+(define_code_iterator andor [and ior])
+(define_code_attr nlogic [(and "nor") (ior "nand")])
+(define_code_attr ternlog_nlogic [(and "0x11") (ior "0x77")])
+
+(define_insn "*<nlogic><mode>3"
+  [(set (match_operand:VI 0 "register_operand" "=v,v")
+	(andor:VI
+	  (not:VI (match_operand:VI 1 "bcst_vector_operand" "%v,v"))
+	  (not:VI (match_operand:VI 2 "bcst_vector_operand" "vBr,m"))))]
+  "(<MODE_SIZE> == 64 || TARGET_AVX512VL
+    || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
+   && (register_operand (operands[1], <MODE>mode)
+       || register_operand (operands[2], <MODE>mode))"
+{
+  if (TARGET_AVX512VL)
+    return "vpternlog<ternlogsuffix>\t{$<ternlog_nlogic>, %2, %1, %0|%0, %1, %2, <ternlog_nlogic>}";
+  else
+    return "vpternlog<ternlogsuffix>\t{$<ternlog_nlogic>, %g2, %g1, %g0|%g0, %g1, %g2, <ternlog_nlogic>}";
+}
+  [(set_attr "type" "sselog")
+   (set_attr "length_immediate" "1")
+   (set_attr "prefix" "evex")
+   (set (attr "mode")
+        (if_then_else (match_test "TARGET_AVX512VL")
+		      (const_string "<sseinsnmode>")
+		      (const_string "XI")))
+   (set (attr "enabled")
+	(if_then_else (eq_attr "alternative" "1")
+		      (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
+		      (const_string "*")))])
+
 (define_mode_iterator AVX512ZEXTMASK
   [(DI "TARGET_AVX512BW") (SI "TARGET_AVX512BW") HI])
 
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512-binop-not-1.h
@@ -0,0 +1,13 @@
+#include <immintrin.h>
+
+#define PASTER2(x,y)		x##y
+#define PASTER3(x,y,z)		_mm##x##_##y##_##z
+#define OP(vec, op, suffix)	PASTER3 (vec, op, suffix)
+#define DUP(vec, suffix, val)	PASTER3 (vec, set1, suffix) (val)
+
+type
+foo (type x, SCALAR *f)
+{
+  return OP (vec, op, suffix) (x, OP (vec, xor, suffix) (DUP (vec, suffix, *f),
+							 DUP (vec, suffix, ~0)));
+}
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512-binop-not-2.h
@@ -0,0 +1,13 @@
+#include <immintrin.h>
+
+#define PASTER2(x,y)		x##y
+#define PASTER3(x,y,z)		_mm##x##_##y##_##z
+#define OP(vec, op, suffix)	PASTER3 (vec, op, suffix)
+#define DUP(vec, suffix, val)	PASTER3 (vec, set1, suffix) (val)
+
+type
+foo (type x, SCALAR *f)
+{
+  return OP (vec, op, suffix) (OP (vec, xor, suffix) (x, DUP (vec, suffix, ~0)),
+			       DUP (vec, suffix, *f));
+}
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-1.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512f -mno-avx512vl -mprefer-vector-width=512 -O2" } */
+/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$0xdd, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */
+/* { dg-final { scan-assembler-not "vpbroadcast" } } */
+
+#define type __m512i
+#define vec 512
+#define op or
+#define suffix epi32
+#define SCALAR int
+
+#include "avx512-binop-not-1.h"
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-2.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512f -mno-avx512vl -mprefer-vector-width=512 -O2" } */
+/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$0xbb, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */
+/* { dg-final { scan-assembler-not "vpbroadcast" } } */
+
+#define type __m512i
+#define vec 512
+#define op or
+#define suffix epi32
+#define SCALAR int
+
+#include "avx512-binop-not-2.h"


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/5] x86: use VPTERNLOG also for certain andnot forms
  2023-06-21  6:24 [PATCH 0/5] x86: make better use of VPTERNLOG{D,Q} Jan Beulich
  2023-06-21  6:25 ` [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations Jan Beulich
@ 2023-06-21  6:27 ` Jan Beulich
  2023-06-25  4:58   ` Hongtao Liu
  2023-06-21  6:27 ` [PATCH 3/5] x86: allow memory operand for AVX2 splitter for PR target/100711 Jan Beulich
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-06-21  6:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: Hongtao Liu, Kirill Yukhin

When it's the memory operand which is to be inverted, using VPANDN*
requires a further load instruction. The same can be achieved by a
single VPTERNLOG*. Add two new alternatives (for plain memory and
embedded broadcast), adjusting the predicate for the first operand
accordingly.

Two pre-existing testcases actually end up being affected (improved) by
the change, which is reflected in updated expectations there.

gcc/

	PR target/93768
	* config/i386/sse.md (*andnot<mode>3): Add new alternatives
	for memory form operand 1.

gcc/testsuite/

	PR target/93768
	* gcc.target/i386/avx512f-andn-di-zmm-2.c: New test.
	* gcc.target/i386/avx512f-andn-si-zmm-2.c: Adjust expecations
	towards generated code.
	* gcc.target/i386/pr100711-3.c: Adjust expectations for 32-bit
	code.

--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -17210,11 +17210,13 @@
   "TARGET_AVX512F")
 
 (define_insn "*andnot<mode>3"
-  [(set (match_operand:VI 0 "register_operand" "=x,x,v")
+  [(set (match_operand:VI 0 "register_operand" "=x,x,v,v,v")
 	(and:VI
-	  (not:VI (match_operand:VI 1 "vector_operand" "0,x,v"))
-	  (match_operand:VI 2 "bcst_vector_operand" "xBm,xm,vmBr")))]
-  "TARGET_SSE"
+	  (not:VI (match_operand:VI 1 "bcst_vector_operand" "0,x,v,m,Br"))
+	  (match_operand:VI 2 "bcst_vector_operand" "xBm,xm,vmBr,v,v")))]
+  "TARGET_SSE
+   && (register_operand (operands[1], <MODE>mode)
+       || register_operand (operands[2], <MODE>mode))"
 {
   char buf[64];
   const char *ops;
@@ -17281,6 +17283,15 @@
     case 2:
       ops = "v%s%s\t{%%2, %%1, %%0|%%0, %%1, %%2}";
       break;
+    case 3:
+    case 4:
+      tmp = "pternlog";
+      ssesuffix = "<ternlogsuffix>";
+      if (which_alternative != 4 || TARGET_AVX512VL)
+	ops = "v%s%s\t{$0x44, %%1, %%2, %%0|%%0, %%2, %%1, $0x44}";
+      else
+	ops = "v%s%s\t{$0x44, %%g1, %%g2, %%g0|%%g0, %%g2, %%g1, $0x44}";
+      break;
     default:
       gcc_unreachable ();
     }
@@ -17289,7 +17300,7 @@
   output_asm_insn (buf, operands);
   return "";
 }
-  [(set_attr "isa" "noavx,avx,avx")
+  [(set_attr "isa" "noavx,avx,avx,*,*")
    (set_attr "type" "sselog")
    (set (attr "prefix_data16")
      (if_then_else
@@ -17297,9 +17308,12 @@
 	    (eq_attr "mode" "TI"))
        (const_string "1")
        (const_string "*")))
-   (set_attr "prefix" "orig,vex,evex")
+   (set_attr "prefix" "orig,vex,evex,evex,evex")
    (set (attr "mode")
-	(cond [(match_test "TARGET_AVX2")
+	(cond [(and (eq_attr "alternative" "3,4")
+		    (match_test "<MODE_SIZE> < 64 && !TARGET_AVX512VL"))
+		 (const_string "XI")
+	       (match_test "TARGET_AVX2")
 		 (const_string "<sseinsnmode>")
 	       (match_test "TARGET_AVX")
 		 (if_then_else
@@ -17310,7 +17324,15 @@
 		    (match_test "optimize_function_for_size_p (cfun)"))
 		 (const_string "V4SF")
 	      ]
-	      (const_string "<sseinsnmode>")))])
+	      (const_string "<sseinsnmode>")))
+   (set (attr "enabled")
+	(cond [(eq_attr "alternative" "3")
+		 (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
+	       (eq_attr "alternative" "4")
+		 (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL
+			      || (TARGET_AVX512F && !TARGET_PREFER_AVX256)")
+	      ]
+	      (const_string "*")))])
 
 ;; PR target/100711: Split notl; vpbroadcastd; vpand as vpbroadcastd; vpandn
 (define_split
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-andn-di-zmm-2.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512f -mno-avx512vl -mprefer-vector-width=512 -O2" } */
+/* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\\\$0x44, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */
+/* { dg-final { scan-assembler-not "vpbroadcast" } } */
+
+#define type __m512i
+#define vec 512
+#define op andnot
+#define suffix epi64
+#define SCALAR long long
+
+#include "avx512-binop-2.h"
--- a/gcc/testsuite/gcc.target/i386/avx512f-andn-si-zmm-2.c
+++ b/gcc/testsuite/gcc.target/i386/avx512f-andn-si-zmm-2.c
@@ -1,7 +1,7 @@
 /* { dg-do compile } */
 /* { dg-options "-mavx512f -O2" } */
-/* { dg-final { scan-assembler-times "vpbroadcastd\[^\n\]*%zmm\[0-9\]+" 1 } } */
-/* { dg-final { scan-assembler-times "vpandnd\[^\n\]*%zmm\[0-9\]+" 1 } } */
+/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$0x44, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */
+/* { dg-final { scan-assembler-not "vpbroadcast" } } */
 
 #define type __m512i
 #define vec 512
--- a/gcc/testsuite/gcc.target/i386/pr100711-3.c
+++ b/gcc/testsuite/gcc.target/i386/pr100711-3.c
@@ -37,4 +37,6 @@ v8di foo_v8di (long long a, v8di b)
     return (__extension__ (v8di) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) & b;
 }
 
-/* { dg-final { scan-assembler-times "vpandn" 4 } } */
+/* { dg-final { scan-assembler-times "vpandn" 4 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpandn" 2 { target { ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0x44" 2 { target { ia32 } } } } */


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/5] x86: allow memory operand for AVX2 splitter for PR target/100711
  2023-06-21  6:24 [PATCH 0/5] x86: make better use of VPTERNLOG{D,Q} Jan Beulich
  2023-06-21  6:25 ` [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations Jan Beulich
  2023-06-21  6:27 ` [PATCH 2/5] x86: use VPTERNLOG also for certain andnot forms Jan Beulich
@ 2023-06-21  6:27 ` Jan Beulich
  2023-06-25  4:58   ` Hongtao Liu
  2023-06-21  6:27 ` [PATCH 4/5] x86: further PR target/100711-like splitting Jan Beulich
  2023-06-21  6:28 ` [PATCH 5/5] x86: yet more " Jan Beulich
  4 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-06-21  6:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: Hongtao Liu, Kirill Yukhin

The intended broadcast (with AVX512) can very well be done right from
memory.

gcc/

	* config/i386/sse.md: Permit non-immediate operand 1 in AVX2
	form of splitter for PR target/100711.

--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -17356,7 +17356,7 @@
 	(and:VI_AVX2
 	  (vec_duplicate:VI_AVX2
 	    (not:<ssescalarmode>
-	      (match_operand:<ssescalarmode> 1 "register_operand")))
+	      (match_operand:<ssescalarmode> 1 "nonimmediate_operand")))
 	  (match_operand:VI_AVX2 2 "vector_operand")))]
   "TARGET_AVX2"
   [(set (match_dup 3)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/5] x86: further PR target/100711-like splitting
  2023-06-21  6:24 [PATCH 0/5] x86: make better use of VPTERNLOG{D,Q} Jan Beulich
                   ` (2 preceding siblings ...)
  2023-06-21  6:27 ` [PATCH 3/5] x86: allow memory operand for AVX2 splitter for PR target/100711 Jan Beulich
@ 2023-06-21  6:27 ` Jan Beulich
  2023-06-25  5:06   ` Hongtao Liu
  2023-06-21  6:28 ` [PATCH 5/5] x86: yet more " Jan Beulich
  4 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-06-21  6:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: Hongtao Liu, Kirill Yukhin

With respective two-operand bitwise operations now expressable by a
single VPTERNLOG, add splitters to also deal with ior and xor
counterparts of the original and-only case. Note that the splitters need
to be separate, as the placement of "not" differs in the final insns
(*iornot<mode>3, *xnor<mode>3) which are intended to pick up one half of
the result.

gcc/

	* config/i386/sse.md: New splitters to simplify
	not;vec_duplicate;{ior,xor} as vec_duplicate;{iornot,xnor}.

gcc/testsuite/

	* gcc.target/i386/pr100711-4.c: New test.
	* gcc.target/i386/pr100711-5.c: New test.

--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -17366,6 +17366,36 @@
 			(match_dup 2)))]
   "operands[3] = gen_reg_rtx (<MODE>mode);")
 
+(define_split
+  [(set (match_operand:VI 0 "register_operand")
+	(ior:VI
+	  (vec_duplicate:VI
+	    (not:<ssescalarmode>
+	      (match_operand:<ssescalarmode> 1 "nonimmediate_operand")))
+	  (match_operand:VI 2 "vector_operand")))]
+  "<MODE_SIZE> == 64 || TARGET_AVX512VL
+   || (TARGET_AVX512F && !TARGET_PREFER_AVX256)"
+  [(set (match_dup 3)
+	(vec_duplicate:VI (match_dup 1)))
+   (set (match_dup 0)
+	(ior:VI (not:VI (match_dup 3)) (match_dup 2)))]
+  "operands[3] = gen_reg_rtx (<MODE>mode);")
+
+(define_split
+  [(set (match_operand:VI 0 "register_operand")
+	(xor:VI
+	  (vec_duplicate:VI
+	    (not:<ssescalarmode>
+	      (match_operand:<ssescalarmode> 1 "nonimmediate_operand")))
+	  (match_operand:VI 2 "vector_operand")))]
+  "<MODE_SIZE> == 64 || TARGET_AVX512VL
+   || (TARGET_AVX512F && !TARGET_PREFER_AVX256)"
+  [(set (match_dup 3)
+	(vec_duplicate:VI (match_dup 1)))
+   (set (match_dup 0)
+	(not:VI (xor:VI (match_dup 3) (match_dup 2))))]
+  "operands[3] = gen_reg_rtx (<MODE>mode);")
+
 (define_insn "*andnot<mode>3_mask"
   [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v")
 	(vec_merge:VI48_AVX512VL
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100711-4.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512bw -mno-avx512vl -mprefer-vector-width=512 -O2" } */
+
+typedef char v64qi __attribute__ ((vector_size (64)));
+typedef short v32hi __attribute__ ((vector_size (64)));
+typedef int v16si __attribute__ ((vector_size (64)));
+typedef long long v8di __attribute__((vector_size (64)));
+
+v64qi foo_v64qi (char a, v64qi b)
+{
+    return (__extension__ (v64qi) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) | b;
+}
+
+v32hi foo_v32hi (short a, v32hi b)
+{
+    return (__extension__ (v32hi) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) | b;
+}
+
+v16si foo_v16si (int a, v16si b)
+{
+    return (__extension__ (v16si) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) | b;
+}
+
+v8di foo_v8di (long long a, v8di b)
+{
+    return (__extension__ (v8di) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) | b;
+}
+
+/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0xbb" 4 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0xbb" 2 { target { ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0xdd" 2 { target { ia32 } } } } */
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100711-5.c
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512bw -mno-avx512vl -mprefer-vector-width=512 -O2" } */
+
+typedef char v64qi __attribute__ ((vector_size (64)));
+typedef short v32hi __attribute__ ((vector_size (64)));
+typedef int v16si __attribute__ ((vector_size (64)));
+typedef long long v8di __attribute__((vector_size (64)));
+
+v64qi foo_v64qi (char a, v64qi b)
+{
+    return (__extension__ (v64qi) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) ^ b;
+}
+
+v32hi foo_v32hi (short a, v32hi b)
+{
+    return (__extension__ (v32hi) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) ^ b;
+}
+
+v16si foo_v16si (int a, v16si b)
+{
+    return (__extension__ (v16si) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
+				   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) ^ b;
+}
+
+v8di foo_v8di (long long a, v8di b)
+{
+    return (__extension__ (v8di) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) ^ b;
+}
+
+/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0x99" 4 } } */


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 5/5] x86: yet more PR target/100711-like splitting
  2023-06-21  6:24 [PATCH 0/5] x86: make better use of VPTERNLOG{D,Q} Jan Beulich
                   ` (3 preceding siblings ...)
  2023-06-21  6:27 ` [PATCH 4/5] x86: further PR target/100711-like splitting Jan Beulich
@ 2023-06-21  6:28 ` Jan Beulich
  2023-06-25  5:12   ` Hongtao Liu
  4 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-06-21  6:28 UTC (permalink / raw)
  To: gcc-patches; +Cc: Hongtao Liu, Kirill Yukhin

Following two-operand bitwise operations, add another splitter to also
deal with not followed by broadcast all on its own, which can be
expressed as simple embedded broadcast instead once a broadcast operand
is actually permitted in the respective insn. While there also permit
a broadcast operand in the corresponding expander.

gcc/

	* config/i386/sse.md: New splitters to simplify
	not;vec_duplicate as a singular vpternlog.
	(one_cmpl<mode>2): Allow broadcast for operand 1.
	(<mask_codefor>one_cmpl<mode>2<mask_name>): Likewise.

gcc/testsuite/

	* gcc.target/i386/pr100711-6.c: New test.
---
For the purpose here (and elsewhere) bcst_vector_operand() (really:
bcst_mem_operand()) isn't permissive enough: We'd want it to allow
128-bit and 256-bit types as well irrespective of AVX512VL being
enabled. This would likely require a new predicate
(bcst_intvec_operand()?) and a new constraint (BR? Bi?). (Yet for name
selection it will want considering that this is applicable to certain
non-calculational FP operations as well.)

--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -17156,7 +17156,7 @@
 
 (define_expand "one_cmpl<mode>2"
   [(set (match_operand:VI 0 "register_operand")
-	(xor:VI (match_operand:VI 1 "vector_operand")
+	(xor:VI (match_operand:VI 1 "bcst_vector_operand")
 		(match_dup 2)))]
   "TARGET_SSE"
 {
@@ -17168,7 +17168,7 @@
 
 (define_insn "<mask_codefor>one_cmpl<mode>2<mask_name>"
   [(set (match_operand:VI 0 "register_operand" "=v,v")
-	(xor:VI (match_operand:VI 1 "nonimmediate_operand" "v,m")
+	(xor:VI (match_operand:VI 1 "bcst_vector_operand" "vBr,m")
 		(match_operand:VI 2 "vector_all_ones_operand" "BC,BC")))]
   "TARGET_AVX512F
    && (!<mask_applied>
@@ -17191,6 +17191,19 @@
 		      (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
 		      (const_int 1)))])
 
+(define_split
+  [(set (match_operand:VI48_AVX512F 0 "register_operand")
+	(vec_duplicate:VI48_AVX512F
+	  (not:<ssescalarmode>
+	    (match_operand:<ssescalarmode> 1 "nonimmediate_operand"))))]
+  "<MODE_SIZE> == 64 || TARGET_AVX512VL
+   || (TARGET_AVX512F && !TARGET_PREFER_AVX256)"
+  [(set (match_dup 0)
+	(xor:VI48_AVX512F
+	  (vec_duplicate:VI48_AVX512F (match_dup 1))
+	  (match_dup 2)))]
+  "operands[2] = CONSTM1_RTX (<MODE>mode);")
+
 (define_expand "<sse2_avx2>_andnot<mode>3"
   [(set (match_operand:VI_AVX2 0 "register_operand")
 	(and:VI_AVX2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr100711-6.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx512f -mno-avx512vl -mprefer-vector-width=512 -O2" } */
+
+typedef int v16si __attribute__ ((vector_size (64)));
+typedef long long v8di __attribute__((vector_size (64)));
+
+v16si foo_v16si (const int *a)
+{
+    return (__extension__ (v16si) {~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a,
+				   ~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a});
+}
+
+v8di foo_v8di (const long long *a)
+{
+    return (__extension__ (v8di) {~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a});
+}
+
+/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0x55, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}" 2 } } */


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations
  2023-06-21  6:25 ` [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations Jan Beulich
@ 2023-06-25  4:42   ` Hongtao Liu
  2023-06-25  5:52     ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Hongtao Liu @ 2023-06-25  4:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Wed, Jun 21, 2023 at 2:26 PM Jan Beulich via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> All combinations of and, ior, xor, and not involving two operands can be
> expressed that way in a single insn.
>
> gcc/
>
>         PR target/93768
>         * config/i386/i386.cc (ix86_rtx_costs): Further special-case
>         bitwise vector operations.
>         * config/i386/sse.md (*iornot<mode>3): New insn.
>         (*xnor<mode>3): Likewise.
>         (*<nlogic><mode>3): Likewise.
>         (andor): New code iterator.
>         (nlogic): New code attribute.
>         (ternlog_nlogic): Likewise.
>
> gcc/testsuite/
>
>         PR target/93768
>         gcc.target/i386/avx512-binop-not-1.h: New.
>         gcc.target/i386/avx512-binop-not-2.h: New.
>         gcc.target/i386/avx512f-orn-si-zmm-1.c: New test.
>         gcc.target/i386/avx512f-orn-si-zmm-2.c: New test.
> ---
> The use of VI matches that in e.g. one_cmpl<mode>2 /
> <mask_codefor>one_cmpl<mode>2<mask_name> and *andnot<mode>3, despite
> (here and there)
> - V64QI and V32HI being needlessly excluded when AVX512BW isn't enabled,
> - V<n>TI not being covered,
> - vector modes more narrow than 16 bytes not being covered.
>
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -21178,6 +21178,32 @@ ix86_rtx_costs (rtx x, machine_mode mode
>        return false;
>
>      case IOR:
> +      if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
> +       {
> +         /* (ior (not ...) ...) can be a single insn in AVX512.  */
> +         if (GET_CODE (XEXP (x, 0)) == NOT && TARGET_AVX512F
> +             && (GET_MODE_SIZE (mode) == 64
> +                 || (TARGET_AVX512VL
> +                     && (GET_MODE_SIZE (mode) == 32
> +                         || GET_MODE_SIZE (mode) == 16))))
> +           {
> +             rtx right = GET_CODE (XEXP (x, 1)) != NOT
> +                         ? XEXP (x, 1) : XEXP (XEXP (x, 1), 0);
> +
> +             *total = ix86_vec_cost (mode, cost->sse_op)
> +                      + rtx_cost (XEXP (XEXP (x, 0), 0), mode,
> +                                  outer_code, opno, speed)
> +                      + rtx_cost (right, mode, outer_code, opno, speed);
> +             return true;
> +           }
> +         *total = ix86_vec_cost (mode, cost->sse_op);
> +       }
> +      else if (GET_MODE_SIZE (mode) > UNITS_PER_WORD)
> +       *total = cost->add * 2;
> +      else
> +       *total = cost->add;
> +      return false;
> +
>      case XOR:
>        if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
>         *total = ix86_vec_cost (mode, cost->sse_op);
> @@ -21198,11 +21224,20 @@ ix86_rtx_costs (rtx x, machine_mode mode
>           /* pandn is a single instruction.  */
>           if (GET_CODE (XEXP (x, 0)) == NOT)
>             {
> +             rtx right = XEXP (x, 1);
> +
> +             /* (and (not ...) (not ...)) can be a single insn in AVX512.  */
> +             if (GET_CODE (right) == NOT && TARGET_AVX512F
> +                 && (GET_MODE_SIZE (mode) == 64
> +                     || (TARGET_AVX512VL
> +                         && (GET_MODE_SIZE (mode) == 32
> +                             || GET_MODE_SIZE (mode) == 16))))
> +               right = XEXP (right, 0);
> +
>               *total = ix86_vec_cost (mode, cost->sse_op)
>                        + rtx_cost (XEXP (XEXP (x, 0), 0), mode,
>                                    outer_code, opno, speed)
> -                      + rtx_cost (XEXP (x, 1), mode,
> -                                  outer_code, opno, speed);
> +                      + rtx_cost (right, mode, outer_code, opno, speed);
>               return true;
>             }
>           else if (GET_CODE (XEXP (x, 1)) == NOT)
> @@ -21260,8 +21295,25 @@ ix86_rtx_costs (rtx x, machine_mode mode
>
>      case NOT:
>        if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
> -       // vnot is pxor -1.
> -       *total = ix86_vec_cost (mode, cost->sse_op) + 1;
> +       {
> +         /* (not (xor ...)) can be a single insn in AVX512.  */
> +         if (GET_CODE (XEXP (x, 0)) == XOR && TARGET_AVX512F
> +             && (GET_MODE_SIZE (mode) == 64
> +                 || (TARGET_AVX512VL
> +                     && (GET_MODE_SIZE (mode) == 32
> +                         || GET_MODE_SIZE (mode) == 16))))
> +           {
> +             *total = ix86_vec_cost (mode, cost->sse_op)
> +                      + rtx_cost (XEXP (XEXP (x, 0), 0), mode,
> +                                  outer_code, opno, speed)
> +                      + rtx_cost (XEXP (XEXP (x, 0), 1), mode,
> +                                  outer_code, opno, speed);
> +             return true;
> +           }
> +
> +         // vnot is pxor -1.
> +         *total = ix86_vec_cost (mode, cost->sse_op) + 1;
> +       }
>        else if (GET_MODE_SIZE (mode) > UNITS_PER_WORD)
>         *total = cost->add * 2;
>        else
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -17616,6 +17616,98 @@
>    operands[2] = force_reg (V1TImode, CONSTM1_RTX (V1TImode));
>  })
>
> +(define_insn "*iornot<mode>3"
> +  [(set (match_operand:VI 0 "register_operand" "=v,v,v,v")
> +       (ior:VI
> +         (not:VI
> +           (match_operand:VI 1 "bcst_vector_operand" "v,Br,v,m"))
> +         (match_operand:VI 2 "bcst_vector_operand" "vBr,v,m,v")))]
> +  "(<MODE_SIZE> == 64 || TARGET_AVX512VL
> +    || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
> +   && (register_operand (operands[1], <MODE>mode)
> +       || register_operand (operands[2], <MODE>mode))"
> +{
> +  if (!register_operand (operands[1], <MODE>mode))
> +    {
> +      if (TARGET_AVX512VL)
> +       return "vpternlog<ternlogsuffix>\t{$0xdd, %1, %2, %0|%0, %2, %1, 0xdd}";
> +      return "vpternlog<ternlogsuffix>\t{$0xdd, %g1, %g2, %g0|%g0, %g2, %g1, 0xdd}";
> +    }
> +  if (TARGET_AVX512VL)
> +    return "vpternlog<ternlogsuffix>\t{$0xbb, %2, %1, %0|%0, %1, %2, 0xbb}";
> +  return "vpternlog<ternlogsuffix>\t{$0xbb, %g2, %g1, %g0|%g0, %g1, %g2, 0xbb}";
> +}
> +  [(set_attr "type" "sselog")
> +   (set_attr "length_immediate" "1")
> +   (set_attr "prefix" "evex")
> +   (set (attr "mode")
> +        (if_then_else (match_test "TARGET_AVX512VL")
> +                     (const_string "<sseinsnmode>")
> +                     (const_string "XI")))
> +   (set (attr "enabled")
> +       (if_then_else (eq_attr "alternative" "2,3")
> +                     (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
> +                     (const_string "*")))])
> +
> +(define_insn "*xnor<mode>3"
> +  [(set (match_operand:VI 0 "register_operand" "=v,v")
> +       (not:VI
> +         (xor:VI
> +           (match_operand:VI 1 "bcst_vector_operand" "%v,v")
> +           (match_operand:VI 2 "bcst_vector_operand" "vBr,m"))))]
> +  "(<MODE_SIZE> == 64 || TARGET_AVX512VL
> +    || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
> +   && (register_operand (operands[1], <MODE>mode)
> +       || register_operand (operands[2], <MODE>mode))"
> +{
> +  if (TARGET_AVX512VL)
> +    return "vpternlog<ternlogsuffix>\t{$0x99, %2, %1, %0|%0, %1, %2, 0x99}";
> +  else
> +    return "vpternlog<ternlogsuffix>\t{$0x99, %g2, %g1, %g0|%g0, %g1, %g2, 0x99}";
> +}
> +  [(set_attr "type" "sselog")
> +   (set_attr "length_immediate" "1")
> +   (set_attr "prefix" "evex")
> +   (set (attr "mode")
> +        (if_then_else (match_test "TARGET_AVX512VL")
> +                     (const_string "<sseinsnmode>")
> +                     (const_string "XI")))
> +   (set (attr "enabled")
> +       (if_then_else (eq_attr "alternative" "1")
> +                     (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
> +                     (const_string "*")))])
> +
> +(define_code_iterator andor [and ior])
> +(define_code_attr nlogic [(and "nor") (ior "nand")])
> +(define_code_attr ternlog_nlogic [(and "0x11") (ior "0x77")])
> +
> +(define_insn "*<nlogic><mode>3"
> +  [(set (match_operand:VI 0 "register_operand" "=v,v")
> +       (andor:VI
> +         (not:VI (match_operand:VI 1 "bcst_vector_operand" "%v,v"))
> +         (not:VI (match_operand:VI 2 "bcst_vector_operand" "vBr,m"))))]
I'm thinking of doing it in simplify_rtx or gimple match.pd to transform
(and (not op1))  (not op2)) -> (not: (ior: op1 op2))
(ior (not op1) (not op2)) -> (not : (and op1 op2))

Even w/o avx512f, the transformation should also benefit since it
takes less logic operations 3 -> 2.(or 2 -> 2 for pandn).

The other 2 patterns: *xnor<mode>3 and iornot<mode>3  LGTM.

> +  "(<MODE_SIZE> == 64 || TARGET_AVX512VL
> +    || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
> +   && (register_operand (operands[1], <MODE>mode)
> +       || register_operand (operands[2], <MODE>mode))"
> +{
> +  if (TARGET_AVX512VL)
> +    return "vpternlog<ternlogsuffix>\t{$<ternlog_nlogic>, %2, %1, %0|%0, %1, %2, <ternlog_nlogic>}";
> +  else
> +    return "vpternlog<ternlogsuffix>\t{$<ternlog_nlogic>, %g2, %g1, %g0|%g0, %g1, %g2, <ternlog_nlogic>}";
> +}
> +  [(set_attr "type" "sselog")
> +   (set_attr "length_immediate" "1")
> +   (set_attr "prefix" "evex")
> +   (set (attr "mode")
> +        (if_then_else (match_test "TARGET_AVX512VL")
> +                     (const_string "<sseinsnmode>")
> +                     (const_string "XI")))
> +   (set (attr "enabled")
> +       (if_then_else (eq_attr "alternative" "1")
> +                     (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
> +                     (const_string "*")))])
> +
>  (define_mode_iterator AVX512ZEXTMASK
>    [(DI "TARGET_AVX512BW") (SI "TARGET_AVX512BW") HI])
>
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512-binop-not-1.h
> @@ -0,0 +1,13 @@
> +#include <immintrin.h>
> +
> +#define PASTER2(x,y)           x##y
> +#define PASTER3(x,y,z)         _mm##x##_##y##_##z
> +#define OP(vec, op, suffix)    PASTER3 (vec, op, suffix)
> +#define DUP(vec, suffix, val)  PASTER3 (vec, set1, suffix) (val)
> +
> +type
> +foo (type x, SCALAR *f)
> +{
> +  return OP (vec, op, suffix) (x, OP (vec, xor, suffix) (DUP (vec, suffix, *f),
> +                                                        DUP (vec, suffix, ~0)));
> +}
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512-binop-not-2.h
> @@ -0,0 +1,13 @@
> +#include <immintrin.h>
> +
> +#define PASTER2(x,y)           x##y
> +#define PASTER3(x,y,z)         _mm##x##_##y##_##z
> +#define OP(vec, op, suffix)    PASTER3 (vec, op, suffix)
> +#define DUP(vec, suffix, val)  PASTER3 (vec, set1, suffix) (val)
> +
> +type
> +foo (type x, SCALAR *f)
> +{
> +  return OP (vec, op, suffix) (OP (vec, xor, suffix) (x, DUP (vec, suffix, ~0)),
> +                              DUP (vec, suffix, *f));
> +}
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx512f -mno-avx512vl -mprefer-vector-width=512 -O2" } */
> +/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$0xdd, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */
> +/* { dg-final { scan-assembler-not "vpbroadcast" } } */
> +
> +#define type __m512i
> +#define vec 512
> +#define op or
> +#define suffix epi32
> +#define SCALAR int
> +
> +#include "avx512-binop-not-1.h"
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-2.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx512f -mno-avx512vl -mprefer-vector-width=512 -O2" } */
> +/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$0xbb, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */
> +/* { dg-final { scan-assembler-not "vpbroadcast" } } */
> +
> +#define type __m512i
> +#define vec 512
> +#define op or
> +#define suffix epi32
> +#define SCALAR int
> +
> +#include "avx512-binop-not-2.h"
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] x86: use VPTERNLOG also for certain andnot forms
  2023-06-21  6:27 ` [PATCH 2/5] x86: use VPTERNLOG also for certain andnot forms Jan Beulich
@ 2023-06-25  4:58   ` Hongtao Liu
  0 siblings, 0 replies; 24+ messages in thread
From: Hongtao Liu @ 2023-06-25  4:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Wed, Jun 21, 2023 at 2:27 PM Jan Beulich via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> When it's the memory operand which is to be inverted, using VPANDN*
> requires a further load instruction. The same can be achieved by a
> single VPTERNLOG*. Add two new alternatives (for plain memory and
> embedded broadcast), adjusting the predicate for the first operand
> accordingly.
>
> Two pre-existing testcases actually end up being affected (improved) by
> the change, which is reflected in updated expectations there.
LGTM.
>
> gcc/
>
>         PR target/93768
>         * config/i386/sse.md (*andnot<mode>3): Add new alternatives
>         for memory form operand 1.
>
> gcc/testsuite/
>
>         PR target/93768
>         * gcc.target/i386/avx512f-andn-di-zmm-2.c: New test.
>         * gcc.target/i386/avx512f-andn-si-zmm-2.c: Adjust expecations
>         towards generated code.
>         * gcc.target/i386/pr100711-3.c: Adjust expectations for 32-bit
>         code.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -17210,11 +17210,13 @@
>    "TARGET_AVX512F")
>
>  (define_insn "*andnot<mode>3"
> -  [(set (match_operand:VI 0 "register_operand" "=x,x,v")
> +  [(set (match_operand:VI 0 "register_operand" "=x,x,v,v,v")
>         (and:VI
> -         (not:VI (match_operand:VI 1 "vector_operand" "0,x,v"))
> -         (match_operand:VI 2 "bcst_vector_operand" "xBm,xm,vmBr")))]
> -  "TARGET_SSE"
> +         (not:VI (match_operand:VI 1 "bcst_vector_operand" "0,x,v,m,Br"))
> +         (match_operand:VI 2 "bcst_vector_operand" "xBm,xm,vmBr,v,v")))]
> +  "TARGET_SSE
> +   && (register_operand (operands[1], <MODE>mode)
> +       || register_operand (operands[2], <MODE>mode))"
>  {
>    char buf[64];
>    const char *ops;
> @@ -17281,6 +17283,15 @@
>      case 2:
>        ops = "v%s%s\t{%%2, %%1, %%0|%%0, %%1, %%2}";
>        break;
> +    case 3:
> +    case 4:
> +      tmp = "pternlog";
> +      ssesuffix = "<ternlogsuffix>";
> +      if (which_alternative != 4 || TARGET_AVX512VL)
> +       ops = "v%s%s\t{$0x44, %%1, %%2, %%0|%%0, %%2, %%1, $0x44}";
> +      else
> +       ops = "v%s%s\t{$0x44, %%g1, %%g2, %%g0|%%g0, %%g2, %%g1, $0x44}";
> +      break;
>      default:
>        gcc_unreachable ();
>      }
> @@ -17289,7 +17300,7 @@
>    output_asm_insn (buf, operands);
>    return "";
>  }
> -  [(set_attr "isa" "noavx,avx,avx")
> +  [(set_attr "isa" "noavx,avx,avx,*,*")
>     (set_attr "type" "sselog")
>     (set (attr "prefix_data16")
>       (if_then_else
> @@ -17297,9 +17308,12 @@
>             (eq_attr "mode" "TI"))
>         (const_string "1")
>         (const_string "*")))
> -   (set_attr "prefix" "orig,vex,evex")
> +   (set_attr "prefix" "orig,vex,evex,evex,evex")
>     (set (attr "mode")
> -       (cond [(match_test "TARGET_AVX2")
> +       (cond [(and (eq_attr "alternative" "3,4")
> +                   (match_test "<MODE_SIZE> < 64 && !TARGET_AVX512VL"))
> +                (const_string "XI")
> +              (match_test "TARGET_AVX2")
>                  (const_string "<sseinsnmode>")
>                (match_test "TARGET_AVX")
>                  (if_then_else
> @@ -17310,7 +17324,15 @@
>                     (match_test "optimize_function_for_size_p (cfun)"))
>                  (const_string "V4SF")
>               ]
> -             (const_string "<sseinsnmode>")))])
> +             (const_string "<sseinsnmode>")))
> +   (set (attr "enabled")
> +       (cond [(eq_attr "alternative" "3")
> +                (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
> +              (eq_attr "alternative" "4")
> +                (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL
> +                             || (TARGET_AVX512F && !TARGET_PREFER_AVX256)")
> +             ]
> +             (const_string "*")))])
>
>  ;; PR target/100711: Split notl; vpbroadcastd; vpand as vpbroadcastd; vpandn
>  (define_split
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-andn-di-zmm-2.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx512f -mno-avx512vl -mprefer-vector-width=512 -O2" } */
> +/* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\\\$0x44, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */
> +/* { dg-final { scan-assembler-not "vpbroadcast" } } */
> +
> +#define type __m512i
> +#define vec 512
> +#define op andnot
> +#define suffix epi64
> +#define SCALAR long long
> +
> +#include "avx512-binop-2.h"
> --- a/gcc/testsuite/gcc.target/i386/avx512f-andn-si-zmm-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-andn-si-zmm-2.c
> @@ -1,7 +1,7 @@
>  /* { dg-do compile } */
>  /* { dg-options "-mavx512f -O2" } */
> -/* { dg-final { scan-assembler-times "vpbroadcastd\[^\n\]*%zmm\[0-9\]+" 1 } } */
> -/* { dg-final { scan-assembler-times "vpandnd\[^\n\]*%zmm\[0-9\]+" 1 } } */
> +/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$0x44, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */
> +/* { dg-final { scan-assembler-not "vpbroadcast" } } */
>
>  #define type __m512i
>  #define vec 512
> --- a/gcc/testsuite/gcc.target/i386/pr100711-3.c
> +++ b/gcc/testsuite/gcc.target/i386/pr100711-3.c
> @@ -37,4 +37,6 @@ v8di foo_v8di (long long a, v8di b)
>      return (__extension__ (v8di) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) & b;
>  }
>
> -/* { dg-final { scan-assembler-times "vpandn" 4 } } */
> +/* { dg-final { scan-assembler-times "vpandn" 4 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vpandn" 2 { target { ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0x44" 2 { target { ia32 } } } } */
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] x86: allow memory operand for AVX2 splitter for PR target/100711
  2023-06-21  6:27 ` [PATCH 3/5] x86: allow memory operand for AVX2 splitter for PR target/100711 Jan Beulich
@ 2023-06-25  4:58   ` Hongtao Liu
  0 siblings, 0 replies; 24+ messages in thread
From: Hongtao Liu @ 2023-06-25  4:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Wed, Jun 21, 2023 at 2:28 PM Jan Beulich via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> The intended broadcast (with AVX512) can very well be done right from
> memory.
Ok.
>
> gcc/
>
>         * config/i386/sse.md: Permit non-immediate operand 1 in AVX2
>         form of splitter for PR target/100711.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -17356,7 +17356,7 @@
>         (and:VI_AVX2
>           (vec_duplicate:VI_AVX2
>             (not:<ssescalarmode>
> -             (match_operand:<ssescalarmode> 1 "register_operand")))
> +             (match_operand:<ssescalarmode> 1 "nonimmediate_operand")))
>           (match_operand:VI_AVX2 2 "vector_operand")))]
>    "TARGET_AVX2"
>    [(set (match_dup 3)
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] x86: further PR target/100711-like splitting
  2023-06-21  6:27 ` [PATCH 4/5] x86: further PR target/100711-like splitting Jan Beulich
@ 2023-06-25  5:06   ` Hongtao Liu
  2023-06-25  6:16     ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Hongtao Liu @ 2023-06-25  5:06 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Wed, Jun 21, 2023 at 2:28 PM Jan Beulich via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> With respective two-operand bitwise operations now expressable by a
> single VPTERNLOG, add splitters to also deal with ior and xor
> counterparts of the original and-only case. Note that the splitters need
> to be separate, as the placement of "not" differs in the final insns
> (*iornot<mode>3, *xnor<mode>3) which are intended to pick up one half of
> the result.
>
> gcc/
>
>         * config/i386/sse.md: New splitters to simplify
>         not;vec_duplicate;{ior,xor} as vec_duplicate;{iornot,xnor}.
>
> gcc/testsuite/
>
>         * gcc.target/i386/pr100711-4.c: New test.
>         * gcc.target/i386/pr100711-5.c: New test.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -17366,6 +17366,36 @@
>                         (match_dup 2)))]
>    "operands[3] = gen_reg_rtx (<MODE>mode);")
>
> +(define_split
> +  [(set (match_operand:VI 0 "register_operand")
> +       (ior:VI
> +         (vec_duplicate:VI
> +           (not:<ssescalarmode>
> +             (match_operand:<ssescalarmode> 1 "nonimmediate_operand")))
> +         (match_operand:VI 2 "vector_operand")))]
> +  "<MODE_SIZE> == 64 || TARGET_AVX512VL
> +   || (TARGET_AVX512F && !TARGET_PREFER_AVX256)"
> +  [(set (match_dup 3)
> +       (vec_duplicate:VI (match_dup 1)))
> +   (set (match_dup 0)
> +       (ior:VI (not:VI (match_dup 3)) (match_dup 2)))]
> +  "operands[3] = gen_reg_rtx (<MODE>mode);")
> +
> +(define_split
> +  [(set (match_operand:VI 0 "register_operand")
> +       (xor:VI
> +         (vec_duplicate:VI
> +           (not:<ssescalarmode>
> +             (match_operand:<ssescalarmode> 1 "nonimmediate_operand")))
> +         (match_operand:VI 2 "vector_operand")))]
> +  "<MODE_SIZE> == 64 || TARGET_AVX512VL
> +   || (TARGET_AVX512F && !TARGET_PREFER_AVX256)"
> +  [(set (match_dup 3)
> +       (vec_duplicate:VI (match_dup 1)))
> +   (set (match_dup 0)
> +       (not:VI (xor:VI (match_dup 3) (match_dup 2))))]
> +  "operands[3] = gen_reg_rtx (<MODE>mode);")
> +
Can we merge this splitter(xor:not) into ior:not one with a code
iterator for xor,ior, They look the same except for the xor/ior.
No need to merge it into and:not case which have different guard conditions.
Others LGTM.
>  (define_insn "*andnot<mode>3_mask"
>    [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v")
>         (vec_merge:VI48_AVX512VL
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100711-4.c
> @@ -0,0 +1,42 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx512bw -mno-avx512vl -mprefer-vector-width=512 -O2" } */
> +
> +typedef char v64qi __attribute__ ((vector_size (64)));
> +typedef short v32hi __attribute__ ((vector_size (64)));
> +typedef int v16si __attribute__ ((vector_size (64)));
> +typedef long long v8di __attribute__((vector_size (64)));
> +
> +v64qi foo_v64qi (char a, v64qi b)
> +{
> +    return (__extension__ (v64qi) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) | b;
> +}
> +
> +v32hi foo_v32hi (short a, v32hi b)
> +{
> +    return (__extension__ (v32hi) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) | b;
> +}
> +
> +v16si foo_v16si (int a, v16si b)
> +{
> +    return (__extension__ (v16si) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) | b;
> +}
> +
> +v8di foo_v8di (long long a, v8di b)
> +{
> +    return (__extension__ (v8di) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) | b;
> +}
> +
> +/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0xbb" 4 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0xbb" 2 { target { ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0xdd" 2 { target { ia32 } } } } */
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100711-5.c
> @@ -0,0 +1,40 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx512bw -mno-avx512vl -mprefer-vector-width=512 -O2" } */
> +
> +typedef char v64qi __attribute__ ((vector_size (64)));
> +typedef short v32hi __attribute__ ((vector_size (64)));
> +typedef int v16si __attribute__ ((vector_size (64)));
> +typedef long long v8di __attribute__((vector_size (64)));
> +
> +v64qi foo_v64qi (char a, v64qi b)
> +{
> +    return (__extension__ (v64qi) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) ^ b;
> +}
> +
> +v32hi foo_v32hi (short a, v32hi b)
> +{
> +    return (__extension__ (v32hi) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                   ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) ^ b;
> +}
> +
> +v16si foo_v16si (int a, v16si b)
> +{
> +    return (__extension__ (v16si) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a,
> +                                  ~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) ^ b;
> +}
> +
> +v8di foo_v8di (long long a, v8di b)
> +{
> +    return (__extension__ (v8di) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) ^ b;
> +}
> +
> +/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0x99" 4 } } */
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] x86: yet more PR target/100711-like splitting
  2023-06-21  6:28 ` [PATCH 5/5] x86: yet more " Jan Beulich
@ 2023-06-25  5:12   ` Hongtao Liu
  2023-06-25  6:25     ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Hongtao Liu @ 2023-06-25  5:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Wed, Jun 21, 2023 at 2:29 PM Jan Beulich via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Following two-operand bitwise operations, add another splitter to also
> deal with not followed by broadcast all on its own, which can be
> expressed as simple embedded broadcast instead once a broadcast operand
> is actually permitted in the respective insn. While there also permit
> a broadcast operand in the corresponding expander.
The patch LGTM.
>
> gcc/
>
>         * config/i386/sse.md: New splitters to simplify
>         not;vec_duplicate as a singular vpternlog.
>         (one_cmpl<mode>2): Allow broadcast for operand 1.
>         (<mask_codefor>one_cmpl<mode>2<mask_name>): Likewise.
>
> gcc/testsuite/
>
>         * gcc.target/i386/pr100711-6.c: New test.
> ---
> For the purpose here (and elsewhere) bcst_vector_operand() (really:
> bcst_mem_operand()) isn't permissive enough: We'd want it to allow
> 128-bit and 256-bit types as well irrespective of AVX512VL being
> enabled. This would likely require a new predicate
> (bcst_intvec_operand()?) and a new constraint (BR? Bi?). (Yet for name
> selection it will want considering that this is applicable to certain
> non-calculational FP operations as well.)
I think so.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -17156,7 +17156,7 @@
>
>  (define_expand "one_cmpl<mode>2"
>    [(set (match_operand:VI 0 "register_operand")
> -       (xor:VI (match_operand:VI 1 "vector_operand")
> +       (xor:VI (match_operand:VI 1 "bcst_vector_operand")
>                 (match_dup 2)))]
>    "TARGET_SSE"
>  {
> @@ -17168,7 +17168,7 @@
>
>  (define_insn "<mask_codefor>one_cmpl<mode>2<mask_name>"
>    [(set (match_operand:VI 0 "register_operand" "=v,v")
> -       (xor:VI (match_operand:VI 1 "nonimmediate_operand" "v,m")
> +       (xor:VI (match_operand:VI 1 "bcst_vector_operand" "vBr,m")
>                 (match_operand:VI 2 "vector_all_ones_operand" "BC,BC")))]
>    "TARGET_AVX512F
>     && (!<mask_applied>
> @@ -17191,6 +17191,19 @@
>                       (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
>                       (const_int 1)))])
>
> +(define_split
> +  [(set (match_operand:VI48_AVX512F 0 "register_operand")
> +       (vec_duplicate:VI48_AVX512F
> +         (not:<ssescalarmode>
> +           (match_operand:<ssescalarmode> 1 "nonimmediate_operand"))))]
> +  "<MODE_SIZE> == 64 || TARGET_AVX512VL
> +   || (TARGET_AVX512F && !TARGET_PREFER_AVX256)"
> +  [(set (match_dup 0)
> +       (xor:VI48_AVX512F
> +         (vec_duplicate:VI48_AVX512F (match_dup 1))
> +         (match_dup 2)))]
> +  "operands[2] = CONSTM1_RTX (<MODE>mode);")
> +
>  (define_expand "<sse2_avx2>_andnot<mode>3"
>    [(set (match_operand:VI_AVX2 0 "register_operand")
>         (and:VI_AVX2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr100711-6.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx512f -mno-avx512vl -mprefer-vector-width=512 -O2" } */
> +
> +typedef int v16si __attribute__ ((vector_size (64)));
> +typedef long long v8di __attribute__((vector_size (64)));
> +
> +v16si foo_v16si (const int *a)
> +{
> +    return (__extension__ (v16si) {~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a,
> +                                  ~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a});
> +}
> +
> +v8di foo_v8di (const long long *a)
> +{
> +    return (__extension__ (v8di) {~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a, ~*a});
> +}
> +
> +/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0x55, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}" 2 } } */
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations
  2023-06-25  4:42   ` Hongtao Liu
@ 2023-06-25  5:52     ` Jan Beulich
  2023-06-25  7:13       ` Hongtao Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-06-25  5:52 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On 25.06.2023 06:42, Hongtao Liu wrote:
> On Wed, Jun 21, 2023 at 2:26 PM Jan Beulich via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> +(define_code_iterator andor [and ior])
>> +(define_code_attr nlogic [(and "nor") (ior "nand")])
>> +(define_code_attr ternlog_nlogic [(and "0x11") (ior "0x77")])
>> +
>> +(define_insn "*<nlogic><mode>3"
>> +  [(set (match_operand:VI 0 "register_operand" "=v,v")
>> +       (andor:VI
>> +         (not:VI (match_operand:VI 1 "bcst_vector_operand" "%v,v"))
>> +         (not:VI (match_operand:VI 2 "bcst_vector_operand" "vBr,m"))))]
> I'm thinking of doing it in simplify_rtx or gimple match.pd to transform
> (and (not op1))  (not op2)) -> (not: (ior: op1 op2))

This wouldn't be a win (not + andn) -> (or + not), but what's
more important is ...

> (ior (not op1) (not op2)) -> (not : (and op1 op2))
> 
> Even w/o avx512f, the transformation should also benefit since it
> takes less logic operations 3 -> 2.(or 2 -> 2 for pandn).

... that these transformations (from the, as per the doc,
canonical representation of nand and nor) are already occurring
in common code, _if_ no suitable insn can be found. That was at
least the conclusion I drew from looking around a lot, supported
by the code that's generated prior to this change.

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] x86: further PR target/100711-like splitting
  2023-06-25  5:06   ` Hongtao Liu
@ 2023-06-25  6:16     ` Jan Beulich
  2023-06-25  6:27       ` Hongtao Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-06-25  6:16 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On 25.06.2023 07:06, Hongtao Liu wrote:
> On Wed, Jun 21, 2023 at 2:28 PM Jan Beulich via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> With respective two-operand bitwise operations now expressable by a
>> single VPTERNLOG, add splitters to also deal with ior and xor
>> counterparts of the original and-only case. Note that the splitters need
>> to be separate, as the placement of "not" differs in the final insns
>> (*iornot<mode>3, *xnor<mode>3) which are intended to pick up one half of
>> the result.
>>
>> gcc/
>>
>>         * config/i386/sse.md: New splitters to simplify
>>         not;vec_duplicate;{ior,xor} as vec_duplicate;{iornot,xnor}.
>>
>> gcc/testsuite/
>>
>>         * gcc.target/i386/pr100711-4.c: New test.
>>         * gcc.target/i386/pr100711-5.c: New test.
>>
>> --- a/gcc/config/i386/sse.md
>> +++ b/gcc/config/i386/sse.md
>> @@ -17366,6 +17366,36 @@
>>                         (match_dup 2)))]
>>    "operands[3] = gen_reg_rtx (<MODE>mode);")
>>
>> +(define_split
>> +  [(set (match_operand:VI 0 "register_operand")
>> +       (ior:VI
>> +         (vec_duplicate:VI
>> +           (not:<ssescalarmode>
>> +             (match_operand:<ssescalarmode> 1 "nonimmediate_operand")))
>> +         (match_operand:VI 2 "vector_operand")))]
>> +  "<MODE_SIZE> == 64 || TARGET_AVX512VL
>> +   || (TARGET_AVX512F && !TARGET_PREFER_AVX256)"
>> +  [(set (match_dup 3)
>> +       (vec_duplicate:VI (match_dup 1)))
>> +   (set (match_dup 0)
>> +       (ior:VI (not:VI (match_dup 3)) (match_dup 2)))]
>> +  "operands[3] = gen_reg_rtx (<MODE>mode);")
>> +
>> +(define_split
>> +  [(set (match_operand:VI 0 "register_operand")
>> +       (xor:VI
>> +         (vec_duplicate:VI
>> +           (not:<ssescalarmode>
>> +             (match_operand:<ssescalarmode> 1 "nonimmediate_operand")))
>> +         (match_operand:VI 2 "vector_operand")))]
>> +  "<MODE_SIZE> == 64 || TARGET_AVX512VL
>> +   || (TARGET_AVX512F && !TARGET_PREFER_AVX256)"
>> +  [(set (match_dup 3)
>> +       (vec_duplicate:VI (match_dup 1)))
>> +   (set (match_dup 0)
>> +       (not:VI (xor:VI (match_dup 3) (match_dup 2))))]
>> +  "operands[3] = gen_reg_rtx (<MODE>mode);")
>> +
> Can we merge this splitter(xor:not) into ior:not one with a code
> iterator for xor,ior, They look the same except for the xor/ior.

They're only almost the same: Note (ior (not )) vs (not (xor )) as
the result of the splitting. The difference is necessary to fit
with what patch 1 introduces (which in turn is the way it is to
fit with what generic code transforms things to up front). (I had
it the way you suggest initially, until I figured why one of the
two would end up never being used.)

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] x86: yet more PR target/100711-like splitting
  2023-06-25  5:12   ` Hongtao Liu
@ 2023-06-25  6:25     ` Jan Beulich
  2023-06-25  6:35       ` Hongtao Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-06-25  6:25 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On 25.06.2023 07:12, Hongtao Liu wrote:
> On Wed, Jun 21, 2023 at 2:29 PM Jan Beulich via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> ---
>> For the purpose here (and elsewhere) bcst_vector_operand() (really:
>> bcst_mem_operand()) isn't permissive enough: We'd want it to allow
>> 128-bit and 256-bit types as well irrespective of AVX512VL being
>> enabled. This would likely require a new predicate
>> (bcst_intvec_operand()?) and a new constraint (BR? Bi?). (Yet for name
>> selection it will want considering that this is applicable to certain
>> non-calculational FP operations as well.)
> I think so.

Any preference towards predicate and constraint naming?

Plus I think there's a more general question behind this: A new
predicate / constraint pair is likely just one way of dealing
with the issue. Another would appear to be to remove the
restriction of 128- and 256-byte types when AVX512VL is not
enabled, but AVX512F is. While that would require touching a
lot of insn constraints, it looks as if lifting that restriction
would "merely" require much wider use of Yv where v is used
right now. But of course I may well be unaware of (some of) the
reasons why that restriction was put in place in the first place
(it can't really be the lack of suitable move insns, as those
can be synthesized by using e.g. vextract{32,64}x4).

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] x86: further PR target/100711-like splitting
  2023-06-25  6:16     ` Jan Beulich
@ 2023-06-25  6:27       ` Hongtao Liu
  0 siblings, 0 replies; 24+ messages in thread
From: Hongtao Liu @ 2023-06-25  6:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Sun, Jun 25, 2023 at 2:16 PM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 25.06.2023 07:06, Hongtao Liu wrote:
> > On Wed, Jun 21, 2023 at 2:28 PM Jan Beulich via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> With respective two-operand bitwise operations now expressable by a
> >> single VPTERNLOG, add splitters to also deal with ior and xor
> >> counterparts of the original and-only case. Note that the splitters need
> >> to be separate, as the placement of "not" differs in the final insns
> >> (*iornot<mode>3, *xnor<mode>3) which are intended to pick up one half of
> >> the result.
> >>
> >> gcc/
> >>
> >>         * config/i386/sse.md: New splitters to simplify
> >>         not;vec_duplicate;{ior,xor} as vec_duplicate;{iornot,xnor}.
> >>
> >> gcc/testsuite/
> >>
> >>         * gcc.target/i386/pr100711-4.c: New test.
> >>         * gcc.target/i386/pr100711-5.c: New test.
> >>
> >> --- a/gcc/config/i386/sse.md
> >> +++ b/gcc/config/i386/sse.md
> >> @@ -17366,6 +17366,36 @@
> >>                         (match_dup 2)))]
> >>    "operands[3] = gen_reg_rtx (<MODE>mode);")
> >>
> >> +(define_split
> >> +  [(set (match_operand:VI 0 "register_operand")
> >> +       (ior:VI
> >> +         (vec_duplicate:VI
> >> +           (not:<ssescalarmode>
> >> +             (match_operand:<ssescalarmode> 1 "nonimmediate_operand")))
> >> +         (match_operand:VI 2 "vector_operand")))]
> >> +  "<MODE_SIZE> == 64 || TARGET_AVX512VL
> >> +   || (TARGET_AVX512F && !TARGET_PREFER_AVX256)"
> >> +  [(set (match_dup 3)
> >> +       (vec_duplicate:VI (match_dup 1)))
> >> +   (set (match_dup 0)
> >> +       (ior:VI (not:VI (match_dup 3)) (match_dup 2)))]
> >> +  "operands[3] = gen_reg_rtx (<MODE>mode);")
> >> +
> >> +(define_split
> >> +  [(set (match_operand:VI 0 "register_operand")
> >> +       (xor:VI
> >> +         (vec_duplicate:VI
> >> +           (not:<ssescalarmode>
> >> +             (match_operand:<ssescalarmode> 1 "nonimmediate_operand")))
> >> +         (match_operand:VI 2 "vector_operand")))]
> >> +  "<MODE_SIZE> == 64 || TARGET_AVX512VL
> >> +   || (TARGET_AVX512F && !TARGET_PREFER_AVX256)"
> >> +  [(set (match_dup 3)
> >> +       (vec_duplicate:VI (match_dup 1)))
> >> +   (set (match_dup 0)
> >> +       (not:VI (xor:VI (match_dup 3) (match_dup 2))))]
> >> +  "operands[3] = gen_reg_rtx (<MODE>mode);")
> >> +
> > Can we merge this splitter(xor:not) into ior:not one with a code
> > iterator for xor,ior, They look the same except for the xor/ior.
>
> They're only almost the same: Note (ior (not )) vs (not (xor )) as
> the result of the splitting. The difference is necessary to fit
> with what patch 1 introduces (which in turn is the way it is to
> fit with what generic code transforms things to up front). (I had
> it the way you suggest initially, until I figured why one of the
> two would end up never being used.)
>
3597      /* Convert (XOR (NOT x) (NOT y)) to (XOR x y).
3598         Also convert (XOR (NOT x) y) to (NOT (XOR x y)), similarly for
3599         (NOT y).  */
3600      {
3601        int num_negated = 0;
3602
3603        if (GET_CODE (op0) == NOT)
3604          num_negated++, op0 = XEXP (op0, 0);
3605        if (GET_CODE (op1) == NOT)
3606          num_negated++, op1 = XEXP (op1, 0);

It looks simplify_rtx plays the trick.

And it's documented.
8602@cindex @code{xor}, canonicalization of
 8603@item
 8604The only possible RTL expressions involving both bitwise exclusive-or
 8605and bitwise negation are @code{(xor:@var{m} @var{x} @var{y})}
 8606and @code{(not:@var{m} (xor:@var{m} @var{x} @var{y}))}.

Then the original patch LGTM.

> Jan



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] x86: yet more PR target/100711-like splitting
  2023-06-25  6:25     ` Jan Beulich
@ 2023-06-25  6:35       ` Hongtao Liu
  2023-06-25  6:41         ` Hongtao Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Hongtao Liu @ 2023-06-25  6:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Sun, Jun 25, 2023 at 2:25 PM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 25.06.2023 07:12, Hongtao Liu wrote:
> > On Wed, Jun 21, 2023 at 2:29 PM Jan Beulich via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> ---
> >> For the purpose here (and elsewhere) bcst_vector_operand() (really:
> >> bcst_mem_operand()) isn't permissive enough: We'd want it to allow
> >> 128-bit and 256-bit types as well irrespective of AVX512VL being
> >> enabled. This would likely require a new predicate
> >> (bcst_intvec_operand()?) and a new constraint (BR? Bi?). (Yet for name
> >> selection it will want considering that this is applicable to certain
> >> non-calculational FP operations as well.)
> > I think so.
>
> Any preference towards predicate and constraint naming?
something like bcst_mem_operand_$suffiix, $suffix indicates the
pattern may use zmm instruction for 128/256-bit operand.
maybe just bcst_mem_operand_zmm?

>
> Plus I think there's a more general question behind this: A new
> predicate / constraint pair is likely just one way of dealing
> with the issue. Another would appear to be to remove the
> restriction of 128- and 256-byte types when AVX512VL is not
> enabled, but AVX512F is. While that would require touching a
> lot of insn constraints, it looks as if lifting that restriction
> would "merely" require much wider use of Yv where v is used
> right now. But of course I may well be unaware of (some of) the
> reasons why that restriction was put in place in the first place
> (it can't really be the lack of suitable move insns, as those
> can be synthesized by using e.g. vextract{32,64}x4).
Also be careful of SIMD Floating-Point Exception if we use the zmm
version for those arithmetic instructions, the upper bits need to be
explicitly cleared for 128/256-bit operand.
For pternlog or other logic instructions, it's ok since there's no
SIMD Floating-Point Exception for such instructions.

>
> Jan



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] x86: yet more PR target/100711-like splitting
  2023-06-25  6:35       ` Hongtao Liu
@ 2023-06-25  6:41         ` Hongtao Liu
  2023-11-06 11:10           ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Hongtao Liu @ 2023-06-25  6:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Sun, Jun 25, 2023 at 2:35 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Sun, Jun 25, 2023 at 2:25 PM Jan Beulich <jbeulich@suse.com> wrote:
> >
> > On 25.06.2023 07:12, Hongtao Liu wrote:
> > > On Wed, Jun 21, 2023 at 2:29 PM Jan Beulich via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > >>
> > >> ---
> > >> For the purpose here (and elsewhere) bcst_vector_operand() (really:
> > >> bcst_mem_operand()) isn't permissive enough: We'd want it to allow
> > >> 128-bit and 256-bit types as well irrespective of AVX512VL being
> > >> enabled. This would likely require a new predicate
> > >> (bcst_intvec_operand()?) and a new constraint (BR? Bi?). (Yet for name
> > >> selection it will want considering that this is applicable to certain
> > >> non-calculational FP operations as well.)
> > > I think so.
> >
> > Any preference towards predicate and constraint naming?
> something like bcst_mem_operand_$suffiix, $suffix indicates the
> pattern may use zmm instruction for 128/256-bit operand.
> maybe just bcst_mem_operand_zmm?
For constraint, maybe we can reuse Br, relax Br to match bcst_mem_operand_zmm.
For those original patterns with bcst_mem_operand, it should be ok
since it's already guarded by the predicate, the constraint must be
valid.
>
> >
> > Plus I think there's a more general question behind this: A new
> > predicate / constraint pair is likely just one way of dealing
> > with the issue. Another would appear to be to remove the
> > restriction of 128- and 256-byte types when AVX512VL is not
> > enabled, but AVX512F is. While that would require touching a
> > lot of insn constraints, it looks as if lifting that restriction
> > would "merely" require much wider use of Yv where v is used
> > right now. But of course I may well be unaware of (some of) the
> > reasons why that restriction was put in place in the first place
> > (it can't really be the lack of suitable move insns, as those
> > can be synthesized by using e.g. vextract{32,64}x4).
> Also be careful of SIMD Floating-Point Exception if we use the zmm
> version for those arithmetic instructions, the upper bits need to be
> explicitly cleared for 128/256-bit operand.
> For pternlog or other logic instructions, it's ok since there's no
> SIMD Floating-Point Exception for such instructions.
>
> >
> > Jan
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations
  2023-06-25  5:52     ` Jan Beulich
@ 2023-06-25  7:13       ` Hongtao Liu
  2023-06-25  7:23         ` Hongtao Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Hongtao Liu @ 2023-06-25  7:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Sun, Jun 25, 2023 at 1:52 PM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 25.06.2023 06:42, Hongtao Liu wrote:
> > On Wed, Jun 21, 2023 at 2:26 PM Jan Beulich via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> +(define_code_iterator andor [and ior])
> >> +(define_code_attr nlogic [(and "nor") (ior "nand")])
> >> +(define_code_attr ternlog_nlogic [(and "0x11") (ior "0x77")])
> >> +
> >> +(define_insn "*<nlogic><mode>3"
> >> +  [(set (match_operand:VI 0 "register_operand" "=v,v")
> >> +       (andor:VI
> >> +         (not:VI (match_operand:VI 1 "bcst_vector_operand" "%v,v"))
> >> +         (not:VI (match_operand:VI 2 "bcst_vector_operand" "vBr,m"))))]
> > I'm thinking of doing it in simplify_rtx or gimple match.pd to transform
> > (and (not op1))  (not op2)) -> (not: (ior: op1 op2))
>
> This wouldn't be a win (not + andn) -> (or + not), but what's
> more important is ...
>
> > (ior (not op1) (not op2)) -> (not : (and op1 op2))
> >
> > Even w/o avx512f, the transformation should also benefit since it
> > takes less logic operations 3 -> 2.(or 2 -> 2 for pandn).
>
> ... that these transformations (from the, as per the doc,
> canonical representation of nand and nor) are already occurring
I see, there're already such simplifications in the gimple phase, so
the question: is there any need for and/ior:not not pattern?
Can you provide a testcase to demonstrate that and/ior: not not
pattern is needed?
> in common code, _if_ no suitable insn can be found. That was at
> least the conclusion I drew from looking around a lot, supported
> by the code that's generated prior to this change.
>
> Jan



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations
  2023-06-25  7:13       ` Hongtao Liu
@ 2023-06-25  7:23         ` Hongtao Liu
  2023-06-25  7:30           ` Hongtao Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Hongtao Liu @ 2023-06-25  7:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Sun, Jun 25, 2023 at 3:13 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Sun, Jun 25, 2023 at 1:52 PM Jan Beulich <jbeulich@suse.com> wrote:
> >
> > On 25.06.2023 06:42, Hongtao Liu wrote:
> > > On Wed, Jun 21, 2023 at 2:26 PM Jan Beulich via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > >>
> > >> +(define_code_iterator andor [and ior])
> > >> +(define_code_attr nlogic [(and "nor") (ior "nand")])
> > >> +(define_code_attr ternlog_nlogic [(and "0x11") (ior "0x77")])
> > >> +
> > >> +(define_insn "*<nlogic><mode>3"
> > >> +  [(set (match_operand:VI 0 "register_operand" "=v,v")
> > >> +       (andor:VI
> > >> +         (not:VI (match_operand:VI 1 "bcst_vector_operand" "%v,v"))
> > >> +         (not:VI (match_operand:VI 2 "bcst_vector_operand" "vBr,m"))))]
> > > I'm thinking of doing it in simplify_rtx or gimple match.pd to transform
> > > (and (not op1))  (not op2)) -> (not: (ior: op1 op2))
> >
> > This wouldn't be a win (not + andn) -> (or + not), but what's
> > more important is ...
> >
> > > (ior (not op1) (not op2)) -> (not : (and op1 op2))
> > >
> > > Even w/o avx512f, the transformation should also benefit since it
> > > takes less logic operations 3 -> 2.(or 2 -> 2 for pandn).
> >
> > ... that these transformations (from the, as per the doc,
> > canonical representation of nand and nor) are already occurring
> I see, there're already such simplifications in the gimple phase, so
> the question: is there any need for and/ior:not not pattern?
> Can you provide a testcase to demonstrate that and/ior: not not
> pattern is needed?

typedef int v4si __attribute__((vector_size(16)));
v4si
foo1 (v4si a, v4si b)
{
    return ~a & ~b;
}

I only gimple have optimized it to

  <bb 2> [local count: 1073741824]:
  # DEBUG BEGIN_STMT
  _1 = a_2(D) | b_3(D);
  _4 = ~_1;
  return _4;


But rtl still try to match

(set (reg:V4SI 86)
    (and:V4SI (not:V4SI (reg:V4SI 88))
        (not:V4SI (reg:V4SI 89))))

Hmm.
> > in common code, _if_ no suitable insn can be found. That was at
> > least the conclusion I drew from looking around a lot, supported
> > by the code that's generated prior to this change.
> >
> > Jan
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations
  2023-06-25  7:23         ` Hongtao Liu
@ 2023-06-25  7:30           ` Hongtao Liu
  2023-06-25 13:35             ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Hongtao Liu @ 2023-06-25  7:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Sun, Jun 25, 2023 at 3:23 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Sun, Jun 25, 2023 at 3:13 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Sun, Jun 25, 2023 at 1:52 PM Jan Beulich <jbeulich@suse.com> wrote:
> > >
> > > On 25.06.2023 06:42, Hongtao Liu wrote:
> > > > On Wed, Jun 21, 2023 at 2:26 PM Jan Beulich via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > >>
> > > >> +(define_code_iterator andor [and ior])
> > > >> +(define_code_attr nlogic [(and "nor") (ior "nand")])
> > > >> +(define_code_attr ternlog_nlogic [(and "0x11") (ior "0x77")])
> > > >> +
> > > >> +(define_insn "*<nlogic><mode>3"
> > > >> +  [(set (match_operand:VI 0 "register_operand" "=v,v")
> > > >> +       (andor:VI
> > > >> +         (not:VI (match_operand:VI 1 "bcst_vector_operand" "%v,v"))
> > > >> +         (not:VI (match_operand:VI 2 "bcst_vector_operand" "vBr,m"))))]
> > > > I'm thinking of doing it in simplify_rtx or gimple match.pd to transform
> > > > (and (not op1))  (not op2)) -> (not: (ior: op1 op2))
> > >
> > > This wouldn't be a win (not + andn) -> (or + not), but what's
> > > more important is ...
> > >
> > > > (ior (not op1) (not op2)) -> (not : (and op1 op2))
> > > >
> > > > Even w/o avx512f, the transformation should also benefit since it
> > > > takes less logic operations 3 -> 2.(or 2 -> 2 for pandn).
> > >
> > > ... that these transformations (from the, as per the doc,
> > > canonical representation of nand and nor) are already occurring
> > I see, there're already such simplifications in the gimple phase, so
> > the question: is there any need for and/ior:not not pattern?
> > Can you provide a testcase to demonstrate that and/ior: not not
> > pattern is needed?
>
> typedef int v4si __attribute__((vector_size(16)));
> v4si
> foo1 (v4si a, v4si b)
> {
>     return ~a & ~b;
> }
>
> I only gimple have optimized it to
>
>   <bb 2> [local count: 1073741824]:
>   # DEBUG BEGIN_STMT
>   _1 = a_2(D) | b_3(D);
>   _4 = ~_1;
>   return _4;
>
>
> But rtl still try to match
>
> (set (reg:V4SI 86)
>     (and:V4SI (not:V4SI (reg:V4SI 88))
>         (not:V4SI (reg:V4SI 89))))
>
> Hmm.
In rtl, we're using xor -1 for not, so it's

(insn 8 7 9 2 (set (reg:V4SI 87)
        (ior:V4SI (reg:V4SI 88)
            (reg:V4SI 89))) "/app/example.cpp":6:15 6830 {*iorv4si3}
     (expr_list:REG_DEAD (reg:V4SI 89)
        (expr_list:REG_DEAD (reg:V4SI 88)
            (nil))))
(insn 9 8 14 2 (set (reg:V4SI 86)
        (xor:V4SI (reg:V4SI 87)
            (const_vector:V4SI [
                    (const_int -1 [0xffffffffffffffff]) repeated x4
                ]))) "/app/example.cpp":6:18 6792 {*one_cmplv4si2}

Then simplified to
> (set (reg:V4SI 86)
>     (and:V4SI (not:V4SI (reg:V4SI 88))
>         (not:V4SI (reg:V4SI 89))))
>

by

3565    case XOR:
3566      if (trueop1 == CONST0_RTX (mode))
3567        return op0;
3568      if (INTEGRAL_MODE_P (mode) && trueop1 == CONSTM1_RTX (mode))
3569        return simplify_gen_unary (NOT, mode, op0, mode);

and

1018      /* Apply De Morgan's laws to reduce number of patterns for machines
1019         with negating logical insns (and-not, nand, etc.).  If result has
1020         only one NOT, put it first, since that is how the patterns are
1021         coded.  */
1022      if (GET_CODE (op) == IOR || GET_CODE (op) == AND)
1023        {
1024          rtx in1 = XEXP (op, 0), in2 = XEXP (op, 1);
1025          machine_mode op_mode;
1026
1027          op_mode = GET_MODE (in1);
1028          in1 = simplify_gen_unary (NOT, op_mode, in1, op_mode);
1029
1030          op_mode = GET_MODE (in2);
1031          if (op_mode == VOIDmode)
1032            op_mode = mode;
1033          in2 = simplify_gen_unary (NOT, op_mode, in2, op_mode);
1034
1035          if (GET_CODE (in2) == NOT && GET_CODE (in1) != NOT)
1036            std::swap (in1, in2);
1037
1038          return gen_rtx_fmt_ee (GET_CODE (op) == IOR ? AND : IOR,
1039                                 mode, in1, in2);
1040        }


Ok, got it, and/ior:not not pattern LGTM then.

> > > in common code, _if_ no suitable insn can be found. That was at
> > > least the conclusion I drew from looking around a lot, supported
> > > by the code that's generated prior to this change.
> > >
> > > Jan
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations
  2023-06-25  7:30           ` Hongtao Liu
@ 2023-06-25 13:35             ` Jan Beulich
  2023-06-26  0:42               ` Hongtao Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-06-25 13:35 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On 25.06.2023 09:30, Hongtao Liu wrote:
> On Sun, Jun 25, 2023 at 3:23 PM Hongtao Liu <crazylht@gmail.com> wrote:
>>
>> On Sun, Jun 25, 2023 at 3:13 PM Hongtao Liu <crazylht@gmail.com> wrote:
>>>
>>> On Sun, Jun 25, 2023 at 1:52 PM Jan Beulich <jbeulich@suse.com> wrote:
>>>>
>>>> On 25.06.2023 06:42, Hongtao Liu wrote:
>>>>> On Wed, Jun 21, 2023 at 2:26 PM Jan Beulich via Gcc-patches
>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>>
>>>>>> +(define_code_iterator andor [and ior])
>>>>>> +(define_code_attr nlogic [(and "nor") (ior "nand")])
>>>>>> +(define_code_attr ternlog_nlogic [(and "0x11") (ior "0x77")])
>>>>>> +
>>>>>> +(define_insn "*<nlogic><mode>3"
>>>>>> +  [(set (match_operand:VI 0 "register_operand" "=v,v")
>>>>>> +       (andor:VI
>>>>>> +         (not:VI (match_operand:VI 1 "bcst_vector_operand" "%v,v"))
>>>>>> +         (not:VI (match_operand:VI 2 "bcst_vector_operand" "vBr,m"))))]
>>>>> I'm thinking of doing it in simplify_rtx or gimple match.pd to transform
>>>>> (and (not op1))  (not op2)) -> (not: (ior: op1 op2))
>>>>
>>>> This wouldn't be a win (not + andn) -> (or + not), but what's
>>>> more important is ...
>>>>
>>>>> (ior (not op1) (not op2)) -> (not : (and op1 op2))
>>>>>
>>>>> Even w/o avx512f, the transformation should also benefit since it
>>>>> takes less logic operations 3 -> 2.(or 2 -> 2 for pandn).
>>>>
>>>> ... that these transformations (from the, as per the doc,
>>>> canonical representation of nand and nor) are already occurring
>>> I see, there're already such simplifications in the gimple phase, so
>>> the question: is there any need for and/ior:not not pattern?
>>> Can you provide a testcase to demonstrate that and/ior: not not
>>> pattern is needed?
>>
>> typedef int v4si __attribute__((vector_size(16)));
>> v4si
>> foo1 (v4si a, v4si b)
>> {
>>     return ~a & ~b;
>> }
>>
>> I only gimple have optimized it to
>>
>>   <bb 2> [local count: 1073741824]:
>>   # DEBUG BEGIN_STMT
>>   _1 = a_2(D) | b_3(D);
>>   _4 = ~_1;
>>   return _4;
>>
>>
>> But rtl still try to match
>>
>> (set (reg:V4SI 86)
>>     (and:V4SI (not:V4SI (reg:V4SI 88))
>>         (not:V4SI (reg:V4SI 89))))
>>
>> Hmm.
> In rtl, we're using xor -1 for not, so it's
> 
> (insn 8 7 9 2 (set (reg:V4SI 87)
>         (ior:V4SI (reg:V4SI 88)
>             (reg:V4SI 89))) "/app/example.cpp":6:15 6830 {*iorv4si3}
>      (expr_list:REG_DEAD (reg:V4SI 89)
>         (expr_list:REG_DEAD (reg:V4SI 88)
>             (nil))))
> (insn 9 8 14 2 (set (reg:V4SI 86)
>         (xor:V4SI (reg:V4SI 87)
>             (const_vector:V4SI [
>                     (const_int -1 [0xffffffffffffffff]) repeated x4
>                 ]))) "/app/example.cpp":6:18 6792 {*one_cmplv4si2}
> 
> Then simplified to
>> (set (reg:V4SI 86)
>>     (and:V4SI (not:V4SI (reg:V4SI 88))
>>         (not:V4SI (reg:V4SI 89))))
>>
> 
> by
> 
> 3565    case XOR:
> 3566      if (trueop1 == CONST0_RTX (mode))
> 3567        return op0;
> 3568      if (INTEGRAL_MODE_P (mode) && trueop1 == CONSTM1_RTX (mode))
> 3569        return simplify_gen_unary (NOT, mode, op0, mode);
> 
> and
> 
> 1018      /* Apply De Morgan's laws to reduce number of patterns for machines
> 1019         with negating logical insns (and-not, nand, etc.).  If result has
> 1020         only one NOT, put it first, since that is how the patterns are
> 1021         coded.  */
> 1022      if (GET_CODE (op) == IOR || GET_CODE (op) == AND)
> 1023        {
> 1024          rtx in1 = XEXP (op, 0), in2 = XEXP (op, 1);
> 1025          machine_mode op_mode;
> 1026
> 1027          op_mode = GET_MODE (in1);
> 1028          in1 = simplify_gen_unary (NOT, op_mode, in1, op_mode);
> 1029
> 1030          op_mode = GET_MODE (in2);
> 1031          if (op_mode == VOIDmode)
> 1032            op_mode = mode;
> 1033          in2 = simplify_gen_unary (NOT, op_mode, in2, op_mode);
> 1034
> 1035          if (GET_CODE (in2) == NOT && GET_CODE (in1) != NOT)
> 1036            std::swap (in1, in2);
> 1037
> 1038          return gen_rtx_fmt_ee (GET_CODE (op) == IOR ? AND : IOR,
> 1039                                 mode, in1, in2);
> 1040        }
> 
> 
> Ok, got it, and/ior:not not pattern LGTM then.

Just to avoid misunderstandings - together with your initial
reply that's then an "okay" to the patch as a whole, right?

Thanks, Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations
  2023-06-25 13:35             ` Jan Beulich
@ 2023-06-26  0:42               ` Hongtao Liu
  0 siblings, 0 replies; 24+ messages in thread
From: Hongtao Liu @ 2023-06-26  0:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Sun, Jun 25, 2023 at 9:35 PM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 25.06.2023 09:30, Hongtao Liu wrote:
> > On Sun, Jun 25, 2023 at 3:23 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >>
> >> On Sun, Jun 25, 2023 at 3:13 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >>>
> >>> On Sun, Jun 25, 2023 at 1:52 PM Jan Beulich <jbeulich@suse.com> wrote:
> >>>>
> >>>> On 25.06.2023 06:42, Hongtao Liu wrote:
> >>>>> On Wed, Jun 21, 2023 at 2:26 PM Jan Beulich via Gcc-patches
> >>>>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>>>
> >>>>>> +(define_code_iterator andor [and ior])
> >>>>>> +(define_code_attr nlogic [(and "nor") (ior "nand")])
> >>>>>> +(define_code_attr ternlog_nlogic [(and "0x11") (ior "0x77")])
> >>>>>> +
> >>>>>> +(define_insn "*<nlogic><mode>3"
> >>>>>> +  [(set (match_operand:VI 0 "register_operand" "=v,v")
> >>>>>> +       (andor:VI
> >>>>>> +         (not:VI (match_operand:VI 1 "bcst_vector_operand" "%v,v"))
> >>>>>> +         (not:VI (match_operand:VI 2 "bcst_vector_operand" "vBr,m"))))]
> >>>>> I'm thinking of doing it in simplify_rtx or gimple match.pd to transform
> >>>>> (and (not op1))  (not op2)) -> (not: (ior: op1 op2))
> >>>>
> >>>> This wouldn't be a win (not + andn) -> (or + not), but what's
> >>>> more important is ...
> >>>>
> >>>>> (ior (not op1) (not op2)) -> (not : (and op1 op2))
> >>>>>
> >>>>> Even w/o avx512f, the transformation should also benefit since it
> >>>>> takes less logic operations 3 -> 2.(or 2 -> 2 for pandn).
> >>>>
> >>>> ... that these transformations (from the, as per the doc,
> >>>> canonical representation of nand and nor) are already occurring
> >>> I see, there're already such simplifications in the gimple phase, so
> >>> the question: is there any need for and/ior:not not pattern?
> >>> Can you provide a testcase to demonstrate that and/ior: not not
> >>> pattern is needed?
> >>
> >> typedef int v4si __attribute__((vector_size(16)));
> >> v4si
> >> foo1 (v4si a, v4si b)
> >> {
> >>     return ~a & ~b;
> >> }
> >>
> >> I only gimple have optimized it to
> >>
> >>   <bb 2> [local count: 1073741824]:
> >>   # DEBUG BEGIN_STMT
> >>   _1 = a_2(D) | b_3(D);
> >>   _4 = ~_1;
> >>   return _4;
> >>
> >>
> >> But rtl still try to match
> >>
> >> (set (reg:V4SI 86)
> >>     (and:V4SI (not:V4SI (reg:V4SI 88))
> >>         (not:V4SI (reg:V4SI 89))))
> >>
> >> Hmm.
> > In rtl, we're using xor -1 for not, so it's
> >
> > (insn 8 7 9 2 (set (reg:V4SI 87)
> >         (ior:V4SI (reg:V4SI 88)
> >             (reg:V4SI 89))) "/app/example.cpp":6:15 6830 {*iorv4si3}
> >      (expr_list:REG_DEAD (reg:V4SI 89)
> >         (expr_list:REG_DEAD (reg:V4SI 88)
> >             (nil))))
> > (insn 9 8 14 2 (set (reg:V4SI 86)
> >         (xor:V4SI (reg:V4SI 87)
> >             (const_vector:V4SI [
> >                     (const_int -1 [0xffffffffffffffff]) repeated x4
> >                 ]))) "/app/example.cpp":6:18 6792 {*one_cmplv4si2}
> >
> > Then simplified to
> >> (set (reg:V4SI 86)
> >>     (and:V4SI (not:V4SI (reg:V4SI 88))
> >>         (not:V4SI (reg:V4SI 89))))
> >>
> >
> > by
> >
> > 3565    case XOR:
> > 3566      if (trueop1 == CONST0_RTX (mode))
> > 3567        return op0;
> > 3568      if (INTEGRAL_MODE_P (mode) && trueop1 == CONSTM1_RTX (mode))
> > 3569        return simplify_gen_unary (NOT, mode, op0, mode);
> >
> > and
> >
> > 1018      /* Apply De Morgan's laws to reduce number of patterns for machines
> > 1019         with negating logical insns (and-not, nand, etc.).  If result has
> > 1020         only one NOT, put it first, since that is how the patterns are
> > 1021         coded.  */
> > 1022      if (GET_CODE (op) == IOR || GET_CODE (op) == AND)
> > 1023        {
> > 1024          rtx in1 = XEXP (op, 0), in2 = XEXP (op, 1);
> > 1025          machine_mode op_mode;
> > 1026
> > 1027          op_mode = GET_MODE (in1);
> > 1028          in1 = simplify_gen_unary (NOT, op_mode, in1, op_mode);
> > 1029
> > 1030          op_mode = GET_MODE (in2);
> > 1031          if (op_mode == VOIDmode)
> > 1032            op_mode = mode;
> > 1033          in2 = simplify_gen_unary (NOT, op_mode, in2, op_mode);
> > 1034
> > 1035          if (GET_CODE (in2) == NOT && GET_CODE (in1) != NOT)
> > 1036            std::swap (in1, in2);
> > 1037
> > 1038          return gen_rtx_fmt_ee (GET_CODE (op) == IOR ? AND : IOR,
> > 1039                                 mode, in1, in2);
> > 1040        }
> >
> >
> > Ok, got it, and/ior:not not pattern LGTM then.
>
> Just to avoid misunderstandings - together with your initial
> reply that's then an "okay" to the patch as a whole, right?
Yes.
>
> Thanks, Jan



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] x86: yet more PR target/100711-like splitting
  2023-06-25  6:41         ` Hongtao Liu
@ 2023-11-06 11:10           ` Jan Beulich
  2023-11-06 13:48             ` Hongtao Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-11-06 11:10 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On 25.06.2023 08:41, Hongtao Liu wrote:
> On Sun, Jun 25, 2023 at 2:35 PM Hongtao Liu <crazylht@gmail.com> wrote:
>>
>> On Sun, Jun 25, 2023 at 2:25 PM Jan Beulich <jbeulich@suse.com> wrote:
>>>
>>> On 25.06.2023 07:12, Hongtao Liu wrote:
>>>> On Wed, Jun 21, 2023 at 2:29 PM Jan Beulich via Gcc-patches
>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>
>>>>> ---
>>>>> For the purpose here (and elsewhere) bcst_vector_operand() (really:
>>>>> bcst_mem_operand()) isn't permissive enough: We'd want it to allow
>>>>> 128-bit and 256-bit types as well irrespective of AVX512VL being
>>>>> enabled. This would likely require a new predicate
>>>>> (bcst_intvec_operand()?) and a new constraint (BR? Bi?). (Yet for name
>>>>> selection it will want considering that this is applicable to certain
>>>>> non-calculational FP operations as well.)
>>>> I think so.
>>>
>>> Any preference towards predicate and constraint naming?
>> something like bcst_mem_operand_$suffiix, $suffix indicates the
>> pattern may use zmm instruction for 128/256-bit operand.
>> maybe just bcst_mem_operand_zmm?
> For constraint, maybe we can reuse Br, relax Br to match bcst_mem_operand_zmm.
> For those original patterns with bcst_mem_operand, it should be ok
> since it's already guarded by the predicate, the constraint must be
> valid.

Hmm, I wanted to get back to this, but then I started wondering about this
reply of yours vs your request to not go farther with the use of "oversized"
insns (i.e. acting in 512-bit registers in lieu of AVX512VL being enabled,
when no FP exceptions can be raised on the otherwise unused elements). Since
iirc the latter came later, am I right in assuming we then also shouldn't go
the route outlined above?

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] x86: yet more PR target/100711-like splitting
  2023-11-06 11:10           ` Jan Beulich
@ 2023-11-06 13:48             ` Hongtao Liu
  0 siblings, 0 replies; 24+ messages in thread
From: Hongtao Liu @ 2023-11-06 13:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: gcc-patches, Hongtao Liu, Kirill Yukhin

On Mon, Nov 6, 2023 at 7:10 PM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 25.06.2023 08:41, Hongtao Liu wrote:
> > On Sun, Jun 25, 2023 at 2:35 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >>
> >> On Sun, Jun 25, 2023 at 2:25 PM Jan Beulich <jbeulich@suse.com> wrote:
> >>>
> >>> On 25.06.2023 07:12, Hongtao Liu wrote:
> >>>> On Wed, Jun 21, 2023 at 2:29 PM Jan Beulich via Gcc-patches
> >>>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>>
> >>>>> ---
> >>>>> For the purpose here (and elsewhere) bcst_vector_operand() (really:
> >>>>> bcst_mem_operand()) isn't permissive enough: We'd want it to allow
> >>>>> 128-bit and 256-bit types as well irrespective of AVX512VL being
> >>>>> enabled. This would likely require a new predicate
> >>>>> (bcst_intvec_operand()?) and a new constraint (BR? Bi?). (Yet for name
> >>>>> selection it will want considering that this is applicable to certain
> >>>>> non-calculational FP operations as well.)
> >>>> I think so.
> >>>
> >>> Any preference towards predicate and constraint naming?
> >> something like bcst_mem_operand_$suffiix, $suffix indicates the
> >> pattern may use zmm instruction for 128/256-bit operand.
> >> maybe just bcst_mem_operand_zmm?
> > For constraint, maybe we can reuse Br, relax Br to match bcst_mem_operand_zmm.
> > For those original patterns with bcst_mem_operand, it should be ok
> > since it's already guarded by the predicate, the constraint must be
> > valid.
>
> Hmm, I wanted to get back to this, but then I started wondering about this
> reply of yours vs your request to not go farther with the use of "oversized"
> insns (i.e. acting in 512-bit registers in lieu of AVX512VL being enabled,
> when no FP exceptions can be raised on the otherwise unused elements). Since
> iirc the latter came later, am I right in assuming we then also shouldn't go
> the route outlined above?
No, we shouldn't.
This reply is just an answer on how to do it technically, but we don't
really want to do it (considering that all AVX512 processors after SKX
will all support AVX512VL)
>
> Jan



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-11-06 13:48 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-21  6:24 [PATCH 0/5] x86: make better use of VPTERNLOG{D,Q} Jan Beulich
2023-06-21  6:25 ` [PATCH 1/5] x86: use VPTERNLOG for further bitwise two-vector operations Jan Beulich
2023-06-25  4:42   ` Hongtao Liu
2023-06-25  5:52     ` Jan Beulich
2023-06-25  7:13       ` Hongtao Liu
2023-06-25  7:23         ` Hongtao Liu
2023-06-25  7:30           ` Hongtao Liu
2023-06-25 13:35             ` Jan Beulich
2023-06-26  0:42               ` Hongtao Liu
2023-06-21  6:27 ` [PATCH 2/5] x86: use VPTERNLOG also for certain andnot forms Jan Beulich
2023-06-25  4:58   ` Hongtao Liu
2023-06-21  6:27 ` [PATCH 3/5] x86: allow memory operand for AVX2 splitter for PR target/100711 Jan Beulich
2023-06-25  4:58   ` Hongtao Liu
2023-06-21  6:27 ` [PATCH 4/5] x86: further PR target/100711-like splitting Jan Beulich
2023-06-25  5:06   ` Hongtao Liu
2023-06-25  6:16     ` Jan Beulich
2023-06-25  6:27       ` Hongtao Liu
2023-06-21  6:28 ` [PATCH 5/5] x86: yet more " Jan Beulich
2023-06-25  5:12   ` Hongtao Liu
2023-06-25  6:25     ` Jan Beulich
2023-06-25  6:35       ` Hongtao Liu
2023-06-25  6:41         ` Hongtao Liu
2023-11-06 11:10           ` Jan Beulich
2023-11-06 13:48             ` Hongtao Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).