[PATCH] Use vpmin to optimize some vector comparisons (PR target/88547)

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] Use vpmin to optimize some vector comparisons (PR target/88547)
@ 2018-12-21  9:15 Jakub Jelinek
  2018-12-21 10:41 ` Uros Bizjak
  0 siblings, 1 reply; 2+ messages in thread
From: Jakub Jelinek @ 2018-12-21  9:15 UTC (permalink / raw)
  To: Uros Bizjak, Richard Henderson; +Cc: gcc-patches

Hi!

The following patch attempts to improve code generated for some
integral vector comparisons and VEC_COND_EXPRs with integral comparisons.
The only available integral vector comparison instructions are GT and
EQ, the rest is handled either by negating the result (for vcond swapping
op_true/op_false), or swapping comparison operands and for unsigned
comparisons we use various tricks (either subtract *_MAX from both operands
for V*SI/V*DImode, or use saturating subtractions for V*QI/V*HImode).

If op_true is -1 and op_false is 0, at least when not using AVX512 mask
reg comparisons we have the right result right after the comparison.
So for signed x > y we can just
	vpcmpgtd	%ymm1, %ymm0, %ymm0
but if op_true is 0 and op_false is -1, i.e. x <= y, we generate
	vpcmpgtd	%ymm1, %ymm0, %ymm0
	vpcmpeqd	%ymm1, %ymm1, %ymm1
	vpandn	%ymm1, %ymm0, %ymm0
The following patch attempts to detect these cases where we would have
op_true 0 and op_false -1, and rather than generating a single vpcmpgtd
for the comparison and wait for 2 more instructions for the conditional move
we generate two instructions for the comparison - min (x, y) == x and don't
need anything for the rest, so:
	vpminud	%ymm1, %ymm0, %ymm1
	vpcmpeqd	%ymm0, %ymm1, %ymm0
For most cases it is done only in these cases where
1) mask registers aren't involved
2) op_true == 0 and op_false == -1 (with the *negate considered, as the
   transformation inverts *negate)

There is one case where this is useful to do even in other cases, for
V*SI/V*DImode and unsigned comparisons we generate those
x -= INT_MAX; y -= INT_MAX subtractions before comparison, so using
vpminu[dq] + vpcmpeq[dq] is shorter regardless of what follows (with the
exception when op_true is -1 and op_false is 0, when both sequences are the
same).

Of course, we can do this optimization only if the corresponding %vpmin
instructions are available, which varries a lot depending on mode (sometimes
SSE2, SSE4.1, AVX2, AVX512DQ {,+VL}, AVX512BW {,+VL}).

Fixed thusly, bootstrapped/regtested on x86_64-linux and i686-linux, ok for
trunk?

The reason for the avx512f_cond_move.c testcase adjustment is that it hits
the above second case, where vpminud + vpcmpeqd is shorter, but that means
we transform that cmp1 ? 2 : 0 which can use a {z} masking into cmp2 ? 0 : 2
which can't (still, the whole thing is shorter).  On the other side, if the
original testcase was cmp1 ? 0 : 2 then we wouldn't use {z} masking before
and with cmp2 ? 2 : 0 we would with the patch.  By changing the comparison
to be signed, we change what I believe the testcase meant to test - verify
that combiner can produce {z} masking.

2018-12-21  Jakub Jelinek  <jakub@redhat.com>

	PR target/88547
	* config/i386/i386.c (ix86_expand_int_sse_cmp): Optimize
	x > y ? 0 : -1 into min (x, y) == x ? -1 : 0.

	* gcc.target/i386/pr88547-1.c: Expect only 2 knotb and 2 knotw
	insns instead of 4, check for vpminud, vpminuq and no vpsubd or
	vpsubq.
	* gcc.target/i386/sse2-pr88547-1.c: New test.
	* gcc.target/i386/sse2-pr88547-2.c: New test.
	* gcc.target/i386/sse4_1-pr88547-1.c: New test.
	* gcc.target/i386/sse4_1-pr88547-2.c: New test.
	* gcc.target/i386/avx2-pr88547-1.c: New test.
	* gcc.target/i386/avx2-pr88547-2.c: New test.
	* gcc.target/i386/avx512f-pr88547-2.c: New test.
	* gcc.target/i386/avx512vl-pr88547-1.c: New test.
	* gcc.target/i386/avx512vl-pr88547-2.c: New test.
	* gcc.target/i386/avx512vl-pr88547-3.c: New test.
	* gcc.target/i386/avx512f_cond_move.c (y): Change from unsigned int
	array to int array.

--- gcc/config/i386/i386.c.jj	2018-12-20 18:28:51.118253338 +0100
+++ gcc/config/i386/i386.c	2018-12-21 02:17:23.049042774 +0100
@@ -24126,6 +24126,104 @@ ix86_expand_int_sse_cmp (rtx dest, enum
 	    }
 	}
 
+      rtx optrue = op_true ? op_true : CONSTM1_RTX (data_mode);
+      rtx opfalse = op_false ? op_false : CONST0_RTX (data_mode);
+      if (*negate)
+	std::swap (optrue, opfalse);
+
+      /* Transform x > y ? 0 : -1 (i.e. x <= y ? -1 : 0 or x <= y) when
+	 not using integer masks into min (x, y) == x ? -1 : 0 (i.e.
+	 min (x, y) == x).  While we add one instruction (the minimum),
+	 we remove the need for two instructions in the negation, as the
+	 result is done this way.
+	 When using masks, do it for SI/DImode element types, as it is shorter
+	 than the two subtractions.  */
+      if ((code != EQ
+	   && GET_MODE_SIZE (mode) != 64
+	   && vector_all_ones_operand (opfalse, data_mode)
+	   && optrue == CONST0_RTX (data_mode))
+	  || (code == GTU
+	      && GET_MODE_SIZE (GET_MODE_INNER (mode)) >= 4
+	      /* Don't do it if not using integer masks and we'd end up with
+		 the right values in the registers though.  */
+	      && (GET_MODE_SIZE (mode) == 64
+		  || !vector_all_ones_operand (optrue, data_mode)
+		  || opfalse != CONST0_RTX (data_mode))))
+	{
+	  rtx (*gen) (rtx, rtx, rtx) = NULL;
+
+	  switch (mode)
+	    {
+	    case E_V16SImode:
+	      gen = (code == GTU) ? gen_uminv16si3 : gen_sminv16si3;
+	      break;
+	    case E_V8DImode:
+	      gen = (code == GTU) ? gen_uminv8di3 : gen_sminv8di3;
+	      cop0 = force_reg (mode, cop0);
+	      cop1 = force_reg (mode, cop1);
+	      break;
+	    case E_V32QImode:
+	      if (TARGET_AVX2)
+		gen = (code == GTU) ? gen_uminv32qi3 : gen_sminv32qi3;
+	      break;
+	    case E_V16HImode:
+	      if (TARGET_AVX2)
+		gen = (code == GTU) ? gen_uminv16hi3 : gen_sminv16hi3;
+	      break;
+	    case E_V8SImode:
+	      if (TARGET_AVX2)
+		gen = (code == GTU) ? gen_uminv8si3 : gen_sminv8si3;
+	      break;
+	    case E_V4DImode:
+	      if (TARGET_AVX512VL)
+		{
+		  gen = (code == GTU) ? gen_uminv4di3 : gen_sminv4di3;
+		  cop0 = force_reg (mode, cop0);
+		  cop1 = force_reg (mode, cop1);
+		}
+	      break;
+	    case E_V16QImode:
+	      if (code == GTU && TARGET_SSE2)
+		gen = gen_uminv16qi3;
+	      else if (code == GT && TARGET_SSE4_1)
+		gen = gen_sminv16qi3;
+	      break;
+	    case E_V8HImode:
+	      if (code == GTU && TARGET_SSE4_1)
+		gen = gen_uminv8hi3;
+	      else if (code == GT && TARGET_SSE2)
+		gen = gen_sminv8hi3;
+	      break;
+	    case E_V4SImode:
+	      if (TARGET_SSE4_1)
+		gen = (code == GTU) ? gen_uminv4si3 : gen_sminv4si3;
+	      break;
+	    case E_V2DImode:
+	      if (TARGET_AVX512VL)
+		{
+		  gen = (code == GTU) ? gen_uminv2di3 : gen_sminv2di3;
+		  cop0 = force_reg (mode, cop0);
+		  cop1 = force_reg (mode, cop1);
+		}
+	      break;
+	    default:
+	      break;
+	    }
+
+	  if (gen)
+	    {
+	      rtx tem = gen_reg_rtx (mode);
+	      if (!vector_operand (cop0, mode))
+		cop0 = force_reg (mode, cop0);
+	      if (!vector_operand (cop1, mode))
+		cop1 = force_reg (mode, cop1);
+	      *negate = !*negate;
+	      emit_insn (gen (tem, cop0, cop1));
+	      cop1 = tem;
+	      code = EQ;
+	    }
+	}
+
       /* Unsigned parallel compare is not supported by the hardware.
 	 Play some tricks to turn this into a signed comparison
 	 against 0.  */
--- gcc/testsuite/gcc.target/i386/pr88547-1.c.jj	2018-12-20 08:54:55.988079446 +0100
+++ gcc/testsuite/gcc.target/i386/pr88547-1.c	2018-12-20 16:36:53.938389427 +0100
@@ -6,10 +6,14 @@
 /* { dg-final { scan-assembler-times "vpmovm2w\[\t  ]" 4 } } */
 /* { dg-final { scan-assembler-times "vpmovm2d\[\t  ]" 4 } } */
 /* { dg-final { scan-assembler-times "vpmovm2q\[\t  ]" 4 } } */
-/* { dg-final { scan-assembler-times "knotb\[\t  ]" 4 } } */
-/* { dg-final { scan-assembler-times "knotw\[\t  ]" 4 } } */
+/* { dg-final { scan-assembler-times "knotb\[\t  ]" 2 } } */
+/* { dg-final { scan-assembler-times "knotw\[\t  ]" 2 } } */
 /* { dg-final { scan-assembler-times "knotd\[\t  ]" 2 } } */
 /* { dg-final { scan-assembler-times "knotq\[\t  ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminud\[\t  ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminuq\[\t  ]" 2 } } */
+/* { dg-final { scan-assembler-not "vpsubd\[\t  ]" } } */
+/* { dg-final { scan-assembler-not "vpsubq\[\t  ]" } } */
 
 typedef signed char v64qi __attribute__((vector_size(64)));
 typedef unsigned char v64uqi __attribute__((vector_size(64)));
--- gcc/testsuite/gcc.target/i386/sse2-pr88547-1.c.jj	2018-12-20 16:16:30.398321030 +0100
+++ gcc/testsuite/gcc.target/i386/sse2-pr88547-1.c	2018-12-20 16:18:49.617050611 +0100
@@ -0,0 +1,115 @@
+/* PR target/88547 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2 -mno-sse3" } */
+/* { dg-final { scan-assembler-not "pmingtw\[\t ]" } } */
+/* { dg-final { scan-assembler-times "pminub\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "pminsw\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-not "pminsb\[\t ]" } } */
+/* { dg-final { scan-assembler-not "pminuw\[\t ]" } } */
+/* { dg-final { scan-assembler-not "pminud\[\t ]" } } */
+/* { dg-final { scan-assembler-not "pminuq\[\t ]" } } */
+
+typedef signed char v16qi __attribute__((vector_size(16)));
+typedef unsigned char v16uqi __attribute__((vector_size(16)));
+typedef short v8hi __attribute__((vector_size(16)));
+typedef unsigned short v8uhi __attribute__((vector_size(16)));
+typedef int v4si __attribute__((vector_size(16)));
+typedef unsigned v4usi __attribute__((vector_size(16)));
+typedef long long v2di __attribute__((vector_size(16)));
+typedef unsigned long long v2udi __attribute__((vector_size(16)));
+
+v16qi
+f1 (v16qi x, v16qi y)
+{
+  return x <= y;
+}
+
+v16uqi
+f2 (v16uqi x, v16uqi y)
+{
+  return x <= y;
+}
+
+v16qi
+f3 (v16qi x, v16qi y)
+{
+  return x >= y;
+}
+
+v16uqi
+f4 (v16uqi x, v16uqi y)
+{
+  return x >= y;
+}
+
+v8hi
+f5 (v8hi x, v8hi y)
+{
+  return x <= y;
+}
+
+v8uhi
+f6 (v8uhi x, v8uhi y)
+{
+  return x <= y;
+}
+
+v8hi
+f7 (v8hi x, v8hi y)
+{
+  return x >= y;
+}
+
+v8uhi
+f8 (v8uhi x, v8uhi y)
+{
+  return x >= y;
+}
+
+v4si
+f9 (v4si x, v4si y)
+{
+  return x <= y;
+}
+
+v4usi
+f10 (v4usi x, v4usi y)
+{
+  return x <= y;
+}
+
+v4si
+f11 (v4si x, v4si y)
+{
+  return x >= y;
+}
+
+v4usi
+f12 (v4usi x, v4usi y)
+{
+  return x >= y;
+}
+
+v2di
+f13 (v2di x, v2di y)
+{
+  return x <= y;
+}
+
+v2udi
+f14 (v2udi x, v2udi y)
+{
+  return x <= y;
+}
+
+v2di
+f15 (v2di x, v2di y)
+{
+  return x >= y;
+}
+
+v2udi
+f16 (v2udi x, v2udi y)
+{
+  return x >= y;
+}
--- gcc/testsuite/gcc.target/i386/sse2-pr88547-2.c.jj	2018-12-20 17:02:32.146234428 +0100
+++ gcc/testsuite/gcc.target/i386/sse2-pr88547-2.c	2018-12-20 17:30:22.410972349 +0100
@@ -0,0 +1,90 @@
+/* { dg-do run } */
+/* { dg-require-effective-target sse2 } */
+/* { dg-options "-O2 -msse2" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse2-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse2_test
+#endif
+
+#include CHECK_H
+
+#include "sse2-pr88547-1.c"
+
+#define NUM 256
+
+#define TEST_SIGNED(vtype, type, N, fn, op) \
+do								\
+  {								\
+    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;	\
+    int i, sign = 1;						\
+    type res;							\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	src1.i[i] = i * i * sign;				\
+	src2.i[i] = (i + 20) * sign;				\
+	sign = -sign;						\
+      }								\
+    for (i = 0; i < NUM; i += N)				\
+      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);		\
+								\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	res = src1.i[i] op src2.i[i] ? -1 : 0;			\
+	if (res != dst.i[i])					\
+	  abort ();						\
+      }								\
+  }								\
+while (0)
+
+#define TEST_UNSIGNED(vtype, type, N, fn, op) \
+do								\
+  {								\
+    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;	\
+    int i;							\
+    type res;							\
+								\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	src1.i[i] = i * i;					\
+	src2.i[i] = i + 20;					\
+	if ((i % 4))						\
+	  src2.i[i] |= (1ULL << (sizeof (type)			\
+				 * __CHAR_BIT__ - 1));		\
+      }								\
+								\
+    for (i = 0; i < NUM; i += N)				\
+      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);		\
+								\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	res = src1.i[i] op src2.i[i] ? -1 : 0;			\
+	if (res != dst.i[i])					\
+	  abort ();						\
+      }								\
+  }								\
+while (0)
+
+static void
+TEST (void)
+{
+  TEST_SIGNED (v16qi, signed char, 16, f1, <=);
+  TEST_UNSIGNED (v16uqi, unsigned char, 16, f2, <=);
+  TEST_SIGNED (v16qi, signed char, 16, f3, >=);
+  TEST_UNSIGNED (v16uqi, unsigned char, 16, f4, >=);
+  TEST_SIGNED (v8hi, short int, 8, f5, <=);
+  TEST_UNSIGNED (v8uhi, unsigned short int, 8, f6, <=);
+  TEST_SIGNED (v8hi, short int, 8, f7, >=);
+  TEST_UNSIGNED (v8uhi, unsigned short int, 8, f8, >=);
+  TEST_SIGNED (v4si, int, 4, f9, <=);
+  TEST_UNSIGNED (v4usi, unsigned int, 4, f10, <=);
+  TEST_SIGNED (v4si, int, 4, f11, >=);
+  TEST_UNSIGNED (v4usi, unsigned int, 4, f12, >=);
+  TEST_SIGNED (v2di, long long int, 2, f13, <=);
+  TEST_UNSIGNED (v2udi, unsigned long long int, 2, f14, <=);
+  TEST_SIGNED (v2di, long long int, 2, f15, >=);
+  TEST_UNSIGNED (v2udi, unsigned long long int, 2, f16, >=);
+}
--- gcc/testsuite/gcc.target/i386/sse4_1-pr88547-1.c.jj	2018-12-20 16:19:38.042260879 +0100
+++ gcc/testsuite/gcc.target/i386/sse4_1-pr88547-1.c	2018-12-20 16:21:24.211529447 +0100
@@ -0,0 +1,12 @@
+/* PR target/88547 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse4.1 -mno-sse4.2" } */
+/* { dg-final { scan-assembler-not "pmingt\[bwd]\[\t ]" } } */
+/* { dg-final { scan-assembler-times "pminub\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "pminsb\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "pminuw\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "pminsw\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "pminud\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "pminsd\[\t ]" 2 } } */
+
+#include "sse2-pr88547-1.c"
--- gcc/testsuite/gcc.target/i386/sse4_1-pr88547-2.c.jj	2018-12-20 17:30:56.908409598 +0100
+++ gcc/testsuite/gcc.target/i386/sse4_1-pr88547-2.c	2018-12-20 17:36:20.876117930 +0100
@@ -0,0 +1,8 @@
+/* { dg-do run } */
+/* { dg-require-effective-target sse4 } */
+/* { dg-options "-O2 -msse4.1" } */
+
+#define CHECK_H "sse4_1-check.h"
+#define TEST sse4_1_test
+
+#include "sse2-pr88547-2.c"
--- gcc/testsuite/gcc.target/i386/avx2-pr88547-1.c.jj	2018-12-20 16:28:22.700714103 +0100
+++ gcc/testsuite/gcc.target/i386/avx2-pr88547-1.c	2018-12-20 16:31:52.935290777 +0100
@@ -0,0 +1,115 @@
+/* PR target/88547 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2 -mno-xop -mno-avx512f" } */
+/* { dg-final { scan-assembler-not "vpmingt\[bwd]\[\t ]" } } */
+/* { dg-final { scan-assembler-times "vpminub\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminsb\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminuw\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminsw\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminud\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminsd\[\t ]" 2 } } */
+
+typedef signed char v32qi __attribute__((vector_size(32)));
+typedef unsigned char v32uqi __attribute__((vector_size(32)));
+typedef short v16hi __attribute__((vector_size(32)));
+typedef unsigned short v16uhi __attribute__((vector_size(32)));
+typedef int v8si __attribute__((vector_size(32)));
+typedef unsigned v8usi __attribute__((vector_size(32)));
+typedef long long v4di __attribute__((vector_size(32)));
+typedef unsigned long long v4udi __attribute__((vector_size(32)));
+
+__attribute__((noipa)) v32qi
+f1 (v32qi x, v32qi y)
+{
+  return x <= y;
+}
+
+__attribute__((noipa)) v32uqi
+f2 (v32uqi x, v32uqi y)
+{
+  return x <= y;
+}
+
+__attribute__((noipa)) v32qi
+f3 (v32qi x, v32qi y)
+{
+  return x >= y;
+}
+
+__attribute__((noipa)) v32uqi
+f4 (v32uqi x, v32uqi y)
+{
+  return x >= y;
+}
+
+__attribute__((noipa)) v16hi
+f5 (v16hi x, v16hi y)
+{
+  return x <= y;
+}
+
+__attribute__((noipa)) v16uhi
+f6 (v16uhi x, v16uhi y)
+{
+  return x <= y;
+}
+
+__attribute__((noipa)) v16hi
+f7 (v16hi x, v16hi y)
+{
+  return x >= y;
+}
+
+__attribute__((noipa)) v16uhi
+f8 (v16uhi x, v16uhi y)
+{
+  return x >= y;
+}
+
+__attribute__((noipa)) v8si
+f9 (v8si x, v8si y)
+{
+  return x <= y;
+}
+
+__attribute__((noipa)) v8usi
+f10 (v8usi x, v8usi y)
+{
+  return x <= y;
+}
+
+__attribute__((noipa)) v8si
+f11 (v8si x, v8si y)
+{
+  return x >= y;
+}
+
+__attribute__((noipa)) v8usi
+f12 (v8usi x, v8usi y)
+{
+  return x >= y;
+}
+
+__attribute__((noipa)) v4di
+f13 (v4di x, v4di y)
+{
+  return x <= y;
+}
+
+__attribute__((noipa)) v4udi
+f14 (v4udi x, v4udi y)
+{
+  return x <= y;
+}
+
+__attribute__((noipa)) v4di
+f15 (v4di x, v4di y)
+{
+  return x >= y;
+}
+
+__attribute__((noipa)) v4udi
+f16 (v4udi x, v4udi y)
+{
+  return x >= y;
+}
--- gcc/testsuite/gcc.target/i386/avx2-pr88547-2.c.jj	2018-12-20 17:33:50.336576866 +0100
+++ gcc/testsuite/gcc.target/i386/avx2-pr88547-2.c	2018-12-20 17:44:34.021062830 +0100
@@ -0,0 +1,90 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx2 } */
+/* { dg-options "-O2 -mavx2" } */
+
+#ifndef CHECK
+#define CHECK "avx2-check.h"
+#endif
+
+#ifndef TEST
+#define TEST avx2_test
+#endif
+
+#include CHECK
+
+#include "avx2-pr88547-1.c"
+
+#define NUM 256
+
+#define TEST_SIGNED(vtype, type, N, fn, op) \
+do								\
+  {								\
+    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;	\
+    int i, sign = 1;						\
+    type res;							\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	src1.i[i] = i * i * sign;				\
+	src2.i[i] = (i + 20) * sign;				\
+	sign = -sign;						\
+      }								\
+    for (i = 0; i < NUM; i += N)				\
+      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);		\
+								\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	res = src1.i[i] op src2.i[i] ? -1 : 0;			\
+	if (res != dst.i[i])					\
+	  abort ();						\
+      }								\
+  }								\
+while (0)
+
+#define TEST_UNSIGNED(vtype, type, N, fn, op) \
+do								\
+  {								\
+    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;	\
+    int i;							\
+    type res;							\
+								\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	src1.i[i] = i * i;					\
+	src2.i[i] = i + 20;					\
+	if ((i % 4))						\
+	  src2.i[i] |= (1ULL << (sizeof (type)			\
+				 * __CHAR_BIT__ - 1));		\
+      }								\
+								\
+    for (i = 0; i < NUM; i += N)				\
+      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);		\
+								\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	res = src1.i[i] op src2.i[i] ? -1 : 0;			\
+	if (res != dst.i[i])					\
+	  abort ();						\
+      }								\
+  }								\
+while (0)
+
+static void
+TEST (void)
+{
+  TEST_SIGNED (v32qi, signed char, 32, f1, <=);
+  TEST_UNSIGNED (v32uqi, unsigned char, 32, f2, <=);
+  TEST_SIGNED (v32qi, signed char, 32, f3, >=);
+  TEST_UNSIGNED (v32uqi, unsigned char, 32, f4, >=);
+  TEST_SIGNED (v16hi, short int, 16, f5, <=);
+  TEST_UNSIGNED (v16uhi, unsigned short int, 16, f6, <=);
+  TEST_SIGNED (v16hi, short int, 16, f7, >=);
+  TEST_UNSIGNED (v16uhi, unsigned short int, 16, f8, >=);
+  TEST_SIGNED (v8si, int, 8, f9, <=);
+  TEST_UNSIGNED (v8usi, unsigned int, 8, f10, <=);
+  TEST_SIGNED (v8si, int, 8, f11, >=);
+  TEST_UNSIGNED (v8usi, unsigned int, 8, f12, >=);
+  TEST_SIGNED (v4di, long long int, 4, f13, <=);
+  TEST_UNSIGNED (v4udi, unsigned long long int, 4, f14, <=);
+  TEST_SIGNED (v4di, long long int, 4, f15, >=);
+  TEST_UNSIGNED (v4udi, unsigned long long int, 4, f16, >=);
+}
--- gcc/testsuite/gcc.target/i386/avx512f-pr88547-2.c.jj	2018-12-20 17:37:15.420227002 +0100
+++ gcc/testsuite/gcc.target/i386/avx512f-pr88547-2.c	2018-12-20 17:41:14.473322272 +0100
@@ -0,0 +1,82 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx512f } */
+/* { dg-options "-O2 -mavx512f" } */
+
+#include "avx512-check.h"
+
+#include "pr88547-1.c"
+
+#define NUM 512
+
+#define TEST_SIGNED(vtype, type, N, fn, op) \
+do								\
+  {								\
+    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;	\
+    int i, sign = 1;						\
+    type res;							\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	src1.i[i] = i * i * sign;				\
+	src2.i[i] = (i + 20) * sign;				\
+	sign = -sign;						\
+      }								\
+    for (i = 0; i < NUM; i += N)				\
+      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);		\
+								\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	res = src1.i[i] op src2.i[i] ? -1 : 0;			\
+	if (res != dst.i[i])					\
+	  abort ();						\
+      }								\
+  }								\
+while (0)
+
+#define TEST_UNSIGNED(vtype, type, N, fn, op) \
+do								\
+  {								\
+    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;	\
+    int i;							\
+    type res;							\
+								\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	src1.i[i] = i * i;					\
+	src2.i[i] = i + 20;					\
+	if ((i % 4))						\
+	  src2.i[i] |= (1ULL << (sizeof (type)			\
+				 * __CHAR_BIT__ - 1));		\
+      }								\
+								\
+    for (i = 0; i < NUM; i += N)				\
+      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);		\
+								\
+    for (i = 0; i < NUM; i++)					\
+      {								\
+	res = src1.i[i] op src2.i[i] ? -1 : 0;			\
+	if (res != dst.i[i])					\
+	  abort ();						\
+      }								\
+  }								\
+while (0)
+
+static void
+test_512 (void)
+{
+  TEST_SIGNED (v64qi, signed char, 64, f1, <=);
+  TEST_UNSIGNED (v64uqi, unsigned char, 64, f2, <=);
+  TEST_SIGNED (v64qi, signed char, 64, f3, >=);
+  TEST_UNSIGNED (v64uqi, unsigned char, 64, f4, >=);
+  TEST_SIGNED (v32hi, short int, 32, f5, <=);
+  TEST_UNSIGNED (v32uhi, unsigned short int, 32, f6, <=);
+  TEST_SIGNED (v32hi, short int, 32, f7, >=);
+  TEST_UNSIGNED (v32uhi, unsigned short int, 32, f8, >=);
+  TEST_SIGNED (v16si, int, 16, f9, <=);
+  TEST_UNSIGNED (v16usi, unsigned int, 16, f10, <=);
+  TEST_SIGNED (v16si, int, 16, f11, >=);
+  TEST_UNSIGNED (v16usi, unsigned int, 16, f12, >=);
+  TEST_SIGNED (v8di, long long int, 8, f13, <=);
+  TEST_UNSIGNED (v8udi, unsigned long long int, 8, f14, <=);
+  TEST_SIGNED (v8di, long long int, 8, f15, >=);
+  TEST_UNSIGNED (v8udi, unsigned long long int, 8, f16, >=);
+}
--- gcc/testsuite/gcc.target/i386/avx512vl-pr88547-1.c.jj	2018-12-20 16:34:12.260022099 +0100
+++ gcc/testsuite/gcc.target/i386/avx512vl-pr88547-1.c	2018-12-20 17:45:12.783429682 +0100
@@ -0,0 +1,14 @@
+/* PR target/88547 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mno-xop -mavx512vl -mno-avx512bw -mno-avx512dq" } */
+/* { dg-final { scan-assembler-not "vpmingt\[bwdq]\[\t ]" } } */
+/* { dg-final { scan-assembler-times "vpminub\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminsb\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminuw\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminsw\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminud\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminsd\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminuq\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpminsq\[\t ]" 2 } } */
+
+#include "avx2-pr88547-1.c"
--- gcc/testsuite/gcc.target/i386/avx512vl-pr88547-2.c.jj	2018-12-20 17:42:21.697224235 +0100
+++ gcc/testsuite/gcc.target/i386/avx512vl-pr88547-2.c	2018-12-20 17:47:49.632867686 +0100
@@ -0,0 +1,22 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx512vl } */
+/* { dg-require-effective-target avx512bw } */
+/* { dg-require-effective-target avx512dq } */
+/* { dg-options "-O2 -mavx512vl -mavx512bw -mavx512dq" } */
+
+#define AVX512VL
+#define AVX512BW
+#define AVX512DQ
+
+#include "avx512f-pr88547-2.c"
+
+static void
+test_256 (void)
+{
+  test_512 ();
+}
+
+static void
+test_128 (void)
+{
+}
--- gcc/testsuite/gcc.target/i386/avx512vl-pr88547-3.c.jj	2018-12-20 17:45:19.358322291 +0100
+++ gcc/testsuite/gcc.target/i386/avx512vl-pr88547-3.c	2018-12-20 17:47:27.823223928 +0100
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx512vl } */
+/* { dg-require-effective-target avx512bw } */
+/* { dg-require-effective-target avx512dq } */
+/* { dg-options "-O2 -mavx512vl -mavx512bw -mavx512dq" } */
+
+#define AVX512VL
+#define AVX512BW
+#define AVX512DQ
+#define CHECK "avx512-check.h"
+#define TEST test_512
+
+#include "avx2-pr88547-2.c"
+
+static void
+test_256 (void)
+{
+  return test_512 ();
+}
+
+static void
+test_128 (void)
+{
+}
--- gcc/testsuite/gcc.target/i386/avx512f_cond_move.c.jj	2016-05-22 12:20:04.109035607 +0200
+++ gcc/testsuite/gcc.target/i386/avx512f_cond_move.c	2018-12-21 09:44:03.451007492 +0100
@@ -3,7 +3,7 @@
 /* { dg-final { scan-assembler-times "(?:vpblendmd|vmovdqa32)\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 8 } } */
 
 unsigned int x[128];
-unsigned int y[128];
+int y[128];
 
 void
 foo () 

	Jakub

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH] Use vpmin to optimize some vector comparisons (PR target/88547)
  2018-12-21  9:15 [PATCH] Use vpmin to optimize some vector comparisons (PR target/88547) Jakub Jelinek
@ 2018-12-21 10:41 ` Uros Bizjak
  0 siblings, 0 replies; 2+ messages in thread
From: Uros Bizjak @ 2018-12-21 10:41 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Henderson, gcc-patches

On Fri, Dec 21, 2018 at 10:13 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> Hi!
>
> The following patch attempts to improve code generated for some
> integral vector comparisons and VEC_COND_EXPRs with integral comparisons.
> The only available integral vector comparison instructions are GT and
> EQ, the rest is handled either by negating the result (for vcond swapping
> op_true/op_false), or swapping comparison operands and for unsigned
> comparisons we use various tricks (either subtract *_MAX from both operands
> for V*SI/V*DImode, or use saturating subtractions for V*QI/V*HImode).
>
> If op_true is -1 and op_false is 0, at least when not using AVX512 mask
> reg comparisons we have the right result right after the comparison.
> So for signed x > y we can just
>         vpcmpgtd        %ymm1, %ymm0, %ymm0
> but if op_true is 0 and op_false is -1, i.e. x <= y, we generate
>         vpcmpgtd        %ymm1, %ymm0, %ymm0
>         vpcmpeqd        %ymm1, %ymm1, %ymm1
>         vpandn  %ymm1, %ymm0, %ymm0
> The following patch attempts to detect these cases where we would have
> op_true 0 and op_false -1, and rather than generating a single vpcmpgtd
> for the comparison and wait for 2 more instructions for the conditional move
> we generate two instructions for the comparison - min (x, y) == x and don't
> need anything for the rest, so:
>         vpminud %ymm1, %ymm0, %ymm1
>         vpcmpeqd        %ymm0, %ymm1, %ymm0
> For most cases it is done only in these cases where
> 1) mask registers aren't involved
> 2) op_true == 0 and op_false == -1 (with the *negate considered, as the
>    transformation inverts *negate)
>
> There is one case where this is useful to do even in other cases, for
> V*SI/V*DImode and unsigned comparisons we generate those
> x -= INT_MAX; y -= INT_MAX subtractions before comparison, so using
> vpminu[dq] + vpcmpeq[dq] is shorter regardless of what follows (with the
> exception when op_true is -1 and op_false is 0, when both sequences are the
> same).
>
> Of course, we can do this optimization only if the corresponding %vpmin
> instructions are available, which varries a lot depending on mode (sometimes
> SSE2, SSE4.1, AVX2, AVX512DQ {,+VL}, AVX512BW {,+VL}).
>
> Fixed thusly, bootstrapped/regtested on x86_64-linux and i686-linux, ok for
> trunk?
>
> The reason for the avx512f_cond_move.c testcase adjustment is that it hits
> the above second case, where vpminud + vpcmpeqd is shorter, but that means
> we transform that cmp1 ? 2 : 0 which can use a {z} masking into cmp2 ? 0 : 2
> which can't (still, the whole thing is shorter).  On the other side, if the
> original testcase was cmp1 ? 0 : 2 then we wouldn't use {z} masking before
> and with cmp2 ? 2 : 0 we would with the patch.  By changing the comparison
> to be signed, we change what I believe the testcase meant to test - verify
> that combiner can produce {z} masking.
>
> 2018-12-21  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/88547
>         * config/i386/i386.c (ix86_expand_int_sse_cmp): Optimize
>         x > y ? 0 : -1 into min (x, y) == x ? -1 : 0.
>
>         * gcc.target/i386/pr88547-1.c: Expect only 2 knotb and 2 knotw
>         insns instead of 4, check for vpminud, vpminuq and no vpsubd or
>         vpsubq.
>         * gcc.target/i386/sse2-pr88547-1.c: New test.
>         * gcc.target/i386/sse2-pr88547-2.c: New test.
>         * gcc.target/i386/sse4_1-pr88547-1.c: New test.
>         * gcc.target/i386/sse4_1-pr88547-2.c: New test.
>         * gcc.target/i386/avx2-pr88547-1.c: New test.
>         * gcc.target/i386/avx2-pr88547-2.c: New test.
>         * gcc.target/i386/avx512f-pr88547-2.c: New test.
>         * gcc.target/i386/avx512vl-pr88547-1.c: New test.
>         * gcc.target/i386/avx512vl-pr88547-2.c: New test.
>         * gcc.target/i386/avx512vl-pr88547-3.c: New test.
>         * gcc.target/i386/avx512f_cond_move.c (y): Change from unsigned int
>         array to int array.

Nice patch, LGTM.

Thanks,
Uros.

> --- gcc/config/i386/i386.c.jj   2018-12-20 18:28:51.118253338 +0100
> +++ gcc/config/i386/i386.c      2018-12-21 02:17:23.049042774 +0100
> @@ -24126,6 +24126,104 @@ ix86_expand_int_sse_cmp (rtx dest, enum
>             }
>         }
>
> +      rtx optrue = op_true ? op_true : CONSTM1_RTX (data_mode);
> +      rtx opfalse = op_false ? op_false : CONST0_RTX (data_mode);
> +      if (*negate)
> +       std::swap (optrue, opfalse);
> +
> +      /* Transform x > y ? 0 : -1 (i.e. x <= y ? -1 : 0 or x <= y) when
> +        not using integer masks into min (x, y) == x ? -1 : 0 (i.e.
> +        min (x, y) == x).  While we add one instruction (the minimum),
> +        we remove the need for two instructions in the negation, as the
> +        result is done this way.
> +        When using masks, do it for SI/DImode element types, as it is shorter
> +        than the two subtractions.  */
> +      if ((code != EQ
> +          && GET_MODE_SIZE (mode) != 64
> +          && vector_all_ones_operand (opfalse, data_mode)
> +          && optrue == CONST0_RTX (data_mode))
> +         || (code == GTU
> +             && GET_MODE_SIZE (GET_MODE_INNER (mode)) >= 4
> +             /* Don't do it if not using integer masks and we'd end up with
> +                the right values in the registers though.  */
> +             && (GET_MODE_SIZE (mode) == 64
> +                 || !vector_all_ones_operand (optrue, data_mode)
> +                 || opfalse != CONST0_RTX (data_mode))))
> +       {
> +         rtx (*gen) (rtx, rtx, rtx) = NULL;
> +
> +         switch (mode)
> +           {
> +           case E_V16SImode:
> +             gen = (code == GTU) ? gen_uminv16si3 : gen_sminv16si3;
> +             break;
> +           case E_V8DImode:
> +             gen = (code == GTU) ? gen_uminv8di3 : gen_sminv8di3;
> +             cop0 = force_reg (mode, cop0);
> +             cop1 = force_reg (mode, cop1);
> +             break;
> +           case E_V32QImode:
> +             if (TARGET_AVX2)
> +               gen = (code == GTU) ? gen_uminv32qi3 : gen_sminv32qi3;
> +             break;
> +           case E_V16HImode:
> +             if (TARGET_AVX2)
> +               gen = (code == GTU) ? gen_uminv16hi3 : gen_sminv16hi3;
> +             break;
> +           case E_V8SImode:
> +             if (TARGET_AVX2)
> +               gen = (code == GTU) ? gen_uminv8si3 : gen_sminv8si3;
> +             break;
> +           case E_V4DImode:
> +             if (TARGET_AVX512VL)
> +               {
> +                 gen = (code == GTU) ? gen_uminv4di3 : gen_sminv4di3;
> +                 cop0 = force_reg (mode, cop0);
> +                 cop1 = force_reg (mode, cop1);
> +               }
> +             break;
> +           case E_V16QImode:
> +             if (code == GTU && TARGET_SSE2)
> +               gen = gen_uminv16qi3;
> +             else if (code == GT && TARGET_SSE4_1)
> +               gen = gen_sminv16qi3;
> +             break;
> +           case E_V8HImode:
> +             if (code == GTU && TARGET_SSE4_1)
> +               gen = gen_uminv8hi3;
> +             else if (code == GT && TARGET_SSE2)
> +               gen = gen_sminv8hi3;
> +             break;
> +           case E_V4SImode:
> +             if (TARGET_SSE4_1)
> +               gen = (code == GTU) ? gen_uminv4si3 : gen_sminv4si3;
> +             break;
> +           case E_V2DImode:
> +             if (TARGET_AVX512VL)
> +               {
> +                 gen = (code == GTU) ? gen_uminv2di3 : gen_sminv2di3;
> +                 cop0 = force_reg (mode, cop0);
> +                 cop1 = force_reg (mode, cop1);
> +               }
> +             break;
> +           default:
> +             break;
> +           }
> +
> +         if (gen)
> +           {
> +             rtx tem = gen_reg_rtx (mode);
> +             if (!vector_operand (cop0, mode))
> +               cop0 = force_reg (mode, cop0);
> +             if (!vector_operand (cop1, mode))
> +               cop1 = force_reg (mode, cop1);
> +             *negate = !*negate;
> +             emit_insn (gen (tem, cop0, cop1));
> +             cop1 = tem;
> +             code = EQ;
> +           }
> +       }
> +
>        /* Unsigned parallel compare is not supported by the hardware.
>          Play some tricks to turn this into a signed comparison
>          against 0.  */
> --- gcc/testsuite/gcc.target/i386/pr88547-1.c.jj        2018-12-20 08:54:55.988079446 +0100
> +++ gcc/testsuite/gcc.target/i386/pr88547-1.c   2018-12-20 16:36:53.938389427 +0100
> @@ -6,10 +6,14 @@
>  /* { dg-final { scan-assembler-times "vpmovm2w\[\t  ]" 4 } } */
>  /* { dg-final { scan-assembler-times "vpmovm2d\[\t  ]" 4 } } */
>  /* { dg-final { scan-assembler-times "vpmovm2q\[\t  ]" 4 } } */
> -/* { dg-final { scan-assembler-times "knotb\[\t  ]" 4 } } */
> -/* { dg-final { scan-assembler-times "knotw\[\t  ]" 4 } } */
> +/* { dg-final { scan-assembler-times "knotb\[\t  ]" 2 } } */
> +/* { dg-final { scan-assembler-times "knotw\[\t  ]" 2 } } */
>  /* { dg-final { scan-assembler-times "knotd\[\t  ]" 2 } } */
>  /* { dg-final { scan-assembler-times "knotq\[\t  ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminud\[\t  ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminuq\[\t  ]" 2 } } */
> +/* { dg-final { scan-assembler-not "vpsubd\[\t  ]" } } */
> +/* { dg-final { scan-assembler-not "vpsubq\[\t  ]" } } */
>
>  typedef signed char v64qi __attribute__((vector_size(64)));
>  typedef unsigned char v64uqi __attribute__((vector_size(64)));
> --- gcc/testsuite/gcc.target/i386/sse2-pr88547-1.c.jj   2018-12-20 16:16:30.398321030 +0100
> +++ gcc/testsuite/gcc.target/i386/sse2-pr88547-1.c      2018-12-20 16:18:49.617050611 +0100
> @@ -0,0 +1,115 @@
> +/* PR target/88547 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-sse3" } */
> +/* { dg-final { scan-assembler-not "pmingtw\[\t ]" } } */
> +/* { dg-final { scan-assembler-times "pminub\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "pminsw\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-not "pminsb\[\t ]" } } */
> +/* { dg-final { scan-assembler-not "pminuw\[\t ]" } } */
> +/* { dg-final { scan-assembler-not "pminud\[\t ]" } } */
> +/* { dg-final { scan-assembler-not "pminuq\[\t ]" } } */
> +
> +typedef signed char v16qi __attribute__((vector_size(16)));
> +typedef unsigned char v16uqi __attribute__((vector_size(16)));
> +typedef short v8hi __attribute__((vector_size(16)));
> +typedef unsigned short v8uhi __attribute__((vector_size(16)));
> +typedef int v4si __attribute__((vector_size(16)));
> +typedef unsigned v4usi __attribute__((vector_size(16)));
> +typedef long long v2di __attribute__((vector_size(16)));
> +typedef unsigned long long v2udi __attribute__((vector_size(16)));
> +
> +v16qi
> +f1 (v16qi x, v16qi y)
> +{
> +  return x <= y;
> +}
> +
> +v16uqi
> +f2 (v16uqi x, v16uqi y)
> +{
> +  return x <= y;
> +}
> +
> +v16qi
> +f3 (v16qi x, v16qi y)
> +{
> +  return x >= y;
> +}
> +
> +v16uqi
> +f4 (v16uqi x, v16uqi y)
> +{
> +  return x >= y;
> +}
> +
> +v8hi
> +f5 (v8hi x, v8hi y)
> +{
> +  return x <= y;
> +}
> +
> +v8uhi
> +f6 (v8uhi x, v8uhi y)
> +{
> +  return x <= y;
> +}
> +
> +v8hi
> +f7 (v8hi x, v8hi y)
> +{
> +  return x >= y;
> +}
> +
> +v8uhi
> +f8 (v8uhi x, v8uhi y)
> +{
> +  return x >= y;
> +}
> +
> +v4si
> +f9 (v4si x, v4si y)
> +{
> +  return x <= y;
> +}
> +
> +v4usi
> +f10 (v4usi x, v4usi y)
> +{
> +  return x <= y;
> +}
> +
> +v4si
> +f11 (v4si x, v4si y)
> +{
> +  return x >= y;
> +}
> +
> +v4usi
> +f12 (v4usi x, v4usi y)
> +{
> +  return x >= y;
> +}
> +
> +v2di
> +f13 (v2di x, v2di y)
> +{
> +  return x <= y;
> +}
> +
> +v2udi
> +f14 (v2udi x, v2udi y)
> +{
> +  return x <= y;
> +}
> +
> +v2di
> +f15 (v2di x, v2di y)
> +{
> +  return x >= y;
> +}
> +
> +v2udi
> +f16 (v2udi x, v2udi y)
> +{
> +  return x >= y;
> +}
> --- gcc/testsuite/gcc.target/i386/sse2-pr88547-2.c.jj   2018-12-20 17:02:32.146234428 +0100
> +++ gcc/testsuite/gcc.target/i386/sse2-pr88547-2.c      2018-12-20 17:30:22.410972349 +0100
> @@ -0,0 +1,90 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target sse2 } */
> +/* { dg-options "-O2 -msse2" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse2-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse2_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include "sse2-pr88547-1.c"
> +
> +#define NUM 256
> +
> +#define TEST_SIGNED(vtype, type, N, fn, op) \
> +do                                                             \
> +  {                                                            \
> +    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;  \
> +    int i, sign = 1;                                           \
> +    type res;                                                  \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       src1.i[i] = i * i * sign;                               \
> +       src2.i[i] = (i + 20) * sign;                            \
> +       sign = -sign;                                           \
> +      }                                                                \
> +    for (i = 0; i < NUM; i += N)                               \
> +      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);                \
> +                                                               \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       res = src1.i[i] op src2.i[i] ? -1 : 0;                  \
> +       if (res != dst.i[i])                                    \
> +         abort ();                                             \
> +      }                                                                \
> +  }                                                            \
> +while (0)
> +
> +#define TEST_UNSIGNED(vtype, type, N, fn, op) \
> +do                                                             \
> +  {                                                            \
> +    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;  \
> +    int i;                                                     \
> +    type res;                                                  \
> +                                                               \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       src1.i[i] = i * i;                                      \
> +       src2.i[i] = i + 20;                                     \
> +       if ((i % 4))                                            \
> +         src2.i[i] |= (1ULL << (sizeof (type)                  \
> +                                * __CHAR_BIT__ - 1));          \
> +      }                                                                \
> +                                                               \
> +    for (i = 0; i < NUM; i += N)                               \
> +      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);                \
> +                                                               \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       res = src1.i[i] op src2.i[i] ? -1 : 0;                  \
> +       if (res != dst.i[i])                                    \
> +         abort ();                                             \
> +      }                                                                \
> +  }                                                            \
> +while (0)
> +
> +static void
> +TEST (void)
> +{
> +  TEST_SIGNED (v16qi, signed char, 16, f1, <=);
> +  TEST_UNSIGNED (v16uqi, unsigned char, 16, f2, <=);
> +  TEST_SIGNED (v16qi, signed char, 16, f3, >=);
> +  TEST_UNSIGNED (v16uqi, unsigned char, 16, f4, >=);
> +  TEST_SIGNED (v8hi, short int, 8, f5, <=);
> +  TEST_UNSIGNED (v8uhi, unsigned short int, 8, f6, <=);
> +  TEST_SIGNED (v8hi, short int, 8, f7, >=);
> +  TEST_UNSIGNED (v8uhi, unsigned short int, 8, f8, >=);
> +  TEST_SIGNED (v4si, int, 4, f9, <=);
> +  TEST_UNSIGNED (v4usi, unsigned int, 4, f10, <=);
> +  TEST_SIGNED (v4si, int, 4, f11, >=);
> +  TEST_UNSIGNED (v4usi, unsigned int, 4, f12, >=);
> +  TEST_SIGNED (v2di, long long int, 2, f13, <=);
> +  TEST_UNSIGNED (v2udi, unsigned long long int, 2, f14, <=);
> +  TEST_SIGNED (v2di, long long int, 2, f15, >=);
> +  TEST_UNSIGNED (v2udi, unsigned long long int, 2, f16, >=);
> +}
> --- gcc/testsuite/gcc.target/i386/sse4_1-pr88547-1.c.jj 2018-12-20 16:19:38.042260879 +0100
> +++ gcc/testsuite/gcc.target/i386/sse4_1-pr88547-1.c    2018-12-20 16:21:24.211529447 +0100
> @@ -0,0 +1,12 @@
> +/* PR target/88547 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse4.1 -mno-sse4.2" } */
> +/* { dg-final { scan-assembler-not "pmingt\[bwd]\[\t ]" } } */
> +/* { dg-final { scan-assembler-times "pminub\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "pminsb\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "pminuw\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "pminsw\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "pminud\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "pminsd\[\t ]" 2 } } */
> +
> +#include "sse2-pr88547-1.c"
> --- gcc/testsuite/gcc.target/i386/sse4_1-pr88547-2.c.jj 2018-12-20 17:30:56.908409598 +0100
> +++ gcc/testsuite/gcc.target/i386/sse4_1-pr88547-2.c    2018-12-20 17:36:20.876117930 +0100
> @@ -0,0 +1,8 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target sse4 } */
> +/* { dg-options "-O2 -msse4.1" } */
> +
> +#define CHECK_H "sse4_1-check.h"
> +#define TEST sse4_1_test
> +
> +#include "sse2-pr88547-2.c"
> --- gcc/testsuite/gcc.target/i386/avx2-pr88547-1.c.jj   2018-12-20 16:28:22.700714103 +0100
> +++ gcc/testsuite/gcc.target/i386/avx2-pr88547-1.c      2018-12-20 16:31:52.935290777 +0100
> @@ -0,0 +1,115 @@
> +/* PR target/88547 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx2 -mno-xop -mno-avx512f" } */
> +/* { dg-final { scan-assembler-not "vpmingt\[bwd]\[\t ]" } } */
> +/* { dg-final { scan-assembler-times "vpminub\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminsb\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminuw\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminsw\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminud\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminsd\[\t ]" 2 } } */
> +
> +typedef signed char v32qi __attribute__((vector_size(32)));
> +typedef unsigned char v32uqi __attribute__((vector_size(32)));
> +typedef short v16hi __attribute__((vector_size(32)));
> +typedef unsigned short v16uhi __attribute__((vector_size(32)));
> +typedef int v8si __attribute__((vector_size(32)));
> +typedef unsigned v8usi __attribute__((vector_size(32)));
> +typedef long long v4di __attribute__((vector_size(32)));
> +typedef unsigned long long v4udi __attribute__((vector_size(32)));
> +
> +__attribute__((noipa)) v32qi
> +f1 (v32qi x, v32qi y)
> +{
> +  return x <= y;
> +}
> +
> +__attribute__((noipa)) v32uqi
> +f2 (v32uqi x, v32uqi y)
> +{
> +  return x <= y;
> +}
> +
> +__attribute__((noipa)) v32qi
> +f3 (v32qi x, v32qi y)
> +{
> +  return x >= y;
> +}
> +
> +__attribute__((noipa)) v32uqi
> +f4 (v32uqi x, v32uqi y)
> +{
> +  return x >= y;
> +}
> +
> +__attribute__((noipa)) v16hi
> +f5 (v16hi x, v16hi y)
> +{
> +  return x <= y;
> +}
> +
> +__attribute__((noipa)) v16uhi
> +f6 (v16uhi x, v16uhi y)
> +{
> +  return x <= y;
> +}
> +
> +__attribute__((noipa)) v16hi
> +f7 (v16hi x, v16hi y)
> +{
> +  return x >= y;
> +}
> +
> +__attribute__((noipa)) v16uhi
> +f8 (v16uhi x, v16uhi y)
> +{
> +  return x >= y;
> +}
> +
> +__attribute__((noipa)) v8si
> +f9 (v8si x, v8si y)
> +{
> +  return x <= y;
> +}
> +
> +__attribute__((noipa)) v8usi
> +f10 (v8usi x, v8usi y)
> +{
> +  return x <= y;
> +}
> +
> +__attribute__((noipa)) v8si
> +f11 (v8si x, v8si y)
> +{
> +  return x >= y;
> +}
> +
> +__attribute__((noipa)) v8usi
> +f12 (v8usi x, v8usi y)
> +{
> +  return x >= y;
> +}
> +
> +__attribute__((noipa)) v4di
> +f13 (v4di x, v4di y)
> +{
> +  return x <= y;
> +}
> +
> +__attribute__((noipa)) v4udi
> +f14 (v4udi x, v4udi y)
> +{
> +  return x <= y;
> +}
> +
> +__attribute__((noipa)) v4di
> +f15 (v4di x, v4di y)
> +{
> +  return x >= y;
> +}
> +
> +__attribute__((noipa)) v4udi
> +f16 (v4udi x, v4udi y)
> +{
> +  return x >= y;
> +}
> --- gcc/testsuite/gcc.target/i386/avx2-pr88547-2.c.jj   2018-12-20 17:33:50.336576866 +0100
> +++ gcc/testsuite/gcc.target/i386/avx2-pr88547-2.c      2018-12-20 17:44:34.021062830 +0100
> @@ -0,0 +1,90 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target avx2 } */
> +/* { dg-options "-O2 -mavx2" } */
> +
> +#ifndef CHECK
> +#define CHECK "avx2-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST avx2_test
> +#endif
> +
> +#include CHECK
> +
> +#include "avx2-pr88547-1.c"
> +
> +#define NUM 256
> +
> +#define TEST_SIGNED(vtype, type, N, fn, op) \
> +do                                                             \
> +  {                                                            \
> +    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;  \
> +    int i, sign = 1;                                           \
> +    type res;                                                  \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       src1.i[i] = i * i * sign;                               \
> +       src2.i[i] = (i + 20) * sign;                            \
> +       sign = -sign;                                           \
> +      }                                                                \
> +    for (i = 0; i < NUM; i += N)                               \
> +      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);                \
> +                                                               \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       res = src1.i[i] op src2.i[i] ? -1 : 0;                  \
> +       if (res != dst.i[i])                                    \
> +         abort ();                                             \
> +      }                                                                \
> +  }                                                            \
> +while (0)
> +
> +#define TEST_UNSIGNED(vtype, type, N, fn, op) \
> +do                                                             \
> +  {                                                            \
> +    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;  \
> +    int i;                                                     \
> +    type res;                                                  \
> +                                                               \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       src1.i[i] = i * i;                                      \
> +       src2.i[i] = i + 20;                                     \
> +       if ((i % 4))                                            \
> +         src2.i[i] |= (1ULL << (sizeof (type)                  \
> +                                * __CHAR_BIT__ - 1));          \
> +      }                                                                \
> +                                                               \
> +    for (i = 0; i < NUM; i += N)                               \
> +      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);                \
> +                                                               \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       res = src1.i[i] op src2.i[i] ? -1 : 0;                  \
> +       if (res != dst.i[i])                                    \
> +         abort ();                                             \
> +      }                                                                \
> +  }                                                            \
> +while (0)
> +
> +static void
> +TEST (void)
> +{
> +  TEST_SIGNED (v32qi, signed char, 32, f1, <=);
> +  TEST_UNSIGNED (v32uqi, unsigned char, 32, f2, <=);
> +  TEST_SIGNED (v32qi, signed char, 32, f3, >=);
> +  TEST_UNSIGNED (v32uqi, unsigned char, 32, f4, >=);
> +  TEST_SIGNED (v16hi, short int, 16, f5, <=);
> +  TEST_UNSIGNED (v16uhi, unsigned short int, 16, f6, <=);
> +  TEST_SIGNED (v16hi, short int, 16, f7, >=);
> +  TEST_UNSIGNED (v16uhi, unsigned short int, 16, f8, >=);
> +  TEST_SIGNED (v8si, int, 8, f9, <=);
> +  TEST_UNSIGNED (v8usi, unsigned int, 8, f10, <=);
> +  TEST_SIGNED (v8si, int, 8, f11, >=);
> +  TEST_UNSIGNED (v8usi, unsigned int, 8, f12, >=);
> +  TEST_SIGNED (v4di, long long int, 4, f13, <=);
> +  TEST_UNSIGNED (v4udi, unsigned long long int, 4, f14, <=);
> +  TEST_SIGNED (v4di, long long int, 4, f15, >=);
> +  TEST_UNSIGNED (v4udi, unsigned long long int, 4, f16, >=);
> +}
> --- gcc/testsuite/gcc.target/i386/avx512f-pr88547-2.c.jj        2018-12-20 17:37:15.420227002 +0100
> +++ gcc/testsuite/gcc.target/i386/avx512f-pr88547-2.c   2018-12-20 17:41:14.473322272 +0100
> @@ -0,0 +1,82 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target avx512f } */
> +/* { dg-options "-O2 -mavx512f" } */
> +
> +#include "avx512-check.h"
> +
> +#include "pr88547-1.c"
> +
> +#define NUM 512
> +
> +#define TEST_SIGNED(vtype, type, N, fn, op) \
> +do                                                             \
> +  {                                                            \
> +    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;  \
> +    int i, sign = 1;                                           \
> +    type res;                                                  \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       src1.i[i] = i * i * sign;                               \
> +       src2.i[i] = (i + 20) * sign;                            \
> +       sign = -sign;                                           \
> +      }                                                                \
> +    for (i = 0; i < NUM; i += N)                               \
> +      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);                \
> +                                                               \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       res = src1.i[i] op src2.i[i] ? -1 : 0;                  \
> +       if (res != dst.i[i])                                    \
> +         abort ();                                             \
> +      }                                                                \
> +  }                                                            \
> +while (0)
> +
> +#define TEST_UNSIGNED(vtype, type, N, fn, op) \
> +do                                                             \
> +  {                                                            \
> +    union { vtype x[NUM / N]; type i[NUM]; } dst, src1, src2;  \
> +    int i;                                                     \
> +    type res;                                                  \
> +                                                               \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       src1.i[i] = i * i;                                      \
> +       src2.i[i] = i + 20;                                     \
> +       if ((i % 4))                                            \
> +         src2.i[i] |= (1ULL << (sizeof (type)                  \
> +                                * __CHAR_BIT__ - 1));          \
> +      }                                                                \
> +                                                               \
> +    for (i = 0; i < NUM; i += N)                               \
> +      dst.x[i / N] = fn (src1.x[i / N], src2.x[i / N]);                \
> +                                                               \
> +    for (i = 0; i < NUM; i++)                                  \
> +      {                                                                \
> +       res = src1.i[i] op src2.i[i] ? -1 : 0;                  \
> +       if (res != dst.i[i])                                    \
> +         abort ();                                             \
> +      }                                                                \
> +  }                                                            \
> +while (0)
> +
> +static void
> +test_512 (void)
> +{
> +  TEST_SIGNED (v64qi, signed char, 64, f1, <=);
> +  TEST_UNSIGNED (v64uqi, unsigned char, 64, f2, <=);
> +  TEST_SIGNED (v64qi, signed char, 64, f3, >=);
> +  TEST_UNSIGNED (v64uqi, unsigned char, 64, f4, >=);
> +  TEST_SIGNED (v32hi, short int, 32, f5, <=);
> +  TEST_UNSIGNED (v32uhi, unsigned short int, 32, f6, <=);
> +  TEST_SIGNED (v32hi, short int, 32, f7, >=);
> +  TEST_UNSIGNED (v32uhi, unsigned short int, 32, f8, >=);
> +  TEST_SIGNED (v16si, int, 16, f9, <=);
> +  TEST_UNSIGNED (v16usi, unsigned int, 16, f10, <=);
> +  TEST_SIGNED (v16si, int, 16, f11, >=);
> +  TEST_UNSIGNED (v16usi, unsigned int, 16, f12, >=);
> +  TEST_SIGNED (v8di, long long int, 8, f13, <=);
> +  TEST_UNSIGNED (v8udi, unsigned long long int, 8, f14, <=);
> +  TEST_SIGNED (v8di, long long int, 8, f15, >=);
> +  TEST_UNSIGNED (v8udi, unsigned long long int, 8, f16, >=);
> +}
> --- gcc/testsuite/gcc.target/i386/avx512vl-pr88547-1.c.jj       2018-12-20 16:34:12.260022099 +0100
> +++ gcc/testsuite/gcc.target/i386/avx512vl-pr88547-1.c  2018-12-20 17:45:12.783429682 +0100
> @@ -0,0 +1,14 @@
> +/* PR target/88547 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mno-xop -mavx512vl -mno-avx512bw -mno-avx512dq" } */
> +/* { dg-final { scan-assembler-not "vpmingt\[bwdq]\[\t ]" } } */
> +/* { dg-final { scan-assembler-times "vpminub\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminsb\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminuw\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminsw\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminud\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminsd\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminuq\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpminsq\[\t ]" 2 } } */
> +
> +#include "avx2-pr88547-1.c"
> --- gcc/testsuite/gcc.target/i386/avx512vl-pr88547-2.c.jj       2018-12-20 17:42:21.697224235 +0100
> +++ gcc/testsuite/gcc.target/i386/avx512vl-pr88547-2.c  2018-12-20 17:47:49.632867686 +0100
> @@ -0,0 +1,22 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target avx512vl } */
> +/* { dg-require-effective-target avx512bw } */
> +/* { dg-require-effective-target avx512dq } */
> +/* { dg-options "-O2 -mavx512vl -mavx512bw -mavx512dq" } */
> +
> +#define AVX512VL
> +#define AVX512BW
> +#define AVX512DQ
> +
> +#include "avx512f-pr88547-2.c"
> +
> +static void
> +test_256 (void)
> +{
> +  test_512 ();
> +}
> +
> +static void
> +test_128 (void)
> +{
> +}
> --- gcc/testsuite/gcc.target/i386/avx512vl-pr88547-3.c.jj       2018-12-20 17:45:19.358322291 +0100
> +++ gcc/testsuite/gcc.target/i386/avx512vl-pr88547-3.c  2018-12-20 17:47:27.823223928 +0100
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target avx512vl } */
> +/* { dg-require-effective-target avx512bw } */
> +/* { dg-require-effective-target avx512dq } */
> +/* { dg-options "-O2 -mavx512vl -mavx512bw -mavx512dq" } */
> +
> +#define AVX512VL
> +#define AVX512BW
> +#define AVX512DQ
> +#define CHECK "avx512-check.h"
> +#define TEST test_512
> +
> +#include "avx2-pr88547-2.c"
> +
> +static void
> +test_256 (void)
> +{
> +  return test_512 ();
> +}
> +
> +static void
> +test_128 (void)
> +{
> +}
> --- gcc/testsuite/gcc.target/i386/avx512f_cond_move.c.jj        2016-05-22 12:20:04.109035607 +0200
> +++ gcc/testsuite/gcc.target/i386/avx512f_cond_move.c   2018-12-21 09:44:03.451007492 +0100
> @@ -3,7 +3,7 @@
>  /* { dg-final { scan-assembler-times "(?:vpblendmd|vmovdqa32)\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 8 } } */
>
>  unsigned int x[128];
> -unsigned int y[128];
> +int y[128];
>
>  void
>  foo ()
>
>         Jakub

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2018-12-21 10:28 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-21  9:15 [PATCH] Use vpmin to optimize some vector comparisons (PR target/88547) Jakub Jelinek
2018-12-21 10:41 ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).