[PATCH v3 0/6] rs6000: Support more SSE4 intrinsics

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics
@ 2021-08-23 19:03 Paul A. Clarke
  2021-08-23 19:03 ` [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics Paul A. Clarke
                   ` (7 more replies)
  0 siblings, 8 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-08-23 19:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, wschmidt

v3: Add "nmmintrin.h". _mm_cmpgt_epi64 is part of SSE4.2
and users will expect to be able to include "nmmintrin.h",
even though "nmmintrin.h" just includes "smmintrin.h"
where all of the SSE4.2 implementations actually appear.

Only patch 5/6 changed from v2.

Tested ppc64le (POWER9) and ppc64/32 (POWER7).

OK for trunk?

Paul A. Clarke (6):
  rs6000: Support SSE4.1 "round" intrinsics
  rs6000: Support SSE4.1 "min" and "max" intrinsics
  rs6000: Simplify some SSE4.1 "test" intrinsics
  rs6000: Support SSE4.1 "cvt" intrinsics
  rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics
  rs6000: Guard some x86 intrinsics implementations

 gcc/config/rs6000/emmintrin.h                 |  12 +-
 gcc/config/rs6000/nmmintrin.h                 |  40 ++
 gcc/config/rs6000/pmmintrin.h                 |   4 +
 gcc/config/rs6000/smmintrin.h                 | 427 ++++++++++++++++--
 gcc/config/rs6000/tmmintrin.h                 |  12 +
 gcc/testsuite/gcc.target/powerpc/pr78102.c    |  23 +
 .../gcc.target/powerpc/sse4_1-packusdw.c      |  73 +++
 .../gcc.target/powerpc/sse4_1-pcmpeqq.c       |  46 ++
 .../gcc.target/powerpc/sse4_1-pmaxsb.c        |  46 ++
 .../gcc.target/powerpc/sse4_1-pmaxsd.c        |  46 ++
 .../gcc.target/powerpc/sse4_1-pmaxud.c        |  47 ++
 .../gcc.target/powerpc/sse4_1-pmaxuw.c        |  47 ++
 .../gcc.target/powerpc/sse4_1-pminsb.c        |  46 ++
 .../gcc.target/powerpc/sse4_1-pminsd.c        |  46 ++
 .../gcc.target/powerpc/sse4_1-pminud.c        |  47 ++
 .../gcc.target/powerpc/sse4_1-pminuw.c        |  47 ++
 .../gcc.target/powerpc/sse4_1-pmovsxbd.c      |  42 ++
 .../gcc.target/powerpc/sse4_1-pmovsxbq.c      |  42 ++
 .../gcc.target/powerpc/sse4_1-pmovsxbw.c      |  42 ++
 .../gcc.target/powerpc/sse4_1-pmovsxdq.c      |  42 ++
 .../gcc.target/powerpc/sse4_1-pmovsxwd.c      |  42 ++
 .../gcc.target/powerpc/sse4_1-pmovsxwq.c      |  42 ++
 .../gcc.target/powerpc/sse4_1-pmovzxbd.c      |  43 ++
 .../gcc.target/powerpc/sse4_1-pmovzxbq.c      |  43 ++
 .../gcc.target/powerpc/sse4_1-pmovzxbw.c      |  43 ++
 .../gcc.target/powerpc/sse4_1-pmovzxdq.c      |  43 ++
 .../gcc.target/powerpc/sse4_1-pmovzxwd.c      |  43 ++
 .../gcc.target/powerpc/sse4_1-pmovzxwq.c      |  43 ++
 .../gcc.target/powerpc/sse4_1-pmuldq.c        |  51 +++
 .../gcc.target/powerpc/sse4_1-pmulld.c        |  46 ++
 .../gcc.target/powerpc/sse4_1-round3.h        |  81 ++++
 .../gcc.target/powerpc/sse4_1-roundpd.c       | 143 ++++++
 .../gcc.target/powerpc/sse4_1-roundps.c       |  98 ++++
 .../gcc.target/powerpc/sse4_1-roundsd.c       | 256 +++++++++++
 .../gcc.target/powerpc/sse4_1-roundss.c       | 208 +++++++++
 .../gcc.target/powerpc/sse4_2-check.h         |  18 +
 .../gcc.target/powerpc/sse4_2-pcmpgtq.c       |  46 ++
 37 files changed, 2407 insertions(+), 59 deletions(-)
 create mode 100644 gcc/config/rs6000/nmmintrin.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr78102.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-round3.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundpd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundps.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundsd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundss.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_2-check.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c

-- 
2.27.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-08-23 19:03 [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
@ 2021-08-23 19:03 ` Paul A. Clarke
  2021-08-27 13:44   ` Bill Schmidt
  2021-10-07 23:39   ` Segher Boessenkool
  2021-08-23 19:03 ` [PATCH v3 2/6] rs6000: Support SSE4.1 "min" and "max" intrinsics Paul A. Clarke
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-08-23 19:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, wschmidt

Suppress exceptions (when specified), by saving, manipulating, and
restoring the FPSCR.  Similarly, save, set, and restore the floating-point
rounding mode when required.

No attempt is made to optimize writing the FPSCR (by checking if the new
value would be the same), other than using lighter weight instructions
when possible.

The scalar versions naively use the parallel versions to compute the
single scalar result and then construct the remainder of the result.

Of minor note, the values of _MM_FROUND_TO_NEG_INF and _MM_FROUND_TO_ZERO
are swapped from the corresponding values on x86 so as to match the
corresponding rounding mode values in the Power ISA.

Move implementations of _mm_ceil* and _mm_floor* into _mm_round*, and
convert _mm_ceil* and _mm_floor* into macros. This matches the current
analogous implementations in config/i386/smmintrin.h.

Function signatures match the analogous functions in config/i386/smmintrin.h.

Add tests for _mm_round_pd, _mm_round_ps, _mm_round_sd, _mm_round_ss,
modeled after the very similar "floor" and "ceil" tests.

Include basic tests, plus tests at the boundaries for floating-point
representation, positive and negative, test all of the parameterized
rounding modes as well as the C99 rounding modes and interactions
between the two.

Exceptions are not explicitly tested.

2021-08-20  Paul A. Clarke  <pc@us.ibm.com>

gcc
	* config/rs6000/smmintrin.h (_mm_round_pd, _mm_round_ps,
	_mm_round_sd, _mm_round_ss, _MM_FROUND_TO_NEAREST_INT,
	_MM_FROUND_TO_ZERO, _MM_FROUND_TO_POS_INF, _MM_FROUND_TO_NEG_INF,
	_MM_FROUND_CUR_DIRECTION, _MM_FROUND_RAISE_EXC, _MM_FROUND_NO_EXC,
	_MM_FROUND_NINT, _MM_FROUND_FLOOR, _MM_FROUND_CEIL, _MM_FROUND_TRUNC,
	_MM_FROUND_RINT, _MM_FROUND_NEARBYINT): New.
	* config/rs6000/smmintrin.h (_mm_ceil_pd, _mm_ceil_ps, _mm_ceil_sd,
	_mm_ceil_ss, _mm_floor_pd, _mm_floor_ps, _mm_floor_sd, _mm_floor_ss):
	Convert from function to macro.

gcc/testsuite
	* gcc.target/powerpc/sse4_1-round3.h: New.
	* gcc.target/powerpc/sse4_1-roundpd.c: New.
	* gcc.target/powerpc/sse4_1-roundps.c: New.
	* gcc.target/powerpc/sse4_1-roundsd.c: New.
	* gcc.target/powerpc/sse4_1-roundss.c: New.
---
v3: No change.
v2:
- Replaced clever (and broken) exception masking with more straightforward
  implementation, per v1 review and closer inspection. mtfsf was only
  writing the final nybble (1) instead of the final two nybbles (2), so
  not all of the exception-enable bits were cleared.
- Renamed some variables from cryptic "tmp" and "save" to
  "fpscr_save" and "enables_save".
- Retained use of __builtin_mffsl, since that is supported pre-POWER8
  (with an alternate instruction sequence).
- Added "extern" to functions to maintain compatible decorations with
  like implementations in gcc/config/i386.
- Added some additional text to the commit message about some of the
  (unpleasant?) implementations and decorations coming from
  like implementations in gcc/config/i386, per v1 review.
- Removed "-Wno-psabi" from tests as unnecessary, per v1 review.
- Fixed indentation and other minor formatting changes, per v1 review.
- Noted testing in patch series cover letter.

 gcc/config/rs6000/smmintrin.h                 | 240 +++++++++++-----
 .../gcc.target/powerpc/sse4_1-round3.h        |  81 ++++++
 .../gcc.target/powerpc/sse4_1-roundpd.c       | 143 ++++++++++
 .../gcc.target/powerpc/sse4_1-roundps.c       |  98 +++++++
 .../gcc.target/powerpc/sse4_1-roundsd.c       | 256 ++++++++++++++++++
 .../gcc.target/powerpc/sse4_1-roundss.c       | 208 ++++++++++++++
 6 files changed, 962 insertions(+), 64 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-round3.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundpd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundps.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundsd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundss.c

diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
index 3767a67eada7..a6b88d313ad0 100644
--- a/gcc/config/rs6000/smmintrin.h
+++ b/gcc/config/rs6000/smmintrin.h
@@ -42,6 +42,182 @@
 #include <altivec.h>
 #include <tmmintrin.h>
 
+/* Rounding mode macros. */
+#define _MM_FROUND_TO_NEAREST_INT       0x00
+#define _MM_FROUND_TO_ZERO              0x01
+#define _MM_FROUND_TO_POS_INF           0x02
+#define _MM_FROUND_TO_NEG_INF           0x03
+#define _MM_FROUND_CUR_DIRECTION        0x04
+
+#define _MM_FROUND_NINT		\
+  (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_RAISE_EXC)
+#define _MM_FROUND_FLOOR	\
+  (_MM_FROUND_TO_NEG_INF | _MM_FROUND_RAISE_EXC)
+#define _MM_FROUND_CEIL		\
+  (_MM_FROUND_TO_POS_INF | _MM_FROUND_RAISE_EXC)
+#define _MM_FROUND_TRUNC	\
+  (_MM_FROUND_TO_ZERO | _MM_FROUND_RAISE_EXC)
+#define _MM_FROUND_RINT		\
+  (_MM_FROUND_CUR_DIRECTION | _MM_FROUND_RAISE_EXC)
+#define _MM_FROUND_NEARBYINT	\
+  (_MM_FROUND_CUR_DIRECTION | _MM_FROUND_NO_EXC)
+
+#define _MM_FROUND_RAISE_EXC            0x00
+#define _MM_FROUND_NO_EXC               0x08
+
+extern __inline __m128d
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_round_pd (__m128d __A, int __rounding)
+{
+  __v2df __r;
+  union {
+    double __fr;
+    long long __fpscr;
+  } __enables_save, __fpscr_save;
+
+  if (__rounding & _MM_FROUND_NO_EXC)
+    {
+      /* Save enabled exceptions, disable all exceptions,
+	 and preserve the rounding mode.  */
+#ifdef _ARCH_PWR9
+      __asm__ __volatile__ ("mffsce %0" : "=f" (__fpscr_save.__fr));
+      __enables_save.__fpscr = __fpscr_save.__fpscr & 0xf8;
+#else
+      __fpscr_save.__fr = __builtin_mffs ();
+      __enables_save.__fpscr = __fpscr_save.__fpscr & 0xf8;
+      __fpscr_save.__fpscr &= ~0xf8;
+      __builtin_mtfsf (0b00000011, __fpscr_save.__fr);
+#endif
+    }
+
+  switch (__rounding)
+    {
+      case _MM_FROUND_TO_NEAREST_INT:
+	__fpscr_save.__fr = __builtin_mffsl ();
+	__attribute__ ((fallthrough));
+      case _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC:
+	__builtin_set_fpscr_rn (0b00);
+	__r = vec_rint ((__v2df) __A);
+	__builtin_set_fpscr_rn (__fpscr_save.__fpscr);
+	break;
+      case _MM_FROUND_TO_NEG_INF:
+      case _MM_FROUND_TO_NEG_INF | _MM_FROUND_NO_EXC:
+	__r = vec_floor ((__v2df) __A);
+	break;
+      case _MM_FROUND_TO_POS_INF:
+      case _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC:
+	__r = vec_ceil ((__v2df) __A);
+	break;
+      case _MM_FROUND_TO_ZERO:
+      case _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC:
+	__r = vec_trunc ((__v2df) __A);
+	break;
+      case _MM_FROUND_CUR_DIRECTION:
+	__r = vec_rint ((__v2df) __A);
+	break;
+    }
+  if (__rounding & _MM_FROUND_NO_EXC)
+    {
+      /* Restore enabled exceptions.  */
+      __fpscr_save.__fr = __builtin_mffsl ();
+      __fpscr_save.__fpscr |= __enables_save.__fpscr;
+      __builtin_mtfsf (0b00000011, __fpscr_save.__fr);
+    }
+  return (__m128d) __r;
+}
+
+extern __inline __m128d
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_round_sd (__m128d __A, __m128d __B, int __rounding)
+{
+  __B = _mm_round_pd (__B, __rounding);
+  __v2df __r = { ((__v2df)__B)[0], ((__v2df) __A)[1] };
+  return (__m128d) __r;
+}
+
+extern __inline __m128
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_round_ps (__m128 __A, int __rounding)
+{
+  __v4sf __r;
+  union {
+    double __fr;
+    long long __fpscr;
+  } __enables_save, __fpscr_save;
+
+  if (__rounding & _MM_FROUND_NO_EXC)
+    {
+      /* Save enabled exceptions, disable all exceptions,
+	 and preserve the rounding mode.  */
+#ifdef _ARCH_PWR9
+      __asm__ __volatile__ ("mffsce %0" : "=f" (__fpscr_save.__fr));
+      __enables_save.__fpscr = __fpscr_save.__fpscr & 0xf8;
+#else
+      __fpscr_save.__fr = __builtin_mffs ();
+      __enables_save.__fpscr = __fpscr_save.__fpscr & 0xf8;
+      __fpscr_save.__fpscr &= ~0xf8;
+      __builtin_mtfsf (0b00000011, __fpscr_save.__fr);
+#endif
+    }
+
+  switch (__rounding)
+    {
+      case _MM_FROUND_TO_NEAREST_INT:
+	__fpscr_save.__fr = __builtin_mffsl ();
+	__attribute__ ((fallthrough));
+      case _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC:
+	__builtin_set_fpscr_rn (0b00);
+	__r = vec_rint ((__v4sf) __A);
+	__builtin_set_fpscr_rn (__fpscr_save.__fpscr);
+	break;
+      case _MM_FROUND_TO_NEG_INF:
+      case _MM_FROUND_TO_NEG_INF | _MM_FROUND_NO_EXC:
+	__r = vec_floor ((__v4sf) __A);
+	break;
+      case _MM_FROUND_TO_POS_INF:
+      case _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC:
+	__r = vec_ceil ((__v4sf) __A);
+	break;
+      case _MM_FROUND_TO_ZERO:
+      case _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC:
+	__r = vec_trunc ((__v4sf) __A);
+	break;
+      case _MM_FROUND_CUR_DIRECTION:
+	__r = vec_rint ((__v4sf) __A);
+	break;
+    }
+  if (__rounding & _MM_FROUND_NO_EXC)
+    {
+      /* Restore enabled exceptions.  */
+      __fpscr_save.__fr = __builtin_mffsl ();
+      __fpscr_save.__fpscr |= __enables_save.__fpscr;
+      __builtin_mtfsf (0b00000011, __fpscr_save.__fr);
+    }
+  return (__m128) __r;
+}
+
+extern __inline __m128
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_round_ss (__m128 __A, __m128 __B, int __rounding)
+{
+  __B = _mm_round_ps (__B, __rounding);
+  __v4sf __r = (__v4sf) __A;
+  __r[0] = ((__v4sf)__B)[0];
+  return (__m128) __r;
+}
+
+#define _mm_ceil_pd(V)	   _mm_round_pd ((V), _MM_FROUND_CEIL)
+#define _mm_ceil_sd(D, V)  _mm_round_sd ((D), (V), _MM_FROUND_CEIL)
+
+#define _mm_floor_pd(V)	   _mm_round_pd((V), _MM_FROUND_FLOOR)
+#define _mm_floor_sd(D, V) _mm_round_sd ((D), (V), _MM_FROUND_FLOOR)
+
+#define _mm_ceil_ps(V)	   _mm_round_ps ((V), _MM_FROUND_CEIL)
+#define _mm_ceil_ss(D, V)  _mm_round_ss ((D), (V), _MM_FROUND_CEIL)
+
+#define _mm_floor_ps(V)	   _mm_round_ps ((V), _MM_FROUND_FLOOR)
+#define _mm_floor_ss(D, V) _mm_round_ss ((D), (V), _MM_FROUND_FLOOR)
+
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_insert_epi8 (__m128i const __A, int const __D, int const __N)
 {
@@ -232,70 +408,6 @@ _mm_test_mix_ones_zeros (__m128i __A, __m128i __mask)
   return any_ones * any_zeros;
 }
 
-__inline __m128d
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_ceil_pd (__m128d __A)
-{
-  return (__m128d) vec_ceil ((__v2df) __A);
-}
-
-__inline __m128d
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_ceil_sd (__m128d __A, __m128d __B)
-{
-  __v2df __r = vec_ceil ((__v2df) __B);
-  __r[1] = ((__v2df) __A)[1];
-  return (__m128d) __r;
-}
-
-__inline __m128d
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_floor_pd (__m128d __A)
-{
-  return (__m128d) vec_floor ((__v2df) __A);
-}
-
-__inline __m128d
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_floor_sd (__m128d __A, __m128d __B)
-{
-  __v2df __r = vec_floor ((__v2df) __B);
-  __r[1] = ((__v2df) __A)[1];
-  return (__m128d) __r;
-}
-
-__inline __m128
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_ceil_ps (__m128 __A)
-{
-  return (__m128) vec_ceil ((__v4sf) __A);
-}
-
-__inline __m128
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_ceil_ss (__m128 __A, __m128 __B)
-{
-  __v4sf __r = (__v4sf) __A;
-  __r[0] = __builtin_ceil (((__v4sf) __B)[0]);
-  return __r;
-}
-
-__inline __m128
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_floor_ps (__m128 __A)
-{
-  return (__m128) vec_floor ((__v4sf) __A);
-}
-
-__inline __m128
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_floor_ss (__m128 __A, __m128 __B)
-{
-  __v4sf __r = (__v4sf) __A;
-  __r[0] = __builtin_floor (((__v4sf) __B)[0]);
-  return __r;
-}
-
 /* Return horizontal packed word minimum and its index in bits [15:0]
    and bits [18:16] respectively.  */
 __inline __m128i
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-round3.h b/gcc/testsuite/gcc.target/powerpc/sse4_1-round3.h
new file mode 100644
index 000000000000..de6cbf7be438
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-round3.h
@@ -0,0 +1,81 @@
+#include <smmintrin.h>
+#include <fenv.h>
+#include "sse4_1-check.h"
+
+#define DIM(a) (sizeof (a) / sizeof (a)[0])
+
+static int roundings[] =
+  {
+    _MM_FROUND_TO_NEAREST_INT,
+    _MM_FROUND_TO_NEG_INF,
+    _MM_FROUND_TO_POS_INF,
+    _MM_FROUND_TO_ZERO,
+    _MM_FROUND_CUR_DIRECTION
+  };
+
+static int modes[] =
+  {
+    FE_TONEAREST,
+    FE_UPWARD,
+    FE_DOWNWARD,
+    FE_TOWARDZERO
+  };
+
+static void
+TEST (void)
+{
+  int i, j, ri, mi, round_save;
+
+  round_save = fegetround ();
+  for (mi = 0; mi < DIM (modes); mi++) {
+    fesetround (modes[mi]);
+    for (i = 0; i < DIM (data); i++) {
+      for (ri = 0; ri < DIM (roundings); ri++) {
+	union value guess;
+	union value *current_answers = answers[ri];
+	switch ( roundings[ri] ) {
+	  case _MM_FROUND_TO_NEAREST_INT:
+	    guess.x = ROUND_INTRIN (data[i].value1.x, data[i].value2.x,
+				    _MM_FROUND_TO_NEAREST_INT);
+	    break;
+	  case _MM_FROUND_TO_NEG_INF:
+	    guess.x = ROUND_INTRIN (data[i].value1.x, data[i].value2.x,
+				    _MM_FROUND_TO_NEG_INF);
+	    break;
+	  case _MM_FROUND_TO_POS_INF:
+	    guess.x = ROUND_INTRIN (data[i].value1.x, data[i].value2.x,
+				    _MM_FROUND_TO_POS_INF);
+	    break;
+	  case _MM_FROUND_TO_ZERO:
+	    guess.x = ROUND_INTRIN (data[i].value1.x, data[i].value2.x,
+				    _MM_FROUND_TO_ZERO);
+	    break;
+	  case _MM_FROUND_CUR_DIRECTION:
+	    guess.x = ROUND_INTRIN (data[i].value1.x, data[i].value2.x,
+				    _MM_FROUND_CUR_DIRECTION);
+	    switch ( modes[mi] ) {
+	      case FE_TONEAREST:
+		current_answers = answers_NEAREST_INT;
+		break;
+	      case FE_UPWARD:
+		current_answers = answers_POS_INF;
+		break;
+	      case FE_DOWNWARD:
+		current_answers = answers_NEG_INF;
+		break;
+	      case FE_TOWARDZERO:
+		current_answers = answers_ZERO;
+		break;
+	    }
+	    break;
+	  default:
+	    abort ();
+	}
+	for (j = 0; j < DIM (guess.f); j++)
+	  if (guess.f[j] != current_answers[i].f[j])
+	    abort ();
+      }
+    }
+  }
+  fesetround (round_save);
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-roundpd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundpd.c
new file mode 100644
index 000000000000..0528c395f233
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundpd.c
@@ -0,0 +1,143 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#define NO_WARN_X86_INTRINSICS 1
+#include <smmintrin.h>
+
+#define VEC_T __m128d
+#define FP_T double
+
+#define ROUND_INTRIN(x, ignored, mode) _mm_round_pd (x, mode)
+
+#include "sse4_1-round-data.h"
+
+struct data2 data[] = {
+  { .value1 = { .f = {  0.00,  0.25 } } },
+  { .value1 = { .f = {  0.50,  0.75 } } },
+
+  { .value1 = { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffdp+50 } } },
+  { .value1 = { .f = {  0x1.ffffffffffffep+50,  0x1.fffffffffffffp+50 } } },
+  { .value1 = { .f = {  0x1.0000000000000p+51,  0x1.0000000000001p+51 } } },
+  { .value1 = { .f = {  0x1.0000000000002p+51,  0x1.0000000000003p+51 } } },
+
+  { .value1 = { .f = {  0x1.ffffffffffffep+51,  0x1.fffffffffffffp+51 } } },
+  { .value1 = { .f = {  0x1.0000000000000p+52,  0x1.0000000000001p+52 } } },
+
+  { .value1 = { .f = { -0x1.0000000000001p+52, -0x1.0000000000000p+52 } } },
+  { .value1 = { .f = { -0x1.fffffffffffffp+51, -0x1.ffffffffffffep+51 } } },
+
+  { .value1 = { .f = { -0x1.0000000000004p+51, -0x1.0000000000002p+51 } } },
+  { .value1 = { .f = { -0x1.0000000000001p+51, -0x1.0000000000000p+51 } } },
+  { .value1 = { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffep+50 } } },
+  { .value1 = { .f = { -0x1.ffffffffffffdp+50, -0x1.ffffffffffffcp+50 } } },
+
+  { .value1 = { .f = { -1.00, -0.75 } } },
+  { .value1 = { .f = { -0.50, -0.25 } } }
+};
+
+union value answers_NEAREST_INT[] = {
+  { .f = {  0.00,  0.00 } },
+  { .f = {  0.00,  1.00 } },
+
+  { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffcp+50 } },
+  { .f = {  0x1.0000000000000p+51,  0x1.0000000000000p+51 } },
+  { .f = {  0x1.0000000000000p+51,  0x1.0000000000000p+51 } },
+  { .f = {  0x1.0000000000002p+51,  0x1.0000000000004p+51 } },
+
+  { .f = {  0x1.ffffffffffffep+51,  0x1.0000000000000p+52 } },
+  { .f = {  0x1.0000000000000p+52,  0x1.0000000000001p+52 } },
+
+  { .f = { -0x1.0000000000001p+52, -0x1.0000000000000p+52 } },
+  { .f = { -0x1.0000000000000p+52, -0x1.ffffffffffffep+51 } },
+
+  { .f = { -0x1.0000000000004p+51, -0x1.0000000000002p+51 } },
+  { .f = { -0x1.0000000000000p+51, -0x1.0000000000000p+51 } },
+  { .f = { -0x1.ffffffffffffcp+50, -0x1.0000000000000p+51 } },
+  { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffcp+50 } },
+
+  { .f = { -1.00, -1.00 } },
+  { .f = {  0.00,  0.00 } }
+};
+
+union value answers_NEG_INF[] = {
+  { .f = {  0.00,  0.00 } },
+  { .f = {  0.00,  0.00 } },
+
+  { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffcp+50 } },
+  { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffcp+50 } },
+  { .f = {  0x1.0000000000000p+51,  0x1.0000000000000p+51 } },
+  { .f = {  0x1.0000000000002p+51,  0x1.0000000000002p+51 } },
+
+  { .f = {  0x1.ffffffffffffep+51,  0x1.ffffffffffffep+51 } },
+  { .f = {  0x1.0000000000000p+52,  0x1.0000000000001p+52 } },
+
+  { .f = { -0x1.0000000000001p+52, -0x1.0000000000000p+52 } },
+  { .f = { -0x1.0000000000000p+52, -0x1.ffffffffffffep+51 } },
+
+  { .f = { -0x1.0000000000004p+51, -0x1.0000000000002p+51 } },
+  { .f = { -0x1.0000000000002p+51, -0x1.0000000000000p+51 } },
+  { .f = { -0x1.ffffffffffffcp+50, -0x1.0000000000000p+51 } },
+  { .f = { -0x1.0000000000000p+51, -0x1.ffffffffffffcp+50 } },
+
+  { .f = { -1.00, -1.00 } },
+  { .f = { -1.00, -1.00 } }
+};
+
+union value answers_POS_INF[] = {
+  { .f = {  0.00,  1.00 } },
+  { .f = {  1.00,  1.00 } },
+
+  { .f = {  0x1.ffffffffffffcp+50,  0x1.0000000000000p+51 } },
+  { .f = {  0x1.0000000000000p+51,  0x1.0000000000000p+51 } },
+  { .f = {  0x1.0000000000000p+51,  0x1.0000000000002p+51 } },
+  { .f = {  0x1.0000000000002p+51,  0x1.0000000000004p+51 } },
+
+  { .f = {  0x1.ffffffffffffep+51,  0x1.0000000000000p+52 } },
+  { .f = {  0x1.0000000000000p+52,  0x1.0000000000001p+52 } },
+
+  { .f = { -0x1.0000000000001p+52, -0x1.0000000000000p+52 } },
+  { .f = { -0x1.ffffffffffffep+51, -0x1.ffffffffffffep+51 } },
+
+  { .f = { -0x1.0000000000004p+51, -0x1.0000000000002p+51 } },
+  { .f = { -0x1.0000000000000p+51, -0x1.0000000000000p+51 } },
+  { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffcp+50 } },
+  { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffcp+50 } },
+
+  { .f = { -1.00,  0.00 } },
+  { .f = {  0.00,  0.00 } }
+};
+
+union value answers_ZERO[] = {
+  { .f = {  0.00,  0.00 } },
+  { .f = {  0.00,  0.00 } },
+
+  { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffcp+50 } },
+  { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffcp+50 } },
+  { .f = {  0x1.0000000000000p+51,  0x1.0000000000000p+51 } },
+  { .f = {  0x1.0000000000002p+51,  0x1.0000000000002p+51 } },
+
+  { .f = {  0x1.ffffffffffffep+51,  0x1.ffffffffffffep+51 } },
+  { .f = {  0x1.0000000000000p+52,  0x1.0000000000001p+52 } },
+
+  { .f = { -0x1.0000000000001p+52, -0x1.0000000000000p+52 } },
+  { .f = { -0x1.ffffffffffffep+51, -0x1.ffffffffffffep+51 } },
+
+  { .f = { -0x1.0000000000004p+51, -0x1.0000000000002p+51 } },
+  { .f = { -0x1.0000000000000p+51, -0x1.0000000000000p+51 } },
+  { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffcp+50 } },
+  { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffcp+50 } },
+
+  { .f = { -1.00,  0.00 } },
+  { .f = {  0.00,  0.00 } }
+};
+
+union value *answers[] = {
+  answers_NEAREST_INT,
+  answers_NEG_INF,
+  answers_POS_INF,
+  answers_ZERO,
+  0 /* CUR_DIRECTION answers depend on current rounding mode.  */
+};
+
+#include "sse4_1-round3.h"
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-roundps.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundps.c
new file mode 100644
index 000000000000..6b5362e07590
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundps.c
@@ -0,0 +1,98 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#define NO_WARN_X86_INTRINSICS 1
+#include <smmintrin.h>
+
+#define VEC_T __m128
+#define FP_T float
+
+#define ROUND_INTRIN(x, ignored, mode) _mm_round_ps (x, mode)
+
+#include "sse4_1-round-data.h"
+
+struct data2 data[] = {
+  { .value1 = { .f = {  0.00,  0.25,  0.50,  0.75 } } },
+
+  { .value1 = { .f = {  0x1.fffff8p+21,  0x1.fffffap+21,
+			0x1.fffffcp+21,  0x1.fffffep+21 } } },
+  { .value1 = { .f = {  0x1.fffffap+22,  0x1.fffffcp+22,
+			0x1.fffffep+22,  0x1.fffffep+23 } } },
+  { .value1 = { .f = { -0x1.fffffep+23, -0x1.fffffep+22,
+		       -0x1.fffffcp+22, -0x1.fffffap+22 } } },
+  { .value1 = { .f = { -0x1.fffffep+21, -0x1.fffffcp+21,
+		       -0x1.fffffap+21, -0x1.fffff8p+21 } } },
+
+  { .value1 = { .f = { -1.00, -0.75, -0.50, -0.25 } } }
+};
+
+union value answers_NEAREST_INT[] = {
+  { .f = {  0.00,  0.00,  0.00,  1.00 } },
+
+  { .f = {  0x1.fffff8p+21,  0x1.fffff8p+21,
+            0x1.000000p+22,  0x1.000000p+22 } },
+  { .f = {  0x1.fffff8p+22,  0x1.fffffcp+22,
+            0x1.000000p+23,  0x1.fffffep+23 } },
+  { .f = { -0x1.fffffep+23, -0x1.000000p+23,
+           -0x1.fffffcp+22, -0x1.fffff8p+22 } },
+  { .f = { -0x1.000000p+22, -0x1.000000p+22,
+           -0x1.fffff8p+21, -0x1.fffff8p+21 } },
+
+  { .f = { -1.00, -1.00,  0.00,  0.00 } }
+};
+
+union value answers_NEG_INF[] = {
+  { .f = {  0.00,  0.00,  0.00,  0.00 } },
+
+  { .f = {  0x1.fffff8p+21,  0x1.fffff8p+21,
+            0x1.fffff8p+21,  0x1.fffff8p+21 } },
+  { .f = {  0x1.fffff8p+22,  0x1.fffffcp+22,
+            0x1.fffffcp+22,  0x1.fffffep+23 } },
+  { .f = { -0x1.fffffep+23, -0x1.000000p+23,
+           -0x1.fffffcp+22, -0x1.fffffcp+22 } },
+  { .f = { -0x1.000000p+22, -0x1.000000p+22,
+           -0x1.000000p+22, -0x1.fffff8p+21 } },
+
+  { .f = { -1.00, -1.00, -1.00, -1.00 } }
+};
+
+union value answers_POS_INF[] = {
+  { .f = {  0.00,  1.00,  1.00,  1.00 } },
+
+  { .f = {  0x1.fffff8p+21,  0x1.000000p+22,
+            0x1.000000p+22,  0x1.000000p+22 } },
+  { .f = {  0x1.fffffcp+22,  0x1.fffffcp+22,
+            0x1.000000p+23,  0x1.fffffep+23 } },
+  { .f = { -0x1.fffffep+23, -0x1.fffffcp+22,
+           -0x1.fffffcp+22, -0x1.fffff8p+22 } },
+  { .f = { -0x1.fffff8p+21, -0x1.fffff8p+21,
+           -0x1.fffff8p+21, -0x1.fffff8p+21 } },
+
+  { .f = { -1.00,  0.00,  0.00,  0.00 } }
+};
+
+union value answers_ZERO[] = {
+  { .f = {  0.00,  0.00,  0.00,  0.00 } },
+
+  { .f = {  0x1.fffff8p+21,  0x1.fffff8p+21,
+            0x1.fffff8p+21,  0x1.fffff8p+21 } },
+  { .f = {  0x1.fffff8p+22,  0x1.fffffcp+22,
+            0x1.fffffcp+22,  0x1.fffffep+23 } },
+  { .f = { -0x1.fffffep+23, -0x1.fffffcp+22,
+           -0x1.fffffcp+22, -0x1.fffff8p+22 } },
+  { .f = { -0x1.fffff8p+21, -0x1.fffff8p+21,
+           -0x1.fffff8p+21, -0x1.fffff8p+21 } },
+
+  { .f = { -1.00,  0.00,  0.00,  0.00 } }
+};
+
+union value *answers[] = {
+  answers_NEAREST_INT,
+  answers_NEG_INF,
+  answers_POS_INF,
+  answers_ZERO,
+  0 /* CUR_DIRECTION answers depend on current rounding mode.  */
+};
+
+#include "sse4_1-round3.h"
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-roundsd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundsd.c
new file mode 100644
index 000000000000..2b0bad6469df
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundsd.c
@@ -0,0 +1,256 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#include <stdio.h>
+#define NO_WARN_X86_INTRINSICS 1
+#include <smmintrin.h>
+
+#define VEC_T __m128d
+#define FP_T double
+
+#define ROUND_INTRIN(x, y, mode) _mm_round_sd (x, y, mode)
+
+#include "sse4_1-round-data.h"
+
+static struct data2 data[] = {
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0.00, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0.25, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0.50, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0.75, IGNORED } } },
+
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.ffffffffffffcp+50, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.ffffffffffffdp+50, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.ffffffffffffep+50, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.fffffffffffffp+50, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.0000000000000p+51, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.0000000000001p+51, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.0000000000002p+51, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.0000000000003p+51, IGNORED } } },
+
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.ffffffffffffep+51, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.fffffffffffffp+51, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.0000000000000p+52, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.0000000000001p+52, IGNORED } } },
+
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.0000000000001p+52, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.0000000000000p+52, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.fffffffffffffp+51, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.ffffffffffffep+51, IGNORED } } },
+
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.0000000000004p+51, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.0000000000002p+51, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.0000000000001p+51, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.0000000000000p+51, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.ffffffffffffcp+50, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.ffffffffffffep+50, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.ffffffffffffdp+50, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.ffffffffffffcp+50, IGNORED } } },
+
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -1.00, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0.75, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0.50, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
+    .value2 = { .f = { -0.25, IGNORED } } }
+};
+
+static union value answers_NEAREST_INT[] = {
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  1.00, PASSTHROUGH } },
+
+  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000004p+51, PASSTHROUGH } },
+
+  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = {  0x1.0000000000001p+52, PASSTHROUGH } },
+
+  { .f = { -0x1.0000000000001p+52, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
+
+  { .f = { -0x1.0000000000004p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000002p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+
+  { .f = { -1.00, PASSTHROUGH } },
+  { .f = { -1.00, PASSTHROUGH } },
+  { .f = { -0.00, PASSTHROUGH } },
+  { .f = { -0.00, PASSTHROUGH } }
+};
+
+static union value answers_NEG_INF[] = {
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+
+  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
+
+  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
+  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = {  0x1.0000000000001p+52, PASSTHROUGH } },
+
+  { .f = { -0x1.0000000000001p+52, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
+
+  { .f = { -0x1.0000000000004p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000002p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000002p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+
+  { .f = { -1.00, PASSTHROUGH } },
+  { .f = { -1.00, PASSTHROUGH } },
+  { .f = { -1.00, PASSTHROUGH } },
+  { .f = { -1.00, PASSTHROUGH } }
+};
+
+static union value answers_POS_INF[] = {
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  1.00, PASSTHROUGH } },
+  { .f = {  1.00, PASSTHROUGH } },
+  { .f = {  1.00, PASSTHROUGH } },
+
+  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000004p+51, PASSTHROUGH } },
+
+  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = {  0x1.0000000000001p+52, PASSTHROUGH } },
+
+  { .f = { -0x1.0000000000001p+52, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
+
+  { .f = { -0x1.0000000000004p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000002p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+
+  { .f = { -1.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } }
+};
+
+static union value answers_ZERO[] = {
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+
+  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
+
+  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
+  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
+  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = {  0x1.0000000000001p+52, PASSTHROUGH } },
+
+  { .f = { -0x1.0000000000001p+52, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
+
+  { .f = { -0x1.0000000000004p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000002p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
+
+  { .f = { -1.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH } }
+};
+
+union value *answers[] = {
+  answers_NEAREST_INT,
+  answers_NEG_INF,
+  answers_POS_INF,
+  answers_ZERO,
+  0 /* CUR_DIRECTION answers depend on current rounding mode.  */
+};
+
+#include "sse4_1-round3.h"
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-roundss.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundss.c
new file mode 100644
index 000000000000..3154310314a1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundss.c
@@ -0,0 +1,208 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#include <stdio.h>
+#define NO_WARN_X86_INTRINSICS 1
+#include <smmintrin.h>
+
+#define VEC_T __m128
+#define FP_T float
+
+#define ROUND_INTRIN(x, y, mode) _mm_round_ss (x, y, mode)
+
+#include "sse4_1-round-data.h"
+
+static struct data2 data[] = {
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0.00, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0.25, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0.50, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0.75, IGNORED, IGNORED, IGNORED } } },
+
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.fffff8p+21, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.fffffap+21, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.fffffcp+21, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.fffffep+21, IGNORED, IGNORED, IGNORED } } },
+
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.fffffap+22, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.fffffcp+22, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.fffffep+22, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = {  0x1.fffffep+23, IGNORED, IGNORED, IGNORED } } },
+
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.fffffep+23, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.fffffep+22, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.fffffcp+22, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.fffffap+22, IGNORED, IGNORED, IGNORED } } },
+
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.fffffep+21, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.fffffcp+21, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.fffffap+21, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -0x1.fffff8p+21, IGNORED, IGNORED, IGNORED } } },
+
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -1.00, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -0.75, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -0.50, IGNORED, IGNORED, IGNORED } } },
+  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+    .value2 = { .f = { -0.25, IGNORED, IGNORED, IGNORED } } }
+};
+
+static union value answers_NEAREST_INT[] = {
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = {  0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.000000p+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.000000p+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } }
+};
+
+static union value answers_NEG_INF[] = {
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = {  0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.000000p+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } }
+};
+
+static union value answers_POS_INF[] = {
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.000000p+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } }
+};
+
+static union value answers_ZERO[] = {
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = {  0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+
+  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
+  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } }
+};
+
+union value *answers[] = {
+  answers_NEAREST_INT,
+  answers_NEG_INF,
+  answers_POS_INF,
+  answers_ZERO,
+  0 /* CUR_DIRECTION answers depend on current rounding mode.  */
+};
+
+#include "sse4_1-round3.h"
-- 
2.27.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 2/6] rs6000: Support SSE4.1 "min" and "max" intrinsics
  2021-08-23 19:03 [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
  2021-08-23 19:03 ` [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics Paul A. Clarke
@ 2021-08-23 19:03 ` Paul A. Clarke
  2021-08-27 13:47   ` Bill Schmidt
  2021-10-11 19:28   ` Segher Boessenkool
  2021-08-23 19:03 ` [PATCH v3 3/6] rs6000: Simplify some SSE4.1 "test" intrinsics Paul A. Clarke
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-08-23 19:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, wschmidt

Function signatures and decorations match gcc/config/i386/smmintrin.h.

Also, copy tests for _mm_min_epi8, _mm_min_epu16, _mm_min_epi32,
_mm_min_epu32, _mm_max_epi8, _mm_max_epu16, _mm_max_epi32, _mm_max_epu32
from gcc/testsuite/gcc.target/i386.

sse4_1-pmaxsb.c and sse4_1-pminsb.c were modified from using
"char" types to "signed char" types, because the default is unsigned on
powerpc.

2021-08-20  Paul A. Clarke  <pc@us.ibm.com>

gcc
	* config/rs6000/smmintrin.h (_mm_min_epi8, _mm_min_epu16,
	_mm_min_epi32, _mm_min_epu32, _mm_max_epi8, _mm_max_epu16,
	_mm_max_epi32, _mm_max_epu32): New.

gcc/testsuite
	* gcc.target/powerpc/sse4_1-pmaxsb.c: Copy from gcc.target/i386.
	* gcc.target/powerpc/sse4_1-pmaxsd.c: Same.
	* gcc.target/powerpc/sse4_1-pmaxud.c: Same.
	* gcc.target/powerpc/sse4_1-pmaxuw.c: Same.
	* gcc.target/powerpc/sse4_1-pminsb.c: Same.
	* gcc.target/powerpc/sse4_1-pminsd.c: Same.
	* gcc.target/powerpc/sse4_1-pminud.c: Same.
	* gcc.target/powerpc/sse4_1-pminuw.c: Same.
---
v3: No change.
v2:
- Added "extern" to functions to maintain compatible decorations with
  like implementations in gcc/config/i386.
- Removed "-Wno-psabi" from tests as unnecessary, per v1 review.
- Noted testing in patch series cover letter.

 gcc/config/rs6000/smmintrin.h                 | 56 +++++++++++++++++++
 .../gcc.target/powerpc/sse4_1-pmaxsb.c        | 46 +++++++++++++++
 .../gcc.target/powerpc/sse4_1-pmaxsd.c        | 46 +++++++++++++++
 .../gcc.target/powerpc/sse4_1-pmaxud.c        | 47 ++++++++++++++++
 .../gcc.target/powerpc/sse4_1-pmaxuw.c        | 47 ++++++++++++++++
 .../gcc.target/powerpc/sse4_1-pminsb.c        | 46 +++++++++++++++
 .../gcc.target/powerpc/sse4_1-pminsd.c        | 46 +++++++++++++++
 .../gcc.target/powerpc/sse4_1-pminud.c        | 47 ++++++++++++++++
 .../gcc.target/powerpc/sse4_1-pminuw.c        | 47 ++++++++++++++++
 9 files changed, 428 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c

diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
index a6b88d313ad0..505fe4ce22a8 100644
--- a/gcc/config/rs6000/smmintrin.h
+++ b/gcc/config/rs6000/smmintrin.h
@@ -408,6 +408,62 @@ _mm_test_mix_ones_zeros (__m128i __A, __m128i __mask)
   return any_ones * any_zeros;
 }
 
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_min_epi8 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_min ((__v16qi)__X, (__v16qi)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_min_epu16 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_min ((__v8hu)__X, (__v8hu)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_min_epi32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_min ((__v4si)__X, (__v4si)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_min_epu32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_min ((__v4su)__X, (__v4su)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_max_epi8 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_max ((__v16qi)__X, (__v16qi)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_max_epu16 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_max ((__v8hu)__X, (__v8hu)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_max_epi32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_max ((__v4si)__X, (__v4si)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_max_epu32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_max ((__v4su)__X, (__v4su)__Y);
+}
+
 /* Return horizontal packed word minimum and its index in bits [15:0]
    and bits [18:16] respectively.  */
 __inline __m128i
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c
new file mode 100644
index 000000000000..7a465b01dd11
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 1024
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 16];
+      signed char i[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  signed char max;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 16)
+    dst.x[i / 16] = _mm_max_epi8 (src1.x[i / 16], src2.x[i / 16]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (max != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c
new file mode 100644
index 000000000000..d4947e9dae9a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  int max;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_max_epi32 (src1.x[i / 4], src2.x[i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (max != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c
new file mode 100644
index 000000000000..1407ebccacd3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      unsigned int i[NUM];
+    } dst, src1, src2;
+  int i;
+  unsigned int max;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i;
+      src2.i[i] = i + 20;
+      if ((i % 4))
+	src2.i[i] |= 0x80000000;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_max_epu32 (src1.x[i / 4], src2.x[i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (max != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c
new file mode 100644
index 000000000000..73ead0e90683
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 8];
+      unsigned short i[NUM];
+    } dst, src1, src2;
+  int i;
+  unsigned short max;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i;
+      src2.i[i] = i + 20;
+      if ((i % 8))
+	src2.i[i] |= 0x8000;
+    }
+
+  for (i = 0; i < NUM; i += 8)
+    dst.x[i / 8] = _mm_max_epu16 (src1.x[i / 8], src2.x[i / 8]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (max != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c
new file mode 100644
index 000000000000..bf491b7d363d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 1024
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 16];
+      signed char i[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  signed char min;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 16)
+    dst.x[i / 16] = _mm_min_epi8 (src1.x[i / 16], src2.x[i / 16]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (min != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c
new file mode 100644
index 000000000000..6cb27556a3b0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  int min;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_min_epi32 (src1.x[i / 4], src2.x[i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (min != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c
new file mode 100644
index 000000000000..afda4b906599
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      unsigned int i[NUM];
+    } dst, src1, src2;
+  int i;
+  unsigned int min;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i;
+      src2.i[i] = i + 20;
+      if ((i % 4))
+	src2.i[i] |= 0x80000000;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_min_epu32 (src1.x[i / 4], src2.x[i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (min != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c
new file mode 100644
index 000000000000..25cc115285c6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 8];
+      unsigned short i[NUM];
+    } dst, src1, src2;
+  int i;
+  unsigned short min;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i;
+      src2.i[i] = i + 20;
+      if ((i % 8))
+	src2.i[i] |= 0x8000;
+    }
+
+  for (i = 0; i < NUM; i += 8)
+    dst.x[i / 8] = _mm_min_epu16 (src1.x[i / 8], src2.x[i / 8]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (min != dst.i[i])
+	abort ();
+    }
+}
-- 
2.27.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 3/6] rs6000: Simplify some SSE4.1 "test" intrinsics
  2021-08-23 19:03 [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
  2021-08-23 19:03 ` [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics Paul A. Clarke
  2021-08-23 19:03 ` [PATCH v3 2/6] rs6000: Support SSE4.1 "min" and "max" intrinsics Paul A. Clarke
@ 2021-08-23 19:03 ` Paul A. Clarke
  2021-08-27 13:48   ` Bill Schmidt
  2021-10-11 20:50   ` Segher Boessenkool
  2021-08-23 19:03 ` [PATCH v3 4/6] rs6000: Support SSE4.1 "cvt" intrinsics Paul A. Clarke
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-08-23 19:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, wschmidt

Copy some simple redirections from i386 <smmintrin.h>, for:
- _mm_test_all_zeros
- _mm_test_all_ones
- _mm_test_mix_ones_zeros

2021-08-20  Paul A. Clarke  <pc@us.ibm.com>

gcc
	* config/rs6000/smmintrin.h (_mm_test_all_zeros,
	_mm_test_all_ones, _mm_test_mix_ones_zeros): Replace.
---
v3: No change.
v2:
- Removed "-Wno-psabi" from tests as unnecessary, per v1 review.
- Noted testing in patch series cover letter.

 gcc/config/rs6000/smmintrin.h | 30 ++++--------------------------
 1 file changed, 4 insertions(+), 26 deletions(-)

diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
index 505fe4ce22a8..363534cb06a2 100644
--- a/gcc/config/rs6000/smmintrin.h
+++ b/gcc/config/rs6000/smmintrin.h
@@ -379,34 +379,12 @@ _mm_testnzc_si128 (__m128i __A, __m128i __B)
   return _mm_testz_si128 (__A, __B) == 0 && _mm_testc_si128 (__A, __B) == 0;
 }
 
-__inline int
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_test_all_zeros (__m128i __A, __m128i __mask)
-{
-  const __v16qu __zero = {0};
-  return vec_all_eq (vec_and ((__v16qu) __A, (__v16qu) __mask), __zero);
-}
+#define _mm_test_all_zeros(M, V) _mm_testz_si128 ((M), (V))
 
-__inline int
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_test_all_ones (__m128i __A)
-{
-  const __v16qu __ones = vec_splats ((unsigned char) 0xff);
-  return vec_all_eq ((__v16qu) __A, __ones);
-}
+#define _mm_test_all_ones(V) \
+  _mm_testc_si128 ((V), _mm_cmpeq_epi32 ((V), (V)))
 
-__inline int
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_test_mix_ones_zeros (__m128i __A, __m128i __mask)
-{
-  const __v16qu __zero = {0};
-  const __v16qu __Amasked = vec_and ((__v16qu) __A, (__v16qu) __mask);
-  const int any_ones = vec_any_ne (__Amasked, __zero);
-  const __v16qu __notA = vec_nor ((__v16qu) __A, (__v16qu) __A);
-  const __v16qu __notAmasked = vec_and ((__v16qu) __notA, (__v16qu) __mask);
-  const int any_zeros = vec_any_ne (__notAmasked, __zero);
-  return any_ones * any_zeros;
-}
+#define _mm_test_mix_ones_zeros(M, V) _mm_testnzc_si128 ((M), (V))
 
 extern __inline __m128i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-- 
2.27.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 4/6] rs6000: Support SSE4.1 "cvt" intrinsics
  2021-08-23 19:03 [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
                   ` (2 preceding siblings ...)
  2021-08-23 19:03 ` [PATCH v3 3/6] rs6000: Simplify some SSE4.1 "test" intrinsics Paul A. Clarke
@ 2021-08-23 19:03 ` Paul A. Clarke
  2021-08-27 13:49   ` Bill Schmidt
  2021-10-11 21:52   ` Segher Boessenkool
  2021-08-23 19:03 ` [PATCH v3 5/6] rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics Paul A. Clarke
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-08-23 19:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, wschmidt

Function signatures and decorations match gcc/config/i386/smmintrin.h.

Also, copy tests for:
- _mm_cvtepi8_epi16, _mm_cvtepi8_epi32, _mm_cvtepi8_epi64
- _mm_cvtepi16_epi32, _mm_cvtepi16_epi64
- _mm_cvtepi32_epi64,
- _mm_cvtepu8_epi16, _mm_cvtepu8_epi32, _mm_cvtepu8_epi64
- _mm_cvtepu16_epi32, _mm_cvtepu16_epi64
- _mm_cvtepu32_epi64

from gcc/testsuite/gcc.target/i386.

sse4_1-pmovsxbd.c, sse4_1-pmovsxbq.c, and sse4_1-pmovsxbw.c were
modified from using "char" types to "signed char" types, because
the default is unsigned on powerpc.

2021-08-20  Paul A. Clarke  <pc@us.ibm.com>

gcc
	* config/rs6000/smmintrin.h (_mm_cvtepi8_epi16, _mm_cvtepi8_epi32,
	_mm_cvtepi8_epi64, _mm_cvtepi16_epi32, _mm_cvtepi16_epi64,
	_mm_cvtepi32_epi64, _mm_cvtepu8_epi16, _mm_cvtepu8_epi32,
	_mm_cvtepu8_epi64, _mm_cvtepu16_epi32, _mm_cvtepu16_epi64,
	_mm_cvtepu32_epi64): New.

gcc/testsuite
	* gcc.target/powerpc/sse4_1-pmovsxbd.c: Copy from gcc.target/i386,
	adjust dg directives to suit.
	* gcc.target/powerpc/sse4_1-pmovsxbq.c: Same.
	* gcc.target/powerpc/sse4_1-pmovsxbw.c: Same.
	* gcc.target/powerpc/sse4_1-pmovsxdq.c: Same.
	* gcc.target/powerpc/sse4_1-pmovsxwd.c: Same.
	* gcc.target/powerpc/sse4_1-pmovsxwq.c: Same.
	* gcc.target/powerpc/sse4_1-pmovzxbd.c: Same.
	* gcc.target/powerpc/sse4_1-pmovzxbq.c: Same.
	* gcc.target/powerpc/sse4_1-pmovzxbw.c: Same.
	* gcc.target/powerpc/sse4_1-pmovzxdq.c: Same.
	* gcc.target/powerpc/sse4_1-pmovzxwd.c: Same.
	* gcc.target/powerpc/sse4_1-pmovzxwq.c: Same.
---
v3: No change.
v2:
- Added "extern" to functions to maintain compatible decorations with
  like implementations in gcc/config/i386.
- Removed "-Wno-psabi" from tests as unnecessary, per v1 review.
- Noted testing in patch series cover letter.

 gcc/config/rs6000/smmintrin.h                 | 138 ++++++++++++++++++
 .../gcc.target/powerpc/sse4_1-pmovsxbd.c      |  42 ++++++
 .../gcc.target/powerpc/sse4_1-pmovsxbq.c      |  42 ++++++
 .../gcc.target/powerpc/sse4_1-pmovsxbw.c      |  42 ++++++
 .../gcc.target/powerpc/sse4_1-pmovsxdq.c      |  42 ++++++
 .../gcc.target/powerpc/sse4_1-pmovsxwd.c      |  42 ++++++
 .../gcc.target/powerpc/sse4_1-pmovsxwq.c      |  42 ++++++
 .../gcc.target/powerpc/sse4_1-pmovzxbd.c      |  43 ++++++
 .../gcc.target/powerpc/sse4_1-pmovzxbq.c      |  43 ++++++
 .../gcc.target/powerpc/sse4_1-pmovzxbw.c      |  43 ++++++
 .../gcc.target/powerpc/sse4_1-pmovzxdq.c      |  43 ++++++
 .../gcc.target/powerpc/sse4_1-pmovzxwd.c      |  43 ++++++
 .../gcc.target/powerpc/sse4_1-pmovzxwq.c      |  43 ++++++
 13 files changed, 648 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c

diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
index 363534cb06a2..fdef6674d16c 100644
--- a/gcc/config/rs6000/smmintrin.h
+++ b/gcc/config/rs6000/smmintrin.h
@@ -442,6 +442,144 @@ _mm_max_epu32 (__m128i __X, __m128i __Y)
   return (__m128i) vec_max ((__v4su)__X, (__v4su)__Y);
 }
 
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi8_epi16 (__m128i __A)
+{
+  return (__m128i) vec_unpackh ((__v16qi)__A);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi8_epi32 (__m128i __A)
+{
+  __A = (__m128i) vec_unpackh ((__v16qi)__A);
+  return (__m128i) vec_unpackh ((__v8hi)__A);
+}
+
+#ifdef _ARCH_PWR8
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi8_epi64 (__m128i __A)
+{
+  __A = (__m128i) vec_unpackh ((__v16qi)__A);
+  __A = (__m128i) vec_unpackh ((__v8hi)__A);
+  return (__m128i) vec_unpackh ((__v4si)__A);
+}
+#endif
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi16_epi32 (__m128i __A)
+{
+  return (__m128i) vec_unpackh ((__v8hi)__A);
+}
+
+#ifdef _ARCH_PWR8
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi16_epi64 (__m128i __A)
+{
+  __A = (__m128i) vec_unpackh ((__v8hi)__A);
+  return (__m128i) vec_unpackh ((__v4si)__A);
+}
+#endif
+
+#ifdef _ARCH_PWR8
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi32_epi64 (__m128i __A)
+{
+  return (__m128i) vec_unpackh ((__v4si)__A);
+}
+#endif
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu8_epi16 (__m128i __A)
+{
+  const __v16qu __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v16qu)__A, __zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v16qu)__A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu8_epi32 (__m128i __A)
+{
+  const __v16qu __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v16qu)__A, __zero);
+  __A = (__m128i) vec_mergeh ((__v8hu)__A, (__v8hu)__zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v16qu)__A);
+  __A = (__m128i) vec_mergeh ((__v8hu)__zero, (__v8hu)__A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu8_epi64 (__m128i __A)
+{
+  const __v16qu __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v16qu)__A, __zero);
+  __A = (__m128i) vec_mergeh ((__v8hu)__A, (__v8hu)__zero);
+  __A = (__m128i) vec_mergeh ((__v4su)__A, (__v4su)__zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v16qu)__A);
+  __A = (__m128i) vec_mergeh ((__v8hu)__zero, (__v8hu)__A);
+  __A = (__m128i) vec_mergeh ((__v4su)__zero, (__v4su)__A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu16_epi32 (__m128i __A)
+{
+  const __v8hu __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v8hu)__A, __zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v8hu)__A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu16_epi64 (__m128i __A)
+{
+  const __v8hu __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v8hu)__A, __zero);
+  __A = (__m128i) vec_mergeh ((__v4su)__A, (__v4su)__zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v8hu)__A);
+  __A = (__m128i) vec_mergeh ((__v4su)__zero, (__v4su)__A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu32_epi64 (__m128i __A)
+{
+  const __v4su __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v4su)__A, __zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v4su)__A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
 /* Return horizontal packed word minimum and its index in bits [15:0]
    and bits [18:16] respectively.  */
 __inline __m128i
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c
new file mode 100644
index 000000000000..553c8dd84505
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+      signed char c[NUM * 4];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 4) + (i / 4) * 16] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x [i / 4] = _mm_cvtepi8_epi32 (src.x [i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 4) + (i / 4) * 16] != dst.i[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c
new file mode 100644
index 000000000000..9ec1ab7a4169
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target p8vector_hw } */
+/* { dg-options "-O2 -mpower8-vector" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+      signed char c[NUM * 8];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 2) + (i / 2) * 16] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepi8_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 2) + (i / 2) * 16] != dst.ll[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c
new file mode 100644
index 000000000000..be4cf417ca7e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 8];
+      short s[NUM];
+      signed char c[NUM * 2];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 8) + (i / 8) * 16] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 8)
+    dst.x [i / 8] = _mm_cvtepi8_epi16 (src.x [i / 8]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 8) + (i / 8) * 16] != dst.s[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c
new file mode 100644
index 000000000000..1c263782240a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target p8vector_hw } */
+/* { dg-options "-O2 -mpower8-vector" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+      int i[NUM * 2];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.i[(i % 2) + (i / 2) * 4] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepi32_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.i[(i % 2) + (i / 2) * 4] != dst.ll[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c
new file mode 100644
index 000000000000..f0f31aba44ba
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+      short s[NUM * 2];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.s[(i % 4) + (i / 4) * 8] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x [i / 4] = _mm_cvtepi16_epi32 (src.x [i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.s[(i % 4) + (i / 4) * 8] != dst.i[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c
new file mode 100644
index 000000000000..67864695a113
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target p8vector_hw } */
+/* { dg-options "-O2 -mpower8-vector" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+      short s[NUM * 4];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.s[(i % 2) + (i / 2) * 8] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepi16_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.s[(i % 2) + (i / 2) * 8] != dst.ll[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c
new file mode 100644
index 000000000000..098ef6a49cb0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      unsigned int i[NUM];
+      unsigned char c[NUM * 4];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 4) + (i / 4) * 16] = i * i;
+      if ((i % 4))
+	src.c[(i % 4) + (i / 4) * 16] |= 0x80;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x [i / 4] = _mm_cvtepu8_epi32 (src.x [i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 4) + (i / 4) * 16] != dst.i[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c
new file mode 100644
index 000000000000..7b862767436e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      unsigned long long ll[NUM];
+      unsigned char c[NUM * 8];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 2) + (i / 2) * 16] = i * i;
+      if ((i % 2))
+	src.c[(i % 2) + (i / 2) * 16] |= 0x80;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepu8_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 2) + (i / 2) * 16] != dst.ll[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c
new file mode 100644
index 000000000000..9fdbec342d46
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 8];
+      unsigned short s[NUM];
+      unsigned char c[NUM * 2];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 8) + (i / 8) * 16] = i * i;
+      if ((i % 4))
+	src.c[(i % 8) + (i / 8) * 16] |= 0x80;
+    }
+
+  for (i = 0; i < NUM; i += 8)
+    dst.x [i / 8] = _mm_cvtepu8_epi16 (src.x [i / 8]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 8) + (i / 8) * 16] != dst.s[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c
new file mode 100644
index 000000000000..7a5e7688d9f5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      unsigned long long ll[NUM];
+      unsigned int i[NUM * 2];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.i[(i % 2) + (i / 2) * 4] = i * i;
+      if ((i % 2))
+        src.i[(i % 2) + (i / 2) * 4] |= 0x80000000;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepu32_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.i[(i % 2) + (i / 2) * 4] != dst.ll[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c
new file mode 100644
index 000000000000..078a5a45d909
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      unsigned int i[NUM];
+      unsigned short s[NUM * 2];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.s[(i % 4) + (i / 4) * 8] = i * i;
+      if ((i % 4))
+	src.s[(i % 4) + (i / 4) * 8] |= 0x8000;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x [i / 4] = _mm_cvtepu16_epi32 (src.x [i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.s[(i % 4) + (i / 4) * 8] != dst.i[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c
new file mode 100644
index 000000000000..120d00290faa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      unsigned long long ll[NUM];
+      unsigned short s[NUM * 4];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.s[(i % 2) + (i / 2) * 8] = i * i;
+      if ((i % 2))
+	src.s[(i % 2) + (i / 2) * 8] |= 0x8000;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepu16_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.s[(i % 2) + (i / 2) * 8] != dst.ll[i])
+      abort ();
+}
-- 
2.27.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 5/6] rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics
  2021-08-23 19:03 [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
                   ` (3 preceding siblings ...)
  2021-08-23 19:03 ` [PATCH v3 4/6] rs6000: Support SSE4.1 "cvt" intrinsics Paul A. Clarke
@ 2021-08-23 19:03 ` Paul A. Clarke
  2021-08-27 15:21   ` Bill Schmidt
  2021-10-11 23:07   ` Segher Boessenkool
  2021-08-23 19:03 ` [PATCH v3 6/6] rs6000: Guard some x86 intrinsics implementations Paul A. Clarke
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-08-23 19:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, wschmidt

Function signatures and decorations match gcc/config/i386/smmintrin.h.

Also, copy tests for:
- _mm_cmpeq_epi64
- _mm_mullo_epi32, _mm_mul_epi32
- _mm_packus_epi32
- _mm_cmpgt_epi64 (SSE4.2)

from gcc/testsuite/gcc.target/i386.

2021-08-23  Paul A. Clarke  <pc@us.ibm.com>

gcc
	* config/rs6000/smmintrin.h (_mm_cmpeq_epi64, _mm_cmpgt_epi64,
	_mm_mullo_epi32, _mm_mul_epi32, _mm_packus_epi32): New.
	* config/rs6000/nmmintrin.h: Copy from i386, tweak to suit.

gcc/testsuite
	* gcc.target/powerpc/pr78102.c: Copy from gcc.target/i386,
	adjust dg directives to suit.
	* gcc.target/powerpc/sse4_1-packusdw.c: Same.
	* gcc.target/powerpc/sse4_1-pcmpeqq.c: Same.
	* gcc.target/powerpc/sse4_1-pmuldq.c: Same.
	* gcc.target/powerpc/sse4_1-pmulld.c: Same.
	* gcc.target/powerpc/sse4_2-pcmpgtq.c: Same.
	* gcc.target/powerpc/sse4_2-check.h: Copy from gcc.target/i386,
	tweak to suit.
---
v3:
- Add nmmintrin.h. _mm_cmpgt_epi64 is part of SSE4.2, which is
  ostensibly defined in nmmintrin.h. Following the i386 implementation,
  however, nmmintrin.h only includes smmintrin.h, and the actual
  implementations appear there.
- Add sse4_2-check.h, required by sse4_2-pcmpgtq.c. My testing was
  obviously inadequate.
v2:
- Added "extern" to functions to maintain compatible decorations with
  like implementations in gcc/config/i386.
- Removed "-Wno-psabi" from tests as unnecessary, per v1 review.
- Noted testing in patch series cover letter.

 gcc/config/rs6000/nmmintrin.h                 | 40 ++++++++++
 gcc/config/rs6000/smmintrin.h                 | 41 +++++++++++
 gcc/testsuite/gcc.target/powerpc/pr78102.c    | 23 ++++++
 .../gcc.target/powerpc/sse4_1-packusdw.c      | 73 +++++++++++++++++++
 .../gcc.target/powerpc/sse4_1-pcmpeqq.c       | 46 ++++++++++++
 .../gcc.target/powerpc/sse4_1-pmuldq.c        | 51 +++++++++++++
 .../gcc.target/powerpc/sse4_1-pmulld.c        | 46 ++++++++++++
 .../gcc.target/powerpc/sse4_2-check.h         | 18 +++++
 .../gcc.target/powerpc/sse4_2-pcmpgtq.c       | 46 ++++++++++++
 9 files changed, 384 insertions(+)
 create mode 100644 gcc/config/rs6000/nmmintrin.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr78102.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_2-check.h
 create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c

diff --git a/gcc/config/rs6000/nmmintrin.h b/gcc/config/rs6000/nmmintrin.h
new file mode 100644
index 000000000000..20a70bee3776
--- /dev/null
+++ b/gcc/config/rs6000/nmmintrin.h
@@ -0,0 +1,40 @@
+/* Copyright (C) 2021 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef NO_WARN_X86_INTRINSICS
+/* This header is distributed to simplify porting x86_64 code that
+   makes explicit use of Intel intrinsics to powerpc64le.
+   It is the user's responsibility to determine if the results are
+   acceptable and make additional changes as necessary.
+   Note that much code that uses Intel intrinsics can be rewritten in
+   standard C or GNU C extensions, which are more portable and better
+   optimized across multiple targets.  */
+#endif
+
+#ifndef _NMMINTRIN_H_INCLUDED
+#define _NMMINTRIN_H_INCLUDED
+
+/* We just include SSE4.1 header file.  */
+#include <smmintrin.h>
+
+#endif /* _NMMINTRIN_H_INCLUDED */
diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
index fdef6674d16c..c04d2bb5b6d3 100644
--- a/gcc/config/rs6000/smmintrin.h
+++ b/gcc/config/rs6000/smmintrin.h
@@ -386,6 +386,15 @@ _mm_testnzc_si128 (__m128i __A, __m128i __B)
 
 #define _mm_test_mix_ones_zeros(M, V) _mm_testnzc_si128 ((M), (V))
 
+#ifdef _ARCH_PWR8
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpeq_epi64 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_cmpeq ((__v2di)__X, (__v2di)__Y);
+}
+#endif
+
 extern __inline __m128i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm_min_epi8 (__m128i __X, __m128i __Y)
@@ -444,6 +453,22 @@ _mm_max_epu32 (__m128i __X, __m128i __Y)
 
 extern __inline __m128i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_mullo_epi32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_mul ((__v4su)__X, (__v4su)__Y);
+}
+
+#ifdef _ARCH_PWR8
+__inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_mul_epi32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_mule ((__v4si)__X, (__v4si)__Y);
+}
+#endif
+
+__inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm_cvtepi8_epi16 (__m128i __A)
 {
   return (__m128i) vec_unpackh ((__v16qi)__A);
@@ -607,4 +632,20 @@ _mm_minpos_epu16 (__m128i __A)
   return __r.__m;
 }
 
+__inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_packus_epi32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_packsu ((__v4si)__X, (__v4si)__Y);
+}
+
+#ifdef _ARCH_PWR8
+__inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpgt_epi64 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_cmpgt ((__v2di)__X, (__v2di)__Y);
+}
+#endif
+
 #endif
diff --git a/gcc/testsuite/gcc.target/powerpc/pr78102.c b/gcc/testsuite/gcc.target/powerpc/pr78102.c
new file mode 100644
index 000000000000..56a2d497bbff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr78102.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mvsx" } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+
+#include <x86intrin.h>
+
+__m128i
+foo (const __m128i x, const __m128i y)
+{
+  return _mm_cmpeq_epi64 (x, y);
+}
+
+__v2di
+bar (const __v2di x, const __v2di y)
+{
+  return x == y;
+}
+
+__v2di
+baz (const __v2di x, const __v2di y)
+{
+  return x != y;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c
new file mode 100644
index 000000000000..15b8ca418f54
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c
@@ -0,0 +1,73 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mvsx" } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static unsigned short
+int_to_ushort (int iVal)
+{
+  unsigned short sVal;
+
+  if (iVal < 0)
+    sVal = 0;
+  else if (iVal > 0xffff)
+    sVal = 0xffff;
+  else sVal = iVal;
+
+  return sVal;
+}
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+    } src1, src2;
+  union
+    {
+      __m128i x[NUM / 4];
+      unsigned short s[NUM * 2];
+    } dst;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_packus_epi32 (src1.x [i / 4], src2.x [i / 4]);
+
+  for (i = 0; i < NUM; i ++)
+    {
+      int dstIndex;
+      unsigned short sVal;
+
+      sVal = int_to_ushort (src1.i[i]);
+      dstIndex = (i % 4) + (i / 4) * 8;
+      if (sVal != dst.s[dstIndex])
+	abort ();
+
+      sVal = int_to_ushort (src2.i[i]);
+      dstIndex += 4;
+      if (sVal != dst.s[dstIndex])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
new file mode 100644
index 000000000000..39b9f01d64a4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mpower8-vector" } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+    } dst, src1, src2;
+  int i, sign=1;
+  long long is_eq;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.ll[i] = i * i * sign;
+      src2.ll[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cmpeq_epi64(src1.x [i / 2], src2.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      is_eq = src1.ll[i] == src2.ll[i] ? 0xffffffffffffffffLL : 0LL;
+      if (is_eq != dst.ll[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
new file mode 100644
index 000000000000..6a884f46235f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
@@ -0,0 +1,51 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mpower8-vector" } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+    } dst;
+  union
+    {
+      __m128i x[NUM / 2];
+      int i[NUM * 2];
+    } src1, src2;
+  int i, sign = 1;
+  long long value;
+
+  for (i = 0; i < NUM * 2; i += 2)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x[i / 2] = _mm_mul_epi32 (src1.x[i / 2], src2.x[i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      value = (long long) src1.i[i * 2] * (long long) src2.i[i * 2];
+      if (value != dst.ll[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c
new file mode 100644
index 000000000000..150832915911
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mvsx" } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  int value;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_mullo_epi32 (src1.x[i / 4], src2.x[i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      value = src1.i[i] * src2.i[i];
+      if (value != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_2-check.h b/gcc/testsuite/gcc.target/powerpc/sse4_2-check.h
new file mode 100644
index 000000000000..f6264e5a1083
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_2-check.h
@@ -0,0 +1,18 @@
+#define NO_WARN_X86_INTRINSICS 1
+
+static void sse4_2_test (void);
+
+static void
+__attribute__ ((noinline))
+do_test (void)
+{
+  sse4_2_test ();
+}
+
+int
+main ()
+{
+  do_test ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c b/gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c
new file mode 100644
index 000000000000..4bfbad885b30
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mvsx" } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_2-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_2_test
+#endif
+
+#include CHECK_H
+
+#include <nmmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  long long is_eq;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.ll[i] = i * i * sign;
+      src2.ll[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x[i / 2] = _mm_cmpgt_epi64 (src1.x[i / 2], src2.x[i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      is_eq = src1.ll[i] > src2.ll[i] ? 0xFFFFFFFFFFFFFFFFLL : 0LL;
+      if (is_eq != dst.ll[i])
+	abort ();
+    }
+}
-- 
2.27.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 6/6] rs6000: Guard some x86 intrinsics implementations
  2021-08-23 19:03 [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
                   ` (4 preceding siblings ...)
  2021-08-23 19:03 ` [PATCH v3 5/6] rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics Paul A. Clarke
@ 2021-08-23 19:03 ` Paul A. Clarke
  2021-08-27 15:25   ` Bill Schmidt
  2021-10-12  0:11   ` Segher Boessenkool
  2021-09-16 14:59 ` [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
  2021-10-07 22:25 ` Segher Boessenkool
  7 siblings, 2 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-08-23 19:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, wschmidt

Some compatibility implementations of x86 intrinsics include
Power intrinsics which require POWER8.  Guard them.

emmintrin.h:
- _mm_cmpord_pd: Remove code which was ostensibly for pre-POWER8,
  but which indeed depended on POWER8 (vec_cmpgt(v2du)/vcmpgtud).
  The "POWER8" version works fine on pre-POWER8.
- _mm_mul_epu32: vec_mule(v4su) uses vmuleuw.
pmmintrin.h:
- _mm_movehdup_ps: vec_mergeo(v4su) uses vmrgow.
- _mm_moveldup_ps: vec_mergee(v4su) uses vmrgew.
smmintrin.h:
- _mm_cmpeq_epi64: vec_cmpeq(v2di) uses vcmpequd.
- _mm_mul_epi32: vec_mule(v4si) uses vmuluwm.
- _mm_cmpgt_epi64: vec_cmpgt(v2di) uses vcmpgtsd.
tmmintrin.h:
- _mm_sign_epi8: vec_neg(v4si) uses vsububm.
- _mm_sign_epi16: vec_neg(v4si) uses vsubuhm.
- _mm_sign_epi32: vec_neg(v4si) uses vsubuwm.
  Note that the above three could actually be supported pre-POWER8,
  but current GCC does not support them before POWER8.
- _mm_sign_pi8: depends on _mm_sign_epi8.
- _mm_sign_pi16: depends on _mm_sign_epi16.
- _mm_sign_pi32: depends on _mm_sign_epi32.

2021-08-20  Paul A. Clarke  <pc@us.ibm.com>

gcc
	PR target/101893
	* config/rs6000/emmintrin.h: Guard POWER8 intrinsics.
	* config/rs6000/pmmintrin.h: Same.
	* config/rs6000/smmintrin.h: Same.
	* config/rs6000/tmmintrin.h: Same.
---
v3: No change.
v2:
- Ensured that new "#ifdef _ARCH_PWR8" bracket each function so
  impacted, rather than groups of functions, per v1 review.
- Noted testing in patch series cover letter.
- Added PR number to commit message.

 gcc/config/rs6000/emmintrin.h | 12 ++----------
 gcc/config/rs6000/pmmintrin.h |  4 ++++
 gcc/config/rs6000/smmintrin.h |  4 ++++
 gcc/config/rs6000/tmmintrin.h | 12 ++++++++++++
 4 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/gcc/config/rs6000/emmintrin.h b/gcc/config/rs6000/emmintrin.h
index ce1287edf782..32ad72b4cc35 100644
--- a/gcc/config/rs6000/emmintrin.h
+++ b/gcc/config/rs6000/emmintrin.h
@@ -430,20 +430,10 @@ _mm_cmpnge_pd (__m128d __A, __m128d __B)
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_cmpord_pd (__m128d __A, __m128d __B)
 {
-#if _ARCH_PWR8
   __v2du c, d;
   /* Compare against self will return false (0's) if NAN.  */
   c = (__v2du)vec_cmpeq (__A, __A);
   d = (__v2du)vec_cmpeq (__B, __B);
-#else
-  __v2du a, b;
-  __v2du c, d;
-  const __v2du double_exp_mask  = {0x7ff0000000000000, 0x7ff0000000000000};
-  a = (__v2du)vec_abs ((__v2df)__A);
-  b = (__v2du)vec_abs ((__v2df)__B);
-  c = (__v2du)vec_cmpgt (double_exp_mask, a);
-  d = (__v2du)vec_cmpgt (double_exp_mask, b);
-#endif
   /* A != NAN and B != NAN.  */
   return ((__m128d)vec_and(c, d));
 }
@@ -1472,6 +1462,7 @@ _mm_mul_su32 (__m64 __A, __m64 __B)
   return ((__m64)a * (__m64)b);
 }
 
+#ifdef _ARCH_PWR8
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_epu32 (__m128i __A, __m128i __B)
 {
@@ -1498,6 +1489,7 @@ _mm_mul_epu32 (__m128i __A, __m128i __B)
   return (__m128i) vec_mule ((__v4su)__A, (__v4su)__B);
 #endif
 }
+#endif
 
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_slli_epi16 (__m128i __A, int __B)
diff --git a/gcc/config/rs6000/pmmintrin.h b/gcc/config/rs6000/pmmintrin.h
index eab712fdfa66..83dff1d85666 100644
--- a/gcc/config/rs6000/pmmintrin.h
+++ b/gcc/config/rs6000/pmmintrin.h
@@ -123,17 +123,21 @@ _mm_hsub_pd (__m128d __X, __m128d __Y)
 			    vec_mergel ((__v2df) __X, (__v2df)__Y));
 }
 
+#ifdef _ARCH_PWR8
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_movehdup_ps (__m128 __X)
 {
   return (__m128)vec_mergeo ((__v4su)__X, (__v4su)__X);
 }
+#endif
 
+#ifdef _ARCH_PWR8
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_moveldup_ps (__m128 __X)
 {
   return (__m128)vec_mergee ((__v4su)__X, (__v4su)__X);
 }
+#endif
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_loaddup_pd (double const *__P)
diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
index c04d2bb5b6d3..29719367e205 100644
--- a/gcc/config/rs6000/smmintrin.h
+++ b/gcc/config/rs6000/smmintrin.h
@@ -272,6 +272,7 @@ _mm_extract_ps (__m128 __X, const int __N)
   return ((__v4si)__X)[__N & 3];
 }
 
+#ifdef _ARCH_PWR8
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_blend_epi16 (__m128i __A, __m128i __B, const int __imm8)
 {
@@ -283,6 +284,7 @@ _mm_blend_epi16 (__m128i __A, __m128i __B, const int __imm8)
   #endif
   return (__m128i) vec_sel ((__v8hu) __A, (__v8hu) __B, __shortmask);
 }
+#endif
 
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_blendv_epi8 (__m128i __A, __m128i __B, __m128i __mask)
@@ -343,6 +345,7 @@ _mm_blend_pd (__m128d __A, __m128d __B, const int __imm8)
   return (__m128d) __r;
 }
 
+#ifdef _ARCH_PWR8
 __inline __m128d
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm_blendv_pd (__m128d __A, __m128d __B, __m128d __mask)
@@ -351,6 +354,7 @@ _mm_blendv_pd (__m128d __A, __m128d __B, __m128d __mask)
   const __vector __bool long long __boolmask = vec_cmplt ((__v2di) __mask, __zero);
   return (__m128d) vec_sel ((__v2du) __A, (__v2du) __B, (__v2du) __boolmask);
 }
+#endif
 
 __inline int
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
diff --git a/gcc/config/rs6000/tmmintrin.h b/gcc/config/rs6000/tmmintrin.h
index 971511260b78..a67d88c8079a 100644
--- a/gcc/config/rs6000/tmmintrin.h
+++ b/gcc/config/rs6000/tmmintrin.h
@@ -350,6 +350,7 @@ _mm_shuffle_pi8 (__m64 __A, __m64 __B)
   return (__m64) ((__v2du) (__C))[0];
 }
 
+#ifdef _ARCH_PWR8
 extern __inline __m128i
 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sign_epi8 (__m128i __A, __m128i __B)
@@ -361,7 +362,9 @@ _mm_sign_epi8 (__m128i __A, __m128i __B)
   __v16qi __conv = vec_add (__selectneg, __selectpos);
   return (__m128i) vec_mul ((__v16qi) __A, (__v16qi) __conv);
 }
+#endif
 
+#ifdef _ARCH_PWR8
 extern __inline __m128i
 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sign_epi16 (__m128i __A, __m128i __B)
@@ -373,7 +376,9 @@ _mm_sign_epi16 (__m128i __A, __m128i __B)
   __v8hi __conv = vec_add (__selectneg, __selectpos);
   return (__m128i) vec_mul ((__v8hi) __A, (__v8hi) __conv);
 }
+#endif
 
+#ifdef _ARCH_PWR8
 extern __inline __m128i
 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sign_epi32 (__m128i __A, __m128i __B)
@@ -385,7 +390,9 @@ _mm_sign_epi32 (__m128i __A, __m128i __B)
   __v4si __conv = vec_add (__selectneg, __selectpos);
   return (__m128i) vec_mul ((__v4si) __A, (__v4si) __conv);
 }
+#endif
 
+#ifdef _ARCH_PWR8
 extern __inline __m64
 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sign_pi8 (__m64 __A, __m64 __B)
@@ -396,7 +403,9 @@ _mm_sign_pi8 (__m64 __A, __m64 __B)
   __C = (__v16qi) _mm_sign_epi8 ((__m128i) __C, (__m128i) __D);
   return (__m64) ((__v2du) (__C))[0];
 }
+#endif
 
+#ifdef _ARCH_PWR8
 extern __inline __m64
 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sign_pi16 (__m64 __A, __m64 __B)
@@ -407,7 +416,9 @@ _mm_sign_pi16 (__m64 __A, __m64 __B)
   __C = (__v8hi) _mm_sign_epi16 ((__m128i) __C, (__m128i) __D);
   return (__m64) ((__v2du) (__C))[0];
 }
+#endif
 
+#ifdef _ARCH_PWR8
 extern __inline __m64
 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sign_pi32 (__m64 __A, __m64 __B)
@@ -418,6 +429,7 @@ _mm_sign_pi32 (__m64 __A, __m64 __B)
   __C = (__v4si) _mm_sign_epi32 ((__m128i) __C, (__m128i) __D);
   return (__m64) ((__v2du) (__C))[0];
 }
+#endif
 
 extern __inline __m128i
 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
-- 
2.27.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-08-23 19:03 ` [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics Paul A. Clarke
@ 2021-08-27 13:44   ` Bill Schmidt
  2021-08-27 13:47     ` Bill Schmidt
  2021-08-30 21:16     ` Paul A. Clarke
  2021-10-07 23:39   ` Segher Boessenkool
  1 sibling, 2 replies; 47+ messages in thread
From: Bill Schmidt @ 2021-08-27 13:44 UTC (permalink / raw)
  To: Paul A. Clarke, gcc-patches; +Cc: segher

Hi Paul,

On 8/23/21 2:03 PM, Paul A. Clarke wrote:
> Suppress exceptions (when specified), by saving, manipulating, and
> restoring the FPSCR.  Similarly, save, set, and restore the floating-point
> rounding mode when required.
>
> No attempt is made to optimize writing the FPSCR (by checking if the new
> value would be the same), other than using lighter weight instructions
> when possible.
>
> The scalar versions naively use the parallel versions to compute the
> single scalar result and then construct the remainder of the result.
>
> Of minor note, the values of _MM_FROUND_TO_NEG_INF and _MM_FROUND_TO_ZERO
> are swapped from the corresponding values on x86 so as to match the
> corresponding rounding mode values in the Power ISA.
>
> Move implementations of _mm_ceil* and _mm_floor* into _mm_round*, and
> convert _mm_ceil* and _mm_floor* into macros. This matches the current
> analogous implementations in config/i386/smmintrin.h.
>
> Function signatures match the analogous functions in config/i386/smmintrin.h.
>
> Add tests for _mm_round_pd, _mm_round_ps, _mm_round_sd, _mm_round_ss,
> modeled after the very similar "floor" and "ceil" tests.
>
> Include basic tests, plus tests at the boundaries for floating-point
> representation, positive and negative, test all of the parameterized
> rounding modes as well as the C99 rounding modes and interactions
> between the two.
>
> Exceptions are not explicitly tested.

Again, please specify where the patch was tested and whether this is for 
trunk, backports, etc.  Thanks!  (I know you aren't asking for 
backports, but in general please get in the habit of this.)
>
> 2021-08-20  Paul A. Clarke  <pc@us.ibm.com>
>
> gcc
> 	* config/rs6000/smmintrin.h (_mm_round_pd, _mm_round_ps,
> 	_mm_round_sd, _mm_round_ss, _MM_FROUND_TO_NEAREST_INT,
> 	_MM_FROUND_TO_ZERO, _MM_FROUND_TO_POS_INF, _MM_FROUND_TO_NEG_INF,
> 	_MM_FROUND_CUR_DIRECTION, _MM_FROUND_RAISE_EXC, _MM_FROUND_NO_EXC,
> 	_MM_FROUND_NINT, _MM_FROUND_FLOOR, _MM_FROUND_CEIL, _MM_FROUND_TRUNC,
> 	_MM_FROUND_RINT, _MM_FROUND_NEARBYINT): New.
> 	* config/rs6000/smmintrin.h (_mm_ceil_pd, _mm_ceil_ps, _mm_ceil_sd,
> 	_mm_ceil_ss, _mm_floor_pd, _mm_floor_ps, _mm_floor_sd, _mm_floor_ss):
> 	Convert from function to macro.
>
> gcc/testsuite
> 	* gcc.target/powerpc/sse4_1-round3.h: New.
> 	* gcc.target/powerpc/sse4_1-roundpd.c: New.
> 	* gcc.target/powerpc/sse4_1-roundps.c: New.
> 	* gcc.target/powerpc/sse4_1-roundsd.c: New.
> 	* gcc.target/powerpc/sse4_1-roundss.c: New.
> ---
> v3: No change.
> v2:
> - Replaced clever (and broken) exception masking with more straightforward
>    implementation, per v1 review and closer inspection. mtfsf was only
>    writing the final nybble (1) instead of the final two nybbles (2), so
>    not all of the exception-enable bits were cleared.
> - Renamed some variables from cryptic "tmp" and "save" to
>    "fpscr_save" and "enables_save".
> - Retained use of __builtin_mffsl, since that is supported pre-POWER8
>    (with an alternate instruction sequence).
> - Added "extern" to functions to maintain compatible decorations with
>    like implementations in gcc/config/i386.
> - Added some additional text to the commit message about some of the
>    (unpleasant?) implementations and decorations coming from
>    like implementations in gcc/config/i386, per v1 review.
> - Removed "-Wno-psabi" from tests as unnecessary, per v1 review.
> - Fixed indentation and other minor formatting changes, per v1 review.
> - Noted testing in patch series cover letter.
>
>   gcc/config/rs6000/smmintrin.h                 | 240 +++++++++++-----
>   .../gcc.target/powerpc/sse4_1-round3.h        |  81 ++++++
>   .../gcc.target/powerpc/sse4_1-roundpd.c       | 143 ++++++++++
>   .../gcc.target/powerpc/sse4_1-roundps.c       |  98 +++++++
>   .../gcc.target/powerpc/sse4_1-roundsd.c       | 256 ++++++++++++++++++
>   .../gcc.target/powerpc/sse4_1-roundss.c       | 208 ++++++++++++++
>   6 files changed, 962 insertions(+), 64 deletions(-)
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-round3.h
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundpd.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundps.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundsd.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundss.c
>
> diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
> index 3767a67eada7..a6b88d313ad0 100644
> --- a/gcc/config/rs6000/smmintrin.h
> +++ b/gcc/config/rs6000/smmintrin.h
> @@ -42,6 +42,182 @@
>   #include <altivec.h>
>   #include <tmmintrin.h>
>   
> +/* Rounding mode macros. */
> +#define _MM_FROUND_TO_NEAREST_INT       0x00
> +#define _MM_FROUND_TO_ZERO              0x01
> +#define _MM_FROUND_TO_POS_INF           0x02
> +#define _MM_FROUND_TO_NEG_INF           0x03
> +#define _MM_FROUND_CUR_DIRECTION        0x04
> +
> +#define _MM_FROUND_NINT		\
> +  (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_RAISE_EXC)
> +#define _MM_FROUND_FLOOR	\
> +  (_MM_FROUND_TO_NEG_INF | _MM_FROUND_RAISE_EXC)
> +#define _MM_FROUND_CEIL		\
> +  (_MM_FROUND_TO_POS_INF | _MM_FROUND_RAISE_EXC)
> +#define _MM_FROUND_TRUNC	\
> +  (_MM_FROUND_TO_ZERO | _MM_FROUND_RAISE_EXC)
> +#define _MM_FROUND_RINT		\
> +  (_MM_FROUND_CUR_DIRECTION | _MM_FROUND_RAISE_EXC)
> +#define _MM_FROUND_NEARBYINT	\
> +  (_MM_FROUND_CUR_DIRECTION | _MM_FROUND_NO_EXC)
> +
> +#define _MM_FROUND_RAISE_EXC            0x00
> +#define _MM_FROUND_NO_EXC               0x08
> +
> +extern __inline __m128d
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_round_pd (__m128d __A, int __rounding)
> +{
> +  __v2df __r;
> +  union {
> +    double __fr;
> +    long long __fpscr;
> +  } __enables_save, __fpscr_save;
> +
> +  if (__rounding & _MM_FROUND_NO_EXC)
> +    {
> +      /* Save enabled exceptions, disable all exceptions,
> +	 and preserve the rounding mode.  */
> +#ifdef _ARCH_PWR9
> +      __asm__ __volatile__ ("mffsce %0" : "=f" (__fpscr_save.__fr));
> +      __enables_save.__fpscr = __fpscr_save.__fpscr & 0xf8;
> +#else
> +      __fpscr_save.__fr = __builtin_mffs ();
> +      __enables_save.__fpscr = __fpscr_save.__fpscr & 0xf8;
> +      __fpscr_save.__fpscr &= ~0xf8;
> +      __builtin_mtfsf (0b00000011, __fpscr_save.__fr);
> +#endif
> +    }
> +
> +  switch (__rounding)
> +    {
> +      case _MM_FROUND_TO_NEAREST_INT:
> +	__fpscr_save.__fr = __builtin_mffsl ();

As pointed out in the v1 review, __builtin_mffsl is enabled (or supposed 
to be) only for POWER9 and later.  This will fail to work on POWER8 and 
earlier when the new builtins support is complete and this is enforced 
more carefully.  Please #ifdef and use __builtin_mffs on earlier 
processors.  Please do this everywhere this occurs.

I think you got some contradictory guidance on this, but trust me, this 
will break.

Otherwise it looks to me that comments were all addressed 
appropriately.  Recommend approval with that fixed.

Thanks!
Bill

> +	__attribute__ ((fallthrough));
> +      case _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC:
> +	__builtin_set_fpscr_rn (0b00);
> +	__r = vec_rint ((__v2df) __A);
> +	__builtin_set_fpscr_rn (__fpscr_save.__fpscr);
> +	break;
> +      case _MM_FROUND_TO_NEG_INF:
> +      case _MM_FROUND_TO_NEG_INF | _MM_FROUND_NO_EXC:
> +	__r = vec_floor ((__v2df) __A);
> +	break;
> +      case _MM_FROUND_TO_POS_INF:
> +      case _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC:
> +	__r = vec_ceil ((__v2df) __A);
> +	break;
> +      case _MM_FROUND_TO_ZERO:
> +      case _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC:
> +	__r = vec_trunc ((__v2df) __A);
> +	break;
> +      case _MM_FROUND_CUR_DIRECTION:
> +	__r = vec_rint ((__v2df) __A);
> +	break;
> +    }
> +  if (__rounding & _MM_FROUND_NO_EXC)
> +    {
> +      /* Restore enabled exceptions.  */
> +      __fpscr_save.__fr = __builtin_mffsl ();
> +      __fpscr_save.__fpscr |= __enables_save.__fpscr;
> +      __builtin_mtfsf (0b00000011, __fpscr_save.__fr);
> +    }
> +  return (__m128d) __r;
> +}
> +
> +extern __inline __m128d
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_round_sd (__m128d __A, __m128d __B, int __rounding)
> +{
> +  __B = _mm_round_pd (__B, __rounding);
> +  __v2df __r = { ((__v2df)__B)[0], ((__v2df) __A)[1] };
> +  return (__m128d) __r;
> +}
> +
> +extern __inline __m128
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_round_ps (__m128 __A, int __rounding)
> +{
> +  __v4sf __r;
> +  union {
> +    double __fr;
> +    long long __fpscr;
> +  } __enables_save, __fpscr_save;
> +
> +  if (__rounding & _MM_FROUND_NO_EXC)
> +    {
> +      /* Save enabled exceptions, disable all exceptions,
> +	 and preserve the rounding mode.  */
> +#ifdef _ARCH_PWR9
> +      __asm__ __volatile__ ("mffsce %0" : "=f" (__fpscr_save.__fr));
> +      __enables_save.__fpscr = __fpscr_save.__fpscr & 0xf8;
> +#else
> +      __fpscr_save.__fr = __builtin_mffs ();
> +      __enables_save.__fpscr = __fpscr_save.__fpscr & 0xf8;
> +      __fpscr_save.__fpscr &= ~0xf8;
> +      __builtin_mtfsf (0b00000011, __fpscr_save.__fr);
> +#endif
> +    }
> +
> +  switch (__rounding)
> +    {
> +      case _MM_FROUND_TO_NEAREST_INT:
> +	__fpscr_save.__fr = __builtin_mffsl ();
> +	__attribute__ ((fallthrough));
> +      case _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC:
> +	__builtin_set_fpscr_rn (0b00);
> +	__r = vec_rint ((__v4sf) __A);
> +	__builtin_set_fpscr_rn (__fpscr_save.__fpscr);
> +	break;
> +      case _MM_FROUND_TO_NEG_INF:
> +      case _MM_FROUND_TO_NEG_INF | _MM_FROUND_NO_EXC:
> +	__r = vec_floor ((__v4sf) __A);
> +	break;
> +      case _MM_FROUND_TO_POS_INF:
> +      case _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC:
> +	__r = vec_ceil ((__v4sf) __A);
> +	break;
> +      case _MM_FROUND_TO_ZERO:
> +      case _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC:
> +	__r = vec_trunc ((__v4sf) __A);
> +	break;
> +      case _MM_FROUND_CUR_DIRECTION:
> +	__r = vec_rint ((__v4sf) __A);
> +	break;
> +    }
> +  if (__rounding & _MM_FROUND_NO_EXC)
> +    {
> +      /* Restore enabled exceptions.  */
> +      __fpscr_save.__fr = __builtin_mffsl ();
> +      __fpscr_save.__fpscr |= __enables_save.__fpscr;
> +      __builtin_mtfsf (0b00000011, __fpscr_save.__fr);
> +    }
> +  return (__m128) __r;
> +}
> +
> +extern __inline __m128
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_round_ss (__m128 __A, __m128 __B, int __rounding)
> +{
> +  __B = _mm_round_ps (__B, __rounding);
> +  __v4sf __r = (__v4sf) __A;
> +  __r[0] = ((__v4sf)__B)[0];
> +  return (__m128) __r;
> +}
> +
> +#define _mm_ceil_pd(V)	   _mm_round_pd ((V), _MM_FROUND_CEIL)
> +#define _mm_ceil_sd(D, V)  _mm_round_sd ((D), (V), _MM_FROUND_CEIL)
> +
> +#define _mm_floor_pd(V)	   _mm_round_pd((V), _MM_FROUND_FLOOR)
> +#define _mm_floor_sd(D, V) _mm_round_sd ((D), (V), _MM_FROUND_FLOOR)
> +
> +#define _mm_ceil_ps(V)	   _mm_round_ps ((V), _MM_FROUND_CEIL)
> +#define _mm_ceil_ss(D, V)  _mm_round_ss ((D), (V), _MM_FROUND_CEIL)
> +
> +#define _mm_floor_ps(V)	   _mm_round_ps ((V), _MM_FROUND_FLOOR)
> +#define _mm_floor_ss(D, V) _mm_round_ss ((D), (V), _MM_FROUND_FLOOR)
> +
>   extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_insert_epi8 (__m128i const __A, int const __D, int const __N)
>   {
> @@ -232,70 +408,6 @@ _mm_test_mix_ones_zeros (__m128i __A, __m128i __mask)
>     return any_ones * any_zeros;
>   }
>   
> -__inline __m128d
> -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_ceil_pd (__m128d __A)
> -{
> -  return (__m128d) vec_ceil ((__v2df) __A);
> -}
> -
> -__inline __m128d
> -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_ceil_sd (__m128d __A, __m128d __B)
> -{
> -  __v2df __r = vec_ceil ((__v2df) __B);
> -  __r[1] = ((__v2df) __A)[1];
> -  return (__m128d) __r;
> -}
> -
> -__inline __m128d
> -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_floor_pd (__m128d __A)
> -{
> -  return (__m128d) vec_floor ((__v2df) __A);
> -}
> -
> -__inline __m128d
> -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_floor_sd (__m128d __A, __m128d __B)
> -{
> -  __v2df __r = vec_floor ((__v2df) __B);
> -  __r[1] = ((__v2df) __A)[1];
> -  return (__m128d) __r;
> -}
> -
> -__inline __m128
> -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_ceil_ps (__m128 __A)
> -{
> -  return (__m128) vec_ceil ((__v4sf) __A);
> -}
> -
> -__inline __m128
> -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_ceil_ss (__m128 __A, __m128 __B)
> -{
> -  __v4sf __r = (__v4sf) __A;
> -  __r[0] = __builtin_ceil (((__v4sf) __B)[0]);
> -  return __r;
> -}
> -
> -__inline __m128
> -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_floor_ps (__m128 __A)
> -{
> -  return (__m128) vec_floor ((__v4sf) __A);
> -}
> -
> -__inline __m128
> -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_floor_ss (__m128 __A, __m128 __B)
> -{
> -  __v4sf __r = (__v4sf) __A;
> -  __r[0] = __builtin_floor (((__v4sf) __B)[0]);
> -  return __r;
> -}
> -
>   /* Return horizontal packed word minimum and its index in bits [15:0]
>      and bits [18:16] respectively.  */
>   __inline __m128i
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-round3.h b/gcc/testsuite/gcc.target/powerpc/sse4_1-round3.h
> new file mode 100644
> index 000000000000..de6cbf7be438
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-round3.h
> @@ -0,0 +1,81 @@
> +#include <smmintrin.h>
> +#include <fenv.h>
> +#include "sse4_1-check.h"
> +
> +#define DIM(a) (sizeof (a) / sizeof (a)[0])
> +
> +static int roundings[] =
> +  {
> +    _MM_FROUND_TO_NEAREST_INT,
> +    _MM_FROUND_TO_NEG_INF,
> +    _MM_FROUND_TO_POS_INF,
> +    _MM_FROUND_TO_ZERO,
> +    _MM_FROUND_CUR_DIRECTION
> +  };
> +
> +static int modes[] =
> +  {
> +    FE_TONEAREST,
> +    FE_UPWARD,
> +    FE_DOWNWARD,
> +    FE_TOWARDZERO
> +  };
> +
> +static void
> +TEST (void)
> +{
> +  int i, j, ri, mi, round_save;
> +
> +  round_save = fegetround ();
> +  for (mi = 0; mi < DIM (modes); mi++) {
> +    fesetround (modes[mi]);
> +    for (i = 0; i < DIM (data); i++) {
> +      for (ri = 0; ri < DIM (roundings); ri++) {
> +	union value guess;
> +	union value *current_answers = answers[ri];
> +	switch ( roundings[ri] ) {
> +	  case _MM_FROUND_TO_NEAREST_INT:
> +	    guess.x = ROUND_INTRIN (data[i].value1.x, data[i].value2.x,
> +				    _MM_FROUND_TO_NEAREST_INT);
> +	    break;
> +	  case _MM_FROUND_TO_NEG_INF:
> +	    guess.x = ROUND_INTRIN (data[i].value1.x, data[i].value2.x,
> +				    _MM_FROUND_TO_NEG_INF);
> +	    break;
> +	  case _MM_FROUND_TO_POS_INF:
> +	    guess.x = ROUND_INTRIN (data[i].value1.x, data[i].value2.x,
> +				    _MM_FROUND_TO_POS_INF);
> +	    break;
> +	  case _MM_FROUND_TO_ZERO:
> +	    guess.x = ROUND_INTRIN (data[i].value1.x, data[i].value2.x,
> +				    _MM_FROUND_TO_ZERO);
> +	    break;
> +	  case _MM_FROUND_CUR_DIRECTION:
> +	    guess.x = ROUND_INTRIN (data[i].value1.x, data[i].value2.x,
> +				    _MM_FROUND_CUR_DIRECTION);
> +	    switch ( modes[mi] ) {
> +	      case FE_TONEAREST:
> +		current_answers = answers_NEAREST_INT;
> +		break;
> +	      case FE_UPWARD:
> +		current_answers = answers_POS_INF;
> +		break;
> +	      case FE_DOWNWARD:
> +		current_answers = answers_NEG_INF;
> +		break;
> +	      case FE_TOWARDZERO:
> +		current_answers = answers_ZERO;
> +		break;
> +	    }
> +	    break;
> +	  default:
> +	    abort ();
> +	}
> +	for (j = 0; j < DIM (guess.f); j++)
> +	  if (guess.f[j] != current_answers[i].f[j])
> +	    abort ();
> +      }
> +    }
> +  }
> +  fesetround (round_save);
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-roundpd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundpd.c
> new file mode 100644
> index 000000000000..0528c395f233
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundpd.c
> @@ -0,0 +1,143 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#define NO_WARN_X86_INTRINSICS 1
> +#include <smmintrin.h>
> +
> +#define VEC_T __m128d
> +#define FP_T double
> +
> +#define ROUND_INTRIN(x, ignored, mode) _mm_round_pd (x, mode)
> +
> +#include "sse4_1-round-data.h"
> +
> +struct data2 data[] = {
> +  { .value1 = { .f = {  0.00,  0.25 } } },
> +  { .value1 = { .f = {  0.50,  0.75 } } },
> +
> +  { .value1 = { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffdp+50 } } },
> +  { .value1 = { .f = {  0x1.ffffffffffffep+50,  0x1.fffffffffffffp+50 } } },
> +  { .value1 = { .f = {  0x1.0000000000000p+51,  0x1.0000000000001p+51 } } },
> +  { .value1 = { .f = {  0x1.0000000000002p+51,  0x1.0000000000003p+51 } } },
> +
> +  { .value1 = { .f = {  0x1.ffffffffffffep+51,  0x1.fffffffffffffp+51 } } },
> +  { .value1 = { .f = {  0x1.0000000000000p+52,  0x1.0000000000001p+52 } } },
> +
> +  { .value1 = { .f = { -0x1.0000000000001p+52, -0x1.0000000000000p+52 } } },
> +  { .value1 = { .f = { -0x1.fffffffffffffp+51, -0x1.ffffffffffffep+51 } } },
> +
> +  { .value1 = { .f = { -0x1.0000000000004p+51, -0x1.0000000000002p+51 } } },
> +  { .value1 = { .f = { -0x1.0000000000001p+51, -0x1.0000000000000p+51 } } },
> +  { .value1 = { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffep+50 } } },
> +  { .value1 = { .f = { -0x1.ffffffffffffdp+50, -0x1.ffffffffffffcp+50 } } },
> +
> +  { .value1 = { .f = { -1.00, -0.75 } } },
> +  { .value1 = { .f = { -0.50, -0.25 } } }
> +};
> +
> +union value answers_NEAREST_INT[] = {
> +  { .f = {  0.00,  0.00 } },
> +  { .f = {  0.00,  1.00 } },
> +
> +  { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffcp+50 } },
> +  { .f = {  0x1.0000000000000p+51,  0x1.0000000000000p+51 } },
> +  { .f = {  0x1.0000000000000p+51,  0x1.0000000000000p+51 } },
> +  { .f = {  0x1.0000000000002p+51,  0x1.0000000000004p+51 } },
> +
> +  { .f = {  0x1.ffffffffffffep+51,  0x1.0000000000000p+52 } },
> +  { .f = {  0x1.0000000000000p+52,  0x1.0000000000001p+52 } },
> +
> +  { .f = { -0x1.0000000000001p+52, -0x1.0000000000000p+52 } },
> +  { .f = { -0x1.0000000000000p+52, -0x1.ffffffffffffep+51 } },
> +
> +  { .f = { -0x1.0000000000004p+51, -0x1.0000000000002p+51 } },
> +  { .f = { -0x1.0000000000000p+51, -0x1.0000000000000p+51 } },
> +  { .f = { -0x1.ffffffffffffcp+50, -0x1.0000000000000p+51 } },
> +  { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffcp+50 } },
> +
> +  { .f = { -1.00, -1.00 } },
> +  { .f = {  0.00,  0.00 } }
> +};
> +
> +union value answers_NEG_INF[] = {
> +  { .f = {  0.00,  0.00 } },
> +  { .f = {  0.00,  0.00 } },
> +
> +  { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffcp+50 } },
> +  { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffcp+50 } },
> +  { .f = {  0x1.0000000000000p+51,  0x1.0000000000000p+51 } },
> +  { .f = {  0x1.0000000000002p+51,  0x1.0000000000002p+51 } },
> +
> +  { .f = {  0x1.ffffffffffffep+51,  0x1.ffffffffffffep+51 } },
> +  { .f = {  0x1.0000000000000p+52,  0x1.0000000000001p+52 } },
> +
> +  { .f = { -0x1.0000000000001p+52, -0x1.0000000000000p+52 } },
> +  { .f = { -0x1.0000000000000p+52, -0x1.ffffffffffffep+51 } },
> +
> +  { .f = { -0x1.0000000000004p+51, -0x1.0000000000002p+51 } },
> +  { .f = { -0x1.0000000000002p+51, -0x1.0000000000000p+51 } },
> +  { .f = { -0x1.ffffffffffffcp+50, -0x1.0000000000000p+51 } },
> +  { .f = { -0x1.0000000000000p+51, -0x1.ffffffffffffcp+50 } },
> +
> +  { .f = { -1.00, -1.00 } },
> +  { .f = { -1.00, -1.00 } }
> +};
> +
> +union value answers_POS_INF[] = {
> +  { .f = {  0.00,  1.00 } },
> +  { .f = {  1.00,  1.00 } },
> +
> +  { .f = {  0x1.ffffffffffffcp+50,  0x1.0000000000000p+51 } },
> +  { .f = {  0x1.0000000000000p+51,  0x1.0000000000000p+51 } },
> +  { .f = {  0x1.0000000000000p+51,  0x1.0000000000002p+51 } },
> +  { .f = {  0x1.0000000000002p+51,  0x1.0000000000004p+51 } },
> +
> +  { .f = {  0x1.ffffffffffffep+51,  0x1.0000000000000p+52 } },
> +  { .f = {  0x1.0000000000000p+52,  0x1.0000000000001p+52 } },
> +
> +  { .f = { -0x1.0000000000001p+52, -0x1.0000000000000p+52 } },
> +  { .f = { -0x1.ffffffffffffep+51, -0x1.ffffffffffffep+51 } },
> +
> +  { .f = { -0x1.0000000000004p+51, -0x1.0000000000002p+51 } },
> +  { .f = { -0x1.0000000000000p+51, -0x1.0000000000000p+51 } },
> +  { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffcp+50 } },
> +  { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffcp+50 } },
> +
> +  { .f = { -1.00,  0.00 } },
> +  { .f = {  0.00,  0.00 } }
> +};
> +
> +union value answers_ZERO[] = {
> +  { .f = {  0.00,  0.00 } },
> +  { .f = {  0.00,  0.00 } },
> +
> +  { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffcp+50 } },
> +  { .f = {  0x1.ffffffffffffcp+50,  0x1.ffffffffffffcp+50 } },
> +  { .f = {  0x1.0000000000000p+51,  0x1.0000000000000p+51 } },
> +  { .f = {  0x1.0000000000002p+51,  0x1.0000000000002p+51 } },
> +
> +  { .f = {  0x1.ffffffffffffep+51,  0x1.ffffffffffffep+51 } },
> +  { .f = {  0x1.0000000000000p+52,  0x1.0000000000001p+52 } },
> +
> +  { .f = { -0x1.0000000000001p+52, -0x1.0000000000000p+52 } },
> +  { .f = { -0x1.ffffffffffffep+51, -0x1.ffffffffffffep+51 } },
> +
> +  { .f = { -0x1.0000000000004p+51, -0x1.0000000000002p+51 } },
> +  { .f = { -0x1.0000000000000p+51, -0x1.0000000000000p+51 } },
> +  { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffcp+50 } },
> +  { .f = { -0x1.ffffffffffffcp+50, -0x1.ffffffffffffcp+50 } },
> +
> +  { .f = { -1.00,  0.00 } },
> +  { .f = {  0.00,  0.00 } }
> +};
> +
> +union value *answers[] = {
> +  answers_NEAREST_INT,
> +  answers_NEG_INF,
> +  answers_POS_INF,
> +  answers_ZERO,
> +  0 /* CUR_DIRECTION answers depend on current rounding mode.  */
> +};
> +
> +#include "sse4_1-round3.h"
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-roundps.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundps.c
> new file mode 100644
> index 000000000000..6b5362e07590
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundps.c
> @@ -0,0 +1,98 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#define NO_WARN_X86_INTRINSICS 1
> +#include <smmintrin.h>
> +
> +#define VEC_T __m128
> +#define FP_T float
> +
> +#define ROUND_INTRIN(x, ignored, mode) _mm_round_ps (x, mode)
> +
> +#include "sse4_1-round-data.h"
> +
> +struct data2 data[] = {
> +  { .value1 = { .f = {  0.00,  0.25,  0.50,  0.75 } } },
> +
> +  { .value1 = { .f = {  0x1.fffff8p+21,  0x1.fffffap+21,
> +			0x1.fffffcp+21,  0x1.fffffep+21 } } },
> +  { .value1 = { .f = {  0x1.fffffap+22,  0x1.fffffcp+22,
> +			0x1.fffffep+22,  0x1.fffffep+23 } } },
> +  { .value1 = { .f = { -0x1.fffffep+23, -0x1.fffffep+22,
> +		       -0x1.fffffcp+22, -0x1.fffffap+22 } } },
> +  { .value1 = { .f = { -0x1.fffffep+21, -0x1.fffffcp+21,
> +		       -0x1.fffffap+21, -0x1.fffff8p+21 } } },
> +
> +  { .value1 = { .f = { -1.00, -0.75, -0.50, -0.25 } } }
> +};
> +
> +union value answers_NEAREST_INT[] = {
> +  { .f = {  0.00,  0.00,  0.00,  1.00 } },
> +
> +  { .f = {  0x1.fffff8p+21,  0x1.fffff8p+21,
> +            0x1.000000p+22,  0x1.000000p+22 } },
> +  { .f = {  0x1.fffff8p+22,  0x1.fffffcp+22,
> +            0x1.000000p+23,  0x1.fffffep+23 } },
> +  { .f = { -0x1.fffffep+23, -0x1.000000p+23,
> +           -0x1.fffffcp+22, -0x1.fffff8p+22 } },
> +  { .f = { -0x1.000000p+22, -0x1.000000p+22,
> +           -0x1.fffff8p+21, -0x1.fffff8p+21 } },
> +
> +  { .f = { -1.00, -1.00,  0.00,  0.00 } }
> +};
> +
> +union value answers_NEG_INF[] = {
> +  { .f = {  0.00,  0.00,  0.00,  0.00 } },
> +
> +  { .f = {  0x1.fffff8p+21,  0x1.fffff8p+21,
> +            0x1.fffff8p+21,  0x1.fffff8p+21 } },
> +  { .f = {  0x1.fffff8p+22,  0x1.fffffcp+22,
> +            0x1.fffffcp+22,  0x1.fffffep+23 } },
> +  { .f = { -0x1.fffffep+23, -0x1.000000p+23,
> +           -0x1.fffffcp+22, -0x1.fffffcp+22 } },
> +  { .f = { -0x1.000000p+22, -0x1.000000p+22,
> +           -0x1.000000p+22, -0x1.fffff8p+21 } },
> +
> +  { .f = { -1.00, -1.00, -1.00, -1.00 } }
> +};
> +
> +union value answers_POS_INF[] = {
> +  { .f = {  0.00,  1.00,  1.00,  1.00 } },
> +
> +  { .f = {  0x1.fffff8p+21,  0x1.000000p+22,
> +            0x1.000000p+22,  0x1.000000p+22 } },
> +  { .f = {  0x1.fffffcp+22,  0x1.fffffcp+22,
> +            0x1.000000p+23,  0x1.fffffep+23 } },
> +  { .f = { -0x1.fffffep+23, -0x1.fffffcp+22,
> +           -0x1.fffffcp+22, -0x1.fffff8p+22 } },
> +  { .f = { -0x1.fffff8p+21, -0x1.fffff8p+21,
> +           -0x1.fffff8p+21, -0x1.fffff8p+21 } },
> +
> +  { .f = { -1.00,  0.00,  0.00,  0.00 } }
> +};
> +
> +union value answers_ZERO[] = {
> +  { .f = {  0.00,  0.00,  0.00,  0.00 } },
> +
> +  { .f = {  0x1.fffff8p+21,  0x1.fffff8p+21,
> +            0x1.fffff8p+21,  0x1.fffff8p+21 } },
> +  { .f = {  0x1.fffff8p+22,  0x1.fffffcp+22,
> +            0x1.fffffcp+22,  0x1.fffffep+23 } },
> +  { .f = { -0x1.fffffep+23, -0x1.fffffcp+22,
> +           -0x1.fffffcp+22, -0x1.fffff8p+22 } },
> +  { .f = { -0x1.fffff8p+21, -0x1.fffff8p+21,
> +           -0x1.fffff8p+21, -0x1.fffff8p+21 } },
> +
> +  { .f = { -1.00,  0.00,  0.00,  0.00 } }
> +};
> +
> +union value *answers[] = {
> +  answers_NEAREST_INT,
> +  answers_NEG_INF,
> +  answers_POS_INF,
> +  answers_ZERO,
> +  0 /* CUR_DIRECTION answers depend on current rounding mode.  */
> +};
> +
> +#include "sse4_1-round3.h"
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-roundsd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundsd.c
> new file mode 100644
> index 000000000000..2b0bad6469df
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundsd.c
> @@ -0,0 +1,256 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#include <stdio.h>
> +#define NO_WARN_X86_INTRINSICS 1
> +#include <smmintrin.h>
> +
> +#define VEC_T __m128d
> +#define FP_T double
> +
> +#define ROUND_INTRIN(x, y, mode) _mm_round_sd (x, y, mode)
> +
> +#include "sse4_1-round-data.h"
> +
> +static struct data2 data[] = {
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0.00, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0.25, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0.50, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0.75, IGNORED } } },
> +
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.ffffffffffffcp+50, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.ffffffffffffdp+50, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.ffffffffffffep+50, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.fffffffffffffp+50, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.0000000000000p+51, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.0000000000001p+51, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.0000000000002p+51, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.0000000000003p+51, IGNORED } } },
> +
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.ffffffffffffep+51, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.fffffffffffffp+51, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.0000000000000p+52, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.0000000000001p+52, IGNORED } } },
> +
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.0000000000001p+52, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.0000000000000p+52, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.fffffffffffffp+51, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.ffffffffffffep+51, IGNORED } } },
> +
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.0000000000004p+51, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.0000000000002p+51, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.0000000000001p+51, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.0000000000000p+51, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.ffffffffffffcp+50, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.ffffffffffffep+50, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.ffffffffffffdp+50, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.ffffffffffffcp+50, IGNORED } } },
> +
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -1.00, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0.75, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0.50, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH } },
> +    .value2 = { .f = { -0.25, IGNORED } } }
> +};
> +
> +static union value answers_NEAREST_INT[] = {
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  1.00, PASSTHROUGH } },
> +
> +  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000004p+51, PASSTHROUGH } },
> +
> +  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000001p+52, PASSTHROUGH } },
> +
> +  { .f = { -0x1.0000000000001p+52, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
> +
> +  { .f = { -0x1.0000000000004p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000002p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +
> +  { .f = { -1.00, PASSTHROUGH } },
> +  { .f = { -1.00, PASSTHROUGH } },
> +  { .f = { -0.00, PASSTHROUGH } },
> +  { .f = { -0.00, PASSTHROUGH } }
> +};
> +
> +static union value answers_NEG_INF[] = {
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +
> +  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
> +
> +  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
> +  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000001p+52, PASSTHROUGH } },
> +
> +  { .f = { -0x1.0000000000001p+52, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
> +
> +  { .f = { -0x1.0000000000004p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000002p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000002p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +
> +  { .f = { -1.00, PASSTHROUGH } },
> +  { .f = { -1.00, PASSTHROUGH } },
> +  { .f = { -1.00, PASSTHROUGH } },
> +  { .f = { -1.00, PASSTHROUGH } }
> +};
> +
> +static union value answers_POS_INF[] = {
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  1.00, PASSTHROUGH } },
> +  { .f = {  1.00, PASSTHROUGH } },
> +  { .f = {  1.00, PASSTHROUGH } },
> +
> +  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000004p+51, PASSTHROUGH } },
> +
> +  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000001p+52, PASSTHROUGH } },
> +
> +  { .f = { -0x1.0000000000001p+52, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
> +
> +  { .f = { -0x1.0000000000004p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000002p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +
> +  { .f = { -1.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } }
> +};
> +
> +static union value answers_ZERO[] = {
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +
> +  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = {  0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000002p+51, PASSTHROUGH } },
> +
> +  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
> +  { .f = {  0x1.ffffffffffffep+51, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = {  0x1.0000000000001p+52, PASSTHROUGH } },
> +
> +  { .f = { -0x1.0000000000001p+52, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+52, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffep+51, PASSTHROUGH } },
> +
> +  { .f = { -0x1.0000000000004p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000002p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = { -0x1.0000000000000p+51, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +  { .f = { -0x1.ffffffffffffcp+50, PASSTHROUGH } },
> +
> +  { .f = { -1.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH } }
> +};
> +
> +union value *answers[] = {
> +  answers_NEAREST_INT,
> +  answers_NEG_INF,
> +  answers_POS_INF,
> +  answers_ZERO,
> +  0 /* CUR_DIRECTION answers depend on current rounding mode.  */
> +};
> +
> +#include "sse4_1-round3.h"
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-roundss.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundss.c
> new file mode 100644
> index 000000000000..3154310314a1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-roundss.c
> @@ -0,0 +1,208 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#include <stdio.h>
> +#define NO_WARN_X86_INTRINSICS 1
> +#include <smmintrin.h>
> +
> +#define VEC_T __m128
> +#define FP_T float
> +
> +#define ROUND_INTRIN(x, y, mode) _mm_round_ss (x, y, mode)
> +
> +#include "sse4_1-round-data.h"
> +
> +static struct data2 data[] = {
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0.00, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0.25, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0.50, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0.75, IGNORED, IGNORED, IGNORED } } },
> +
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.fffff8p+21, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.fffffap+21, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.fffffcp+21, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.fffffep+21, IGNORED, IGNORED, IGNORED } } },
> +
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.fffffap+22, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.fffffcp+22, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.fffffep+22, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = {  0x1.fffffep+23, IGNORED, IGNORED, IGNORED } } },
> +
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.fffffep+23, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.fffffep+22, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.fffffcp+22, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.fffffap+22, IGNORED, IGNORED, IGNORED } } },
> +
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.fffffep+21, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.fffffcp+21, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.fffffap+21, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -0x1.fffff8p+21, IGNORED, IGNORED, IGNORED } } },
> +
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -1.00, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -0.75, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -0.50, IGNORED, IGNORED, IGNORED } } },
> +  { .value1 = { .f = { IGNORED, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +    .value2 = { .f = { -0.25, IGNORED, IGNORED, IGNORED } } }
> +};
> +
> +static union value answers_NEAREST_INT[] = {
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = {  0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.000000p+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.000000p+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } }
> +};
> +
> +static union value answers_NEG_INF[] = {
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = {  0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.000000p+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } }
> +};
> +
> +static union value answers_POS_INF[] = {
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.000000p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.000000p+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } }
> +};
> +
> +static union value answers_ZERO[] = {
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = {  0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -0x1.fffffep+23, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffffcp+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+22, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = { -0x1.fffff8p+21, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +
> +  { .f = { -1.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } },
> +  { .f = {  0.00, PASSTHROUGH, PASSTHROUGH, PASSTHROUGH } }
> +};
> +
> +union value *answers[] = {
> +  answers_NEAREST_INT,
> +  answers_NEG_INF,
> +  answers_POS_INF,
> +  answers_ZERO,
> +  0 /* CUR_DIRECTION answers depend on current rounding mode.  */
> +};
> +
> +#include "sse4_1-round3.h"

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-08-27 13:44   ` Bill Schmidt
@ 2021-08-27 13:47     ` Bill Schmidt
  2021-08-30 21:16     ` Paul A. Clarke
  1 sibling, 0 replies; 47+ messages in thread
From: Bill Schmidt @ 2021-08-27 13:47 UTC (permalink / raw)
  To: Paul A. Clarke, gcc-patches; +Cc: segher


On 8/27/21 8:44 AM, Bill Schmidt wrote:
>
> Again, please specify where the patch was tested and whether this is for
> trunk, backports, etc.  Thanks!  (I know you aren't asking for
> backports, but in general please get in the habit of this.)
>

Sorry, I see that you did this in the cover letter.  Never mind, sorry 
for the noise.

Bill



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 2/6] rs6000: Support SSE4.1 "min" and "max" intrinsics
  2021-08-23 19:03 ` [PATCH v3 2/6] rs6000: Support SSE4.1 "min" and "max" intrinsics Paul A. Clarke
@ 2021-08-27 13:47   ` Bill Schmidt
  2021-10-11 19:28   ` Segher Boessenkool
  1 sibling, 0 replies; 47+ messages in thread
From: Bill Schmidt @ 2021-08-27 13:47 UTC (permalink / raw)
  To: Paul A. Clarke, gcc-patches; +Cc: segher

Hi Paul,

This looks fine to me, recommend approval.

Thanks,
Bill

On 8/23/21 2:03 PM, Paul A. Clarke wrote:
> Function signatures and decorations match gcc/config/i386/smmintrin.h.
>
> Also, copy tests for _mm_min_epi8, _mm_min_epu16, _mm_min_epi32,
> _mm_min_epu32, _mm_max_epi8, _mm_max_epu16, _mm_max_epi32, _mm_max_epu32
> from gcc/testsuite/gcc.target/i386.
>
> sse4_1-pmaxsb.c and sse4_1-pminsb.c were modified from using
> "char" types to "signed char" types, because the default is unsigned on
> powerpc.
>
> 2021-08-20  Paul A. Clarke  <pc@us.ibm.com>
>
> gcc
> 	* config/rs6000/smmintrin.h (_mm_min_epi8, _mm_min_epu16,
> 	_mm_min_epi32, _mm_min_epu32, _mm_max_epi8, _mm_max_epu16,
> 	_mm_max_epi32, _mm_max_epu32): New.
>
> gcc/testsuite
> 	* gcc.target/powerpc/sse4_1-pmaxsb.c: Copy from gcc.target/i386.
> 	* gcc.target/powerpc/sse4_1-pmaxsd.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmaxud.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmaxuw.c: Same.
> 	* gcc.target/powerpc/sse4_1-pminsb.c: Same.
> 	* gcc.target/powerpc/sse4_1-pminsd.c: Same.
> 	* gcc.target/powerpc/sse4_1-pminud.c: Same.
> 	* gcc.target/powerpc/sse4_1-pminuw.c: Same.
> ---
> v3: No change.
> v2:
> - Added "extern" to functions to maintain compatible decorations with
>    like implementations in gcc/config/i386.
> - Removed "-Wno-psabi" from tests as unnecessary, per v1 review.
> - Noted testing in patch series cover letter.
>
>   gcc/config/rs6000/smmintrin.h                 | 56 +++++++++++++++++++
>   .../gcc.target/powerpc/sse4_1-pmaxsb.c        | 46 +++++++++++++++
>   .../gcc.target/powerpc/sse4_1-pmaxsd.c        | 46 +++++++++++++++
>   .../gcc.target/powerpc/sse4_1-pmaxud.c        | 47 ++++++++++++++++
>   .../gcc.target/powerpc/sse4_1-pmaxuw.c        | 47 ++++++++++++++++
>   .../gcc.target/powerpc/sse4_1-pminsb.c        | 46 +++++++++++++++
>   .../gcc.target/powerpc/sse4_1-pminsd.c        | 46 +++++++++++++++
>   .../gcc.target/powerpc/sse4_1-pminud.c        | 47 ++++++++++++++++
>   .../gcc.target/powerpc/sse4_1-pminuw.c        | 47 ++++++++++++++++
>   9 files changed, 428 insertions(+)
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c
>
> diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
> index a6b88d313ad0..505fe4ce22a8 100644
> --- a/gcc/config/rs6000/smmintrin.h
> +++ b/gcc/config/rs6000/smmintrin.h
> @@ -408,6 +408,62 @@ _mm_test_mix_ones_zeros (__m128i __A, __m128i __mask)
>     return any_ones * any_zeros;
>   }
>   
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_min_epi8 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_min ((__v16qi)__X, (__v16qi)__Y);
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_min_epu16 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_min ((__v8hu)__X, (__v8hu)__Y);
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_min_epi32 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_min ((__v4si)__X, (__v4si)__Y);
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_min_epu32 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_min ((__v4su)__X, (__v4su)__Y);
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_max_epi8 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_max ((__v16qi)__X, (__v16qi)__Y);
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_max_epu16 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_max ((__v8hu)__X, (__v8hu)__Y);
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_max_epi32 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_max ((__v4si)__X, (__v4si)__Y);
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_max_epu32 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_max ((__v4su)__X, (__v4su)__Y);
> +}
> +
>   /* Return horizontal packed word minimum and its index in bits [15:0]
>      and bits [18:16] respectively.  */
>   __inline __m128i
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c
> new file mode 100644
> index 000000000000..7a465b01dd11
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c
> @@ -0,0 +1,46 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 1024
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 16];
> +      signed char i[NUM];
> +    } dst, src1, src2;
> +  int i, sign = 1;
> +  signed char max;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.i[i] = i * i * sign;
> +      src2.i[i] = (i + 20) * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 16)
> +    dst.x[i / 16] = _mm_max_epi8 (src1.x[i / 16], src2.x[i / 16]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
> +      if (max != dst.i[i])
> +	abort ();
> +    }
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c
> new file mode 100644
> index 000000000000..d4947e9dae9a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c
> @@ -0,0 +1,46 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 64
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 4];
> +      int i[NUM];
> +    } dst, src1, src2;
> +  int i, sign = 1;
> +  int max;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.i[i] = i * i * sign;
> +      src2.i[i] = (i + 20) * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 4)
> +    dst.x[i / 4] = _mm_max_epi32 (src1.x[i / 4], src2.x[i / 4]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
> +      if (max != dst.i[i])
> +	abort ();
> +    }
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c
> new file mode 100644
> index 000000000000..1407ebccacd3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c
> @@ -0,0 +1,47 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 64
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 4];
> +      unsigned int i[NUM];
> +    } dst, src1, src2;
> +  int i;
> +  unsigned int max;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.i[i] = i * i;
> +      src2.i[i] = i + 20;
> +      if ((i % 4))
> +	src2.i[i] |= 0x80000000;
> +    }
> +
> +  for (i = 0; i < NUM; i += 4)
> +    dst.x[i / 4] = _mm_max_epu32 (src1.x[i / 4], src2.x[i / 4]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
> +      if (max != dst.i[i])
> +	abort ();
> +    }
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c
> new file mode 100644
> index 000000000000..73ead0e90683
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c
> @@ -0,0 +1,47 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 64
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 8];
> +      unsigned short i[NUM];
> +    } dst, src1, src2;
> +  int i;
> +  unsigned short max;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.i[i] = i * i;
> +      src2.i[i] = i + 20;
> +      if ((i % 8))
> +	src2.i[i] |= 0x8000;
> +    }
> +
> +  for (i = 0; i < NUM; i += 8)
> +    dst.x[i / 8] = _mm_max_epu16 (src1.x[i / 8], src2.x[i / 8]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
> +      if (max != dst.i[i])
> +	abort ();
> +    }
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c
> new file mode 100644
> index 000000000000..bf491b7d363d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c
> @@ -0,0 +1,46 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 1024
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 16];
> +      signed char i[NUM];
> +    } dst, src1, src2;
> +  int i, sign = 1;
> +  signed char min;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.i[i] = i * i * sign;
> +      src2.i[i] = (i + 20) * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 16)
> +    dst.x[i / 16] = _mm_min_epi8 (src1.x[i / 16], src2.x[i / 16]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
> +      if (min != dst.i[i])
> +	abort ();
> +    }
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c
> new file mode 100644
> index 000000000000..6cb27556a3b0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c
> @@ -0,0 +1,46 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 64
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 4];
> +      int i[NUM];
> +    } dst, src1, src2;
> +  int i, sign = 1;
> +  int min;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.i[i] = i * i * sign;
> +      src2.i[i] = (i + 20) * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 4)
> +    dst.x[i / 4] = _mm_min_epi32 (src1.x[i / 4], src2.x[i / 4]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
> +      if (min != dst.i[i])
> +	abort ();
> +    }
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c
> new file mode 100644
> index 000000000000..afda4b906599
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c
> @@ -0,0 +1,47 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 64
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 4];
> +      unsigned int i[NUM];
> +    } dst, src1, src2;
> +  int i;
> +  unsigned int min;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.i[i] = i * i;
> +      src2.i[i] = i + 20;
> +      if ((i % 4))
> +	src2.i[i] |= 0x80000000;
> +    }
> +
> +  for (i = 0; i < NUM; i += 4)
> +    dst.x[i / 4] = _mm_min_epu32 (src1.x[i / 4], src2.x[i / 4]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
> +      if (min != dst.i[i])
> +	abort ();
> +    }
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c
> new file mode 100644
> index 000000000000..25cc115285c6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c
> @@ -0,0 +1,47 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 64
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 8];
> +      unsigned short i[NUM];
> +    } dst, src1, src2;
> +  int i;
> +  unsigned short min;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.i[i] = i * i;
> +      src2.i[i] = i + 20;
> +      if ((i % 8))
> +	src2.i[i] |= 0x8000;
> +    }
> +
> +  for (i = 0; i < NUM; i += 8)
> +    dst.x[i / 8] = _mm_min_epu16 (src1.x[i / 8], src2.x[i / 8]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
> +      if (min != dst.i[i])
> +	abort ();
> +    }
> +}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 3/6] rs6000: Simplify some SSE4.1 "test" intrinsics
  2021-08-23 19:03 ` [PATCH v3 3/6] rs6000: Simplify some SSE4.1 "test" intrinsics Paul A. Clarke
@ 2021-08-27 13:48   ` Bill Schmidt
  2021-10-11 20:50   ` Segher Boessenkool
  1 sibling, 0 replies; 47+ messages in thread
From: Bill Schmidt @ 2021-08-27 13:48 UTC (permalink / raw)
  To: Paul A. Clarke, gcc-patches; +Cc: segher

This looks fine, recommend approval.

Thanks!
Bill

On 8/23/21 2:03 PM, Paul A. Clarke wrote:
> Copy some simple redirections from i386 <smmintrin.h>, for:
> - _mm_test_all_zeros
> - _mm_test_all_ones
> - _mm_test_mix_ones_zeros
>
> 2021-08-20  Paul A. Clarke  <pc@us.ibm.com>
>
> gcc
> 	* config/rs6000/smmintrin.h (_mm_test_all_zeros,
> 	_mm_test_all_ones, _mm_test_mix_ones_zeros): Replace.
> ---
> v3: No change.
> v2:
> - Removed "-Wno-psabi" from tests as unnecessary, per v1 review.
> - Noted testing in patch series cover letter.
>
>   gcc/config/rs6000/smmintrin.h | 30 ++++--------------------------
>   1 file changed, 4 insertions(+), 26 deletions(-)
>
> diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
> index 505fe4ce22a8..363534cb06a2 100644
> --- a/gcc/config/rs6000/smmintrin.h
> +++ b/gcc/config/rs6000/smmintrin.h
> @@ -379,34 +379,12 @@ _mm_testnzc_si128 (__m128i __A, __m128i __B)
>     return _mm_testz_si128 (__A, __B) == 0 && _mm_testc_si128 (__A, __B) == 0;
>   }
>   
> -__inline int
> -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_test_all_zeros (__m128i __A, __m128i __mask)
> -{
> -  const __v16qu __zero = {0};
> -  return vec_all_eq (vec_and ((__v16qu) __A, (__v16qu) __mask), __zero);
> -}
> +#define _mm_test_all_zeros(M, V) _mm_testz_si128 ((M), (V))
>   
> -__inline int
> -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_test_all_ones (__m128i __A)
> -{
> -  const __v16qu __ones = vec_splats ((unsigned char) 0xff);
> -  return vec_all_eq ((__v16qu) __A, __ones);
> -}
> +#define _mm_test_all_ones(V) \
> +  _mm_testc_si128 ((V), _mm_cmpeq_epi32 ((V), (V)))
>   
> -__inline int
> -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_test_mix_ones_zeros (__m128i __A, __m128i __mask)
> -{
> -  const __v16qu __zero = {0};
> -  const __v16qu __Amasked = vec_and ((__v16qu) __A, (__v16qu) __mask);
> -  const int any_ones = vec_any_ne (__Amasked, __zero);
> -  const __v16qu __notA = vec_nor ((__v16qu) __A, (__v16qu) __A);
> -  const __v16qu __notAmasked = vec_and ((__v16qu) __notA, (__v16qu) __mask);
> -  const int any_zeros = vec_any_ne (__notAmasked, __zero);
> -  return any_ones * any_zeros;
> -}
> +#define _mm_test_mix_ones_zeros(M, V) _mm_testnzc_si128 ((M), (V))
>   
>   extern __inline __m128i
>   __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 4/6] rs6000: Support SSE4.1 "cvt" intrinsics
  2021-08-23 19:03 ` [PATCH v3 4/6] rs6000: Support SSE4.1 "cvt" intrinsics Paul A. Clarke
@ 2021-08-27 13:49   ` Bill Schmidt
  2021-10-11 21:52   ` Segher Boessenkool
  1 sibling, 0 replies; 47+ messages in thread
From: Bill Schmidt @ 2021-08-27 13:49 UTC (permalink / raw)
  To: Paul A. Clarke, gcc-patches; +Cc: segher

This looks fine, recommend approval.

Thanks!
Bill

On 8/23/21 2:03 PM, Paul A. Clarke wrote:
> Function signatures and decorations match gcc/config/i386/smmintrin.h.
>
> Also, copy tests for:
> - _mm_cvtepi8_epi16, _mm_cvtepi8_epi32, _mm_cvtepi8_epi64
> - _mm_cvtepi16_epi32, _mm_cvtepi16_epi64
> - _mm_cvtepi32_epi64,
> - _mm_cvtepu8_epi16, _mm_cvtepu8_epi32, _mm_cvtepu8_epi64
> - _mm_cvtepu16_epi32, _mm_cvtepu16_epi64
> - _mm_cvtepu32_epi64
>
> from gcc/testsuite/gcc.target/i386.
>
> sse4_1-pmovsxbd.c, sse4_1-pmovsxbq.c, and sse4_1-pmovsxbw.c were
> modified from using "char" types to "signed char" types, because
> the default is unsigned on powerpc.
>
> 2021-08-20  Paul A. Clarke  <pc@us.ibm.com>
>
> gcc
> 	* config/rs6000/smmintrin.h (_mm_cvtepi8_epi16, _mm_cvtepi8_epi32,
> 	_mm_cvtepi8_epi64, _mm_cvtepi16_epi32, _mm_cvtepi16_epi64,
> 	_mm_cvtepi32_epi64, _mm_cvtepu8_epi16, _mm_cvtepu8_epi32,
> 	_mm_cvtepu8_epi64, _mm_cvtepu16_epi32, _mm_cvtepu16_epi64,
> 	_mm_cvtepu32_epi64): New.
>
> gcc/testsuite
> 	* gcc.target/powerpc/sse4_1-pmovsxbd.c: Copy from gcc.target/i386,
> 	adjust dg directives to suit.
> 	* gcc.target/powerpc/sse4_1-pmovsxbq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovsxbw.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovsxdq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovsxwd.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovsxwq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxbd.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxbq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxbw.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxdq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxwd.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxwq.c: Same.
> ---
> v3: No change.
> v2:
> - Added "extern" to functions to maintain compatible decorations with
>    like implementations in gcc/config/i386.
> - Removed "-Wno-psabi" from tests as unnecessary, per v1 review.
> - Noted testing in patch series cover letter.
>
>   gcc/config/rs6000/smmintrin.h                 | 138 ++++++++++++++++++
>   .../gcc.target/powerpc/sse4_1-pmovsxbd.c      |  42 ++++++
>   .../gcc.target/powerpc/sse4_1-pmovsxbq.c      |  42 ++++++
>   .../gcc.target/powerpc/sse4_1-pmovsxbw.c      |  42 ++++++
>   .../gcc.target/powerpc/sse4_1-pmovsxdq.c      |  42 ++++++
>   .../gcc.target/powerpc/sse4_1-pmovsxwd.c      |  42 ++++++
>   .../gcc.target/powerpc/sse4_1-pmovsxwq.c      |  42 ++++++
>   .../gcc.target/powerpc/sse4_1-pmovzxbd.c      |  43 ++++++
>   .../gcc.target/powerpc/sse4_1-pmovzxbq.c      |  43 ++++++
>   .../gcc.target/powerpc/sse4_1-pmovzxbw.c      |  43 ++++++
>   .../gcc.target/powerpc/sse4_1-pmovzxdq.c      |  43 ++++++
>   .../gcc.target/powerpc/sse4_1-pmovzxwd.c      |  43 ++++++
>   .../gcc.target/powerpc/sse4_1-pmovzxwq.c      |  43 ++++++
>   13 files changed, 648 insertions(+)
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c
>
> diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
> index 363534cb06a2..fdef6674d16c 100644
> --- a/gcc/config/rs6000/smmintrin.h
> +++ b/gcc/config/rs6000/smmintrin.h
> @@ -442,6 +442,144 @@ _mm_max_epu32 (__m128i __X, __m128i __Y)
>     return (__m128i) vec_max ((__v4su)__X, (__v4su)__Y);
>   }
>   
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepi8_epi16 (__m128i __A)
> +{
> +  return (__m128i) vec_unpackh ((__v16qi)__A);
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepi8_epi32 (__m128i __A)
> +{
> +  __A = (__m128i) vec_unpackh ((__v16qi)__A);
> +  return (__m128i) vec_unpackh ((__v8hi)__A);
> +}
> +
> +#ifdef _ARCH_PWR8
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepi8_epi64 (__m128i __A)
> +{
> +  __A = (__m128i) vec_unpackh ((__v16qi)__A);
> +  __A = (__m128i) vec_unpackh ((__v8hi)__A);
> +  return (__m128i) vec_unpackh ((__v4si)__A);
> +}
> +#endif
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepi16_epi32 (__m128i __A)
> +{
> +  return (__m128i) vec_unpackh ((__v8hi)__A);
> +}
> +
> +#ifdef _ARCH_PWR8
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepi16_epi64 (__m128i __A)
> +{
> +  __A = (__m128i) vec_unpackh ((__v8hi)__A);
> +  return (__m128i) vec_unpackh ((__v4si)__A);
> +}
> +#endif
> +
> +#ifdef _ARCH_PWR8
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepi32_epi64 (__m128i __A)
> +{
> +  return (__m128i) vec_unpackh ((__v4si)__A);
> +}
> +#endif
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepu8_epi16 (__m128i __A)
> +{
> +  const __v16qu __zero = {0};
> +#ifdef __LITTLE_ENDIAN__
> +  __A = (__m128i) vec_mergeh ((__v16qu)__A, __zero);
> +#else /* __BIG_ENDIAN__.  */
> +  __A = (__m128i) vec_mergeh (__zero, (__v16qu)__A);
> +#endif /* __BIG_ENDIAN__.  */
> +  return __A;
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepu8_epi32 (__m128i __A)
> +{
> +  const __v16qu __zero = {0};
> +#ifdef __LITTLE_ENDIAN__
> +  __A = (__m128i) vec_mergeh ((__v16qu)__A, __zero);
> +  __A = (__m128i) vec_mergeh ((__v8hu)__A, (__v8hu)__zero);
> +#else /* __BIG_ENDIAN__.  */
> +  __A = (__m128i) vec_mergeh (__zero, (__v16qu)__A);
> +  __A = (__m128i) vec_mergeh ((__v8hu)__zero, (__v8hu)__A);
> +#endif /* __BIG_ENDIAN__.  */
> +  return __A;
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepu8_epi64 (__m128i __A)
> +{
> +  const __v16qu __zero = {0};
> +#ifdef __LITTLE_ENDIAN__
> +  __A = (__m128i) vec_mergeh ((__v16qu)__A, __zero);
> +  __A = (__m128i) vec_mergeh ((__v8hu)__A, (__v8hu)__zero);
> +  __A = (__m128i) vec_mergeh ((__v4su)__A, (__v4su)__zero);
> +#else /* __BIG_ENDIAN__.  */
> +  __A = (__m128i) vec_mergeh (__zero, (__v16qu)__A);
> +  __A = (__m128i) vec_mergeh ((__v8hu)__zero, (__v8hu)__A);
> +  __A = (__m128i) vec_mergeh ((__v4su)__zero, (__v4su)__A);
> +#endif /* __BIG_ENDIAN__.  */
> +  return __A;
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepu16_epi32 (__m128i __A)
> +{
> +  const __v8hu __zero = {0};
> +#ifdef __LITTLE_ENDIAN__
> +  __A = (__m128i) vec_mergeh ((__v8hu)__A, __zero);
> +#else /* __BIG_ENDIAN__.  */
> +  __A = (__m128i) vec_mergeh (__zero, (__v8hu)__A);
> +#endif /* __BIG_ENDIAN__.  */
> +  return __A;
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepu16_epi64 (__m128i __A)
> +{
> +  const __v8hu __zero = {0};
> +#ifdef __LITTLE_ENDIAN__
> +  __A = (__m128i) vec_mergeh ((__v8hu)__A, __zero);
> +  __A = (__m128i) vec_mergeh ((__v4su)__A, (__v4su)__zero);
> +#else /* __BIG_ENDIAN__.  */
> +  __A = (__m128i) vec_mergeh (__zero, (__v8hu)__A);
> +  __A = (__m128i) vec_mergeh ((__v4su)__zero, (__v4su)__A);
> +#endif /* __BIG_ENDIAN__.  */
> +  return __A;
> +}
> +
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepu32_epi64 (__m128i __A)
> +{
> +  const __v4su __zero = {0};
> +#ifdef __LITTLE_ENDIAN__
> +  __A = (__m128i) vec_mergeh ((__v4su)__A, __zero);
> +#else /* __BIG_ENDIAN__.  */
> +  __A = (__m128i) vec_mergeh (__zero, (__v4su)__A);
> +#endif /* __BIG_ENDIAN__.  */
> +  return __A;
> +}
> +
>   /* Return horizontal packed word minimum and its index in bits [15:0]
>      and bits [18:16] respectively.  */
>   __inline __m128i
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c
> new file mode 100644
> index 000000000000..553c8dd84505
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c
> @@ -0,0 +1,42 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 4];
> +      int i[NUM];
> +      signed char c[NUM * 4];
> +    } dst, src;
> +  int i, sign = 1;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.c[(i % 4) + (i / 4) * 16] = i * i * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 4)
> +    dst.x [i / 4] = _mm_cvtepi8_epi32 (src.x [i / 4]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.c[(i % 4) + (i / 4) * 16] != dst.i[i])
> +      abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c
> new file mode 100644
> index 000000000000..9ec1ab7a4169
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c
> @@ -0,0 +1,42 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target p8vector_hw } */
> +/* { dg-options "-O2 -mpower8-vector" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 2];
> +      long long ll[NUM];
> +      signed char c[NUM * 8];
> +    } dst, src;
> +  int i, sign = 1;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.c[(i % 2) + (i / 2) * 16] = i * i * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 2)
> +    dst.x [i / 2] = _mm_cvtepi8_epi64 (src.x [i / 2]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.c[(i % 2) + (i / 2) * 16] != dst.ll[i])
> +      abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c
> new file mode 100644
> index 000000000000..be4cf417ca7e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c
> @@ -0,0 +1,42 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 8];
> +      short s[NUM];
> +      signed char c[NUM * 2];
> +    } dst, src;
> +  int i, sign = 1;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.c[(i % 8) + (i / 8) * 16] = i * i * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 8)
> +    dst.x [i / 8] = _mm_cvtepi8_epi16 (src.x [i / 8]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.c[(i % 8) + (i / 8) * 16] != dst.s[i])
> +      abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c
> new file mode 100644
> index 000000000000..1c263782240a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c
> @@ -0,0 +1,42 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target p8vector_hw } */
> +/* { dg-options "-O2 -mpower8-vector" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 2];
> +      long long ll[NUM];
> +      int i[NUM * 2];
> +    } dst, src;
> +  int i, sign = 1;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.i[(i % 2) + (i / 2) * 4] = i * i * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 2)
> +    dst.x [i / 2] = _mm_cvtepi32_epi64 (src.x [i / 2]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.i[(i % 2) + (i / 2) * 4] != dst.ll[i])
> +      abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c
> new file mode 100644
> index 000000000000..f0f31aba44ba
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c
> @@ -0,0 +1,42 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 4];
> +      int i[NUM];
> +      short s[NUM * 2];
> +    } dst, src;
> +  int i, sign = 1;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.s[(i % 4) + (i / 4) * 8] = i * i * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 4)
> +    dst.x [i / 4] = _mm_cvtepi16_epi32 (src.x [i / 4]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.s[(i % 4) + (i / 4) * 8] != dst.i[i])
> +      abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c
> new file mode 100644
> index 000000000000..67864695a113
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c
> @@ -0,0 +1,42 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target p8vector_hw } */
> +/* { dg-options "-O2 -mpower8-vector" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 2];
> +      long long ll[NUM];
> +      short s[NUM * 4];
> +    } dst, src;
> +  int i, sign = 1;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.s[(i % 2) + (i / 2) * 8] = i * i * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 2)
> +    dst.x [i / 2] = _mm_cvtepi16_epi64 (src.x [i / 2]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.s[(i % 2) + (i / 2) * 8] != dst.ll[i])
> +      abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c
> new file mode 100644
> index 000000000000..098ef6a49cb0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c
> @@ -0,0 +1,43 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 4];
> +      unsigned int i[NUM];
> +      unsigned char c[NUM * 4];
> +    } dst, src;
> +  int i;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.c[(i % 4) + (i / 4) * 16] = i * i;
> +      if ((i % 4))
> +	src.c[(i % 4) + (i / 4) * 16] |= 0x80;
> +    }
> +
> +  for (i = 0; i < NUM; i += 4)
> +    dst.x [i / 4] = _mm_cvtepu8_epi32 (src.x [i / 4]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.c[(i % 4) + (i / 4) * 16] != dst.i[i])
> +      abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c
> new file mode 100644
> index 000000000000..7b862767436e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c
> @@ -0,0 +1,43 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 2];
> +      unsigned long long ll[NUM];
> +      unsigned char c[NUM * 8];
> +    } dst, src;
> +  int i;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.c[(i % 2) + (i / 2) * 16] = i * i;
> +      if ((i % 2))
> +	src.c[(i % 2) + (i / 2) * 16] |= 0x80;
> +    }
> +
> +  for (i = 0; i < NUM; i += 2)
> +    dst.x [i / 2] = _mm_cvtepu8_epi64 (src.x [i / 2]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.c[(i % 2) + (i / 2) * 16] != dst.ll[i])
> +      abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c
> new file mode 100644
> index 000000000000..9fdbec342d46
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c
> @@ -0,0 +1,43 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 8];
> +      unsigned short s[NUM];
> +      unsigned char c[NUM * 2];
> +    } dst, src;
> +  int i;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.c[(i % 8) + (i / 8) * 16] = i * i;
> +      if ((i % 4))
> +	src.c[(i % 8) + (i / 8) * 16] |= 0x80;
> +    }
> +
> +  for (i = 0; i < NUM; i += 8)
> +    dst.x [i / 8] = _mm_cvtepu8_epi16 (src.x [i / 8]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.c[(i % 8) + (i / 8) * 16] != dst.s[i])
> +      abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c
> new file mode 100644
> index 000000000000..7a5e7688d9f5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c
> @@ -0,0 +1,43 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 2];
> +      unsigned long long ll[NUM];
> +      unsigned int i[NUM * 2];
> +    } dst, src;
> +  int i;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.i[(i % 2) + (i / 2) * 4] = i * i;
> +      if ((i % 2))
> +        src.i[(i % 2) + (i / 2) * 4] |= 0x80000000;
> +    }
> +
> +  for (i = 0; i < NUM; i += 2)
> +    dst.x [i / 2] = _mm_cvtepu32_epi64 (src.x [i / 2]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.i[(i % 2) + (i / 2) * 4] != dst.ll[i])
> +      abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c
> new file mode 100644
> index 000000000000..078a5a45d909
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c
> @@ -0,0 +1,43 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 4];
> +      unsigned int i[NUM];
> +      unsigned short s[NUM * 2];
> +    } dst, src;
> +  int i;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.s[(i % 4) + (i / 4) * 8] = i * i;
> +      if ((i % 4))
> +	src.s[(i % 4) + (i / 4) * 8] |= 0x8000;
> +    }
> +
> +  for (i = 0; i < NUM; i += 4)
> +    dst.x [i / 4] = _mm_cvtepu16_epi32 (src.x [i / 4]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.s[(i % 4) + (i / 4) * 8] != dst.i[i])
> +      abort ();
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c
> new file mode 100644
> index 000000000000..120d00290faa
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c
> @@ -0,0 +1,43 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 128
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 2];
> +      unsigned long long ll[NUM];
> +      unsigned short s[NUM * 4];
> +    } dst, src;
> +  int i;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src.s[(i % 2) + (i / 2) * 8] = i * i;
> +      if ((i % 2))
> +	src.s[(i % 2) + (i / 2) * 8] |= 0x8000;
> +    }
> +
> +  for (i = 0; i < NUM; i += 2)
> +    dst.x [i / 2] = _mm_cvtepu16_epi64 (src.x [i / 2]);
> +
> +  for (i = 0; i < NUM; i++)
> +    if (src.s[(i % 2) + (i / 2) * 8] != dst.ll[i])
> +      abort ();
> +}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 5/6] rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics
  2021-08-23 19:03 ` [PATCH v3 5/6] rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics Paul A. Clarke
@ 2021-08-27 15:21   ` Bill Schmidt
  2021-08-27 18:52     ` Paul A. Clarke
  2021-10-11 23:07   ` Segher Boessenkool
  1 sibling, 1 reply; 47+ messages in thread
From: Bill Schmidt @ 2021-08-27 15:21 UTC (permalink / raw)
  To: Paul A. Clarke, gcc-patches; +Cc: segher

Hi Paul,

On 8/23/21 2:03 PM, Paul A. Clarke wrote:
> Function signatures and decorations match gcc/config/i386/smmintrin.h.
>
> Also, copy tests for:
> - _mm_cmpeq_epi64
> - _mm_mullo_epi32, _mm_mul_epi32
> - _mm_packus_epi32
> - _mm_cmpgt_epi64 (SSE4.2)
>
> from gcc/testsuite/gcc.target/i386.
>
> 2021-08-23  Paul A. Clarke  <pc@us.ibm.com>
>
> gcc
> 	* config/rs6000/smmintrin.h (_mm_cmpeq_epi64, _mm_cmpgt_epi64,
> 	_mm_mullo_epi32, _mm_mul_epi32, _mm_packus_epi32): New.
> 	* config/rs6000/nmmintrin.h: Copy from i386, tweak to suit.
>
> gcc/testsuite
> 	* gcc.target/powerpc/pr78102.c: Copy from gcc.target/i386,
> 	adjust dg directives to suit.
> 	* gcc.target/powerpc/sse4_1-packusdw.c: Same.
> 	* gcc.target/powerpc/sse4_1-pcmpeqq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmuldq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmulld.c: Same.
> 	* gcc.target/powerpc/sse4_2-pcmpgtq.c: Same.
> 	* gcc.target/powerpc/sse4_2-check.h: Copy from gcc.target/i386,
> 	tweak to suit.
> ---
> v3:
> - Add nmmintrin.h. _mm_cmpgt_epi64 is part of SSE4.2, which is
>    ostensibly defined in nmmintrin.h. Following the i386 implementation,
>    however, nmmintrin.h only includes smmintrin.h, and the actual
>    implementations appear there.
> - Add sse4_2-check.h, required by sse4_2-pcmpgtq.c. My testing was
>    obviously inadequate.
> v2:
> - Added "extern" to functions to maintain compatible decorations with
>    like implementations in gcc/config/i386.
> - Removed "-Wno-psabi" from tests as unnecessary, per v1 review.
> - Noted testing in patch series cover letter.
>
>   gcc/config/rs6000/nmmintrin.h                 | 40 ++++++++++
>   gcc/config/rs6000/smmintrin.h                 | 41 +++++++++++
>   gcc/testsuite/gcc.target/powerpc/pr78102.c    | 23 ++++++
>   .../gcc.target/powerpc/sse4_1-packusdw.c      | 73 +++++++++++++++++++
>   .../gcc.target/powerpc/sse4_1-pcmpeqq.c       | 46 ++++++++++++
>   .../gcc.target/powerpc/sse4_1-pmuldq.c        | 51 +++++++++++++
>   .../gcc.target/powerpc/sse4_1-pmulld.c        | 46 ++++++++++++
>   .../gcc.target/powerpc/sse4_2-check.h         | 18 +++++
>   .../gcc.target/powerpc/sse4_2-pcmpgtq.c       | 46 ++++++++++++
>   9 files changed, 384 insertions(+)
>   create mode 100644 gcc/config/rs6000/nmmintrin.h
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/pr78102.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_2-check.h
>   create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c
>
> diff --git a/gcc/config/rs6000/nmmintrin.h b/gcc/config/rs6000/nmmintrin.h
> new file mode 100644
> index 000000000000..20a70bee3776
> --- /dev/null
> +++ b/gcc/config/rs6000/nmmintrin.h
> @@ -0,0 +1,40 @@
> +/* Copyright (C) 2021 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   Under Section 7 of GPL version 3, you are granted additional
> +   permissions described in the GCC Runtime Library Exception, version
> +   3.1, as published by the Free Software Foundation.
> +
> +   You should have received a copy of the GNU General Public License and
> +   a copy of the GCC Runtime Library Exception along with this program;
> +   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef NO_WARN_X86_INTRINSICS
> +/* This header is distributed to simplify porting x86_64 code that
> +   makes explicit use of Intel intrinsics to powerpc64le.
> +   It is the user's responsibility to determine if the results are
> +   acceptable and make additional changes as necessary.
> +   Note that much code that uses Intel intrinsics can be rewritten in
> +   standard C or GNU C extensions, which are more portable and better
> +   optimized across multiple targets.  */
> +#endif
> +
> +#ifndef _NMMINTRIN_H_INCLUDED
> +#define _NMMINTRIN_H_INCLUDED
> +
> +/* We just include SSE4.1 header file.  */
> +#include <smmintrin.h>
> +
> +#endif /* _NMMINTRIN_H_INCLUDED */

Should there be something in here indicating that nmmintrin.h is for SSE 
4.2?  Otherwise it's a bit of a head-scratcher to a new person wondering 
why this file exists.  No big deal either way.

This looks fine to me with or without that.  Recommend approval.

Thanks!
Bill

> diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
> index fdef6674d16c..c04d2bb5b6d3 100644
> --- a/gcc/config/rs6000/smmintrin.h
> +++ b/gcc/config/rs6000/smmintrin.h
> @@ -386,6 +386,15 @@ _mm_testnzc_si128 (__m128i __A, __m128i __B)
>   
>   #define _mm_test_mix_ones_zeros(M, V) _mm_testnzc_si128 ((M), (V))
>   
> +#ifdef _ARCH_PWR8
> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cmpeq_epi64 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_cmpeq ((__v2di)__X, (__v2di)__Y);
> +}
> +#endif
> +
>   extern __inline __m128i
>   __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_min_epi8 (__m128i __X, __m128i __Y)
> @@ -444,6 +453,22 @@ _mm_max_epu32 (__m128i __X, __m128i __Y)
>   
>   extern __inline __m128i
>   __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_mullo_epi32 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_mul ((__v4su)__X, (__v4su)__Y);
> +}
> +
> +#ifdef _ARCH_PWR8
> +__inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_mul_epi32 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_mule ((__v4si)__X, (__v4si)__Y);
> +}
> +#endif
> +
> +__inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_cvtepi8_epi16 (__m128i __A)
>   {
>     return (__m128i) vec_unpackh ((__v16qi)__A);
> @@ -607,4 +632,20 @@ _mm_minpos_epu16 (__m128i __A)
>     return __r.__m;
>   }
>   
> +__inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_packus_epi32 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_packsu ((__v4si)__X, (__v4si)__Y);
> +}
> +
> +#ifdef _ARCH_PWR8
> +__inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cmpgt_epi64 (__m128i __X, __m128i __Y)
> +{
> +  return (__m128i) vec_cmpgt ((__v2di)__X, (__v2di)__Y);
> +}
> +#endif
> +
>   #endif
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr78102.c b/gcc/testsuite/gcc.target/powerpc/pr78102.c
> new file mode 100644
> index 000000000000..56a2d497bbff
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr78102.c
> @@ -0,0 +1,23 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mvsx" } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +
> +#include <x86intrin.h>
> +
> +__m128i
> +foo (const __m128i x, const __m128i y)
> +{
> +  return _mm_cmpeq_epi64 (x, y);
> +}
> +
> +__v2di
> +bar (const __v2di x, const __v2di y)
> +{
> +  return x == y;
> +}
> +
> +__v2di
> +baz (const __v2di x, const __v2di y)
> +{
> +  return x != y;
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c
> new file mode 100644
> index 000000000000..15b8ca418f54
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c
> @@ -0,0 +1,73 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -mvsx" } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 64
> +
> +static unsigned short
> +int_to_ushort (int iVal)
> +{
> +  unsigned short sVal;
> +
> +  if (iVal < 0)
> +    sVal = 0;
> +  else if (iVal > 0xffff)
> +    sVal = 0xffff;
> +  else sVal = iVal;
> +
> +  return sVal;
> +}
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 4];
> +      int i[NUM];
> +    } src1, src2;
> +  union
> +    {
> +      __m128i x[NUM / 4];
> +      unsigned short s[NUM * 2];
> +    } dst;
> +  int i, sign = 1;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.i[i] = i * i * sign;
> +      src2.i[i] = (i + 20) * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 4)
> +    dst.x[i / 4] = _mm_packus_epi32 (src1.x [i / 4], src2.x [i / 4]);
> +
> +  for (i = 0; i < NUM; i ++)
> +    {
> +      int dstIndex;
> +      unsigned short sVal;
> +
> +      sVal = int_to_ushort (src1.i[i]);
> +      dstIndex = (i % 4) + (i / 4) * 8;
> +      if (sVal != dst.s[dstIndex])
> +	abort ();
> +
> +      sVal = int_to_ushort (src2.i[i]);
> +      dstIndex += 4;
> +      if (sVal != dst.s[dstIndex])
> +	abort ();
> +    }
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
> new file mode 100644
> index 000000000000..39b9f01d64a4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
> @@ -0,0 +1,46 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -mpower8-vector" } */
> +/* { dg-require-effective-target p8vector_hw } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 64
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 2];
> +      long long ll[NUM];
> +    } dst, src1, src2;
> +  int i, sign=1;
> +  long long is_eq;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.ll[i] = i * i * sign;
> +      src2.ll[i] = (i + 20) * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 2)
> +    dst.x [i / 2] = _mm_cmpeq_epi64(src1.x [i / 2], src2.x [i / 2]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      is_eq = src1.ll[i] == src2.ll[i] ? 0xffffffffffffffffLL : 0LL;
> +      if (is_eq != dst.ll[i])
> +	abort ();
> +    }
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
> new file mode 100644
> index 000000000000..6a884f46235f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
> @@ -0,0 +1,51 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -mpower8-vector" } */
> +/* { dg-require-effective-target p8vector_hw } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 64
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 2];
> +      long long ll[NUM];
> +    } dst;
> +  union
> +    {
> +      __m128i x[NUM / 2];
> +      int i[NUM * 2];
> +    } src1, src2;
> +  int i, sign = 1;
> +  long long value;
> +
> +  for (i = 0; i < NUM * 2; i += 2)
> +    {
> +      src1.i[i] = i * i * sign;
> +      src2.i[i] = (i + 20) * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 2)
> +    dst.x[i / 2] = _mm_mul_epi32 (src1.x[i / 2], src2.x[i / 2]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      value = (long long) src1.i[i * 2] * (long long) src2.i[i * 2];
> +      if (value != dst.ll[i])
> +	abort ();
> +    }
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c
> new file mode 100644
> index 000000000000..150832915911
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c
> @@ -0,0 +1,46 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -mvsx" } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_1-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_1_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <smmintrin.h>
> +
> +#define NUM 64
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 4];
> +      int i[NUM];
> +    } dst, src1, src2;
> +  int i, sign = 1;
> +  int value;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.i[i] = i * i * sign;
> +      src2.i[i] = (i + 20) * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 4)
> +    dst.x[i / 4] = _mm_mullo_epi32 (src1.x[i / 4], src2.x[i / 4]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      value = src1.i[i] * src2.i[i];
> +      if (value != dst.i[i])
> +	abort ();
> +    }
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_2-check.h b/gcc/testsuite/gcc.target/powerpc/sse4_2-check.h
> new file mode 100644
> index 000000000000..f6264e5a1083
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_2-check.h
> @@ -0,0 +1,18 @@
> +#define NO_WARN_X86_INTRINSICS 1
> +
> +static void sse4_2_test (void);
> +
> +static void
> +__attribute__ ((noinline))
> +do_test (void)
> +{
> +  sse4_2_test ();
> +}
> +
> +int
> +main ()
> +{
> +  do_test ();
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c b/gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c
> new file mode 100644
> index 000000000000..4bfbad885b30
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c
> @@ -0,0 +1,46 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -mvsx" } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +
> +#ifndef CHECK_H
> +#define CHECK_H "sse4_2-check.h"
> +#endif
> +
> +#ifndef TEST
> +#define TEST sse4_2_test
> +#endif
> +
> +#include CHECK_H
> +
> +#include <nmmintrin.h>
> +
> +#define NUM 64
> +
> +static void
> +TEST (void)
> +{
> +  union
> +    {
> +      __m128i x[NUM / 2];
> +      long long ll[NUM];
> +    } dst, src1, src2;
> +  int i, sign = 1;
> +  long long is_eq;
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      src1.ll[i] = i * i * sign;
> +      src2.ll[i] = (i + 20) * sign;
> +      sign = -sign;
> +    }
> +
> +  for (i = 0; i < NUM; i += 2)
> +    dst.x[i / 2] = _mm_cmpgt_epi64 (src1.x[i / 2], src2.x[i / 2]);
> +
> +  for (i = 0; i < NUM; i++)
> +    {
> +      is_eq = src1.ll[i] > src2.ll[i] ? 0xFFFFFFFFFFFFFFFFLL : 0LL;
> +      if (is_eq != dst.ll[i])
> +	abort ();
> +    }
> +}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 6/6] rs6000: Guard some x86 intrinsics implementations
  2021-08-23 19:03 ` [PATCH v3 6/6] rs6000: Guard some x86 intrinsics implementations Paul A. Clarke
@ 2021-08-27 15:25   ` Bill Schmidt
  2021-10-12  0:11   ` Segher Boessenkool
  1 sibling, 0 replies; 47+ messages in thread
From: Bill Schmidt @ 2021-08-27 15:25 UTC (permalink / raw)
  To: Paul A. Clarke, gcc-patches; +Cc: segher

Hi Paul,

Thanks for the changes!  This looks fine to me, recommend approval.

Thanks,
Bill

On 8/23/21 2:03 PM, Paul A. Clarke wrote:
> Some compatibility implementations of x86 intrinsics include
> Power intrinsics which require POWER8.  Guard them.
>
> emmintrin.h:
> - _mm_cmpord_pd: Remove code which was ostensibly for pre-POWER8,
>    but which indeed depended on POWER8 (vec_cmpgt(v2du)/vcmpgtud).
>    The "POWER8" version works fine on pre-POWER8.
> - _mm_mul_epu32: vec_mule(v4su) uses vmuleuw.
> pmmintrin.h:
> - _mm_movehdup_ps: vec_mergeo(v4su) uses vmrgow.
> - _mm_moveldup_ps: vec_mergee(v4su) uses vmrgew.
> smmintrin.h:
> - _mm_cmpeq_epi64: vec_cmpeq(v2di) uses vcmpequd.
> - _mm_mul_epi32: vec_mule(v4si) uses vmuluwm.
> - _mm_cmpgt_epi64: vec_cmpgt(v2di) uses vcmpgtsd.
> tmmintrin.h:
> - _mm_sign_epi8: vec_neg(v4si) uses vsububm.
> - _mm_sign_epi16: vec_neg(v4si) uses vsubuhm.
> - _mm_sign_epi32: vec_neg(v4si) uses vsubuwm.
>    Note that the above three could actually be supported pre-POWER8,
>    but current GCC does not support them before POWER8.
> - _mm_sign_pi8: depends on _mm_sign_epi8.
> - _mm_sign_pi16: depends on _mm_sign_epi16.
> - _mm_sign_pi32: depends on _mm_sign_epi32.
>
> 2021-08-20  Paul A. Clarke  <pc@us.ibm.com>
>
> gcc
> 	PR target/101893
> 	* config/rs6000/emmintrin.h: Guard POWER8 intrinsics.
> 	* config/rs6000/pmmintrin.h: Same.
> 	* config/rs6000/smmintrin.h: Same.
> 	* config/rs6000/tmmintrin.h: Same.
> ---
> v3: No change.
> v2:
> - Ensured that new "#ifdef _ARCH_PWR8" bracket each function so
>    impacted, rather than groups of functions, per v1 review.
> - Noted testing in patch series cover letter.
> - Added PR number to commit message.
>
>   gcc/config/rs6000/emmintrin.h | 12 ++----------
>   gcc/config/rs6000/pmmintrin.h |  4 ++++
>   gcc/config/rs6000/smmintrin.h |  4 ++++
>   gcc/config/rs6000/tmmintrin.h | 12 ++++++++++++
>   4 files changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/config/rs6000/emmintrin.h b/gcc/config/rs6000/emmintrin.h
> index ce1287edf782..32ad72b4cc35 100644
> --- a/gcc/config/rs6000/emmintrin.h
> +++ b/gcc/config/rs6000/emmintrin.h
> @@ -430,20 +430,10 @@ _mm_cmpnge_pd (__m128d __A, __m128d __B)
>   extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_cmpord_pd (__m128d __A, __m128d __B)
>   {
> -#if _ARCH_PWR8
>     __v2du c, d;
>     /* Compare against self will return false (0's) if NAN.  */
>     c = (__v2du)vec_cmpeq (__A, __A);
>     d = (__v2du)vec_cmpeq (__B, __B);
> -#else
> -  __v2du a, b;
> -  __v2du c, d;
> -  const __v2du double_exp_mask  = {0x7ff0000000000000, 0x7ff0000000000000};
> -  a = (__v2du)vec_abs ((__v2df)__A);
> -  b = (__v2du)vec_abs ((__v2df)__B);
> -  c = (__v2du)vec_cmpgt (double_exp_mask, a);
> -  d = (__v2du)vec_cmpgt (double_exp_mask, b);
> -#endif
>     /* A != NAN and B != NAN.  */
>     return ((__m128d)vec_and(c, d));
>   }
> @@ -1472,6 +1462,7 @@ _mm_mul_su32 (__m64 __A, __m64 __B)
>     return ((__m64)a * (__m64)b);
>   }
>   
> +#ifdef _ARCH_PWR8
>   extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_mul_epu32 (__m128i __A, __m128i __B)
>   {
> @@ -1498,6 +1489,7 @@ _mm_mul_epu32 (__m128i __A, __m128i __B)
>     return (__m128i) vec_mule ((__v4su)__A, (__v4su)__B);
>   #endif
>   }
> +#endif
>   
>   extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_slli_epi16 (__m128i __A, int __B)
> diff --git a/gcc/config/rs6000/pmmintrin.h b/gcc/config/rs6000/pmmintrin.h
> index eab712fdfa66..83dff1d85666 100644
> --- a/gcc/config/rs6000/pmmintrin.h
> +++ b/gcc/config/rs6000/pmmintrin.h
> @@ -123,17 +123,21 @@ _mm_hsub_pd (__m128d __X, __m128d __Y)
>   			    vec_mergel ((__v2df) __X, (__v2df)__Y));
>   }
>   
> +#ifdef _ARCH_PWR8
>   extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_movehdup_ps (__m128 __X)
>   {
>     return (__m128)vec_mergeo ((__v4su)__X, (__v4su)__X);
>   }
> +#endif
>   
> +#ifdef _ARCH_PWR8
>   extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_moveldup_ps (__m128 __X)
>   {
>     return (__m128)vec_mergee ((__v4su)__X, (__v4su)__X);
>   }
> +#endif
>   
>   extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_loaddup_pd (double const *__P)
> diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
> index c04d2bb5b6d3..29719367e205 100644
> --- a/gcc/config/rs6000/smmintrin.h
> +++ b/gcc/config/rs6000/smmintrin.h
> @@ -272,6 +272,7 @@ _mm_extract_ps (__m128 __X, const int __N)
>     return ((__v4si)__X)[__N & 3];
>   }
>   
> +#ifdef _ARCH_PWR8
>   extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_blend_epi16 (__m128i __A, __m128i __B, const int __imm8)
>   {
> @@ -283,6 +284,7 @@ _mm_blend_epi16 (__m128i __A, __m128i __B, const int __imm8)
>     #endif
>     return (__m128i) vec_sel ((__v8hu) __A, (__v8hu) __B, __shortmask);
>   }
> +#endif
>   
>   extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_blendv_epi8 (__m128i __A, __m128i __B, __m128i __mask)
> @@ -343,6 +345,7 @@ _mm_blend_pd (__m128d __A, __m128d __B, const int __imm8)
>     return (__m128d) __r;
>   }
>   
> +#ifdef _ARCH_PWR8
>   __inline __m128d
>   __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_blendv_pd (__m128d __A, __m128d __B, __m128d __mask)
> @@ -351,6 +354,7 @@ _mm_blendv_pd (__m128d __A, __m128d __B, __m128d __mask)
>     const __vector __bool long long __boolmask = vec_cmplt ((__v2di) __mask, __zero);
>     return (__m128d) vec_sel ((__v2du) __A, (__v2du) __B, (__v2du) __boolmask);
>   }
> +#endif
>   
>   __inline int
>   __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> diff --git a/gcc/config/rs6000/tmmintrin.h b/gcc/config/rs6000/tmmintrin.h
> index 971511260b78..a67d88c8079a 100644
> --- a/gcc/config/rs6000/tmmintrin.h
> +++ b/gcc/config/rs6000/tmmintrin.h
> @@ -350,6 +350,7 @@ _mm_shuffle_pi8 (__m64 __A, __m64 __B)
>     return (__m64) ((__v2du) (__C))[0];
>   }
>   
> +#ifdef _ARCH_PWR8
>   extern __inline __m128i
>   __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_sign_epi8 (__m128i __A, __m128i __B)
> @@ -361,7 +362,9 @@ _mm_sign_epi8 (__m128i __A, __m128i __B)
>     __v16qi __conv = vec_add (__selectneg, __selectpos);
>     return (__m128i) vec_mul ((__v16qi) __A, (__v16qi) __conv);
>   }
> +#endif
>   
> +#ifdef _ARCH_PWR8
>   extern __inline __m128i
>   __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_sign_epi16 (__m128i __A, __m128i __B)
> @@ -373,7 +376,9 @@ _mm_sign_epi16 (__m128i __A, __m128i __B)
>     __v8hi __conv = vec_add (__selectneg, __selectpos);
>     return (__m128i) vec_mul ((__v8hi) __A, (__v8hi) __conv);
>   }
> +#endif
>   
> +#ifdef _ARCH_PWR8
>   extern __inline __m128i
>   __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_sign_epi32 (__m128i __A, __m128i __B)
> @@ -385,7 +390,9 @@ _mm_sign_epi32 (__m128i __A, __m128i __B)
>     __v4si __conv = vec_add (__selectneg, __selectpos);
>     return (__m128i) vec_mul ((__v4si) __A, (__v4si) __conv);
>   }
> +#endif
>   
> +#ifdef _ARCH_PWR8
>   extern __inline __m64
>   __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_sign_pi8 (__m64 __A, __m64 __B)
> @@ -396,7 +403,9 @@ _mm_sign_pi8 (__m64 __A, __m64 __B)
>     __C = (__v16qi) _mm_sign_epi8 ((__m128i) __C, (__m128i) __D);
>     return (__m64) ((__v2du) (__C))[0];
>   }
> +#endif
>   
> +#ifdef _ARCH_PWR8
>   extern __inline __m64
>   __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_sign_pi16 (__m64 __A, __m64 __B)
> @@ -407,7 +416,9 @@ _mm_sign_pi16 (__m64 __A, __m64 __B)
>     __C = (__v8hi) _mm_sign_epi16 ((__m128i) __C, (__m128i) __D);
>     return (__m64) ((__v2du) (__C))[0];
>   }
> +#endif
>   
> +#ifdef _ARCH_PWR8
>   extern __inline __m64
>   __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>   _mm_sign_pi32 (__m64 __A, __m64 __B)
> @@ -418,6 +429,7 @@ _mm_sign_pi32 (__m64 __A, __m64 __B)
>     __C = (__v4si) _mm_sign_epi32 ((__m128i) __C, (__m128i) __D);
>     return (__m64) ((__v2du) (__C))[0];
>   }
> +#endif
>   
>   extern __inline __m128i
>   __attribute__((__gnu_inline__, __always_inline__, __artificial__))

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 5/6] rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics
  2021-08-27 15:21   ` Bill Schmidt
@ 2021-08-27 18:52     ` Paul A. Clarke
  0 siblings, 0 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-08-27 18:52 UTC (permalink / raw)
  To: wschmidt; +Cc: gcc-patches, segher

On Fri, Aug 27, 2021 at 10:21:35AM -0500, Bill Schmidt via Gcc-patches wrote:
> On 8/23/21 2:03 PM, Paul A. Clarke wrote:
> > Function signatures and decorations match gcc/config/i386/smmintrin.h.

> > gcc

> > 	* config/rs6000/nmmintrin.h: Copy from i386, tweak to suit.

> > ---
> > v3:
> > - Add nmmintrin.h. _mm_cmpgt_epi64 is part of SSE4.2, which is
> >    ostensibly defined in nmmintrin.h. Following the i386 implementation,
> >    however, nmmintrin.h only includes smmintrin.h, and the actual
> >    implementations appear there.

> > v2:
> > - Added "extern" to functions to maintain compatible decorations with
> >    like implementations in gcc/config/i386.

> > diff --git a/gcc/config/rs6000/nmmintrin.h b/gcc/config/rs6000/nmmintrin.h
> > new file mode 100644
> > index 000000000000..20a70bee3776
> > --- /dev/null
> > +++ b/gcc/config/rs6000/nmmintrin.h
> > @@ -0,0 +1,40 @@
> > +/* Copyright (C) 2021 Free Software Foundation, Inc.
> > +
> > +   This file is part of GCC.
> > +
> > +   GCC is free software; you can redistribute it and/or modify
> > +   it under the terms of the GNU General Public License as published by
> > +   the Free Software Foundation; either version 3, or (at your option)
> > +   any later version.
> > +
> > +   GCC is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +   GNU General Public License for more details.
> > +
> > +   Under Section 7 of GPL version 3, you are granted additional
> > +   permissions described in the GCC Runtime Library Exception, version
> > +   3.1, as published by the Free Software Foundation.
> > +
> > +   You should have received a copy of the GNU General Public License and
> > +   a copy of the GCC Runtime Library Exception along with this program;
> > +   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#ifndef NO_WARN_X86_INTRINSICS
> > +/* This header is distributed to simplify porting x86_64 code that
> > +   makes explicit use of Intel intrinsics to powerpc64le.
> > +   It is the user's responsibility to determine if the results are
> > +   acceptable and make additional changes as necessary.
> > +   Note that much code that uses Intel intrinsics can be rewritten in
> > +   standard C or GNU C extensions, which are more portable and better
> > +   optimized across multiple targets.  */
> > +#endif
> > +
> > +#ifndef _NMMINTRIN_H_INCLUDED
> > +#define _NMMINTRIN_H_INCLUDED
> > +
> > +/* We just include SSE4.1 header file.  */
> > +#include <smmintrin.h>
> > +
> > +#endif /* _NMMINTRIN_H_INCLUDED */
> 
> Should there be something in here indicating that nmmintrin.h is for SSE
> 4.2?  Otherwise it's a bit of a head-scratcher to a new person wondering why
> this file exists.  No big deal either way.

For good or bad, I have been trying to minimize differences with the
analogous i386 files.  With the exception of the copyright and our annoying
litte warning, the only difference was this comment:

--
/* Implemented from the specification included in the Intel C++ Compiler
   User Guide and Reference, version 10.0.  */
--

I didn't find that (1) accurate, since there are no implementations therein,
or (2) particularly informative, as I imagine that document has a much
bigger scope than SSE4.2.  And keeping it would be a bit misleading, I think.
So, I intentionally removed the comment.

> This looks fine to me with or without that.  Recommend approval.

Thanks for the review!

PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-08-27 13:44   ` Bill Schmidt
  2021-08-27 13:47     ` Bill Schmidt
@ 2021-08-30 21:16     ` Paul A. Clarke
  2021-08-30 21:24       ` Bill Schmidt
  2021-10-07 23:08       ` Segher Boessenkool
  1 sibling, 2 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-08-30 21:16 UTC (permalink / raw)
  To: wschmidt; +Cc: gcc-patches, segher

On Fri, Aug 27, 2021 at 08:44:43AM -0500, Bill Schmidt via Gcc-patches wrote:
> On 8/23/21 2:03 PM, Paul A. Clarke wrote:
> > +	__fpscr_save.__fr = __builtin_mffsl ();
> 
> As pointed out in the v1 review, __builtin_mffsl is enabled (or supposed to
> be) only for POWER9 and later.  This will fail to work on POWER8 and earlier
> when the new builtins support is complete and this is enforced more
> carefully.  Please #ifdef and use __builtin_mffs on earlier processors. 
> Please do this everywhere this occurs.
> 
> I think you got some contradictory guidance on this, but trust me, this will
> break.

The confusing thing is that _builtin_mffsl is explicitly supported on earlier
processors, if I read the code right (from gcc/config/rs6000/rs6000.md):
--
(define_expand "rs6000_mffsl"
  [(set (match_operand:DF 0 "gpc_reg_operand")
        (unspec_volatile:DF [(const_int 0)] UNSPECV_MFFSL))]
  "TARGET_HARD_FLOAT"
{
  /* If the low latency mffsl instruction (ISA 3.0) is available use it,
     otherwise fall back to the older mffs instruction to emulate the mffsl
     instruction.  */
  
  if (!TARGET_P9_MISC)
    {
      rtx tmp1 = gen_reg_rtx (DFmode);

      /* The mffs instruction reads the entire FPSCR.  Emulate the mffsl 
         instruction using the mffs instruction and masking the result.  */
      emit_insn (gen_rs6000_mffs (tmp1));
...
--

Is that going away?  If so, that would be a possible (undesirable?)
API change, no?

PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-08-30 21:16     ` Paul A. Clarke
@ 2021-08-30 21:24       ` Bill Schmidt
  2021-10-07 23:08       ` Segher Boessenkool
  1 sibling, 0 replies; 47+ messages in thread
From: Bill Schmidt @ 2021-08-30 21:24 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, segher

Hi Paul,

On 8/30/21 4:16 PM, Paul A. Clarke wrote:
> On Fri, Aug 27, 2021 at 08:44:43AM -0500, Bill Schmidt via Gcc-patches wrote:
>> On 8/23/21 2:03 PM, Paul A. Clarke wrote:
>>> +	__fpscr_save.__fr = __builtin_mffsl ();
>> As pointed out in the v1 review, __builtin_mffsl is enabled (or supposed to
>> be) only for POWER9 and later.  This will fail to work on POWER8 and earlier
>> when the new builtins support is complete and this is enforced more
>> carefully.  Please #ifdef and use __builtin_mffs on earlier processors.
>> Please do this everywhere this occurs.
>>
>> I think you got some contradictory guidance on this, but trust me, this will
>> break.
> The confusing thing is that _builtin_mffsl is explicitly supported on earlier
> processors, if I read the code right (from gcc/config/rs6000/rs6000.md):
> --
> (define_expand "rs6000_mffsl"
>    [(set (match_operand:DF 0 "gpc_reg_operand")
>          (unspec_volatile:DF [(const_int 0)] UNSPECV_MFFSL))]
>    "TARGET_HARD_FLOAT"
> {
>    /* If the low latency mffsl instruction (ISA 3.0) is available use it,
>       otherwise fall back to the older mffs instruction to emulate the mffsl
>       instruction.  */
>    
>    if (!TARGET_P9_MISC)
>      {
>        rtx tmp1 = gen_reg_rtx (DFmode);
>
>        /* The mffs instruction reads the entire FPSCR.  Emulate the mffsl
>           instruction using the mffs instruction and masking the result.  */
>        emit_insn (gen_rs6000_mffs (tmp1));
> ...
> --
>
> Is that going away?  If so, that would be a possible (undesirable?)
> API change, no?

Hm, I see.  I missed that in the builtins conversion.  Apparently 
there's nothing in the test suite that verifies this work on P9, which 
is a hole that could use fixing.  This usage isn't documented anywhere 
near the builtin machinery, either.

I'll patch the new builtins code to move this to a more permissive 
stanza and document why.  You can leave your code as is.

Thanks!
Bill

>
> PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics
  2021-08-23 19:03 [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
                   ` (5 preceding siblings ...)
  2021-08-23 19:03 ` [PATCH v3 6/6] rs6000: Guard some x86 intrinsics implementations Paul A. Clarke
@ 2021-09-16 14:59 ` Paul A. Clarke
  2021-10-04 18:26   ` Paul A. Clarke
  2021-10-07 22:25 ` Segher Boessenkool
  7 siblings, 1 reply; 47+ messages in thread
From: Paul A. Clarke @ 2021-09-16 14:59 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher

Ping.

On Mon, Aug 23, 2021 at 02:03:04PM -0500, Paul A. Clarke via Gcc-patches wrote:
> v3: Add "nmmintrin.h". _mm_cmpgt_epi64 is part of SSE4.2
> and users will expect to be able to include "nmmintrin.h",
> even though "nmmintrin.h" just includes "smmintrin.h"
> where all of the SSE4.2 implementations actually appear.
> 
> Only patch 5/6 changed from v2.
> 
> Tested ppc64le (POWER9) and ppc64/32 (POWER7).
> 
> OK for trunk?
> 
> Paul A. Clarke (6):
>   rs6000: Support SSE4.1 "round" intrinsics
>   rs6000: Support SSE4.1 "min" and "max" intrinsics
>   rs6000: Simplify some SSE4.1 "test" intrinsics
>   rs6000: Support SSE4.1 "cvt" intrinsics
>   rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics
>   rs6000: Guard some x86 intrinsics implementations
> 
>  gcc/config/rs6000/emmintrin.h                 |  12 +-
>  gcc/config/rs6000/nmmintrin.h                 |  40 ++
>  gcc/config/rs6000/pmmintrin.h                 |   4 +
>  gcc/config/rs6000/smmintrin.h                 | 427 ++++++++++++++++--
>  gcc/config/rs6000/tmmintrin.h                 |  12 +
>  gcc/testsuite/gcc.target/powerpc/pr78102.c    |  23 +
>  .../gcc.target/powerpc/sse4_1-packusdw.c      |  73 +++
>  .../gcc.target/powerpc/sse4_1-pcmpeqq.c       |  46 ++
>  .../gcc.target/powerpc/sse4_1-pmaxsb.c        |  46 ++
>  .../gcc.target/powerpc/sse4_1-pmaxsd.c        |  46 ++
>  .../gcc.target/powerpc/sse4_1-pmaxud.c        |  47 ++
>  .../gcc.target/powerpc/sse4_1-pmaxuw.c        |  47 ++
>  .../gcc.target/powerpc/sse4_1-pminsb.c        |  46 ++
>  .../gcc.target/powerpc/sse4_1-pminsd.c        |  46 ++
>  .../gcc.target/powerpc/sse4_1-pminud.c        |  47 ++
>  .../gcc.target/powerpc/sse4_1-pminuw.c        |  47 ++
>  .../gcc.target/powerpc/sse4_1-pmovsxbd.c      |  42 ++
>  .../gcc.target/powerpc/sse4_1-pmovsxbq.c      |  42 ++
>  .../gcc.target/powerpc/sse4_1-pmovsxbw.c      |  42 ++
>  .../gcc.target/powerpc/sse4_1-pmovsxdq.c      |  42 ++
>  .../gcc.target/powerpc/sse4_1-pmovsxwd.c      |  42 ++
>  .../gcc.target/powerpc/sse4_1-pmovsxwq.c      |  42 ++
>  .../gcc.target/powerpc/sse4_1-pmovzxbd.c      |  43 ++
>  .../gcc.target/powerpc/sse4_1-pmovzxbq.c      |  43 ++
>  .../gcc.target/powerpc/sse4_1-pmovzxbw.c      |  43 ++
>  .../gcc.target/powerpc/sse4_1-pmovzxdq.c      |  43 ++
>  .../gcc.target/powerpc/sse4_1-pmovzxwd.c      |  43 ++
>  .../gcc.target/powerpc/sse4_1-pmovzxwq.c      |  43 ++
>  .../gcc.target/powerpc/sse4_1-pmuldq.c        |  51 +++
>  .../gcc.target/powerpc/sse4_1-pmulld.c        |  46 ++
>  .../gcc.target/powerpc/sse4_1-round3.h        |  81 ++++
>  .../gcc.target/powerpc/sse4_1-roundpd.c       | 143 ++++++
>  .../gcc.target/powerpc/sse4_1-roundps.c       |  98 ++++
>  .../gcc.target/powerpc/sse4_1-roundsd.c       | 256 +++++++++++
>  .../gcc.target/powerpc/sse4_1-roundss.c       | 208 +++++++++
>  .../gcc.target/powerpc/sse4_2-check.h         |  18 +
>  .../gcc.target/powerpc/sse4_2-pcmpgtq.c       |  46 ++
>  37 files changed, 2407 insertions(+), 59 deletions(-)
>  create mode 100644 gcc/config/rs6000/nmmintrin.h
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr78102.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-round3.h
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundpd.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundps.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundsd.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundss.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_2-check.h
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c
> 
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics
  2021-09-16 14:59 ` [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
@ 2021-10-04 18:26   ` Paul A. Clarke
  0 siblings, 0 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-04 18:26 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher

Ping.

On Thu, Sep 16, 2021 at 09:59:39AM -0500, Paul A. Clarke via Gcc-patches wrote:
> Ping.
> 
> On Mon, Aug 23, 2021 at 02:03:04PM -0500, Paul A. Clarke via Gcc-patches wrote:
> > v3: Add "nmmintrin.h". _mm_cmpgt_epi64 is part of SSE4.2
> > and users will expect to be able to include "nmmintrin.h",
> > even though "nmmintrin.h" just includes "smmintrin.h"
> > where all of the SSE4.2 implementations actually appear.
> > 
> > Only patch 5/6 changed from v2.
> > 
> > Tested ppc64le (POWER9) and ppc64/32 (POWER7).
> > 
> > OK for trunk?
> > 
> > Paul A. Clarke (6):
> >   rs6000: Support SSE4.1 "round" intrinsics
> >   rs6000: Support SSE4.1 "min" and "max" intrinsics
> >   rs6000: Simplify some SSE4.1 "test" intrinsics
> >   rs6000: Support SSE4.1 "cvt" intrinsics
> >   rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics
> >   rs6000: Guard some x86 intrinsics implementations
> > 
> >  gcc/config/rs6000/emmintrin.h                 |  12 +-
> >  gcc/config/rs6000/nmmintrin.h                 |  40 ++
> >  gcc/config/rs6000/pmmintrin.h                 |   4 +
> >  gcc/config/rs6000/smmintrin.h                 | 427 ++++++++++++++++--
> >  gcc/config/rs6000/tmmintrin.h                 |  12 +
> >  gcc/testsuite/gcc.target/powerpc/pr78102.c    |  23 +
> >  .../gcc.target/powerpc/sse4_1-packusdw.c      |  73 +++
> >  .../gcc.target/powerpc/sse4_1-pcmpeqq.c       |  46 ++
> >  .../gcc.target/powerpc/sse4_1-pmaxsb.c        |  46 ++
> >  .../gcc.target/powerpc/sse4_1-pmaxsd.c        |  46 ++
> >  .../gcc.target/powerpc/sse4_1-pmaxud.c        |  47 ++
> >  .../gcc.target/powerpc/sse4_1-pmaxuw.c        |  47 ++
> >  .../gcc.target/powerpc/sse4_1-pminsb.c        |  46 ++
> >  .../gcc.target/powerpc/sse4_1-pminsd.c        |  46 ++
> >  .../gcc.target/powerpc/sse4_1-pminud.c        |  47 ++
> >  .../gcc.target/powerpc/sse4_1-pminuw.c        |  47 ++
> >  .../gcc.target/powerpc/sse4_1-pmovsxbd.c      |  42 ++
> >  .../gcc.target/powerpc/sse4_1-pmovsxbq.c      |  42 ++
> >  .../gcc.target/powerpc/sse4_1-pmovsxbw.c      |  42 ++
> >  .../gcc.target/powerpc/sse4_1-pmovsxdq.c      |  42 ++
> >  .../gcc.target/powerpc/sse4_1-pmovsxwd.c      |  42 ++
> >  .../gcc.target/powerpc/sse4_1-pmovsxwq.c      |  42 ++
> >  .../gcc.target/powerpc/sse4_1-pmovzxbd.c      |  43 ++
> >  .../gcc.target/powerpc/sse4_1-pmovzxbq.c      |  43 ++
> >  .../gcc.target/powerpc/sse4_1-pmovzxbw.c      |  43 ++
> >  .../gcc.target/powerpc/sse4_1-pmovzxdq.c      |  43 ++
> >  .../gcc.target/powerpc/sse4_1-pmovzxwd.c      |  43 ++
> >  .../gcc.target/powerpc/sse4_1-pmovzxwq.c      |  43 ++
> >  .../gcc.target/powerpc/sse4_1-pmuldq.c        |  51 +++
> >  .../gcc.target/powerpc/sse4_1-pmulld.c        |  46 ++
> >  .../gcc.target/powerpc/sse4_1-round3.h        |  81 ++++
> >  .../gcc.target/powerpc/sse4_1-roundpd.c       | 143 ++++++
> >  .../gcc.target/powerpc/sse4_1-roundps.c       |  98 ++++
> >  .../gcc.target/powerpc/sse4_1-roundsd.c       | 256 +++++++++++
> >  .../gcc.target/powerpc/sse4_1-roundss.c       | 208 +++++++++
> >  .../gcc.target/powerpc/sse4_2-check.h         |  18 +
> >  .../gcc.target/powerpc/sse4_2-pcmpgtq.c       |  46 ++
> >  37 files changed, 2407 insertions(+), 59 deletions(-)
> >  create mode 100644 gcc/config/rs6000/nmmintrin.h
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr78102.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-round3.h
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundpd.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundps.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundsd.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_1-roundss.c
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_2-check.h
> >  create mode 100644 gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c
> > 
> > -- 
> > 2.27.0
> > 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics
  2021-08-23 19:03 [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
                   ` (6 preceding siblings ...)
  2021-09-16 14:59 ` [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
@ 2021-10-07 22:25 ` Segher Boessenkool
  2021-10-08  0:29   ` Paul A. Clarke
  7 siblings, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-07 22:25 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

Hi!

On Mon, Aug 23, 2021 at 02:03:04PM -0500, Paul A. Clarke wrote:
> v3: Add "nmmintrin.h". _mm_cmpgt_epi64 is part of SSE4.2

There should not be a "v3" in the commit message.  The easy way to
achieve this is put it inside the [] in the subject (as you did), and to
mention the version history after a --- (see --notes for git-format-patch
for example).

> Tested ppc64le (POWER9) and ppc64/32 (POWER7).

Please write the full triples -- well at least enough that they are
usable, like, powerpc64-linux.  I'll assume you tested on Linux :-)

Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-08-30 21:16     ` Paul A. Clarke
  2021-08-30 21:24       ` Bill Schmidt
@ 2021-10-07 23:08       ` Segher Boessenkool
  1 sibling, 0 replies; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-07 23:08 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: wschmidt, gcc-patches

On Mon, Aug 30, 2021 at 04:16:43PM -0500, Paul A. Clarke wrote:
> The confusing thing is that _builtin_mffsl is explicitly supported on earlier
> processors, if I read the code right (from gcc/config/rs6000/rs6000.md):

Yes.  It is very simple to implement everywhere, not significantly
slower than mffs.  So allowing this builtin to be used everywhere makes
it easier to use, with no real downsides.


Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-08-23 19:03 ` [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics Paul A. Clarke
  2021-08-27 13:44   ` Bill Schmidt
@ 2021-10-07 23:39   ` Segher Boessenkool
  2021-10-08  1:04     ` Paul A. Clarke
  1 sibling, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-07 23:39 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

On Mon, Aug 23, 2021 at 02:03:05PM -0500, Paul A. Clarke wrote:
> No attempt is made to optimize writing the FPSCR (by checking if the new
> value would be the same), other than using lighter weight instructions
> when possible.

__builtin_set_fpscr_rn makes optimised code (using mtfsb[01])
automatically, fwiw.

> Move implementations of _mm_ceil* and _mm_floor* into _mm_round*, and
> convert _mm_ceil* and _mm_floor* into macros. This matches the current
> analogous implementations in config/i386/smmintrin.h.

Hrm.  Using function-like macros is begging for trouble, as usual.  But
the x86 version does this, so meh.

> +extern __inline __m128d
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_round_pd (__m128d __A, int __rounding)
> +{
> +  __v2df __r;
> +  union {
> +    double __fr;
> +    long long __fpscr;
> +  } __enables_save, __fpscr_save;
> +
> +  if (__rounding & _MM_FROUND_NO_EXC)
> +    {
> +      /* Save enabled exceptions, disable all exceptions,
> +	 and preserve the rounding mode.  */
> +#ifdef _ARCH_PWR9
> +      __asm__ __volatile__ ("mffsce %0" : "=f" (__fpscr_save.__fr));

The __volatile__ does likely not do what you want.  As far as I can see
you do not want one here anyway?

"volatile" does not order asm wrt fp insns, which you likely *do* want.

> +  __v2df __r = { ((__v2df)__B)[0], ((__v2df) __A)[1] };

You put spaces after only some casts, btw?  Well maybe I found the one
place you did it wrong, heh :-)  And you can avoid having so many parens
by making extra variables -- much more readable.

> +  switch (__rounding)

You do not need any of that __ either.

> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */

"dg-do run" requires vsx_hw, not just vsx_ok.  Testing on a machine
without VSX (so before p7) would have shown that, but do you have access
to any?  This is one of those things we are only told about a year after
it was added, because no one who tests often does that on so old
hardware :-)

So, okay for trunk (and backports after some burn-in) with that vsx_ok
fixed.  That asm needs fixing, but you can do that later.

Thanks!

Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics
  2021-10-07 22:25 ` Segher Boessenkool
@ 2021-10-08  0:29   ` Paul A. Clarke
  2021-10-12  0:15     ` Segher Boessenkool
  0 siblings, 1 reply; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-08  0:29 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: wschmidt, gcc-patches

On Thu, Oct 07, 2021 at 05:25:54PM -0500, Segher Boessenkool wrote:
> On Mon, Aug 23, 2021 at 02:03:04PM -0500, Paul A. Clarke wrote:
> > v3: Add "nmmintrin.h". _mm_cmpgt_epi64 is part of SSE4.2
> 
> There should not be a "v3" in the commit message.  The easy way to
> achieve this is put it inside the [] in the subject (as you did), and to
> mention the version history after a --- (see --notes for git-format-patch
> for example).

This is just a cover letter. Does it matter in that context?
(I have done as described in the patches which followed.)

> > Tested ppc64le (POWER9) and ppc64/32 (POWER7).
> 
> Please write the full triples -- well at least enough that they are
> usable, like, powerpc64-linux.  I'll assume you tested on Linux :-)

Yes, sorry.  All are "-linux", and I'll try to remember that for next time.

PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-10-07 23:39   ` Segher Boessenkool
@ 2021-10-08  1:04     ` Paul A. Clarke
  2021-10-08 17:39       ` Segher Boessenkool
  0 siblings, 1 reply; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-08  1:04 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, wschmidt

On Thu, Oct 07, 2021 at 06:39:06PM -0500, Segher Boessenkool wrote:
> On Mon, Aug 23, 2021 at 02:03:05PM -0500, Paul A. Clarke wrote:
> > No attempt is made to optimize writing the FPSCR (by checking if the new
> > value would be the same), other than using lighter weight instructions
> > when possible.
> 
> __builtin_set_fpscr_rn makes optimised code (using mtfsb[01])
> automatically, fwiw.
> 
> > Move implementations of _mm_ceil* and _mm_floor* into _mm_round*, and
> > convert _mm_ceil* and _mm_floor* into macros. This matches the current
> > analogous implementations in config/i386/smmintrin.h.
> 
> Hrm.  Using function-like macros is begging for trouble, as usual.  But
> the x86 version does this, so meh.
> 
> > +extern __inline __m128d
> > +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> > +_mm_round_pd (__m128d __A, int __rounding)
> > +{
> > +  __v2df __r;
> > +  union {
> > +    double __fr;
> > +    long long __fpscr;
> > +  } __enables_save, __fpscr_save;
> > +
> > +  if (__rounding & _MM_FROUND_NO_EXC)
> > +    {
> > +      /* Save enabled exceptions, disable all exceptions,
> > +	 and preserve the rounding mode.  */
> > +#ifdef _ARCH_PWR9
> > +      __asm__ __volatile__ ("mffsce %0" : "=f" (__fpscr_save.__fr));
> 
> The __volatile__ does likely not do what you want.  As far as I can see
> you do not want one here anyway?
> 
> "volatile" does not order asm wrt fp insns, which you likely *do* want.

Reading the GCC docs, it looks like the "volatile" qualifier for "asm"
has no effect at all (6.47.1):

| The optional volatile qualifier has no effect. All basic asm blocks are
| implicitly volatile.

So, it could be removed without concern.

> > +  __v2df __r = { ((__v2df)__B)[0], ((__v2df) __A)[1] };
> 
> You put spaces after only some casts, btw?  Well maybe I found the one
> place you did it wrong, heh :-)  And you can avoid having so many parens
> by making extra variables -- much more readable.

I'll fix this.

> > +  switch (__rounding)
> 
> You do not need any of that __ either.

I'm surprised that I don't. A .h file needs to be concerned about the
namespace it inherits, no?

> > +/* { dg-do run } */
> > +/* { dg-require-effective-target powerpc_vsx_ok } */
> > +/* { dg-options "-O2 -mvsx" } */
> 
> "dg-do run" requires vsx_hw, not just vsx_ok.  Testing on a machine
> without VSX (so before p7) would have shown that, but do you have access
> to any?  This is one of those things we are only told about a year after
> it was added, because no one who tests often does that on so old
> hardware :-)
> 
> So, okay for trunk (and backports after some burn-in) with that vsx_ok
> fixed.  That asm needs fixing, but you can do that later.

OK.

Thanks!

PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-10-08  1:04     ` Paul A. Clarke
@ 2021-10-08 17:39       ` Segher Boessenkool
  2021-10-08 19:27         ` Paul A. Clarke
  0 siblings, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-08 17:39 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

On Thu, Oct 07, 2021 at 08:04:23PM -0500, Paul A. Clarke wrote:
> On Thu, Oct 07, 2021 at 06:39:06PM -0500, Segher Boessenkool wrote:
> > > +      __asm__ __volatile__ ("mffsce %0" : "=f" (__fpscr_save.__fr));
> > 
> > The __volatile__ does likely not do what you want.  As far as I can see
> > you do not want one here anyway?
> > 
> > "volatile" does not order asm wrt fp insns, which you likely *do* want.
> 
> Reading the GCC docs, it looks like the "volatile" qualifier for "asm"
> has no effect at all (6.47.1):
> 
> | The optional volatile qualifier has no effect. All basic asm blocks are
> | implicitly volatile.
> 
> So, it could be removed without concern.

This is not a basic asm (it contains a ":"; that is not just an easy way
to see it, it is the *definition* of basic vs. extended asm).

The manual explains:

"""
Note that the compiler can move even 'volatile asm' instructions
relative to other code, including across jump instructions.  For
example, on many targets there is a system register that controls the
rounding mode of floating-point operations.  Setting it with a 'volatile
asm' statement, as in the following PowerPC example, does not work
reliably.

     asm volatile("mtfsf 255, %0" : : "f" (fpenv));
     sum = x + y;

The compiler may move the addition back before the 'volatile asm'
statement.  To make it work as expected, add an artificial dependency to
the 'asm' by referencing a variable in the subsequent code, for example:

     asm volatile ("mtfsf 255,%1" : "=X" (sum) : "f" (fpenv));
     sum = x + y;
"""

> > You do not need any of that __ either.
> 
> I'm surprised that I don't. A .h file needs to be concerned about the
> namespace it inherits, no?

These are local variables in a function though.  You get such
complexities in macros, but never in functions, where everything is
scoped.  Local variables are a great thing.  And macros are a bad thing!

Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-10-08 17:39       ` Segher Boessenkool
@ 2021-10-08 19:27         ` Paul A. Clarke
  2021-10-08 22:31           ` Segher Boessenkool
  0 siblings, 1 reply; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-08 19:27 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, wschmidt

On Fri, Oct 08, 2021 at 12:39:15PM -0500, Segher Boessenkool wrote:
> On Thu, Oct 07, 2021 at 08:04:23PM -0500, Paul A. Clarke wrote:
> > On Thu, Oct 07, 2021 at 06:39:06PM -0500, Segher Boessenkool wrote:
> > > > +      __asm__ __volatile__ ("mffsce %0" : "=f" (__fpscr_save.__fr));
> > > 
> > > The __volatile__ does likely not do what you want.  As far as I can see
> > > you do not want one here anyway?
> > > 
> > > "volatile" does not order asm wrt fp insns, which you likely *do* want.
> > 
> > Reading the GCC docs, it looks like the "volatile" qualifier for "asm"
> > has no effect at all (6.47.1):
> > 
> > | The optional volatile qualifier has no effect. All basic asm blocks are
> > | implicitly volatile.
> > 
> > So, it could be removed without concern.
> 
> This is not a basic asm (it contains a ":"; that is not just an easy way
> to see it, it is the *definition* of basic vs. extended asm).

Ah, basic vs extended. I learned something today... thanks for your
patience!

> The manual explains:
> 
> """
> Note that the compiler can move even 'volatile asm' instructions
> relative to other code, including across jump instructions.  For
> example, on many targets there is a system register that controls the
> rounding mode of floating-point operations.  Setting it with a 'volatile
> asm' statement, as in the following PowerPC example, does not work
> reliably.
> 
>      asm volatile("mtfsf 255, %0" : : "f" (fpenv));
>      sum = x + y;
> 
> The compiler may move the addition back before the 'volatile asm'
> statement.  To make it work as expected, add an artificial dependency to
> the 'asm' by referencing a variable in the subsequent code, for example:
> 
>      asm volatile ("mtfsf 255,%1" : "=X" (sum) : "f" (fpenv));
>      sum = x + y;
> """

I see. Thanks for the reference. If I understand correctly, volatile
prevents some optimizations based on the defined inputs/outputs, but
the asm could still be subject to reordering.

In this particular case, I don't think it's an issue with respect to
reordering.  The code in question is:
+      __asm__ __volatile__ ("mffsce %0" : "=f" (__fpscr_save.__fr));
+      __enables_save.__fpscr = __fpscr_save.__fpscr & 0xf8;

The output (__fpscr_save) is a source for the following assignment,
so the order should be respected, no?

With respect to volatile, I worry about removing it, because I do
indeed need that instruction to execute in order to clear the FPSCR
exception enable bits. That side-effect is not otherwise known to the
compiler.

> > > You do not need any of that __ either.
> > 
> > I'm surprised that I don't. A .h file needs to be concerned about the
> > namespace it inherits, no?
> 
> These are local variables in a function though.  You get such
> complexities in macros, but never in functions, where everything is
> scoped.  Local variables are a great thing.  And macros are a bad thing!

They are local variables in a function *in an include file*, though.
If a user's preprocessor macro just happens to match a local variable name
there could be problems, right?

a.h:
inline void foo () {
  int A = 0;
}

a.c:
#define A a+b
#include <a.h>

$ gcc -c -I. a.c
In file included from a.c:1:
a.c: In function ‘foo’:
a.h:1:12: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘+’ token
 #define A a+b
            ^
a.c:2:17: note: in expansion of macro ‘A’
 int foo() { int A = 0; }
                 ^
a.h:1:13: error: ‘b’ undeclared (first use in this function)
 #define A a+b
             ^
a.c:2:17: note: in expansion of macro ‘A’
 int foo() { int A = 0; }
                 ^
a.h:1:13: note: each undeclared identifier is reported only once for each function it appears in
 #define A a+b
             ^
a.c:2:17: note: in expansion of macro ‘A’
 int foo() { int A = 0; }
                 ^
PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-10-08 19:27         ` Paul A. Clarke
@ 2021-10-08 22:31           ` Segher Boessenkool
  2021-10-11 13:46             ` Paul A. Clarke
  0 siblings, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-08 22:31 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

On Fri, Oct 08, 2021 at 02:27:28PM -0500, Paul A. Clarke wrote:
> On Fri, Oct 08, 2021 at 12:39:15PM -0500, Segher Boessenkool wrote:
> > This is not a basic asm (it contains a ":"; that is not just an easy way
> > to see it, it is the *definition* of basic vs. extended asm).
> 
> Ah, basic vs extended. I learned something today... thanks for your
> patience!

To expand a little: any asm with operands is extended asm.  And without
operands can be either:  asm("eieio");  is basic, while  asm("eieio" : );
is extended.  This matters because semantics are a bit different.

> I see. Thanks for the reference. If I understand correctly, volatile
> prevents some optimizations based on the defined inputs/outputs, but
> the asm could still be subject to reordering.

"asm volatile" means there is a side effect in the asm.  This means that
it has to be executed on the real machine the same as on the abstract
machine, with the side effects in the same order.

It can still be reordered, modulo those restrictions.  It can be merged
with an identical asm as well.  And the compiler can split this into two
identical asms on two paths.

In this case you might want a side effect (the instructions writes to
the FPSCR after all).  But you need this to be tied to the FP code that
you want the flags to be changed for, and to the restore of the flags,
and finally you need to prevent other FP code from being scheduled in
between.

You need more for that than just volatile, and the solution may well
make volatile not wanted: tying the insns together somehow will
naturally make the flags restored to a sane situation again, so the
whole group can be removed if you want, etc.

> In this particular case, I don't think it's an issue with respect to
> reordering.  The code in question is:
> +      __asm__ __volatile__ ("mffsce %0" : "=f" (__fpscr_save.__fr));
> +      __enables_save.__fpscr = __fpscr_save.__fpscr & 0xf8;
> 
> The output (__fpscr_save) is a source for the following assignment,
> so the order should be respected, no?

Other FP code can be interleaved, and then do the wrong thing.

> With respect to volatile, I worry about removing it, because I do
> indeed need that instruction to execute in order to clear the FPSCR
> exception enable bits. That side-effect is not otherwise known to the
> compiler.

Yes.  But as said above, volatile isn't enough to get this to behave
correctly.

The easiest way out is to write this all in one piece of (inline) asm.

> > > > You do not need any of that __ either.
> > > 
> > > I'm surprised that I don't. A .h file needs to be concerned about the
> > > namespace it inherits, no?
> > 
> > These are local variables in a function though.  You get such
> > complexities in macros, but never in functions, where everything is
> > scoped.  Local variables are a great thing.  And macros are a bad thing!
> 
> They are local variables in a function *in an include file*, though.
> If a user's preprocessor macro just happens to match a local variable name
> there could be problems, right?

Of course.  This is why traditionally macro names are ALL_CAPS :-)  So
in practice it doesn't matter, and in practice many users use __ names
themselves as well.

But you are right.  I just don't see it will help practically :-(

Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-10-08 22:31           ` Segher Boessenkool
@ 2021-10-11 13:46             ` Paul A. Clarke
  2021-10-11 16:28               ` Segher Boessenkool
  0 siblings, 1 reply; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-11 13:46 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, wschmidt

On Fri, Oct 08, 2021 at 05:31:11PM -0500, Segher Boessenkool wrote:
> On Fri, Oct 08, 2021 at 02:27:28PM -0500, Paul A. Clarke wrote:
> > On Fri, Oct 08, 2021 at 12:39:15PM -0500, Segher Boessenkool wrote:
> > I see. Thanks for the reference. If I understand correctly, volatile
> > prevents some optimizations based on the defined inputs/outputs, but
> > the asm could still be subject to reordering.
> 
> "asm volatile" means there is a side effect in the asm.  This means that
> it has to be executed on the real machine the same as on the abstract
> machine, with the side effects in the same order.
> 
> It can still be reordered, modulo those restrictions.  It can be merged
> with an identical asm as well.  And the compiler can split this into two
> identical asms on two paths.

It seems odd to me that the compiler can make any assumptions about
the side-effect(s). How does it know that a side-effect does not alter
computation (as it indeed does in this case), such that reordering is
a still correct (which it wouldn't be in this case)?

> In this case you might want a side effect (the instructions writes to
> the FPSCR after all).  But you need this to be tied to the FP code that
> you want the flags to be changed for, and to the restore of the flags,
> and finally you need to prevent other FP code from being scheduled in
> between.
> 
> You need more for that than just volatile, and the solution may well
> make volatile not wanted: tying the insns together somehow will
> naturally make the flags restored to a sane situation again, so the
> whole group can be removed if you want, etc.
> 
> > In this particular case, I don't think it's an issue with respect to
> > reordering.  The code in question is:
> > +      __asm__ __volatile__ ("mffsce %0" : "=f" (__fpscr_save.__fr));
> > +      __enables_save.__fpscr = __fpscr_save.__fpscr & 0xf8;
> > 
> > The output (__fpscr_save) is a source for the following assignment,
> > so the order should be respected, no?
> 
> Other FP code can be interleaved, and then do the wrong thing.
> 
> > With respect to volatile, I worry about removing it, because I do
> > indeed need that instruction to execute in order to clear the FPSCR
> > exception enable bits. That side-effect is not otherwise known to the
> > compiler.
> 
> Yes.  But as said above, volatile isn't enough to get this to behave
> correctly.
> 
> The easiest way out is to write this all in one piece of (inline) asm.

Ugh. I really don't want to go there, not just because it's work, but
I think this is a paradigm that should work without needing to drop
fully into asm.

Is there something unique about using an "asm" statement versus using,
say, a builtin like __builtin_mtfsf or a hypothetical __builtin_mffsce?
Very similar methods are used in glibc today. Are those broken?

Would creating a __builtin_mffsce be another solution?

Would adding memory barriers between the FPSCR manipulations and the
code which is bracketed by them be sufficient?

PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-10-11 13:46             ` Paul A. Clarke
@ 2021-10-11 16:28               ` Segher Boessenkool
  2021-10-11 17:31                 ` Paul A. Clarke
  0 siblings, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-11 16:28 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

On Mon, Oct 11, 2021 at 08:46:17AM -0500, Paul A. Clarke wrote:
> On Fri, Oct 08, 2021 at 05:31:11PM -0500, Segher Boessenkool wrote:
> > "asm volatile" means there is a side effect in the asm.  This means that
> > it has to be executed on the real machine the same as on the abstract
> > machine, with the side effects in the same order.
> > 
> > It can still be reordered, modulo those restrictions.  It can be merged
> > with an identical asm as well.  And the compiler can split this into two
> > identical asms on two paths.
> 
> It seems odd to me that the compiler can make any assumptions about
> the side-effect(s). How does it know that a side-effect does not alter
> computation (as it indeed does in this case), such that reordering is
> a still correct (which it wouldn't be in this case)?

Because by definition side effects do not change the computation (where
"computation" means "the outputs of the asm")!

And if you are talking about changing future computations, as floating
point control flags can be used for: this falls ouside of the C abstract
machine, other than fe[gs]etround etc.

> > > With respect to volatile, I worry about removing it, because I do
> > > indeed need that instruction to execute in order to clear the FPSCR
> > > exception enable bits. That side-effect is not otherwise known to the
> > > compiler.
> > 
> > Yes.  But as said above, volatile isn't enough to get this to behave
> > correctly.
> > 
> > The easiest way out is to write this all in one piece of (inline) asm.
> 
> Ugh. I really don't want to go there, not just because it's work, but
> I think this is a paradigm that should work without needing to drop
> fully into asm.

Yes.  Let's say GCC still has some challenges here :-(

> Is there something unique about using an "asm" statement versus using,
> say, a builtin like __builtin_mtfsf or a hypothetical __builtin_mffsce?

Nope.

> Very similar methods are used in glibc today. Are those broken?

Maybe.  If you get a real (i.e. not inline) function call there, that
can save you often.

> Would creating a __builtin_mffsce be another solution?

Yes.  And not a bad idea in the first place.

> Would adding memory barriers between the FPSCR manipulations and the
> code which is bracketed by them be sufficient?

No, what you want to order is not memory accesses, but FP computations
relative to the insns that change the FP control bits.  If *both* of
those change memory you can artificially order them with that.  But most
FP computations do not access memory.


Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-10-11 16:28               ` Segher Boessenkool
@ 2021-10-11 17:31                 ` Paul A. Clarke
  2021-10-11 22:04                   ` Segher Boessenkool
  0 siblings, 1 reply; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-11 17:31 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, wschmidt

On Mon, Oct 11, 2021 at 11:28:39AM -0500, Segher Boessenkool wrote:
> On Mon, Oct 11, 2021 at 08:46:17AM -0500, Paul A. Clarke wrote:
> > On Fri, Oct 08, 2021 at 05:31:11PM -0500, Segher Boessenkool wrote:
[...]
> > > > With respect to volatile, I worry about removing it, because I do
> > > > indeed need that instruction to execute in order to clear the FPSCR
> > > > exception enable bits. That side-effect is not otherwise known to the
> > > > compiler.
> > > 
> > > Yes.  But as said above, volatile isn't enough to get this to behave
> > > correctly.
> > > 
> > > The easiest way out is to write this all in one piece of (inline) asm.
> > 
> > Ugh. I really don't want to go there, not just because it's work, but
> > I think this is a paradigm that should work without needing to drop
> > fully into asm.
> 
> Yes.  Let's say GCC still has some challenges here :-(
> 
> > Is there something unique about using an "asm" statement versus using,
> > say, a builtin like __builtin_mtfsf or a hypothetical __builtin_mffsce?
> 
> Nope.
> 
> > Very similar methods are used in glibc today. Are those broken?
> 
> Maybe.

Ouch.

> If you get a real (i.e. not inline) function call there, that
> can save you often.

Calling a real function in order to execute a single instruction is
sub-optimal. ;-)

> > Would creating a __builtin_mffsce be another solution?
> 
> Yes.  And not a bad idea in the first place.

The previous "Nope" and this "Yes" seem in contradiction. If there is no
difference between "asm" and builtin, how does using a builtin solve the
problem?

PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 2/6] rs6000: Support SSE4.1 "min" and "max" intrinsics
  2021-08-23 19:03 ` [PATCH v3 2/6] rs6000: Support SSE4.1 "min" and "max" intrinsics Paul A. Clarke
  2021-08-27 13:47   ` Bill Schmidt
@ 2021-10-11 19:28   ` Segher Boessenkool
  2021-10-12  1:42     ` [COMMITTED v4 " Paul A. Clarke
  1 sibling, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-11 19:28 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

On Mon, Aug 23, 2021 at 02:03:06PM -0500, Paul A. Clarke wrote:
> gcc
> 	* config/rs6000/smmintrin.h (_mm_min_epi8, _mm_min_epu16,
> 	_mm_min_epi32, _mm_min_epu32, _mm_max_epi8, _mm_max_epu16,
> 	_mm_max_epi32, _mm_max_epu32): New.
> 
> gcc/testsuite
> 	* gcc.target/powerpc/sse4_1-pmaxsb.c: Copy from gcc.target/i386.
> 	* gcc.target/powerpc/sse4_1-pmaxsd.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmaxud.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmaxuw.c: Same.
> 	* gcc.target/powerpc/sse4_1-pminsb.c: Same.
> 	* gcc.target/powerpc/sse4_1-pminsd.c: Same.
> 	* gcc.target/powerpc/sse4_1-pminud.c: Same.
> 	* gcc.target/powerpc/sse4_1-pminuw.c: Same.

Okay for trunk.  Thanks!


Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 3/6] rs6000: Simplify some SSE4.1 "test" intrinsics
  2021-08-23 19:03 ` [PATCH v3 3/6] rs6000: Simplify some SSE4.1 "test" intrinsics Paul A. Clarke
  2021-08-27 13:48   ` Bill Schmidt
@ 2021-10-11 20:50   ` Segher Boessenkool
  2021-10-12  1:47     ` [COMMITTED v4 " Paul A. Clarke
  1 sibling, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-11 20:50 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

On Mon, Aug 23, 2021 at 02:03:07PM -0500, Paul A. Clarke wrote:
> gcc
> 	* config/rs6000/smmintrin.h (_mm_test_all_zeros,
> 	_mm_test_all_ones, _mm_test_mix_ones_zeros): Replace.

"Replace" does not say what it is replaced with.  "Rewrite" maybe?

Okay for trunk either way.  Thanks!


Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 4/6] rs6000: Support SSE4.1 "cvt" intrinsics
  2021-08-23 19:03 ` [PATCH v3 4/6] rs6000: Support SSE4.1 "cvt" intrinsics Paul A. Clarke
  2021-08-27 13:49   ` Bill Schmidt
@ 2021-10-11 21:52   ` Segher Boessenkool
  2021-10-12  1:51     ` [COMMITTED v4 " Paul A. Clarke
  1 sibling, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-11 21:52 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

Hi!

On Mon, Aug 23, 2021 at 02:03:08PM -0500, Paul A. Clarke wrote:
> gcc
> 	* config/rs6000/smmintrin.h (_mm_cvtepi8_epi16, _mm_cvtepi8_epi32,
> 	_mm_cvtepi8_epi64, _mm_cvtepi16_epi32, _mm_cvtepi16_epi64,
> 	_mm_cvtepi32_epi64, _mm_cvtepu8_epi16, _mm_cvtepu8_epi32,
> 	_mm_cvtepu8_epi64, _mm_cvtepu16_epi32, _mm_cvtepu16_epi64,
> 	_mm_cvtepu32_epi64): New.
> 
> gcc/testsuite
> 	* gcc.target/powerpc/sse4_1-pmovsxbd.c: Copy from gcc.target/i386,
> 	adjust dg directives to suit.
> 	* gcc.target/powerpc/sse4_1-pmovsxbq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovsxbw.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovsxdq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovsxwd.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovsxwq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxbd.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxbq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxbw.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxdq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxwd.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmovzxwq.c: Same.

> +extern __inline __m128i
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +_mm_cvtepi8_epi16 (__m128i __A)
> +{
> +  return (__m128i) vec_unpackh ((__v16qi)__A);
> +}

This strange mixture of sometimes writing a cast with a space and
sometimes without one is...  strange :-)

Having up to three unpacks in a row seems suboptimal.  But it certainly
is aesthetically pleasing :-)

> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */

Same as before here too (needs vsx_hw).

Okay for trunk with that fixed.  Thanks!


Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-10-11 17:31                 ` Paul A. Clarke
@ 2021-10-11 22:04                   ` Segher Boessenkool
  2021-10-12 19:35                     ` Paul A. Clarke
  0 siblings, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-11 22:04 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

Hi!

On Mon, Oct 11, 2021 at 12:31:07PM -0500, Paul A. Clarke wrote:
> On Mon, Oct 11, 2021 at 11:28:39AM -0500, Segher Boessenkool wrote:
> > > Very similar methods are used in glibc today. Are those broken?
> > 
> > Maybe.
> 
> Ouch.

So show the code?

> > If you get a real (i.e. not inline) function call there, that
> > can save you often.
> 
> Calling a real function in order to execute a single instruction is
> sub-optimal. ;-)

Calling a real function (that does not even need a stack frame, just a
blr) is not terribly expensive, either.

> > > Would creating a __builtin_mffsce be another solution?
> > 
> > Yes.  And not a bad idea in the first place.
> 
> The previous "Nope" and this "Yes" seem in contradiction. If there is no
> difference between "asm" and builtin, how does using a builtin solve the
> problem?

You will have to make the builtin solve it.  What a builtin can do is
virtually unlimited.  What an asm can do is not: it just outputs some
assembler language, and does in/out/clobber constraints.  You can do a
*lot* with that, but it is much more limited than everything you can do
in the compiler!  :-)

The fact remains that there is no way in RTL (or Gimple for that matter)
to express things like rounding mode changes.  You will need to
artificially make some barriers.

Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 5/6] rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics
  2021-08-23 19:03 ` [PATCH v3 5/6] rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics Paul A. Clarke
  2021-08-27 15:21   ` Bill Schmidt
@ 2021-10-11 23:07   ` Segher Boessenkool
  2021-10-12  1:55     ` [COMMITTED v4 " Paul A. Clarke
  1 sibling, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-11 23:07 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

Hi!

On Mon, Aug 23, 2021 at 02:03:09PM -0500, Paul A. Clarke wrote:
> gcc
> 	* config/rs6000/smmintrin.h (_mm_cmpeq_epi64, _mm_cmpgt_epi64,
> 	_mm_mullo_epi32, _mm_mul_epi32, _mm_packus_epi32): New.
> 	* config/rs6000/nmmintrin.h: Copy from i386, tweak to suit.
> 
> gcc/testsuite
> 	* gcc.target/powerpc/pr78102.c: Copy from gcc.target/i386,
> 	adjust dg directives to suit.
> 	* gcc.target/powerpc/sse4_1-packusdw.c: Same.
> 	* gcc.target/powerpc/sse4_1-pcmpeqq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmuldq.c: Same.
> 	* gcc.target/powerpc/sse4_1-pmulld.c: Same.
> 	* gcc.target/powerpc/sse4_2-pcmpgtq.c: Same.
> 	* gcc.target/powerpc/sse4_2-check.h: Copy from gcc.target/i386,
> 	tweak to suit.

Okay for trunk (with the vsx_hw thing).  Thanks!


Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 6/6] rs6000: Guard some x86 intrinsics implementations
  2021-08-23 19:03 ` [PATCH v3 6/6] rs6000: Guard some x86 intrinsics implementations Paul A. Clarke
  2021-08-27 15:25   ` Bill Schmidt
@ 2021-10-12  0:11   ` Segher Boessenkool
  2021-10-13 17:04     ` Paul A. Clarke
  1 sibling, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-12  0:11 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

Hi!

On Mon, Aug 23, 2021 at 02:03:10PM -0500, Paul A. Clarke wrote:
> Some compatibility implementations of x86 intrinsics include
> Power intrinsics which require POWER8.  Guard them.

> emmintrin.h:
> - _mm_cmpord_pd: Remove code which was ostensibly for pre-POWER8,
>   but which indeed depended on POWER8 (vec_cmpgt(v2du)/vcmpgtud).
>   The "POWER8" version works fine on pre-POWER8.

Huh.  It just generates xvcmpeqdp I suppose?

> - _mm_mul_epu32: vec_mule(v4su) uses vmuleuw.

Did this fail on p7?  If not, add a test that *does*?

> pmmintrin.h:
> - _mm_movehdup_ps: vec_mergeo(v4su) uses vmrgow.
> - _mm_moveldup_ps: vec_mergee(v4su) uses vmrgew.

Similar.

> smmintrin.h:
> - _mm_cmpeq_epi64: vec_cmpeq(v2di) uses vcmpequd.
> - _mm_mul_epi32: vec_mule(v4si) uses vmuluwm.
> - _mm_cmpgt_epi64: vec_cmpgt(v2di) uses vcmpgtsd.
> tmmintrin.h:
> - _mm_sign_epi8: vec_neg(v4si) uses vsububm.
> - _mm_sign_epi16: vec_neg(v4si) uses vsubuhm.
> - _mm_sign_epi32: vec_neg(v4si) uses vsubuwm.
>   Note that the above three could actually be supported pre-POWER8,
>   but current GCC does not support them before POWER8.
> - _mm_sign_pi8: depends on _mm_sign_epi8.
> - _mm_sign_pi16: depends on _mm_sign_epi16.
> - _mm_sign_pi32: depends on _mm_sign_epi32.

And more.

> gcc
> 	PR target/101893

This is a different bug (the vgbdd one)?

All looks good, but we need such failing tests :-)


Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics
  2021-10-08  0:29   ` Paul A. Clarke
@ 2021-10-12  0:15     ` Segher Boessenkool
  0 siblings, 0 replies; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-12  0:15 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: wschmidt, gcc-patches

On Thu, Oct 07, 2021 at 07:29:26PM -0500, Paul A. Clarke wrote:
> On Thu, Oct 07, 2021 at 05:25:54PM -0500, Segher Boessenkool wrote:
> > On Mon, Aug 23, 2021 at 02:03:04PM -0500, Paul A. Clarke wrote:
> > > v3: Add "nmmintrin.h". _mm_cmpgt_epi64 is part of SSE4.2
> > 
> > There should not be a "v3" in the commit message.  The easy way to
> > achieve this is put it inside the [] in the subject (as you did), and to
> > mention the version history after a --- (see --notes for git-format-patch
> > for example).
> 
> This is just a cover letter. Does it matter in that context?

Ha no, it just confused me apparently :-)


Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [COMMITTED v4 2/6] rs6000: Support SSE4.1 "min" and "max" intrinsics
  2021-10-11 19:28   ` Segher Boessenkool
@ 2021-10-12  1:42     ` Paul A. Clarke
  0 siblings, 0 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-12  1:42 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, wschmidt

On Mon, Oct 11, 2021 at 02:28:15PM -0500, Segher Boessenkool wrote:
> On Mon, Aug 23, 2021 at 02:03:06PM -0500, Paul A. Clarke wrote:
> > gcc
> > 	* config/rs6000/smmintrin.h (_mm_min_epi8, _mm_min_epu16,
> > 	_mm_min_epi32, _mm_min_epu32, _mm_max_epi8, _mm_max_epu16,
> > 	_mm_max_epi32, _mm_max_epu32): New.
> > 
> > gcc/testsuite
> > 	* gcc.target/powerpc/sse4_1-pmaxsb.c: Copy from gcc.target/i386.
> > 	* gcc.target/powerpc/sse4_1-pmaxsd.c: Same.
> > 	* gcc.target/powerpc/sse4_1-pmaxud.c: Same.
> > 	* gcc.target/powerpc/sse4_1-pmaxuw.c: Same.
> > 	* gcc.target/powerpc/sse4_1-pminsb.c: Same.
> > 	* gcc.target/powerpc/sse4_1-pminsd.c: Same.
> > 	* gcc.target/powerpc/sse4_1-pminud.c: Same.
> > 	* gcc.target/powerpc/sse4_1-pminuw.c: Same.
> 
> Okay for trunk.  Thanks!

The following was committed.

Function signatures and decorations match gcc/config/i386/smmintrin.h.

Also, copy tests for _mm_min_epi8, _mm_min_epu16, _mm_min_epi32,
_mm_min_epu32, _mm_max_epi8, _mm_max_epu16, _mm_max_epi32, _mm_max_epu32
from gcc/testsuite/gcc.target/i386.

sse4_1-pmaxsb.c and sse4_1-pminsb.c were modified from using
"char" types to "signed char" types, because the default is unsigned on
powerpc.

2021-10-11  Paul A. Clarke  <pc@us.ibm.com>

gcc
	* config/rs6000/smmintrin.h (_mm_min_epi8, _mm_min_epu16,
	_mm_min_epi32, _mm_min_epu32, _mm_max_epi8, _mm_max_epu16,
	_mm_max_epi32, _mm_max_epu32): New.

gcc/testsuite
	* gcc.target/powerpc/sse4_1-pmaxsb.c: Copy from gcc.target/i386.
	* gcc.target/powerpc/sse4_1-pmaxsd.c: Same.
	* gcc.target/powerpc/sse4_1-pmaxud.c: Same.
	* gcc.target/powerpc/sse4_1-pmaxuw.c: Same.
	* gcc.target/powerpc/sse4_1-pminsb.c: Same.
	* gcc.target/powerpc/sse4_1-pminsd.c: Same.
	* gcc.target/powerpc/sse4_1-pminud.c: Same.
	* gcc.target/powerpc/sse4_1-pminuw.c: Same.
---
v4: I fixed more "space after cast" and "vsx_hw" issues.

diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
index 3767a67eada7..af782079cbcb 100644
--- a/gcc/config/rs6000/smmintrin.h
+++ b/gcc/config/rs6000/smmintrin.h
@@ -296,6 +296,62 @@ _mm_floor_ss (__m128 __A, __m128 __B)
   return __r;
 }
 
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_min_epi8 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_min ((__v16qi)__X, (__v16qi)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_min_epu16 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_min ((__v8hu)__X, (__v8hu)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_min_epi32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_min ((__v4si)__X, (__v4si)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_min_epu32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_min ((__v4su)__X, (__v4su)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_max_epi8 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_max ((__v16qi)__X, (__v16qi)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_max_epu16 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_max ((__v8hu)__X, (__v8hu)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_max_epi32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_max ((__v4si)__X, (__v4si)__Y);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_max_epu32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_max ((__v4su)__X, (__v4su)__Y);
+}
+
 /* Return horizontal packed word minimum and its index in bits [15:0]
    and bits [18:16] respectively.  */
 __inline __m128i
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c
new file mode 100644
index 000000000000..33f168b712ea
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsb.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 1024
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 16];
+      signed char i[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  signed char max;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 16)
+    dst.x[i / 16] = _mm_max_epi8 (src1.x[i / 16], src2.x[i / 16]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (max != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c
new file mode 100644
index 000000000000..60b342587ddb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxsd.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  int max;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_max_epi32 (src1.x[i / 4], src2.x[i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (max != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c
new file mode 100644
index 000000000000..a6e9ffa711e1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxud.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      unsigned int i[NUM];
+    } dst, src1, src2;
+  int i;
+  unsigned int max;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i;
+      src2.i[i] = i + 20;
+      if ((i % 4))
+	src2.i[i] |= 0x80000000;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_max_epu32 (src1.x[i / 4], src2.x[i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (max != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c
new file mode 100644
index 000000000000..826db1efe1f5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmaxuw.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 8];
+      unsigned short i[NUM];
+    } dst, src1, src2;
+  int i;
+  unsigned short max;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i;
+      src2.i[i] = i + 20;
+      if ((i % 8))
+	src2.i[i] |= 0x8000;
+    }
+
+  for (i = 0; i < NUM; i += 8)
+    dst.x[i / 8] = _mm_max_epu16 (src1.x[i / 8], src2.x[i / 8]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      max = src1.i[i] <= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (max != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c
new file mode 100644
index 000000000000..74a395882e79
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsb.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 1024
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 16];
+      signed char i[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  signed char min;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 16)
+    dst.x[i / 16] = _mm_min_epi8 (src1.x[i / 16], src2.x[i / 16]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (min != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c
new file mode 100644
index 000000000000..36aab228fcf3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminsd.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  int min;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_min_epi32 (src1.x[i / 4], src2.x[i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (min != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c
new file mode 100644
index 000000000000..972e15124ca9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminud.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      unsigned int i[NUM];
+    } dst, src1, src2;
+  int i;
+  unsigned int min;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i;
+      src2.i[i] = i + 20;
+      if ((i % 4))
+	src2.i[i] |= 0x80000000;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_min_epu32 (src1.x[i / 4], src2.x[i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (min != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c
new file mode 100644
index 000000000000..4fe7d3aabf5c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pminuw.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 8];
+      unsigned short i[NUM];
+    } dst, src1, src2;
+  int i;
+  unsigned short min;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i;
+      src2.i[i] = i + 20;
+      if ((i % 8))
+	src2.i[i] |= 0x8000;
+    }
+
+  for (i = 0; i < NUM; i += 8)
+    dst.x[i / 8] = _mm_min_epu16 (src1.x[i / 8], src2.x[i / 8]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      min = src1.i[i] >= src2.i[i] ? src2.i[i] : src1.i[i];
+      if (min != dst.i[i])
+	abort ();
+    }
+}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [COMMITTED v4 3/6] rs6000: Simplify some SSE4.1 "test" intrinsics
  2021-10-11 20:50   ` Segher Boessenkool
@ 2021-10-12  1:47     ` Paul A. Clarke
  0 siblings, 0 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-12  1:47 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, wschmidt

On Mon, Oct 11, 2021 at 03:50:31PM -0500, Segher Boessenkool wrote:
> On Mon, Aug 23, 2021 at 02:03:07PM -0500, Paul A. Clarke wrote:
> > gcc
> > 	* config/rs6000/smmintrin.h (_mm_test_all_zeros,
> > 	_mm_test_all_ones, _mm_test_mix_ones_zeros): Replace.
> 
> "Replace" does not say what it is replaced with.  "Rewrite" maybe?
> 
> Okay for trunk either way.  Thanks!

This was committed:

Copy some simple redirections from i386 <smmintrin.h>, for:
- _mm_test_all_zeros
- _mm_test_all_ones
- _mm_test_mix_ones_zeros

2021-10-11  Paul A. Clarke  <pc@us.ibm.com>

gcc
	* config/rs6000/smmintrin.h (_mm_test_all_zeros,
	_mm_test_all_ones, _mm_test_mix_ones_zeros): Rewrite as macro.
--
v4: tweak commit message.

diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
index af782079cbcb..f935ab060abc 100644
--- a/gcc/config/rs6000/smmintrin.h
+++ b/gcc/config/rs6000/smmintrin.h
@@ -203,34 +203,12 @@ _mm_testnzc_si128 (__m128i __A, __m128i __B)
   return _mm_testz_si128 (__A, __B) == 0 && _mm_testc_si128 (__A, __B) == 0;
 }
 
-__inline int
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_test_all_zeros (__m128i __A, __m128i __mask)
-{
-  const __v16qu __zero = {0};
-  return vec_all_eq (vec_and ((__v16qu) __A, (__v16qu) __mask), __zero);
-}
+#define _mm_test_all_zeros(M, V) _mm_testz_si128 ((M), (V))
 
-__inline int
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_test_all_ones (__m128i __A)
-{
-  const __v16qu __ones = vec_splats ((unsigned char) 0xff);
-  return vec_all_eq ((__v16qu) __A, __ones);
-}
+#define _mm_test_all_ones(V) \
+  _mm_testc_si128 ((V), _mm_cmpeq_epi32 ((V), (V)))
 
-__inline int
-__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm_test_mix_ones_zeros (__m128i __A, __m128i __mask)
-{
-  const __v16qu __zero = {0};
-  const __v16qu __Amasked = vec_and ((__v16qu) __A, (__v16qu) __mask);
-  const int any_ones = vec_any_ne (__Amasked, __zero);
-  const __v16qu __notA = vec_nor ((__v16qu) __A, (__v16qu) __A);
-  const __v16qu __notAmasked = vec_and ((__v16qu) __notA, (__v16qu) __mask);
-  const int any_zeros = vec_any_ne (__notAmasked, __zero);
-  return any_ones * any_zeros;
-}
+#define _mm_test_mix_ones_zeros(M, V) _mm_testnzc_si128 ((M), (V))
 
 __inline __m128d
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [COMMITTED v4 4/6] rs6000: Support SSE4.1 "cvt" intrinsics
  2021-10-11 21:52   ` Segher Boessenkool
@ 2021-10-12  1:51     ` Paul A. Clarke
  0 siblings, 0 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-12  1:51 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, wschmidt

On Mon, Oct 11, 2021 at 04:52:44PM -0500, Segher Boessenkool wrote:
> On Mon, Aug 23, 2021 at 02:03:08PM -0500, Paul A. Clarke wrote:
[...]
> > +extern __inline __m128i
> > +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> > +_mm_cvtepi8_epi16 (__m128i __A)
> > +{
> > +  return (__m128i) vec_unpackh ((__v16qi)__A);
> > +}
> 
> This strange mixture of sometimes writing a cast with a space and
> sometimes without one is...  strange :-)
> 
> Having up to three unpacks in a row seems suboptimal.  But it certainly
> is aesthetically pleasing :-)
> 
> > +/* { dg-do run } */
> > +/* { dg-require-effective-target powerpc_vsx_ok } */
> > +/* { dg-options "-O2 -mvsx" } */
> 
> Same as before here too (needs vsx_hw).
> 
> Okay for trunk with that fixed.  Thanks!

This was committed:

rs6000: Support SSE4.1 "cvt" intrinsics

Function signatures and decorations match gcc/config/i386/smmintrin.h.

Also, copy tests for:
- _mm_cvtepi8_epi16, _mm_cvtepi8_epi32, _mm_cvtepi8_epi64
- _mm_cvtepi16_epi32, _mm_cvtepi16_epi64
- _mm_cvtepi32_epi64,
- _mm_cvtepu8_epi16, _mm_cvtepu8_epi32, _mm_cvtepu8_epi64
- _mm_cvtepu16_epi32, _mm_cvtepu16_epi64
- _mm_cvtepu32_epi64

from gcc/testsuite/gcc.target/i386.

sse4_1-pmovsxbd.c, sse4_1-pmovsxbq.c, and sse4_1-pmovsxbw.c were
modified from using "char" types to "signed char" types, because
the default is unsigned on powerpc.

2021-10-11  Paul A. Clarke  <pc@us.ibm.com>

gcc
        * config/rs6000/smmintrin.h (_mm_cvtepi8_epi16, _mm_cvtepi8_epi32,
        _mm_cvtepi8_epi64, _mm_cvtepi16_epi32, _mm_cvtepi16_epi64,
        _mm_cvtepi32_epi64, _mm_cvtepu8_epi16, _mm_cvtepu8_epi32,
        _mm_cvtepu8_epi64, _mm_cvtepu16_epi32, _mm_cvtepu16_epi64,
        _mm_cvtepu32_epi64): New.

gcc/testsuite
        * gcc.target/powerpc/sse4_1-pmovsxbd.c: Copy from gcc.target/i386,
        adjust dg directives to suit.
        * gcc.target/powerpc/sse4_1-pmovsxbq.c: Same.
        * gcc.target/powerpc/sse4_1-pmovsxbw.c: Same.
        * gcc.target/powerpc/sse4_1-pmovsxdq.c: Same.
        * gcc.target/powerpc/sse4_1-pmovsxwd.c: Same.
        * gcc.target/powerpc/sse4_1-pmovsxwq.c: Same.
        * gcc.target/powerpc/sse4_1-pmovzxbd.c: Same.
        * gcc.target/powerpc/sse4_1-pmovzxbq.c: Same.
        * gcc.target/powerpc/sse4_1-pmovzxbw.c: Same.
        * gcc.target/powerpc/sse4_1-pmovzxdq.c: Same.
        * gcc.target/powerpc/sse4_1-pmovzxwd.c: Same.
        * gcc.target/powerpc/sse4_1-pmovzxwq.c: Same.
---
v4: Fix "space after cast" and "vsx_ok" issues, per Segher review.

diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
index f935ab060abc..ad6b68e13cce 100644
--- a/gcc/config/rs6000/smmintrin.h
+++ b/gcc/config/rs6000/smmintrin.h
@@ -330,6 +330,144 @@ _mm_max_epu32 (__m128i __X, __m128i __Y)
   return (__m128i) vec_max ((__v4su)__X, (__v4su)__Y);
 }
 
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi8_epi16 (__m128i __A)
+{
+  return (__m128i) vec_unpackh ((__v16qi) __A);
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi8_epi32 (__m128i __A)
+{
+  __A = (__m128i) vec_unpackh ((__v16qi) __A);
+  return (__m128i) vec_unpackh ((__v8hi) __A);
+}
+
+#ifdef _ARCH_PWR8
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi8_epi64 (__m128i __A)
+{
+  __A = (__m128i) vec_unpackh ((__v16qi) __A);
+  __A = (__m128i) vec_unpackh ((__v8hi) __A);
+  return (__m128i) vec_unpackh ((__v4si) __A);
+}
+#endif
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi16_epi32 (__m128i __A)
+{
+  return (__m128i) vec_unpackh ((__v8hi) __A);
+}
+
+#ifdef _ARCH_PWR8
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi16_epi64 (__m128i __A)
+{
+  __A = (__m128i) vec_unpackh ((__v8hi) __A);
+  return (__m128i) vec_unpackh ((__v4si) __A);
+}
+#endif
+
+#ifdef _ARCH_PWR8
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepi32_epi64 (__m128i __A)
+{
+  return (__m128i) vec_unpackh ((__v4si) __A);
+}
+#endif
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu8_epi16 (__m128i __A)
+{
+  const __v16qu __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v16qu) __A, __zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v16qu) __A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu8_epi32 (__m128i __A)
+{
+  const __v16qu __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v16qu) __A, __zero);
+  __A = (__m128i) vec_mergeh ((__v8hu) __A, (__v8hu) __zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v16qu) __A);
+  __A = (__m128i) vec_mergeh ((__v8hu) __zero, (__v8hu) __A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu8_epi64 (__m128i __A)
+{
+  const __v16qu __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v16qu) __A, __zero);
+  __A = (__m128i) vec_mergeh ((__v8hu) __A, (__v8hu) __zero);
+  __A = (__m128i) vec_mergeh ((__v4su) __A, (__v4su) __zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v16qu) __A);
+  __A = (__m128i) vec_mergeh ((__v8hu) __zero, (__v8hu) __A);
+  __A = (__m128i) vec_mergeh ((__v4su) __zero, (__v4su) __A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu16_epi32 (__m128i __A)
+{
+  const __v8hu __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v8hu) __A, __zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v8hu) __A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu16_epi64 (__m128i __A)
+{
+  const __v8hu __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v8hu) __A, __zero);
+  __A = (__m128i) vec_mergeh ((__v4su) __A, (__v4su) __zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v8hu) __A);
+  __A = (__m128i) vec_mergeh ((__v4su) __zero, (__v4su) __A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtepu32_epi64 (__m128i __A)
+{
+  const __v4su __zero = {0};
+#ifdef __LITTLE_ENDIAN__
+  __A = (__m128i) vec_mergeh ((__v4su) __A, __zero);
+#else /* __BIG_ENDIAN__.  */
+  __A = (__m128i) vec_mergeh (__zero, (__v4su) __A);
+#endif /* __BIG_ENDIAN__.  */
+  return __A;
+}
+
 /* Return horizontal packed word minimum and its index in bits [15:0]
    and bits [18:16] respectively.  */
 __inline __m128i
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c
new file mode 100644
index 000000000000..99cca6150ea4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbd.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+      signed char c[NUM * 4];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 4) + (i / 4) * 16] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x [i / 4] = _mm_cvtepi8_epi32 (src.x [i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 4) + (i / 4) * 16] != dst.i[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c
new file mode 100644
index 000000000000..9ec1ab7a4169
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbq.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target p8vector_hw } */
+/* { dg-options "-O2 -mpower8-vector" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+      signed char c[NUM * 8];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 2) + (i / 2) * 16] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepi8_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 2) + (i / 2) * 16] != dst.ll[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c
new file mode 100644
index 000000000000..805897d929b1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxbw.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 8];
+      short s[NUM];
+      signed char c[NUM * 2];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 8) + (i / 8) * 16] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 8)
+    dst.x [i / 8] = _mm_cvtepi8_epi16 (src.x [i / 8]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 8) + (i / 8) * 16] != dst.s[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c
new file mode 100644
index 000000000000..1c263782240a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxdq.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target p8vector_hw } */
+/* { dg-options "-O2 -mpower8-vector" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+      int i[NUM * 2];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.i[(i % 2) + (i / 2) * 4] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepi32_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.i[(i % 2) + (i / 2) * 4] != dst.ll[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c
new file mode 100644
index 000000000000..43f30f024390
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwd.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+      short s[NUM * 2];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.s[(i % 4) + (i / 4) * 8] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x [i / 4] = _mm_cvtepi16_epi32 (src.x [i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.s[(i % 4) + (i / 4) * 8] != dst.i[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c
new file mode 100644
index 000000000000..67864695a113
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovsxwq.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target p8vector_hw } */
+/* { dg-options "-O2 -mpower8-vector" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+      short s[NUM * 4];
+    } dst, src;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.s[(i % 2) + (i / 2) * 8] = i * i * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepi16_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.s[(i % 2) + (i / 2) * 8] != dst.ll[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c
new file mode 100644
index 000000000000..643a2a6abf3c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbd.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      unsigned int i[NUM];
+      unsigned char c[NUM * 4];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 4) + (i / 4) * 16] = i * i;
+      if ((i % 4))
+	src.c[(i % 4) + (i / 4) * 16] |= 0x80;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x [i / 4] = _mm_cvtepu8_epi32 (src.x [i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 4) + (i / 4) * 16] != dst.i[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c
new file mode 100644
index 000000000000..871f425c80eb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbq.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      unsigned long long ll[NUM];
+      unsigned char c[NUM * 8];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 2) + (i / 2) * 16] = i * i;
+      if ((i % 2))
+	src.c[(i % 2) + (i / 2) * 16] |= 0x80;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepu8_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 2) + (i / 2) * 16] != dst.ll[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c
new file mode 100644
index 000000000000..ee89ebc805fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxbw.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 8];
+      unsigned short s[NUM];
+      unsigned char c[NUM * 2];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.c[(i % 8) + (i / 8) * 16] = i * i;
+      if ((i % 4))
+	src.c[(i % 8) + (i / 8) * 16] |= 0x80;
+    }
+
+  for (i = 0; i < NUM; i += 8)
+    dst.x [i / 8] = _mm_cvtepu8_epi16 (src.x [i / 8]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.c[(i % 8) + (i / 8) * 16] != dst.s[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c
new file mode 100644
index 000000000000..3ec28ab263bc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxdq.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      unsigned long long ll[NUM];
+      unsigned int i[NUM * 2];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.i[(i % 2) + (i / 2) * 4] = i * i;
+      if ((i % 2))
+        src.i[(i % 2) + (i / 2) * 4] |= 0x80000000;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepu32_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.i[(i % 2) + (i / 2) * 4] != dst.ll[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c
new file mode 100644
index 000000000000..decd9ff7f9ef
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwd.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      unsigned int i[NUM];
+      unsigned short s[NUM * 2];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.s[(i % 4) + (i / 4) * 8] = i * i;
+      if ((i % 4))
+	src.s[(i % 4) + (i / 4) * 8] |= 0x8000;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x [i / 4] = _mm_cvtepu16_epi32 (src.x [i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.s[(i % 4) + (i / 4) * 8] != dst.i[i])
+      abort ();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c
new file mode 100644
index 000000000000..03830448d173
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmovzxwq.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 128
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      unsigned long long ll[NUM];
+      unsigned short s[NUM * 4];
+    } dst, src;
+  int i;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src.s[(i % 2) + (i / 2) * 8] = i * i;
+      if ((i % 2))
+	src.s[(i % 2) + (i / 2) * 8] |= 0x8000;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cvtepu16_epi64 (src.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    if (src.s[(i % 2) + (i / 2) * 8] != dst.ll[i])
+      abort ();
+}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [COMMITTED v4 5/6] rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics
  2021-10-11 23:07   ` Segher Boessenkool
@ 2021-10-12  1:55     ` Paul A. Clarke
  0 siblings, 0 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-12  1:55 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, wschmidt

On Mon, Oct 11, 2021 at 06:07:35PM -0500, Segher Boessenkool wrote:
> On Mon, Aug 23, 2021 at 02:03:09PM -0500, Paul A. Clarke wrote:
> > gcc
> > 	* config/rs6000/smmintrin.h (_mm_cmpeq_epi64, _mm_cmpgt_epi64,
> > 	_mm_mullo_epi32, _mm_mul_epi32, _mm_packus_epi32): New.
> > 	* config/rs6000/nmmintrin.h: Copy from i386, tweak to suit.
> > 
> > gcc/testsuite
> > 	* gcc.target/powerpc/pr78102.c: Copy from gcc.target/i386,
> > 	adjust dg directives to suit.
> > 	* gcc.target/powerpc/sse4_1-packusdw.c: Same.
> > 	* gcc.target/powerpc/sse4_1-pcmpeqq.c: Same.
> > 	* gcc.target/powerpc/sse4_1-pmuldq.c: Same.
> > 	* gcc.target/powerpc/sse4_1-pmulld.c: Same.
> > 	* gcc.target/powerpc/sse4_2-pcmpgtq.c: Same.
> > 	* gcc.target/powerpc/sse4_2-check.h: Copy from gcc.target/i386,
> > 	tweak to suit.
> 
> Okay for trunk (with the vsx_hw thing).  Thanks!

This was committed:

rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics

Function signatures and decorations match gcc/config/i386/smmintrin.h.

Also, copy tests for:
- _mm_cmpeq_epi64
- _mm_mullo_epi32, _mm_mul_epi32
- _mm_packus_epi32
- _mm_cmpgt_epi64 (SSE4.2)

from gcc/testsuite/gcc.target/i386.

2021-10-11  Paul A. Clarke  <pc@us.ibm.com>

gcc
        * config/rs6000/smmintrin.h (_mm_cmpeq_epi64, _mm_cmpgt_epi64,
        _mm_mullo_epi32, _mm_mul_epi32, _mm_packus_epi32): New.
        * config/rs6000/nmmintrin.h: Copy from i386, tweak to suit.

gcc/testsuite
        * gcc.target/powerpc/pr78102.c: Copy from gcc.target/i386,
        adjust dg directives to suit.
        * gcc.target/powerpc/sse4_1-packusdw.c: Same.
        * gcc.target/powerpc/sse4_1-pcmpeqq.c: Same.
        * gcc.target/powerpc/sse4_1-pmuldq.c: Same.
        * gcc.target/powerpc/sse4_1-pmulld.c: Same.
        * gcc.target/powerpc/sse4_2-pcmpgtq.c: Same.
        * gcc.target/powerpc/sse4_2-check.h: Copy from gcc.target/i386,
        tweak to suit.
---
v4: Fix "space after cast" and "vsx_hw" issues, per Segher review.

diff --git a/gcc/config/rs6000/nmmintrin.h b/gcc/config/rs6000/nmmintrin.h
new file mode 100644
index 000000000000..20a70bee3776
--- /dev/null
+++ b/gcc/config/rs6000/nmmintrin.h
@@ -0,0 +1,40 @@
+/* Copyright (C) 2021 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef NO_WARN_X86_INTRINSICS
+/* This header is distributed to simplify porting x86_64 code that
+   makes explicit use of Intel intrinsics to powerpc64le.
+   It is the user's responsibility to determine if the results are
+   acceptable and make additional changes as necessary.
+   Note that much code that uses Intel intrinsics can be rewritten in
+   standard C or GNU C extensions, which are more portable and better
+   optimized across multiple targets.  */
+#endif
+
+#ifndef _NMMINTRIN_H_INCLUDED
+#define _NMMINTRIN_H_INCLUDED
+
+/* We just include SSE4.1 header file.  */
+#include <smmintrin.h>
+
+#endif /* _NMMINTRIN_H_INCLUDED */
diff --git a/gcc/config/rs6000/smmintrin.h b/gcc/config/rs6000/smmintrin.h
index ad6b68e13cce..90ce03d22709 100644
--- a/gcc/config/rs6000/smmintrin.h
+++ b/gcc/config/rs6000/smmintrin.h
@@ -274,6 +274,15 @@ _mm_floor_ss (__m128 __A, __m128 __B)
   return __r;
 }
 
+#ifdef _ARCH_PWR8
+extern __inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpeq_epi64 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_cmpeq ((__v2di) __X, (__v2di) __Y);
+}
+#endif
+
 extern __inline __m128i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm_min_epi8 (__m128i __X, __m128i __Y)
@@ -332,6 +341,22 @@ _mm_max_epu32 (__m128i __X, __m128i __Y)
 
 extern __inline __m128i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_mullo_epi32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_mul ((__v4su) __X, (__v4su) __Y);
+}
+
+#ifdef _ARCH_PWR8
+__inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_mul_epi32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_mule ((__v4si) __X, (__v4si) __Y);
+}
+#endif
+
+__inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm_cvtepi8_epi16 (__m128i __A)
 {
   return (__m128i) vec_unpackh ((__v16qi) __A);
@@ -495,4 +520,20 @@ _mm_minpos_epu16 (__m128i __A)
   return __r.__m;
 }
 
+__inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_packus_epi32 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_packsu ((__v4si) __X, (__v4si) __Y);
+}
+
+#ifdef _ARCH_PWR8
+__inline __m128i
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpgt_epi64 (__m128i __X, __m128i __Y)
+{
+  return (__m128i) vec_cmpgt ((__v2di) __X, (__v2di) __Y);
+}
+#endif
+
 #endif
diff --git a/gcc/testsuite/gcc.target/powerpc/pr78102.c b/gcc/testsuite/gcc.target/powerpc/pr78102.c
new file mode 100644
index 000000000000..68898c7f9428
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr78102.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mvsx" } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+
+#include <x86intrin.h>
+
+__m128i
+foo (const __m128i x, const __m128i y)
+{
+  return _mm_cmpeq_epi64 (x, y);
+}
+
+__v2di
+bar (const __v2di x, const __v2di y)
+{
+  return x == y;
+}
+
+__v2di
+baz (const __v2di x, const __v2di y)
+{
+  return x != y;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c
new file mode 100644
index 000000000000..8b757a267468
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-packusdw.c
@@ -0,0 +1,73 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mvsx" } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static unsigned short
+int_to_ushort (int iVal)
+{
+  unsigned short sVal;
+
+  if (iVal < 0)
+    sVal = 0;
+  else if (iVal > 0xffff)
+    sVal = 0xffff;
+  else sVal = iVal;
+
+  return sVal;
+}
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+    } src1, src2;
+  union
+    {
+      __m128i x[NUM / 4];
+      unsigned short s[NUM * 2];
+    } dst;
+  int i, sign = 1;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_packus_epi32 (src1.x [i / 4], src2.x [i / 4]);
+
+  for (i = 0; i < NUM; i ++)
+    {
+      int dstIndex;
+      unsigned short sVal;
+
+      sVal = int_to_ushort (src1.i[i]);
+      dstIndex = (i % 4) + (i / 4) * 8;
+      if (sVal != dst.s[dstIndex])
+	abort ();
+
+      sVal = int_to_ushort (src2.i[i]);
+      dstIndex += 4;
+      if (sVal != dst.s[dstIndex])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
new file mode 100644
index 000000000000..39b9f01d64a4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mpower8-vector" } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+    } dst, src1, src2;
+  int i, sign=1;
+  long long is_eq;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.ll[i] = i * i * sign;
+      src2.ll[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x [i / 2] = _mm_cmpeq_epi64(src1.x [i / 2], src2.x [i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      is_eq = src1.ll[i] == src2.ll[i] ? 0xffffffffffffffffLL : 0LL;
+      if (is_eq != dst.ll[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
new file mode 100644
index 000000000000..6a884f46235f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
@@ -0,0 +1,51 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mpower8-vector" } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+    } dst;
+  union
+    {
+      __m128i x[NUM / 2];
+      int i[NUM * 2];
+    } src1, src2;
+  int i, sign = 1;
+  long long value;
+
+  for (i = 0; i < NUM * 2; i += 2)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x[i / 2] = _mm_mul_epi32 (src1.x[i / 2], src2.x[i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      value = (long long) src1.i[i * 2] * (long long) src2.i[i * 2];
+      if (value != dst.ll[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c
new file mode 100644
index 000000000000..730334366426
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_1-pmulld.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mvsx" } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_1-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_1_test
+#endif
+
+#include CHECK_H
+
+#include <smmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 4];
+      int i[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  int value;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.i[i] = i * i * sign;
+      src2.i[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 4)
+    dst.x[i / 4] = _mm_mullo_epi32 (src1.x[i / 4], src2.x[i / 4]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      value = src1.i[i] * src2.i[i];
+      if (value != dst.i[i])
+	abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_2-check.h b/gcc/testsuite/gcc.target/powerpc/sse4_2-check.h
new file mode 100644
index 000000000000..f6264e5a1083
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_2-check.h
@@ -0,0 +1,18 @@
+#define NO_WARN_X86_INTRINSICS 1
+
+static void sse4_2_test (void);
+
+static void
+__attribute__ ((noinline))
+do_test (void)
+{
+  sse4_2_test ();
+}
+
+int
+main ()
+{
+  do_test ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c b/gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c
new file mode 100644
index 000000000000..a8a6a2010f45
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mvsx" } */
+/* { dg-require-effective-target powerpc_vsx_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse4_2-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse4_2_test
+#endif
+
+#include CHECK_H
+
+#include <nmmintrin.h>
+
+#define NUM 64
+
+static void
+TEST (void)
+{
+  union
+    {
+      __m128i x[NUM / 2];
+      long long ll[NUM];
+    } dst, src1, src2;
+  int i, sign = 1;
+  long long is_eq;
+
+  for (i = 0; i < NUM; i++)
+    {
+      src1.ll[i] = i * i * sign;
+      src2.ll[i] = (i + 20) * sign;
+      sign = -sign;
+    }
+
+  for (i = 0; i < NUM; i += 2)
+    dst.x[i / 2] = _mm_cmpgt_epi64 (src1.x[i / 2], src2.x[i / 2]);
+
+  for (i = 0; i < NUM; i++)
+    {
+      is_eq = src1.ll[i] > src2.ll[i] ? 0xFFFFFFFFFFFFFFFFLL : 0LL;
+      if (is_eq != dst.ll[i])
+	abort ();
+    }
+}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-10-11 22:04                   ` Segher Boessenkool
@ 2021-10-12 19:35                     ` Paul A. Clarke
  2021-10-12 22:25                       ` Segher Boessenkool
  0 siblings, 1 reply; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-12 19:35 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, wschmidt

On Mon, Oct 11, 2021 at 05:04:12PM -0500, Segher Boessenkool wrote:
> On Mon, Oct 11, 2021 at 12:31:07PM -0500, Paul A. Clarke wrote:
> > On Mon, Oct 11, 2021 at 11:28:39AM -0500, Segher Boessenkool wrote:
> > > > Very similar methods are used in glibc today. Are those broken?
> > > 
> > > Maybe.
> > 
> > Ouch.
> 
> So show the code?

You asked for it. ;-)  Boiled down to remove macroisms and code that
should be removed by optimization:
--
static __inline __attribute__ ((__always_inline__)) void
libc_feholdsetround_ppc_ctx (struct rm_ctx *ctx, int r)
{
  fenv_union_t old;
  register fenv_union_t __fr;
  __asm__ __volatile__ ("mffscrni %0,%1" : "=f" (__fr.fenv) : "i" (r));
  ctx->env = old.fenv = __fr.fenv; 
  ctx->updated_status = (r != (old.l & 3));
}
static __inline __attribute__ ((__always_inline__)) void
libc_feresetround_ppc (fenv_t *envp)
{ 
  fenv_union_t new = { .fenv = *envp };
  register fenv_union_t __fr;
  __fr.l = new.l & 3;
  __asm__ __volatile__ ("mffscrn %0,%1" : "=f" (__fr.fenv) : "f" (__fr.fenv));
}
double
__sin (double x)
{
  struct rm_ctx ctx __attribute__ ((cleanup (libc_feresetround_ppc_ctx)));
  libc_feholdsetround_ppc_ctx (&ctx, (0));
  /* floating point intensive code.  */
  return retval;
}
--

There's not much to it, really.  "mffscrni" on the way in to save and set
a required rounding mode, and "mffscrn" on the way out to restore it.

> > > If you get a real (i.e. not inline) function call there, that
> > > can save you often.
> > 
> > Calling a real function in order to execute a single instruction is
> > sub-optimal. ;-)
> 
> Calling a real function (that does not even need a stack frame, just a
> blr) is not terribly expensive, either.

Not ideal, better would be better.

> > > > Would creating a __builtin_mffsce be another solution?
> > > 
> > > Yes.  And not a bad idea in the first place.
> > 
> > The previous "Nope" and this "Yes" seem in contradiction. If there is no
> > difference between "asm" and builtin, how does using a builtin solve the
> > problem?
> 
> You will have to make the builtin solve it.  What a builtin can do is
> virtually unlimited.  What an asm can do is not: it just outputs some
> assembler language, and does in/out/clobber constraints.  You can do a
> *lot* with that, but it is much more limited than everything you can do
> in the compiler!  :-)
> 
> The fact remains that there is no way in RTL (or Gimple for that matter)
> to express things like rounding mode changes.  You will need to
> artificially make some barriers.

I know there is __builtin_set_fpscr_rn that generates mffscrn. This
is not used in the code above because I believe it first appears in
GCC 9.1 or so, and glibc still supports GCC 6.2 (and it doesn't define
a return value, which would be handy in this case).  Does the
implementation of that builtin meet the requirements needed here,
to prevent reordering of FP computation across instantiations of the
builtin?  If not, is there a model on which to base an implementation
of __builtin_mffsce (or some preferred name)?

PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-10-12 19:35                     ` Paul A. Clarke
@ 2021-10-12 22:25                       ` Segher Boessenkool
  2021-10-19  0:36                         ` Paul A. Clarke
  0 siblings, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-12 22:25 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

On Tue, Oct 12, 2021 at 02:35:57PM -0500, Paul A. Clarke wrote:
> You asked for it. ;-)  Boiled down to remove macroisms and code that
> should be removed by optimization:

Thanks :-)

> static __inline __attribute__ ((__always_inline__)) void
> libc_feholdsetround_ppc_ctx (struct rm_ctx *ctx, int r)
> {
>   fenv_union_t old;
>   register fenv_union_t __fr;
>   __asm__ __volatile__ ("mffscrni %0,%1" : "=f" (__fr.fenv) : "i" (r));
>   ctx->env = old.fenv = __fr.fenv; 
>   ctx->updated_status = (r != (old.l & 3));
> }

(Should use "n", not "i", only numbers are allowed, not e.g. the address
of something.  This actually can matter, in unusual cases.)

This orders the updating of RN before the store to __fr.fenv .  There is
no other ordering ensured here.

The store to __fr.env obviously has to stay in order with anything that
can alias it, if that store isn't optimised away completely later.

> static __inline __attribute__ ((__always_inline__)) void
> libc_feresetround_ppc (fenv_t *envp)
> { 
>   fenv_union_t new = { .fenv = *envp };
>   register fenv_union_t __fr;
>   __fr.l = new.l & 3;
>   __asm__ __volatile__ ("mffscrn %0,%1" : "=f" (__fr.fenv) : "f" (__fr.fenv));
> }

This both reads from and stores to __fr.fenv, the asm has to stay
between those two accesses (in the machine code).  If the code that
actually depends on the modified RN depends onb that __fr.fenv some way,
all will be fine.

> double
> __sin (double x)
> {
>   struct rm_ctx ctx __attribute__ ((cleanup (libc_feresetround_ppc_ctx)));
>   libc_feholdsetround_ppc_ctx (&ctx, (0));
>   /* floating point intensive code.  */
>   return retval;
> }

... but there is no such dependency.  The cleanup attribute does not
give any such ordering either afaik.

> There's not much to it, really.  "mffscrni" on the way in to save and set
> a required rounding mode, and "mffscrn" on the way out to restore it.

Yes.  But the code making use of the modified RN needs to have some
artificial dependencies with the RN setters, perhaps via __fr.fenv .

> > Calling a real function (that does not even need a stack frame, just a
> > blr) is not terribly expensive, either.
> 
> Not ideal, better would be better.

Yes.  But at least it *works* :-)  I'll take a stupid, simply, stupidly
simple, *robust* solution over some nice, faster,nicely faster way of
doing the wrong thing.

> > > > > Would creating a __builtin_mffsce be another solution?
> > > > 
> > > > Yes.  And not a bad idea in the first place.
> > > 
> > > The previous "Nope" and this "Yes" seem in contradiction. If there is no
> > > difference between "asm" and builtin, how does using a builtin solve the
> > > problem?
> > 
> > You will have to make the builtin solve it.  What a builtin can do is
> > virtually unlimited.  What an asm can do is not: it just outputs some
> > assembler language, and does in/out/clobber constraints.  You can do a
> > *lot* with that, but it is much more limited than everything you can do
> > in the compiler!  :-)
> > 
> > The fact remains that there is no way in RTL (or Gimple for that matter)
> > to express things like rounding mode changes.  You will need to
> > artificially make some barriers.
> 
> I know there is __builtin_set_fpscr_rn that generates mffscrn.

Or some mtfsb[01]'s, or nasty mffs/mtfsf code, yeah.  And it does not
provide the ordering either.  It *cannot*: you need to cooperate with
whatever you are ordering against.  There is no way in GCC to say "this
is an FP insn and has to stay in order with all FP control writes and FP
status reads".

Maybe now you see why I like external functions for this :-)

> This
> is not used in the code above because I believe it first appears in
> GCC 9.1 or so, and glibc still supports GCC 6.2 (and it doesn't define
> a return value, which would be handy in this case).  Does the
> implementation of that builtin meet the requirements needed here,
> to prevent reordering of FP computation across instantiations of the
> builtin?  If not, is there a model on which to base an implementation
> of __builtin_mffsce (or some preferred name)?

It depends on what you are actually ordering, unfortunately.


Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 6/6] rs6000: Guard some x86 intrinsics implementations
  2021-10-12  0:11   ` Segher Boessenkool
@ 2021-10-13 17:04     ` Paul A. Clarke
  2021-10-13 23:47       ` Segher Boessenkool
  0 siblings, 1 reply; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-13 17:04 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, wschmidt

On Mon, Oct 11, 2021 at 07:11:13PM -0500, Segher Boessenkool wrote:
> On Mon, Aug 23, 2021 at 02:03:10PM -0500, Paul A. Clarke wrote:
> > Some compatibility implementations of x86 intrinsics include
> > Power intrinsics which require POWER8.  Guard them.
> 
> > emmintrin.h:
> > - _mm_cmpord_pd: Remove code which was ostensibly for pre-POWER8,
> >   but which indeed depended on POWER8 (vec_cmpgt(v2du)/vcmpgtud).
> >   The "POWER8" version works fine on pre-POWER8.
> 
> Huh.  It just generates xvcmpeqdp I suppose?

Yes.

> > - _mm_mul_epu32: vec_mule(v4su) uses vmuleuw.
> 
> Did this fail on p7?  If not, add a test that *does*?

Do you mean fail if not for "dg-require-effective-target p8vector_hw"?
We have that, in gcc/testsuite/gcc.target/powerpc/sse2-pmuludq-1.c.

> > pmmintrin.h:
> > - _mm_movehdup_ps: vec_mergeo(v4su) uses vmrgow.
> > - _mm_moveldup_ps: vec_mergee(v4su) uses vmrgew.
> 
> Similar.

gcc/testsuite/gcc.target/powerpc/sse3-movshdup.c
gcc/testsuite/gcc.target/powerpc/sse3-movsldup.c

> > smmintrin.h:
> > - _mm_cmpeq_epi64: vec_cmpeq(v2di) uses vcmpequd.
> > - _mm_mul_epi32: vec_mule(v4si) uses vmuluwm.
> > - _mm_cmpgt_epi64: vec_cmpgt(v2di) uses vcmpgtsd.
> > tmmintrin.h:
> > - _mm_sign_epi8: vec_neg(v4si) uses vsububm.
> > - _mm_sign_epi16: vec_neg(v4si) uses vsubuhm.
> > - _mm_sign_epi32: vec_neg(v4si) uses vsubuwm.
> >   Note that the above three could actually be supported pre-POWER8,
> >   but current GCC does not support them before POWER8.
> > - _mm_sign_pi8: depends on _mm_sign_epi8.
> > - _mm_sign_pi16: depends on _mm_sign_epi16.
> > - _mm_sign_pi32: depends on _mm_sign_epi32.
> 
> And more.

gcc/testsuite/gcc.target/powerpc/sse4_1-pcmpeqq.c
gcc/testsuite/gcc.target/powerpc/sse4_1-pmuldq.c
gcc/testsuite/gcc.target/powerpc/sse4_2-pcmpgtq.c
- although this one will _actually_ fail on P7, as it only requires
"vsx_hw". I'll fix this.
gcc/testsuite/gcc.target/powerpc/ssse3-psignb.c
gcc/testsuite/gcc.target/powerpc/ssse3-psignw.c
gcc/testsuite/gcc.target/powerpc/ssse3-psignd.c

> > gcc
> > 	PR target/101893
> 
> This is a different bug (the vgbdd one)?

PR 101893 is the same issue: things not being properly masked by
#ifdefs.

> All looks good, but we need such failing tests :-)

Thanks for the review! Let me know what you mean by "failing tests".
("Would fail if not for ..."?)

PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 6/6] rs6000: Guard some x86 intrinsics implementations
  2021-10-13 17:04     ` Paul A. Clarke
@ 2021-10-13 23:47       ` Segher Boessenkool
  2021-10-19  0:26         ` Paul A. Clarke
  0 siblings, 1 reply; 47+ messages in thread
From: Segher Boessenkool @ 2021-10-13 23:47 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

On Wed, Oct 13, 2021 at 12:04:39PM -0500, Paul A. Clarke wrote:
> On Mon, Oct 11, 2021 at 07:11:13PM -0500, Segher Boessenkool wrote:
> > > - _mm_mul_epu32: vec_mule(v4su) uses vmuleuw.
> > 
> > Did this fail on p7?  If not, add a test that *does*?
> 
> Do you mean fail if not for "dg-require-effective-target p8vector_hw"?
> We have that, in gcc/testsuite/gcc.target/powerpc/sse2-pmuludq-1.c.

"Some compatibility implementations of x86 intrinsics include
Power intrinsics which require POWER8."

Plus, everything this patch does.  None of that would be needed if it
worked on p7!

So things in this patch are either not needed (so add noise only, and
reduce functionality on older systems for no reason), or they do fix a
bug.  It would be nice if we could have detected such bugs earlier.

> > > gcc
> > > 	PR target/101893
> > 
> > This is a different bug (the vgbdd one)?
> 
> PR 101893 is the same issue: things not being properly masked by
> #ifdefs.

But PR101893 does not mention anything you touch here, and this patch
does not fix PR101893.  The main purpose of bug tracking systems is the
tracking part!


Segher

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 6/6] rs6000: Guard some x86 intrinsics implementations
  2021-10-13 23:47       ` Segher Boessenkool
@ 2021-10-19  0:26         ` Paul A. Clarke
  0 siblings, 0 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-19  0:26 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, wschmidt

On Wed, Oct 13, 2021 at 06:47:21PM -0500, Segher Boessenkool wrote:
> On Wed, Oct 13, 2021 at 12:04:39PM -0500, Paul A. Clarke wrote:
> > On Mon, Oct 11, 2021 at 07:11:13PM -0500, Segher Boessenkool wrote:
> > > > - _mm_mul_epu32: vec_mule(v4su) uses vmuleuw.
> > > 
> > > Did this fail on p7?  If not, add a test that *does*?
> > 
> > Do you mean fail if not for "dg-require-effective-target p8vector_hw"?
> > We have that, in gcc/testsuite/gcc.target/powerpc/sse2-pmuludq-1.c.
> 
> "Some compatibility implementations of x86 intrinsics include
> Power intrinsics which require POWER8."
> 
> Plus, everything this patch does.  None of that would be needed if it
> worked on p7!

The tests that are permitted to compile/link on P7, gated by dg directives,
work on P7.

> So things in this patch are either not needed (so add noise only, and
> reduce functionality on older systems for no reason), or they do fix a
> bug.  It would be nice if we could have detected such bugs earlier.

Most, if not all of the intrinsics tests were originally limited to
P8 and up, 64bit, and little-endian. At your request, I have lowered
many of those restrictions in areas that are capable of support.
Such is the case here, to enable compiling and running as much as
possible on P7.

If you want a different approach, do let me know.

> > > > gcc
> > > > 	PR target/101893
> > > 
> > > This is a different bug (the vgbdd one)?
> > 
> > PR 101893 is the same issue: things not being properly masked by
> > #ifdefs.
> 
> But PR101893 does not mention anything you touch here, and this patch
> does not fix PR101893.  The main purpose of bug tracking systems is the
> tracking part!

The error message in PR101893 is in smmintrin.h:
| gcc/include/smmintrin.h:103:3: error: AltiVec argument passed to unprototyped function
| 
| That line is
| 
|   __charmask = vec_gb (__charmask);

smmintrin.h is changed by this patch, including `#ifdef _ARCH_PWR8` around
the code which has vec_gb.

PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics
  2021-10-12 22:25                       ` Segher Boessenkool
@ 2021-10-19  0:36                         ` Paul A. Clarke
  0 siblings, 0 replies; 47+ messages in thread
From: Paul A. Clarke @ 2021-10-19  0:36 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, wschmidt

On Tue, Oct 12, 2021 at 05:25:32PM -0500, Segher Boessenkool wrote:
> On Tue, Oct 12, 2021 at 02:35:57PM -0500, Paul A. Clarke wrote:
> > static __inline __attribute__ ((__always_inline__)) void
> > libc_feholdsetround_ppc_ctx (struct rm_ctx *ctx, int r)
> > {
> >   fenv_union_t old;
> >   register fenv_union_t __fr;
> >   __asm__ __volatile__ ("mffscrni %0,%1" : "=f" (__fr.fenv) : "i" (r));
> >   ctx->env = old.fenv = __fr.fenv; 
> >   ctx->updated_status = (r != (old.l & 3));
> > }
> 
> (Should use "n", not "i", only numbers are allowed, not e.g. the address
> of something.  This actually can matter, in unusual cases.)

Noted, will submit a change to glibc when I get a chance. Thanks!

> This orders the updating of RN before the store to __fr.fenv .  There is
> no other ordering ensured here.
> 
> The store to __fr.env obviously has to stay in order with anything that
> can alias it, if that store isn't optimised away completely later.
> 
> > static __inline __attribute__ ((__always_inline__)) void
> > libc_feresetround_ppc (fenv_t *envp)
> > { 
> >   fenv_union_t new = { .fenv = *envp };
> >   register fenv_union_t __fr;
> >   __fr.l = new.l & 3;
> >   __asm__ __volatile__ ("mffscrn %0,%1" : "=f" (__fr.fenv) : "f" (__fr.fenv));
> > }
> 
> This both reads from and stores to __fr.fenv, the asm has to stay
> between those two accesses (in the machine code).  If the code that
> actually depends on the modified RN depends onb that __fr.fenv some way,
> all will be fine.
> 
> > double
> > __sin (double x)
> > {
> >   struct rm_ctx ctx __attribute__ ((cleanup (libc_feresetround_ppc_ctx)));
> >   libc_feholdsetround_ppc_ctx (&ctx, (0));
> >   /* floating point intensive code.  */
> >   return retval;
> > }
> 
> ... but there is no such dependency.  The cleanup attribute does not
> give any such ordering either afaik.
> 
> > There's not much to it, really.  "mffscrni" on the way in to save and set
> > a required rounding mode, and "mffscrn" on the way out to restore it.
> 
> Yes.  But the code making use of the modified RN needs to have some
> artificial dependencies with the RN setters, perhaps via __fr.fenv .
> 
> > > Calling a real function (that does not even need a stack frame, just a
> > > blr) is not terribly expensive, either.
> > 
> > Not ideal, better would be better.
> 
> Yes.  But at least it *works* :-)  I'll take a stupid, simply, stupidly
> simple, *robust* solution over some nice, faster,nicely faster way of
> doing the wrong thing.

Understand, and agree. 

> > > > > > Would creating a __builtin_mffsce be another solution?
> > > > > 
> > > > > Yes.  And not a bad idea in the first place.
> > > > 
> > > > The previous "Nope" and this "Yes" seem in contradiction. If there is no
> > > > difference between "asm" and builtin, how does using a builtin solve the
> > > > problem?
> > > 
> > > You will have to make the builtin solve it.  What a builtin can do is
> > > virtually unlimited.  What an asm can do is not: it just outputs some
> > > assembler language, and does in/out/clobber constraints.  You can do a
> > > *lot* with that, but it is much more limited than everything you can do
> > > in the compiler!  :-)
> > > 
> > > The fact remains that there is no way in RTL (or Gimple for that matter)
> > > to express things like rounding mode changes.  You will need to
> > > artificially make some barriers.
> > 
> > I know there is __builtin_set_fpscr_rn that generates mffscrn.
> 
> Or some mtfsb[01]'s, or nasty mffs/mtfsf code, yeah.  And it does not
> provide the ordering either.  It *cannot*: you need to cooperate with
> whatever you are ordering against.  There is no way in GCC to say "this
> is an FP insn and has to stay in order with all FP control writes and FP
> status reads".
> 
> Maybe now you see why I like external functions for this :-)
> 
> > This
> > is not used in the code above because I believe it first appears in
> > GCC 9.1 or so, and glibc still supports GCC 6.2 (and it doesn't define
> > a return value, which would be handy in this case).  Does the
> > implementation of that builtin meet the requirements needed here,
> > to prevent reordering of FP computation across instantiations of the
> > builtin?  If not, is there a model on which to base an implementation
> > of __builtin_mffsce (or some preferred name)?
> 
> It depends on what you are actually ordering, unfortunately.

What I hear is that for the specific requirements and restrictions here,
there is nothing special that another builtin, like a theoretical
__builtin_mffsce implemented like __builtin_fpscr_set_rn, can provide
to solve the issue under discussion.  The dependencies need to be expressed
such that the compiler understand them, and there is no way to do so
with the current implementation of __builtin_fpscr_set_rn.

With some effort, and proper visibility, the dependencies can be expressed
using "asm". I believe that's the case here, and will submit a v2 for
review shortly.

For the general case of inlines, builtins, or asm without visibility,
I've opened an issue for GCC to consider accommodation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102783.

Thanks so much for your help!

PC

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2021-10-19  0:36 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-23 19:03 [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
2021-08-23 19:03 ` [PATCH v3 1/6] rs6000: Support SSE4.1 "round" intrinsics Paul A. Clarke
2021-08-27 13:44   ` Bill Schmidt
2021-08-27 13:47     ` Bill Schmidt
2021-08-30 21:16     ` Paul A. Clarke
2021-08-30 21:24       ` Bill Schmidt
2021-10-07 23:08       ` Segher Boessenkool
2021-10-07 23:39   ` Segher Boessenkool
2021-10-08  1:04     ` Paul A. Clarke
2021-10-08 17:39       ` Segher Boessenkool
2021-10-08 19:27         ` Paul A. Clarke
2021-10-08 22:31           ` Segher Boessenkool
2021-10-11 13:46             ` Paul A. Clarke
2021-10-11 16:28               ` Segher Boessenkool
2021-10-11 17:31                 ` Paul A. Clarke
2021-10-11 22:04                   ` Segher Boessenkool
2021-10-12 19:35                     ` Paul A. Clarke
2021-10-12 22:25                       ` Segher Boessenkool
2021-10-19  0:36                         ` Paul A. Clarke
2021-08-23 19:03 ` [PATCH v3 2/6] rs6000: Support SSE4.1 "min" and "max" intrinsics Paul A. Clarke
2021-08-27 13:47   ` Bill Schmidt
2021-10-11 19:28   ` Segher Boessenkool
2021-10-12  1:42     ` [COMMITTED v4 " Paul A. Clarke
2021-08-23 19:03 ` [PATCH v3 3/6] rs6000: Simplify some SSE4.1 "test" intrinsics Paul A. Clarke
2021-08-27 13:48   ` Bill Schmidt
2021-10-11 20:50   ` Segher Boessenkool
2021-10-12  1:47     ` [COMMITTED v4 " Paul A. Clarke
2021-08-23 19:03 ` [PATCH v3 4/6] rs6000: Support SSE4.1 "cvt" intrinsics Paul A. Clarke
2021-08-27 13:49   ` Bill Schmidt
2021-10-11 21:52   ` Segher Boessenkool
2021-10-12  1:51     ` [COMMITTED v4 " Paul A. Clarke
2021-08-23 19:03 ` [PATCH v3 5/6] rs6000: Support more SSE4 "cmp", "mul", "pack" intrinsics Paul A. Clarke
2021-08-27 15:21   ` Bill Schmidt
2021-08-27 18:52     ` Paul A. Clarke
2021-10-11 23:07   ` Segher Boessenkool
2021-10-12  1:55     ` [COMMITTED v4 " Paul A. Clarke
2021-08-23 19:03 ` [PATCH v3 6/6] rs6000: Guard some x86 intrinsics implementations Paul A. Clarke
2021-08-27 15:25   ` Bill Schmidt
2021-10-12  0:11   ` Segher Boessenkool
2021-10-13 17:04     ` Paul A. Clarke
2021-10-13 23:47       ` Segher Boessenkool
2021-10-19  0:26         ` Paul A. Clarke
2021-09-16 14:59 ` [PATCH v3 0/6] rs6000: Support more SSE4 intrinsics Paul A. Clarke
2021-10-04 18:26   ` Paul A. Clarke
2021-10-07 22:25 ` Segher Boessenkool
2021-10-08  0:29   ` Paul A. Clarke
2021-10-12  0:15     ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).